choose number of topics for lda model -凯发k8网页登录
this example shows how to decide on a suitable number of topics for a latent dirichlet allocation (lda) model.
to decide on a suitable number of topics, you can compare the goodness-of-fit of lda models fit with varying numbers of topics. you can evaluate the goodness-of-fit of an lda model by calculating the perplexity of a held-out set of documents. the perplexity indicates how well the model describes a set of documents. a lower perplexity suggests a better fit.
extract and preprocess text data
load the example data. the file factoryreports.csv
contains factory reports, including a text description and categorical labels for each event. extract the text data from the field description
.
filename = "factoryreports.csv"; data = readtable(filename,'texttype','string'); textdata = data.description;
tokenize and preprocess the text data using the function preprocesstext
which is listed at the end of this example.
documents = preprocesstext(textdata); documents(1:5)
ans = 5×1 tokenizeddocument: 6 tokens: item occasionally get stuck scanner spool 7 tokens: loud rattle bang sound come assembler piston 4 tokens: cut power start plant 3 tokens: fry capacitor assembler 3 tokens: mixer trip fuse
set aside 10% of the documents at random for validation.
numdocuments = numel(documents);
cvp = cvpartition(numdocuments,'holdout',0.1);
documentstrain = documents(cvp.training);
documentsvalidation = documents(cvp.test);
create a bag-of-words model from the training documents. remove the words that do not appear more than two times in total. remove any documents containing no words.
bag = bagofwords(documentstrain); bag = removeinfrequentwords(bag,2); bag = removeemptydocuments(bag);
choose number of topics
the goal is to choose a number of topics that minimize the perplexity compared to other numbers of topics. this is not the only consideration: models fit with larger numbers of topics may take longer to converge. to see the effects of the tradeoff, calculate both goodness-of-fit and the fitting time. if the optimal number of topics is high, then you might want to choose a lower value to speed up the fitting process.
fit some lda models for a range of values for the number of topics. compare the fitting time and the perplexity of each model on the held-out set of test documents. the perplexity is the second output to the logp
function. to obtain the second output without assigning the first output to anything, use the ~
symbol. the fitting time is the timesincestart
value for the last iteration. this value is in the history
struct of the fitinfo
property of the lda model.
for a quicker fit, specify 'solver'
to be 'savb'
. to suppress verbose output, set 'verbose'
to 0
. this may take a few minutes to run.
numtopicsrange = [5 10 15 20 40]; for i = 1:numel(numtopicsrange) numtopics = numtopicsrange(i); mdl = fitlda(bag,numtopics, ... 'solver','savb', ... 'verbose',0); [~,validationperplexity(i)] = logp(mdl,documentsvalidation); timeelapsed(i) = mdl.fitinfo.history.timesincestart(end); end
show the perplexity and elapsed time for each number of topics in a plot. plot the perplexity on the left axis and the time elapsed on the right axis.
figure yyaxis left plot(numtopicsrange,validationperplexity,' -') ylabel("validation perplexity") yyaxis right plot(numtopicsrange,timeelapsed,'o-') ylabel("time elapsed (s)") legend(["validation perplexity" "time elapsed (s)"],'location','southeast') xlabel("number of topics")
the plot suggests that fitting a model with 10–20 topics may be a good choice. the perplexity is low compared with the models with different numbers of topics. with this solver, the elapsed time for this many topics is also reasonable. with different solvers, you may find that increasing the number of topics can lead to a better fit, but fitting the model takes longer to converge.
example preprocessing function
the function preprocesstext
, performs the following steps in order:
convert the text data to lowercase using
lower
.tokenize the text using
tokenizeddocument
.erase punctuation using
erasepunctuation
.remove a list of stop words (such as "and", "of", and "the") using
removestopwords
.remove words with 2 or fewer characters using
removeshortwords
.remove words with 15 or more characters using
removelongwords
.lemmatize the words using
normalizewords
.
function documents = preprocesstext(textdata) % convert the text data to lowercase. cleantextdata = lower(textdata); % tokenize the text. documents = tokenizeddocument(cleantextdata); % erase punctuation. documents = erasepunctuation(documents); % remove a list of stop words. documents = removestopwords(documents); % remove words with 2 or fewer characters, and words with 15 or greater % characters. documents = removeshortwords(documents,2); documents = removelongwords(documents,15); % lemmatize the words. documents = addpartofspeechdetails(documents); documents = normalizewords(documents,'style','lemma'); end
see also
tokenizeddocument
| | removestopwords
| | | | | erasepunctuation
| | | normalizewords
| addpartofspeechdetails
| |