choose number of topics for lda model -凯发k8网页登录

this example shows how to decide on a suitable number of topics for a latent dirichlet allocation (lda) model.

to decide on a suitable number of topics, you can compare the goodness-of-fit of lda models fit with varying numbers of topics. you can evaluate the goodness-of-fit of an lda model by calculating the perplexity of a held-out set of documents. the perplexity indicates how well the model describes a set of documents. a lower perplexity suggests a better fit.

extract and preprocess text data

load the example data. the file factoryreports.csv contains factory reports, including a text description and categorical labels for each event. extract the text data from the field description.

filename = "factoryreports.csv";
data = readtable(filename,'texttype','string');
textdata = data.description;

tokenize and preprocess the text data using the function preprocesstext which is listed at the end of this example.

documents = preprocesstext(textdata);
documents(1:5)

ans = 
  5×1 tokenizeddocument:
    6 tokens: item occasionally get stuck scanner spool
    7 tokens: loud rattle bang sound come assembler piston
    4 tokens: cut power start plant
    3 tokens: fry capacitor assembler
    3 tokens: mixer trip fuse

set aside 10% of the documents at random for validation.

numdocuments = numel(documents);
cvp = cvpartition(numdocuments,'holdout',0.1);
documentstrain = documents(cvp.training);
documentsvalidation = documents(cvp.test);

create a bag-of-words model from the training documents. remove the words that do not appear more than two times in total. remove any documents containing no words.

bag = bagofwords(documentstrain);
bag = removeinfrequentwords(bag,2);
bag = removeemptydocuments(bag);

choose number of topics

the goal is to choose a number of topics that minimize the perplexity compared to other numbers of topics. this is not the only consideration: models fit with larger numbers of topics may take longer to converge. to see the effects of the tradeoff, calculate both goodness-of-fit and the fitting time. if the optimal number of topics is high, then you might want to choose a lower value to speed up the fitting process.

fit some lda models for a range of values for the number of topics. compare the fitting time and the perplexity of each model on the held-out set of test documents. the perplexity is the second output to the logp function. to obtain the second output without assigning the first output to anything, use the ~ symbol. the fitting time is the timesincestart value for the last iteration. this value is in the history struct of the fitinfo property of the lda model.

for a quicker fit, specify 'solver' to be 'savb'. to suppress verbose output, set 'verbose' to 0. this may take a few minutes to run.

numtopicsrange = [5 10 15 20 40];
for i = 1:numel(numtopicsrange)
    numtopics = numtopicsrange(i);
    
    mdl = fitlda(bag,numtopics, ...
        'solver','savb', ...
        'verbose',0);
    
    [~,validationperplexity(i)] = logp(mdl,documentsvalidation);
    timeelapsed(i) = mdl.fitinfo.history.timesincestart(end);
end

show the perplexity and elapsed time for each number of topics in a plot. plot the perplexity on the left axis and the time elapsed on the right axis.

figure
yyaxis left
plot(numtopicsrange,validationperplexity,' -')
ylabel("validation perplexity")
yyaxis right
plot(numtopicsrange,timeelapsed,'o-')
ylabel("time elapsed (s)")
legend(["validation perplexity" "time elapsed (s)"],'location','southeast')
xlabel("number of topics")

the plot suggests that fitting a model with 10–20 topics may be a good choice. the perplexity is low compared with the models with different numbers of topics. with this solver, the elapsed time for this many topics is also reasonable. with different solvers, you may find that increasing the number of topics can lead to a better fit, but fitting the model takes longer to converge.

example preprocessing function

the function preprocesstext, performs the following steps in order:

convert the text data to lowercase using lower.
tokenize the text using tokenizeddocument.
erase punctuation using erasepunctuation.
remove a list of stop words (such as "and", "of", and "the") using removestopwords.
remove words with 2 or fewer characters using removeshortwords.
remove words with 15 or more characters using removelongwords.
lemmatize the words using normalizewords.

function documents = preprocesstext(textdata)
% convert the text data to lowercase.
cleantextdata = lower(textdata);
% tokenize the text.
documents = tokenizeddocument(cleantextdata);
% erase punctuation.
documents = erasepunctuation(documents);
% remove a list of stop words.
documents = removestopwords(documents);
% remove words with 2 or fewer characters, and words with 15 or greater
% characters.
documents = removeshortwords(documents,2);
documents = removelongwords(documents,15);
% lemmatize the words.
documents = addpartofspeechdetails(documents);
documents = normalizewords(documents,'style','lemma');
end

choose number of topics for lda model -凯发k8网页登录

extract and preprocess text data

choose number of topics

example preprocessing function

see also

related topics

choose number of topics for lda model -凯发k8网页登录

extract and preprocess text data

choose number of topics

example preprocessing function

see also

related topics

wechat