analyze text data using multiword phrases -凯发k8网页登录
this example shows how to analyze text using n-gram frequency counts.
an n-gram is a tuple of consecutive words. for example, a bigram (the case when ) is a pair of consecutive words such as "heavy rainfall". a unigram (the case when ) is a single word. a bag-of-n-grams model records the number of times that different n-grams appear in document collections.
using a bag-of-n-grams model, you can retain more information on word ordering in the original text data. for example, a bag-of-n-grams model is better suited for capturing short phrases which appear in the text, such as "heavy rainfall" and "thunderstorm winds".
to create a bag-of-n-grams model, use bagofngrams
. you can input bagofngrams
objects into other text analytics toolbox functions such as wordcloud
and fitlda
.
load and extract text data
load the example data. the file factoryreports.csv
contains factory reports, including a text description and categorical labels for each event. remove the rows with empty reports.
filename = "factoryreports.csv"; data = readtable(filename,texttype="string");
extract the text data from the table and view the first few reports.
textdata = data.description; textdata(1:5)
ans = 5×1 string
"items are occasionally getting stuck in the scanner spools."
"loud rattling and banging sounds are coming from assembler pistons."
"there are cuts to the power when starting the plant."
"fried capacitors in the assembler."
"mixer tripped the fuses."
prepare text data for analysis
create a function which tokenizes and preprocesses the text data so it can be used for analysis. the function preprocesstext
listed at the end of the example, performs the following steps:
convert the text data to lowercase using
lower
.tokenize the text using
tokenizeddocument
.erase punctuation using
erasepunctuation
.remove a list of stop words (such as "and", "of", and "the") using
removestopwords
.remove words with 2 or fewer characters using
removeshortwords
.remove words with 15 or more characters using
removelongwords
.lemmatize the words using
normalizewords
.
use the example preprocessing function preprocesstext
to prepare the text data.
documents = preprocesstext(textdata); documents(1:5)
ans = 5×1 tokenizeddocument: 6 tokens: item occasionally get stuck scanner spool 7 tokens: loud rattling bang sound come assembler piston 4 tokens: cut power start plant 3 tokens: fry capacitor assembler 3 tokens: mixer trip fuse
create word cloud of bigrams
create a word cloud of bigrams by first creating a bag-of-n-grams model using bagofngrams
, and then inputting the model to wordcloud
.
to count the n-grams of length 2 (bigrams), use bagofngrams
with the default options.
bag = bagofngrams(documents)
bag = bagofngrams with properties: counts: [480×921 double] vocabulary: ["item" "occasionally" "get" "stuck" "scanner" "loud" "rattling" "bang" "sound" "come" "assembler" "cut" "power" "start" "fry" "capacitor" "mixer" "trip" "burst" "pipe" … ] ngrams: [921×2 string] ngramlengths: 2 numngrams: 921 numdocuments: 480
visualize the bag-of-n-grams model using a word cloud.
figure
wordcloud(bag);
title("text data: preprocessed bigrams")
fit topic model to bag-of-n-grams
a latent dirichlet allocation (lda) model is a topic model which discovers underlying topics in a collection of documents and infers the word probabilities in topics.
create an lda topic model with 10 topics using fitlda
. the function fits an lda model by treating the n-grams as single words.
mdl = fitlda(bag,10,verbose=0);
visualize the first four topics as word clouds.
figure tiledlayout("flow"); for i = 1:4 nexttile wordcloud(mdl,i); title("lda topic " i) end
the word clouds highlight commonly co-occurring bigrams in the lda topics. the function plots the bigrams with sizes according to their probabilities for the specified lda topics.
analyze text using longer phrases
to analyze text using longer phrases, specify the ngramlengths
option in bagofngrams
to be a larger value.
when working with longer phrases, it can be useful to keep stop words in the model. for example, to detect the phrase "is not happy", keep the stop words "is" and "not" in the model.
preprocess the text. erase the punctuation using erasepunctuation
, and tokenize using tokenizeddocument
.
cleantextdata = erasepunctuation(textdata); documents = tokenizeddocument(cleantextdata);
to count the n-grams of length 3 (trigrams), use bagofngrams
and specify ngramlengths
to be 3.
bag = bagofngrams(documents,ngramlengths=3);
visualize the bag-of-n-grams model using a word cloud. the word cloud of trigrams better shows the context of the individual words.
figure
wordcloud(bag);
title("text data: trigrams")
view the top 10 trigrams and their frequency counts using topkngrams
.
tbl = topkngrams(bag,10)
tbl=10×3 table
ngram count ngramlength
__________________________________ _____ ___________
"in" "the" "mixer" 14 3
"in" "the" "scanner" 13 3
"blown" "in" "the" 9 3
"the" "robot" "arm" 7 3
"stuck" "in" "the" 6 3
"is" "spraying" "coolant" 6 3
"from" "time" "to" 6 3
"time" "to" "time" 6 3
"heard" "in" "the" 6 3
"on" "the" "floor" 6 3
example preprocessing function
the function preprocesstext
performs the following steps in order:
convert the text data to lowercase using
lower
.tokenize the text using
tokenizeddocument
.erase punctuation using
erasepunctuation
.remove a list of stop words (such as "and", "of", and "the") using
removestopwords
.remove words with 2 or fewer characters using
removeshortwords
.remove words with 15 or more characters using
removelongwords
.lemmatize the words using
normalizewords
.
function documents = preprocesstext(textdata) % convert the text data to lowercase. cleantextdata = lower(textdata); % tokenize the text. documents = tokenizeddocument(cleantextdata); % erase punctuation. documents = erasepunctuation(documents); % remove a list of stop words. documents = removestopwords(documents); % remove words with 2 or fewer characters, and words with 15 or greater % characters. documents = removeshortwords(documents,2); documents = removelongwords(documents,15); % lemmatize the words. documents = addpartofspeechdetails(documents); documents = normalizewords(documents,style="lemma"); end
see also
tokenizeddocument
| | removestopwords
| erasepunctuation
| | | | normalizewords
| | | | | addpartofspeechdetails