create simple text model for classification -凯发k8网页登录
this example shows how to train a simple text classifier on word frequency counts using a bag-of-words model.
you can create a simple classification model which uses word frequency counts as predictors. this example trains a simple classification model to predict the category of factory reports using text descriptions.
load and extract text data
load the example data. the file factoryreports.csv
contains factory reports, including a text description and categorical labels for each report.
filename = "factoryreports.csv"; data = readtable(filename,'texttype','string'); head(data)
ans=8×5 table
description category urgency resolution cost
_____________________________________________________________________ ____________________ ________ ____________________ _____
"items are occasionally getting stuck in the scanner spools." "mechanical failure" "medium" "readjust machine" 45
"loud rattling and banging sounds are coming from assembler pistons." "mechanical failure" "medium" "readjust machine" 35
"there are cuts to the power when starting the plant." "electronic failure" "high" "full replacement" 16200
"fried capacitors in the assembler." "electronic failure" "high" "replace components" 352
"mixer tripped the fuses." "electronic failure" "low" "add to watch list" 55
"burst pipe in the constructing agent is spraying coolant." "leak" "high" "replace components" 371
"a fuse is blown in the mixer." "electronic failure" "low" "replace components" 441
"things continue to tumble off of the belt." "mechanical failure" "low" "readjust machine" 38
convert the labels in the category
column of the table to categorical and view the distribution of the classes in the data using a histogram.
data.category = categorical(data.category); figure histogram(data.category) xlabel("class") ylabel("frequency") title("class distribution")
partition the data into a training partition and a held-out test set. specify the holdout percentage to be 10%.
cvp = cvpartition(data.category,'holdout',0.1);
datatrain = data(cvp.training,:);
datatest = data(cvp.test,:);
extract the text data and labels from the tables.
textdatatrain = datatrain.description; textdatatest = datatest.description; ytrain = datatrain.category; ytest = datatest.category;
prepare text data for analysis
create a function which tokenizes and preprocesses the text data so it can be used for analysis. the function preprocesstext
, performs the following steps in order:
tokenize the text using
tokenizeddocument
.remove a list of stop words (such as "and", "of", and "the") using
removestopwords
.lemmatize the words using
normalizewords
.erase punctuation using
erasepunctuation
.remove words with 2 or fewer characters using
removeshortwords
.remove words with 15 or more characters using
removelongwords
.
use the example preprocessing function preprocesstext
to prepare the text data.
documents = preprocesstext(textdatatrain); documents(1:5)
ans = 5×1 tokenizeddocument: 6 tokens: items occasionally get stuck scanner spool 7 tokens: loud rattle bang sound come assembler piston 4 tokens: cut power start plant 3 tokens: fry capacitor assembler 3 tokens: mixer trip fuse
create a bag-of-words model from the tokenized documents.
bag = bagofwords(documents)
bag = bagofwords with properties: counts: [432×336 double] vocabulary: [1×336 string] numwords: 336 numdocuments: 432
remove words from the bag-of-words model that do not appear more than two times in total. remove any documents containing no words from the bag-of-words model, and remove the corresponding entries in labels.
bag = removeinfrequentwords(bag,2); [bag,idx] = removeemptydocuments(bag); ytrain(idx) = []; bag
bag = bagofwords with properties: counts: [432×155 double] vocabulary: [1×155 string] numwords: 155 numdocuments: 432
train supervised classifier
train a supervised classification model using the word frequency counts from the bag-of-words model and the labels.
train a multiclass linear classification model using fitcecoc
. specify the counts
property of the bag-of-words model to be the predictors, and the event type labels to be the response. specify the learners to be linear. these learners support sparse data input.
xtrain = bag.counts; mdl = fitcecoc(xtrain,ytrain,'learners','linear')
mdl = compactclassificationecoc responsename: 'y' classnames: [electronic failure leak mechanical failure software failure] scoretransform: 'none' binarylearners: {6×1 cell} codingmatrix: [4×6 double] properties, methods
for a better fit, you can try specifying different parameters of the linear learners. for more information on linear classification learner templates, see templatelinear
.
test classifier
predict the labels of the test data using the trained model and calculate the classification accuracy. the classification accuracy is the proportion of the labels that the model predicts correctly.
preprocess the test data using the same preprocessing steps as the training data. encode the resulting test documents as a matrix of word frequency counts according to the bag-of-words model.
documentstest = preprocesstext(textdatatest); xtest = encode(bag,documentstest);
predict the labels of the test data using the trained model and calculate the classification accuracy.
ypred = predict(mdl,xtest); acc = sum(ypred == ytest)/numel(ytest)
acc = 0.8542
predict using new data
classify the event type of new factory reports. create a string array containing the new factory reports.
str = [ "coolant is pooling underneath sorter." "sorter blows fuses at start up." "there are some very loud rattling sounds coming from the assembler."]; documentsnew = preprocesstext(str); xnew = encode(bag,documentsnew); labelsnew = predict(mdl,xnew)
labelsnew = 3×1 categorical
leak
electronic failure
mechanical failure
example preprocessing function
the function preprocesstext
, performs the following steps in order:
tokenize the text using
tokenizeddocument
.remove a list of stop words (such as "and", "of", and "the") using
removestopwords
.lemmatize the words using
normalizewords
.erase punctuation using
erasepunctuation
.remove words with 2 or fewer characters using
removeshortwords
.remove words with 15 or more characters using
removelongwords
.
function documents = preprocesstext(textdata) % tokenize the text. documents = tokenizeddocument(textdata); % remove a list of stop words then lemmatize the words. to improve % lemmatization, first use addpartofspeechdetails. documents = addpartofspeechdetails(documents); documents = removestopwords(documents); documents = normalizewords(documents,'style','lemma'); % erase punctuation. documents = erasepunctuation(documents); % remove words with 2 or fewer characters, and words with 15 or more % characters. documents = removeshortwords(documents,2); documents = removelongwords(documents,15); end
see also
erasepunctuation
| tokenizeddocument
| | removestopwords
| | | normalizewords
| | addpartofspeechdetails
|