train a sentiment classifier -凯发k8网页登录
this example shows how to train a classifier for sentiment analysis using an annotated list of positive and negative sentiment words and a pretrained word embedding.
the pretrained word embedding plays several roles in this workflow. it converts words into numeric vectors and forms the basis for a classifier. you can then use the classifier to predict the sentiment of other words using their vector representation, and use these classifications to calculate the sentiment of a piece of text. there are four steps in training and using the sentiment classifier:
load a pretrained word embedding.
load an opinion lexicon listing positive and negative words.
train a sentiment classifier using the word vectors of the positive and negative words.
calculate the mean sentiment scores of the words in a piece of text.
load pretrained word embedding
word embeddings map words in a vocabulary to numeric vectors. these embeddings can capture semantic details of the words so that similar words have similar vectors. they also model relationships between words through vector arithmetic. for example, the relationship rome is to paris as italy is to france is described by the equation .
load a pretrained word embedding using the fasttextwordembedding
function. this function requires text analytics toolbox™ model for fasttext english 16 billion token word embedding support package. if this support package is not installed, then the function provides a download link.
emb = fasttextwordembedding;
load opinion lexicon
load the positive and negative words from the opinion lexicon (also known as a sentiment lexicon) from . [1] first, extract the files from the .rar
file into a folder named opinion-lexicon-english
, and then import the text.
load the data using the function readlexicon
listed at the end of this example. the output data
is a table with variables word
containing the words, and label
containing a categorical sentiment label, positive
or negative
.
data = readlexicon;
view the first few words labeled as positive.
idx = data.label == "positive";
head(data(idx,:))
ans=8×2 table
word label
____________ ________
"a " positive
"abound" positive
"abounds" positive
"abundance" positive
"abundant" positive
"accessable" positive
"accessible" positive
"acclaim" positive
view the first few words labeled as negative.
idx = data.label == "negative";
head(data(idx,:))
ans=8×2 table
word label
_____________ ________
"2-faced" negative
"2-faces" negative
"abnormal" negative
"abolish" negative
"abominable" negative
"abominably" negative
"abominate" negative
"abomination" negative
prepare data for training
to train the sentiment classifier, convert the words to word vectors using the pretrained word embedding emb
. first remove the words that do not appear in the word embedding emb
.
idx = ~isvocabularyword(emb,data.word); data(idx,:) = [];
set aside 10% of the words at random for testing.
numwords = size(data,1);
cvp = cvpartition(numwords,'holdout',0.1);
datatrain = data(training(cvp),:);
datatest = data(test(cvp),:);
convert the words in the training data to word vectors using word2vec
.
wordstrain = datatrain.word; xtrain = word2vec(emb,wordstrain); ytrain = datatrain.label;
train sentiment classifier
train a support vector machine (svm) classifier which classifies word vectors into positive and negative categories.
mdl = fitcsvm(xtrain,ytrain);
test classifier
convert the words in the test data to word vectors using word2vec
.
wordstest = datatest.word; xtest = word2vec(emb,wordstest); ytest = datatest.label;
predict the sentiment labels of the test word vectors.
[ypred,scores] = predict(mdl,xtest);
visualize the classification accuracy in a confusion matrix.
figure confusionchart(ytest,ypred);
visualize the classifications in word clouds. plot the words with positive and negative sentiments in word clouds with word sizes corresponding to the prediction scores.
figure subplot(1,2,1) idx = ypred == "positive"; wordcloud(wordstest(idx),scores(idx,1)); title("predicted positive sentiment") subplot(1,2,2) wordcloud(wordstest(~idx),scores(~idx,2)); title("predicted negative sentiment")
calculate sentiment of collections of text
to calculate the sentiment of a piece of text, for example an update on social media, predict the sentiment score of each word in the text and take the mean sentiment score.
filename = "weekendupdates.xlsx"; tbl = readtable(filename,'texttype','string'); textdata = tbl.textdata; textdata(1:10)
ans = 10×1 string array
"happy anniversary! ❤ next stop: paris! ✈ #vacation"
"haha, bbq on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"
"getting ready for saturday night 🍕 #yum #weekend 😎"
"say it with me - i need a #vacation!!! ☹"
"😎 chilling 😎 at home for the first time in ages…this is the life! 👍 #weekend"
"my last #weekend before the exam 😢 👎."
"can’t believe my #vacation is over 😢 so unfair"
"can’t wait for tennis this #weekend 🎾🍓🥂 😀"
"i had so much fun! 😀😀😀 best trip ever! 😀😀😀 #vacation #weekend"
"hot weather and air con broke in car 😢 #sweaty #roadtrip #vacation"
create a function which tokenizes and preprocesses the text data so it can be used for analysis. the function preprocesstext
, listed at the end of the example, performs the following steps in order:
tokenize the text using
tokenizeddocument
.erase punctuation using
erasepunctuation
.remove stop words (such as "and", "of", and "the") using
removestopwords
.convert to lowercase using
lower
.
use the preprocessing function preprocesstext
to prepare the text data. this step can take a few minutes to run.
documents = preprocesstext(textdata);
remove the words from the documents that do not appear in the word embedding emb
.
idx = ~isvocabularyword(emb,documents.vocabulary); documents = removewords(documents,idx);
to visualize how well the sentiment classifier generalizes to the new text, classify the sentiments on the words that occur in the text, but not in the training data and visualize them in word clouds. use the word clouds to manually check that the classifier behaves as expected.
words = documents.vocabulary; words(ismember(words,wordstrain)) = []; vec = word2vec(emb,words); [ypred,scores] = predict(mdl,vec); figure subplot(1,2,1) idx = ypred == "positive"; wordcloud(words(idx),scores(idx,1)); title("predicted positive sentiment") subplot(1,2,2) wordcloud(words(~idx),scores(~idx,2)); title("predicted negative sentiment")
to calculate the sentiment of a given piece of text, compute the sentiment score for each word in the text and calculate the mean sentiment score.
calculate the mean sentiment score of the updates. for each document, convert the words to word vectors, predict the sentiment score on the word vectors, transform the scores using the score-to-posterior transform function and then calculate the mean sentiment score.
for i = 1:numel(documents) words = string(documents(i)); vec = word2vec(emb,words); [~,scores] = predict(mdl,vec); sentimentscore(i) = mean(scores(:,1)); end
view the predicted sentiment scores with the text data. scores greater than 0 correspond to positive sentiment, scores less than 0 correspond to negative sentiment, and scores close to 0 correspond to neutral sentiment.
table(sentimentscore', textdata)
ans=50×2 table
var1 textdata
__________ ___________________________________________________________________________________________________________________________
1.8382 "happy anniversary! ❤ next stop: paris! ✈ #vacation"
1.294 "haha, bbq on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"
1.0922 "getting ready for saturday night 🍕 #yum #weekend 😎"
0.094709 "say it with me - i need a #vacation!!! ☹"
1.4073 "😎 chilling 😎 at home for the first time in ages…this is the life! 👍 #weekend"
-0.8356 "my last #weekend before the exam 😢 👎."
-1.3556 "can’t believe my #vacation is over 😢 so unfair"
1.4312 "can’t wait for tennis this #weekend 🎾🍓🥂 😀"
3.0458 "i had so much fun! 😀😀😀 best trip ever! 😀😀😀 #vacation #weekend"
-0.39243 "hot weather and air con broke in car 😢 #sweaty #roadtrip #vacation"
0.8028 "🎉 check the out-of-office crew, we are officially on #vacation!! 😎"
0.38217 "well that wasn’t how i expected this #weekend to go 👎 total washout!! 😢"
3.03 "so excited for my bestie to visit this #weekend! 😀 ❤ 😀"
2.3849 "who needs a #vacation when the weather is this good ☀ 😎"
-0.0006176 "i love meetings in summer that run into the weekend! wait that was sarcasm. bring on the aircon apocalypse! 👎 ☹ #weekend"
0.52992 "you know we all worked hard for this! we totes deserve this 🎉 #vacation 🎉 ibiza ain’t gonna know what hit em 😎"
⋮
sentiment lexicon reading function
this function reads the positive and negative words from the sentiment lexicon and returns a table. the table contains variables word
and label
, where label
contains categorical values positive
and negative
corresponding to the sentiment of each word.
function data = readlexicon % read positive words fidpositive = fopen(fullfile('opinion-lexicon-english','positive-words.txt')); c = textscan(fidpositive,'%s','commentstyle',';'); wordspositive = string(c{1}); % read negative words fidnegative = fopen(fullfile('opinion-lexicon-english','negative-words.txt')); c = textscan(fidnegative,'%s','commentstyle',';'); wordsnegative = string(c{1}); fclose all; % create table of labeled words words = [wordspositive;wordsnegative]; labels = categorical(nan(numel(words),1)); labels(1:numel(wordspositive)) = "positive"; labels(numel(wordspositive) 1:end) = "negative"; data = table(words,labels,'variablenames',{'word','label'}); end
preprocessing function
the function preprocesstext
performs the following steps:
tokenize the text using
tokenizeddocument
.erase punctuation using
erasepunctuation
.remove stop words (such as "and", "of", and "the") using
removestopwords
.convert to lowercase using
lower
.
function documents = preprocesstext(textdata) % tokenize the text. documents = tokenizeddocument(textdata); % erase punctuation. documents = erasepunctuation(documents); % remove a list of stop words. documents = removestopwords(documents); % convert to lowercase. documents = lower(documents); end
bibliography
hu, minqing, and bing liu. "mining and summarizing customer reviews." in proceedings of the tenth acm sigkdd international conference on knowledge discovery and data mining, pp. 168-177. acm, 2004.
see also
tokenizeddocument
| | erasepunctuation
| | removestopwords
| | word2vec
| fasttextwordembedding
related topics
- analyze sentiment in text
- generate domain specific sentiment lexicon
- create simple text model for classification
- analyze text data containing emojis
- analyze text data using topic models
- analyze text data using multiword phrases
- classify text data using deep learning
- generate text using deep learning (deep learning toolbox)