main content

train a sentiment classifier -凯发k8网页登录

this example shows how to train a classifier for sentiment analysis using an annotated list of positive and negative sentiment words and a pretrained word embedding.

the pretrained word embedding plays several roles in this workflow. it converts words into numeric vectors and forms the basis for a classifier. you can then use the classifier to predict the sentiment of other words using their vector representation, and use these classifications to calculate the sentiment of a piece of text. there are four steps in training and using the sentiment classifier:

  • load a pretrained word embedding.

  • load an opinion lexicon listing positive and negative words.

  • train a sentiment classifier using the word vectors of the positive and negative words.

  • calculate the mean sentiment scores of the words in a piece of text.

load pretrained word embedding

word embeddings map words in a vocabulary to numeric vectors. these embeddings can capture semantic details of the words so that similar words have similar vectors. they also model relationships between words through vector arithmetic. for example, the relationship rome is to paris as italy is to france is described by the equation rome-italy franceparis.

load a pretrained word embedding using the fasttextwordembedding function. this function requires text analytics toolbox™ model for fasttext english 16 billion token word embedding support package. if this support package is not installed, then the function provides a download link.

emb = fasttextwordembedding;

load opinion lexicon

load the positive and negative words from the opinion lexicon (also known as a sentiment lexicon) from . [1] first, extract the files from the .rar file into a folder named opinion-lexicon-english, and then import the text.

load the data using the function readlexicon listed at the end of this example. the output data is a table with variables word containing the words, and label containing a categorical sentiment label, positive or negative.

data = readlexicon;

view the first few words labeled as positive.

idx = data.label == "positive";
head(data(idx,:))
ans=8×2 table
        word         label  
    ____________    ________
    "a "            positive
    "abound"        positive
    "abounds"       positive
    "abundance"     positive
    "abundant"      positive
    "accessable"    positive
    "accessible"    positive
    "acclaim"       positive

view the first few words labeled as negative.

idx = data.label == "negative";
head(data(idx,:))
ans=8×2 table
        word          label  
    _____________    ________
    "2-faced"        negative
    "2-faces"        negative
    "abnormal"       negative
    "abolish"        negative
    "abominable"     negative
    "abominably"     negative
    "abominate"      negative
    "abomination"    negative

prepare data for training

to train the sentiment classifier, convert the words to word vectors using the pretrained word embedding emb. first remove the words that do not appear in the word embedding emb.

idx = ~isvocabularyword(emb,data.word);
data(idx,:) = [];

set aside 10% of the words at random for testing.

numwords = size(data,1);
cvp = cvpartition(numwords,'holdout',0.1);
datatrain = data(training(cvp),:);
datatest = data(test(cvp),:);

convert the words in the training data to word vectors using word2vec.

wordstrain = datatrain.word;
xtrain = word2vec(emb,wordstrain);
ytrain = datatrain.label;

train sentiment classifier

train a support vector machine (svm) classifier which classifies word vectors into positive and negative categories.

mdl = fitcsvm(xtrain,ytrain);

test classifier

convert the words in the test data to word vectors using word2vec.

wordstest = datatest.word;
xtest = word2vec(emb,wordstest);
ytest = datatest.label;

predict the sentiment labels of the test word vectors.

[ypred,scores] = predict(mdl,xtest);

visualize the classification accuracy in a confusion matrix.

figure
confusionchart(ytest,ypred);

visualize the classifications in word clouds. plot the words with positive and negative sentiments in word clouds with word sizes corresponding to the prediction scores.

figure
subplot(1,2,1)
idx = ypred == "positive";
wordcloud(wordstest(idx),scores(idx,1));
title("predicted positive sentiment")
subplot(1,2,2)
wordcloud(wordstest(~idx),scores(~idx,2));
title("predicted negative sentiment")

calculate sentiment of collections of text

to calculate the sentiment of a piece of text, for example an update on social media, predict the sentiment score of each word in the text and take the mean sentiment score.

filename = "weekendupdates.xlsx";
tbl = readtable(filename,'texttype','string');
textdata = tbl.textdata;
textdata(1:10)
ans = 10×1 string array
    "happy anniversary! ❤ next stop: paris! ✈ #vacation"
    "haha, bbq on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"
    "getting ready for saturday night 🍕 #yum #weekend 😎"
    "say it with me - i need a #vacation!!! ☹"
    "😎 chilling 😎 at home for the first time in ages…this is the life! 👍 #weekend"
    "my last #weekend before the exam 😢 👎."
    "can’t believe my #vacation is over 😢 so unfair"
    "can’t wait for tennis this #weekend 🎾🍓🥂 😀"
    "i had so much fun! 😀😀😀 best trip ever! 😀😀😀 #vacation #weekend"
    "hot weather and air con broke in car 😢 #sweaty #roadtrip #vacation"

create a function which tokenizes and preprocesses the text data so it can be used for analysis. the function preprocesstext, listed at the end of the example, performs the following steps in order:

  1. tokenize the text using tokenizeddocument.

  2. erase punctuation using erasepunctuation.

  3. remove stop words (such as "and", "of", and "the") using removestopwords.

  4. convert to lowercase using lower.

use the preprocessing function preprocesstext to prepare the text data. this step can take a few minutes to run.

documents = preprocesstext(textdata);

remove the words from the documents that do not appear in the word embedding emb.

idx = ~isvocabularyword(emb,documents.vocabulary);
documents = removewords(documents,idx);

to visualize how well the sentiment classifier generalizes to the new text, classify the sentiments on the words that occur in the text, but not in the training data and visualize them in word clouds. use the word clouds to manually check that the classifier behaves as expected.

words = documents.vocabulary;
words(ismember(words,wordstrain)) = [];
vec = word2vec(emb,words);
[ypred,scores] = predict(mdl,vec);
figure
subplot(1,2,1)
idx = ypred == "positive";
wordcloud(words(idx),scores(idx,1));
title("predicted positive sentiment")
subplot(1,2,2)
wordcloud(words(~idx),scores(~idx,2));
title("predicted negative sentiment")

to calculate the sentiment of a given piece of text, compute the sentiment score for each word in the text and calculate the mean sentiment score.

calculate the mean sentiment score of the updates. for each document, convert the words to word vectors, predict the sentiment score on the word vectors, transform the scores using the score-to-posterior transform function and then calculate the mean sentiment score.

for i = 1:numel(documents)
    words = string(documents(i));
    vec = word2vec(emb,words);
    [~,scores] = predict(mdl,vec);
    sentimentscore(i) = mean(scores(:,1));
end

view the predicted sentiment scores with the text data. scores greater than 0 correspond to positive sentiment, scores less than 0 correspond to negative sentiment, and scores close to 0 correspond to neutral sentiment.

table(sentimentscore', textdata)
ans=50×2 table
       var1                                                                textdata                                                          
    __________    ___________________________________________________________________________________________________________________________
        1.8382    "happy anniversary! ❤ next stop: paris! ✈ #vacation"                                                                       
         1.294    "haha, bbq on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"                                                           
        1.0922    "getting ready for saturday night 🍕 #yum #weekend 😎"                                                                     
      0.094709    "say it with me - i need a #vacation!!! ☹"                                                                                 
        1.4073    "😎 chilling 😎 at home for the first time in ages…this is the life! 👍 #weekend"                                          
       -0.8356    "my last #weekend before the exam 😢 👎."                                                                                  
       -1.3556    "can’t believe my #vacation is over 😢 so unfair"                                                                          
        1.4312    "can’t wait for tennis this #weekend 🎾🍓🥂 😀"                                                                            
        3.0458    "i had so much fun! 😀😀😀 best trip ever! 😀😀😀 #vacation #weekend"                                                      
      -0.39243    "hot weather and air con broke in car 😢 #sweaty #roadtrip #vacation"                                                      
        0.8028    "🎉 check the out-of-office crew, we are officially on #vacation!! 😎"                                                     
       0.38217    "well that wasn’t how i expected this #weekend to go 👎 total washout!! 😢"                                                
          3.03    "so excited for my bestie to visit this #weekend! 😀 ❤ 😀"                                                                 
        2.3849    "who needs a #vacation when the weather is this good ☀ 😎"                                                                 
    -0.0006176    "i love meetings in summer that run into the weekend! wait that was sarcasm. bring on the aircon apocalypse! 👎 ☹ #weekend"
       0.52992    "you know we all worked hard for this! we totes deserve this 🎉 #vacation 🎉 ibiza ain’t gonna know what hit em 😎"        
      ⋮

sentiment lexicon reading function

this function reads the positive and negative words from the sentiment lexicon and returns a table. the table contains variables word and label, where label contains categorical values positive and negative corresponding to the sentiment of each word.

function data = readlexicon
% read positive words
fidpositive = fopen(fullfile('opinion-lexicon-english','positive-words.txt'));
c = textscan(fidpositive,'%s','commentstyle',';');
wordspositive = string(c{1});
% read negative words
fidnegative = fopen(fullfile('opinion-lexicon-english','negative-words.txt'));
c = textscan(fidnegative,'%s','commentstyle',';');
wordsnegative = string(c{1});
fclose all;
% create table of labeled words
words = [wordspositive;wordsnegative];
labels = categorical(nan(numel(words),1));
labels(1:numel(wordspositive)) = "positive";
labels(numel(wordspositive) 1:end) = "negative";
data = table(words,labels,'variablenames',{'word','label'});
end

preprocessing function

the function preprocesstext performs the following steps:

  1. tokenize the text using tokenizeddocument.

  2. erase punctuation using erasepunctuation.

  3. remove stop words (such as "and", "of", and "the") using removestopwords.

  4. convert to lowercase using lower.

function documents = preprocesstext(textdata)
% tokenize the text.
documents = tokenizeddocument(textdata);
% erase punctuation.
documents = erasepunctuation(documents);
% remove a list of stop words.
documents = removestopwords(documents);
% convert to lowercase.
documents = lower(documents);
end

bibliography

  1. hu, minqing, and bing liu. "mining and summarizing customer reviews." in proceedings of the tenth acm sigkdd international conference on knowledge discovery and data mining, pp. 168-177. acm, 2004.

see also

| | | | | | |

related topics

网站地图