generate domain specific sentiment lexicon -凯发k8网页登录

this example shows how to generate a lexicon for sentiment analysis using 10-k and 10-q financial reports.

sentiment analysis allows you to automatically summarize the sentiment in a given piece of text. for example, assign the pieces of text "this company is showing strong growth." and "this other company is accused of misleading consumers." with positive and negative sentiment, respectively. also, for example, to assign the text "this company is showing extremely strong growth." a stronger sentiment score than the text "this company is showing strong growth."

sentiment analysis algorithms such as vader rely on annotated lists of words called sentiment lexicons. for example, vader uses a sentiment lexicon with words annotated with a sentiment score ranging from -1 to 1, where scores close to 1 indicate strong positive sentiment, scores close to -1 indicate strong negative sentiment, and scores close to zero indicate neutral sentiment.

to analyze the sentiment of text using the vader algorithm, use the vadersentimentscores function. if the sentiment lexicon used by the vadersentimentscores function does not suit the data you are analyzing, for example, if you have a domain-specific data set like medical or engineering data, then you can generate your own custom sentiment lexicon using a small set of seed words.

this example shows how to generate a sentiment lexicon given a collection of seed words using a graph-based approach based on [1]:

train a word embedding that models the similarity between words using the training data.
create a simplified graph representing the embedding with nodes corresponding to words and edges weighted by similarity.
to determine words with strong polarity, identify the words connected to multiple seed words through short but heavily weighted paths.

load data

download the 10-k and 10-q financial reports data from securities and exchange commission (sec) via the electronic data gathering, analysis, and retrieval (edgar) api [2] using the financereports helper function attached to this example as a supporting file. to access this file, open this example as a live script. the financereports function downloads 10-k and 10-q reports for the specified year, quarter, and maximum character length.

download a set of 20,000 reports from the fourth quarter of 2019. depending on the sizes of the reports, this can take some time to run.

year = 2019;
qtr = 4;
textdata = financereports(year,qtr,'maxnumreports',20000);

downloading 10-k and 10-q reports...
done.
elapsed time is 1799.718710 seconds.

define sets of positive and negative seed words to use with this data. the seed words must appear at least once in the text data, otherwise they are ignored.

seedspositive = ["achieve" "advantage" "better" "creative" "efficiency" ...
    "efficiently" "enhance" "greater" "improved" "improving" ...
    "innovation" "innovations" "innovative" "opportunities" "profitable" ...
    "profitably" "strength" "strengthen" "strong" "success"]';
seedsnegative = ["adverse" "adversely" "against" "complaint" "concern" ...
    "damages" "default" "deficiencies" "disclosed" "failure" ...
    "fraud" "impairment" "litigation" "losses" "misleading" ...
    "omit" "restated" "restructuring" "termination" "weaknesses"]';

prepare text data

create a function names preprocesstext that prepares the text data for analysis. the preprocesstext function, listed at the end of the example performs the following steps:

erase any urls.
tokenize the text.
remove tokens containing digits.
convert the text to lower case.
remove any words with two or fewer characters.
remove any stop words.

preprocess the text using the preprocesstext function. depending on the size of the text data, this can take some time to run.

documents = preprocesstext(textdata);

visualize the preprocessed text data in a word cloud.

figure
wordcloud(documents);

train word embedding

word embeddings map words in a vocabulary to numeric vectors. these embeddings can capture semantic details of the words so that similar words have similar vectors.

train a word embedding that models the similarity between words using the training data. specify a context window of size 25 and discard words that appear fewer than 20 times. depending on the size of the text data, this can take some time to run.

emb = trainwordembedding(documents,'window',25,'mincount',20);

training: 100% loss: 1.44806  remaining time: 0 hours 0 minutes.

create word graph

create a simplified graph representing the embedding with nodes corresponding to words and edges weighted by similarity.

create a weighted graph with nodes corresponding to words in the vocabulary, edges denoting whether the words are within a neigborhood of 7 of each other, and weights corresponding to the cosine distance between the corresponding word vectors in the embedding.

for each word in the vocabulary, find the nearest 7 words and their cosine distances.

numneighbors = 7;
vocabulary = emb.vocabulary;
wordvectors = word2vec(emb,vocabulary);
[nearestwords,dist] = vec2word(emb,wordvectors,numneighbors);

to create the graph, use the graph function and specify pairwise source and target nodes, and specify their edge weights.

define the source and target nodes.

sourcenodes = repelem(vocabulary,numneighbors);
targetnodes = reshape(nearestwords,1,[]);

calculate the edge weights.

edgeweights = reshape(dist,1,[]);

create a graph connecting each word with its neigbors with edge weights corresponding to the similarity scores.

wordgraph = graph(sourcenodes,targetnodes,edgeweights,vocabulary);

remove the repeated edges using the simplify function.

wordgraph = simplify(wordgraph);

visualize the section of the word graph connected to the word "losses".

word = "losses";
idx = findnode(wordgraph,word);
nbrs = neighbors(wordgraph,idx);
wordsubgraph = subgraph(wordgraph,[idx; nbrs]);
figure
plot(wordsubgraph)
title("words connected to """   word   """")

generate sentiment scores

to determine words with strong polarity, identify the words connected to multiple seed words through short but heavily weighted paths.

initialize an array of sentiment scores corresponding to each word in the vocabulary.

sentimentscores = zeros([1 numel(vocabulary)]);

iteratively traverse the graph and update the sentiment scores.

traverse the graph at different depths. for each depth, calculate the positive and negative polarity of the words by using the positive and negative seeds to propagate sentiment to the rest of the graph.

for each depth:

calculate the positive and negative polarity scores.
account for the difference in overall mass of positive and negative flow in the graph.
for each node-word, normalize the difference of its two scores.

after running the algorithm, if a phrase has a higher positive than negative polarity score, then its final polarity will be positive, and negative otherwise.

specify a maximum path length of 4.

maxpathlength = 4;

iteratively traverse the graph and calculate the sum of the sentiment scores.

for depth = 1:maxpathlength
    
    % calculate polarity scores.
    polaritypositive = polarityscores(seedspositive,vocabulary,wordgraph,depth);
    polaritynegative = polarityscores(seedsnegative,vocabulary,wordgraph,depth);
    
    % account for difference in overall mass of positive and negative flow
    % in the graph.
    b = sum(polaritypositive) / sum(polaritynegative);
        
    % calculate new sentiment scores.
    sentimentscoresnew = polaritypositive - b * polaritynegative;
    sentimentscoresnew = normalize(sentimentscoresnew,'range',[-1,1]);
    
    % add scores to sum.
    sentimentscores = sentimentscores   sentimentscoresnew;
end

normalize the sentiment scores by the number of iterations.

sentimentscores = sentimentscores / maxpathlength;

create a table containing the vocabulary and the corresponding sentiment scores.

tbl = table;
tbl.token = vocabulary';
tbl.sentimentscore = sentimentscores';

to remove tokens with neutral sentiment from the lexicon, remove the tokens with sentiment score that have absolute value less than a threshold of 0.1.

thr = 0.1;
idx = abs(tbl.sentimentscore) < thr;
tbl(idx,:) = [];

sort the table rows by descending sentiment score and view the first few rows.

tbl = sortrows(tbl,'sentimentscore','descend');
head(tbl)

ans=8×2 table
         token         sentimentscore
    _______________    ______________
    "opportunities"       0.95633    
    "innovative"          0.89635    
    "success"             0.84362    
    "focused"             0.83768    
    "strong"              0.81042    
    "capabilities"        0.79174    
    "innovation"          0.77698    
    "improved"            0.77176

you can use this table as a custom sentiment lexicon for the vadersentimentscores function.

visualize the sentiment lexicon in word clouds. display tokens with a positive score in one word cloud and tokens with negative scores in another. display the words with sizes given by the absolute value their corresponding sentiment score.

figure
subplot(1,2,1);
idx = tbl.sentimentscore > 0;
tblpositive = tbl(idx,:);
wordcloud(tblpositive,'token','sentimentscore')
title('positive words')
subplot(1,2,2);
idx = tbl.sentimentscore < 0;
tblnegative = tbl(idx,:);
tblnegative.sentimentscore = abs(tblnegative.sentimentscore);
wordcloud(tblnegative,'token','sentimentscore')
title('negative words')

export the table to a csv file.

filename = "financesentimentlexicon.csv";
writetable(tbl,filename)

analyze sentiment in text

to analyze the sentiment in for previously unseen text data, preprocess the text using the same preprocessing steps and use the vadersentimentscores function.

create a string array containing the text data and preprocess it using the preprocesstext function.

textdatanew = [
    "this innovative company is continually showing strong growth."
    "this other company is accused of misleading consumers."];
documentsnew = preprocesstext(textdatanew);

evaluate the sentiment using the vadersentimentscores function. specify the sentiment lexicon created in this example using the 'sentimentlexicon' option.

compoundscores = vadersentimentscores(documentsnew,'sentimentlexicon',tbl)

compoundscores = 2×1
    0.4360
   -0.1112

positive and negative scores indicate positive and negative sentiment, respectively. the magnitude of the value corresponds to the strength of the sentiment.

supporting functions

text preprocessing function

the preprocesstext function performs the following steps:

erase any urls.
tokenize the text.
remove tokens containing digits.
convert the text to lower case.
remove any words with two or fewer characters.
remove any stop words.

function documents = preprocesstext(textdata)
% erase urls.
textdata = eraseurls(textdata);
% tokenize.
documents = tokenizeddocument(textdata);
% remove tokens containing digits.
pat = textboundary   wildcardpattern   digitspattern   wildcardpattern   textboundary;
documents = replace(documents,pat,"");
% convert to lowercase.
documents = lower(documents);
% remove short words.
documents = removeshortwords(documents,2);
% remove stop words.
documents = removestopwords(documents);
end

polarity scores function

the polarityscores function returns a vector of polarity scores given a set of seed words, vocabulary, graph, and a specified depth. the function computes the sum over the maximum weighted path from every seed word to each node in the vocabulary. a high polarity score indicates phrases connected to multiple seed words via both short and strongly weighted paths.

the function performs the following steps:

initialize the scores of the seeds with ones and otherwise zeros.
loop over the seeds. for each seed, iteratively traverse the graph at different depth levels. for the first iteration, set the search space to the immediate neighbors of the seed.
for each depth level, loop over the nodes in the search space and identify its neighbors in the graph.
loop over its neighbors and update the corresponding scores. the updated score is the maximum value of the current score for the seed and neighbor, and the score for the seed and search node weighted by the corresponding graph edge.
at the end of the search for the depth level, append the neighbors to the search space. this increases the depth of the search for the next iteration.

the output polarity is the sum of the scores connected to the input seeds.

function polarity = polarityscores(seeds,vocabulary,wordgraph,depth)
% remove seeds missing from vocabulary.
idx = ~ismember(seeds,vocabulary);
seeds(idx) = [];
% initialize scores.
vocabularysize = numel(vocabulary);
scores = zeros(vocabularysize);
idx = ismember(vocabulary,seeds);
scores(idx,idx) = eye(numel(seeds));
% loop over seeds.
for i = 1:numel(seeds)
    
    % initialize search space.
    seed = seeds(i);
    idxseed = vocabulary == seed;
    searchspace = find(idxseed);
    
    % search at different depths.
    for d = 1:depth
    
        % loop over nodes in search space.
        numnodes = numel(searchspace);
        
        for k = 1:numnodes
            
            idxnew = searchspace(k);
            
            % find neighbors and weights.
            nbrs = neighbors(wordgraph,idxnew);
            idxweights = findedge(wordgraph,idxnew,nbrs);
            weights = wordgraph.edges.weight(idxweights);
            
            % loop over neighbors.
            for j = 1:numel(nbrs)
                
                % calculate scores.
                score = scores(idxseed,nbrs(j));
                scorenew = scores(idxseed,idxnew);
                
                % update score.
                scores(idxseed,nbrs(j)) = max(score,scorenew*weights(j));
            end
            
            % appended nodes to search space for next depth iteration.
            searchspace = [searchspace nbrs'];
        end
    end
end
% find seeds in vocabulary.
[~,idx] = ismember(seeds,vocabulary);
% sum scores connected to seeds.
polarity = sum(scores(idx,:));
end

bibliography

velikovich, lenid. "the viability of web-derived polarity lexicons." in proceedings of the annual conference of the north american chapter of the association for computational linguistics, 2010, pp. 777-785. 2010.
accessing edgar data.