main content

analyze text data containing emojis -凯发k8网页登录

this example shows how to analyze text data containing emojis.

emojis are pictorial symbols that appear inline in text. when writing text on mobile devices such as smartphones and tablets, people use emojis to keep the text short and convey emotion and feelings.

you also can use emojis to analyze text data. for example, use them to identify relevant strings of text or to visualize the sentiment or emotion of the text.

when working with text data, emojis can behave unpredictably. depending on your system fonts, your system might not display some emojis correctly. therefore, if an emoji is not displayed correctly, then the data is not necessarily missing. your system might be unable to display the emoji in the current font.

composing emojis

in most cases, you can read emojis from a file (for example, by using extractfiletext, extracthtmltext, or readtable) or by copying and pasting them directly into matlab®. otherwise, you must compose the emoji using unicode utf16 code units.

some emojis consist of multiple unicode utf16 code units. for example, the "smiling face with sunglasses" emoji (😎 with code point u 1f60e) is a single glyph but comprises two utf16 code units "d83d" and "de0e". create a string containing this emoji using the compose function, and specify the two code units with the prefix "\x".

emoji = compose("\xd83d\xde0e")
emoji = 
"😎"

first get the unicode utf16 code units of an emoji. use char to get the numeric representation of the emoji, and then use dec2hex to get the corresponding hex value.

codeunits = dec2hex(char(emoji))
codeunits = 2×4 char array
    'd83d'
    'de0e'

reconstruct the composition string using the strjoin function with the empty delimiter "".

formatspec = strjoin("\x"   codeunits,"")
formatspec = 
"\xd83d\xde0e"
emoji = compose(formatspec)
emoji = 
"😎"

import text data

extract the text data in the file weekendupdates.xlsx using readtable. the file weekendupdates.xlsx contains status updates containing the hashtags "#weekend" and "#vacation".

filename = "weekendupdates.xlsx";
tbl = readtable(filename,'texttype','string');
head(tbl)
ans=8×2 table
    id                                         textdata                                     
    __    __________________________________________________________________________________
    1     "happy anniversary! ❤ next stop: paris! ✈ #vacation"                              
    2     "haha, bbq on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"                  
    3     "getting ready for saturday night 🍕 #yum #weekend 😎"                            
    4     "say it with me - i need a #vacation!!! ☹"                                        
    5     "😎 chilling 😎 at home for the first time in ages…this is the life! 👍 #weekend"
    6     "my last #weekend before the exam 😢 👎."                                         
    7     "can’t believe my #vacation is over 😢 so unfair"                                 
    8     "can’t wait for tennis this #weekend 🎾🍓🥂 😀"                                   

extract the text data from the field textdata and view the first few status updates.

textdata = tbl.textdata;
textdata(1:5)
ans = 5×1 string
    "happy anniversary! ❤ next stop: paris! ✈ #vacation"
    "haha, bbq on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"
    "getting ready for saturday night 🍕 #yum #weekend 😎"
    "say it with me - i need a #vacation!!! ☹"
    "😎 chilling 😎 at home for the first time in ages…this is the life! 👍 #weekend"

visualize the text data in a word cloud.

figure
wordcloud(textdata);

filter text data by emoji

identify the status updates containing a particular emoji using the contains function. find the indices of the documents containing the "smiling face with sunglasses" emoji (😎 with code u 1f60e). this emoji comprises the two unicode utf16 code units "d83d" and "de0e".

emoji = compose("\xd83d\xde0e");
idx = contains(textdata,emoji);
textdatasunglasses = textdata(idx);
textdatasunglasses(1:5)
ans = 5×1 string
    "haha, bbq on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"
    "getting ready for saturday night 🍕 #yum #weekend 😎"
    "😎 chilling 😎 at home for the first time in ages…this is the life! 👍 #weekend"
    "🎉 check the out-of-office crew, we are officially on #vacation!! 😎"
    "who needs a #vacation when the weather is this good ☀ 😎"

visualize the extracted text data in a word cloud.

figure
wordcloud(textdatasunglasses);

extract and visualize emojis

visualize all the emojis in text data using a word cloud.

extract the emojis. first tokenize the text using tokenizeddocument, and then view the first few documents.

documents = tokenizeddocument(textdata);
documents(1:5)
ans = 
  5×1 tokenizeddocument:
    11 tokens: happy anniversary ! ❤ next stop : paris ! ✈ #vacation
    16 tokens: haha , bbq on the beach , engage smug mode ! 😍 😎 ❤ 🎉 #vacation
     9 tokens: getting ready for saturday night 🍕 #yum #weekend 😎
    13 tokens: say it with me - i need a #vacation ! ! ! ☹
    19 tokens: 😎 chilling 😎 at home for the first time in ages … this is the life ! 👍 #weekend

the tokenizeddocument function automatically detects emoji and assigns the token type "emoji". view the first few token details of the documents using the tokendetails function.

tdetails = tokendetails(documents);
head(tdetails)
ans=8×5 table
        token        documentnumber    linenumber       type        language
    _____________    ______________    __________    ___________    ________
    "happy"                1               1         letters           en   
    "anniversary"          1               1         letters           en   
    "!"                    1               1         punctuation       en   
    "❤"                    1               1         emoji             en   
    "next"                 1               1         letters           en   
    "stop"                 1               1         letters           en   
    ":"                    1               1         punctuation       en   
    "paris"                1               1         letters           en   

visualize the emojis in a word cloud by extracting the tokens with token type "emoji" and inputting them into the wordcloud function.

idx = tdetails.type == "emoji";
tokens = tdetails.token(idx);
figure
wordcloud(tokens);
title("emojis")

see also

| |

related topics

网站地图