analyze text data containing emojis -凯发k8网页登录
this example shows how to analyze text data containing emojis.
emojis are pictorial symbols that appear inline in text. when writing text on mobile devices such as smartphones and tablets, people use emojis to keep the text short and convey emotion and feelings.
you also can use emojis to analyze text data. for example, use them to identify relevant strings of text or to visualize the sentiment or emotion of the text.
when working with text data, emojis can behave unpredictably. depending on your system fonts, your system might not display some emojis correctly. therefore, if an emoji is not displayed correctly, then the data is not necessarily missing. your system might be unable to display the emoji in the current font.
composing emojis
in most cases, you can read emojis from a file (for example, by using extractfiletext
, extracthtmltext
, or readtable
) or by copying and pasting them directly into matlab®. otherwise, you must compose the emoji using unicode utf16 code units.
some emojis consist of multiple unicode utf16 code units. for example, the "smiling face with sunglasses" emoji (😎 with code point u 1f60e) is a single glyph but comprises two utf16 code units "d83d"
and "de0e"
. create a string containing this emoji using the compose
function, and specify the two code units with the prefix "\x"
.
emoji = compose("\xd83d\xde0e")
emoji = "😎"
first get the unicode utf16 code units of an emoji. use char
to get the numeric representation of the emoji, and then use dec2hex
to get the corresponding hex value.
codeunits = dec2hex(char(emoji))
codeunits = 2×4 char array
'd83d'
'de0e'
reconstruct the composition string using the strjoin
function with the empty delimiter ""
.
formatspec = strjoin("\x" codeunits,"")
formatspec = "\xd83d\xde0e"
emoji = compose(formatspec)
emoji = "😎"
import text data
extract the text data in the file weekendupdates.xlsx
using readtable
. the file weekendupdates.xlsx
contains status updates containing the hashtags "#weekend"
and "#vacation"
.
filename = "weekendupdates.xlsx"; tbl = readtable(filename,'texttype','string'); head(tbl)
ans=8×2 table
id textdata
__ __________________________________________________________________________________
1 "happy anniversary! ❤ next stop: paris! ✈ #vacation"
2 "haha, bbq on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"
3 "getting ready for saturday night 🍕 #yum #weekend 😎"
4 "say it with me - i need a #vacation!!! ☹"
5 "😎 chilling 😎 at home for the first time in ages…this is the life! 👍 #weekend"
6 "my last #weekend before the exam 😢 👎."
7 "can’t believe my #vacation is over 😢 so unfair"
8 "can’t wait for tennis this #weekend 🎾🍓🥂 😀"
extract the text data from the field textdata
and view the first few status updates.
textdata = tbl.textdata; textdata(1:5)
ans = 5×1 string
"happy anniversary! ❤ next stop: paris! ✈ #vacation"
"haha, bbq on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"
"getting ready for saturday night 🍕 #yum #weekend 😎"
"say it with me - i need a #vacation!!! ☹"
"😎 chilling 😎 at home for the first time in ages…this is the life! 👍 #weekend"
visualize the text data in a word cloud.
figure wordcloud(textdata);
filter text data by emoji
identify the status updates containing a particular emoji using the contains
function. find the indices of the documents containing the "smiling face with sunglasses" emoji (😎 with code u 1f60e). this emoji comprises the two unicode utf16 code units "d83d"
and "de0e"
.
emoji = compose("\xd83d\xde0e");
idx = contains(textdata,emoji);
textdatasunglasses = textdata(idx);
textdatasunglasses(1:5)
ans = 5×1 string
"haha, bbq on the beach, engage smug mode! 😍 😎 ❤ 🎉 #vacation"
"getting ready for saturday night 🍕 #yum #weekend 😎"
"😎 chilling 😎 at home for the first time in ages…this is the life! 👍 #weekend"
"🎉 check the out-of-office crew, we are officially on #vacation!! 😎"
"who needs a #vacation when the weather is this good ☀ 😎"
visualize the extracted text data in a word cloud.
figure wordcloud(textdatasunglasses);
extract and visualize emojis
visualize all the emojis in text data using a word cloud.
extract the emojis. first tokenize the text using tokenizeddocument
, and then view the first few documents.
documents = tokenizeddocument(textdata); documents(1:5)
ans = 5×1 tokenizeddocument: 11 tokens: happy anniversary ! ❤ next stop : paris ! ✈ #vacation 16 tokens: haha , bbq on the beach , engage smug mode ! 😍 😎 ❤ 🎉 #vacation 9 tokens: getting ready for saturday night 🍕 #yum #weekend 😎 13 tokens: say it with me - i need a #vacation ! ! ! ☹ 19 tokens: 😎 chilling 😎 at home for the first time in ages … this is the life ! 👍 #weekend
the tokenizeddocument
function automatically detects emoji and assigns the token type "emoji"
. view the first few token details of the documents using the tokendetails
function.
tdetails = tokendetails(documents); head(tdetails)
ans=8×5 table
token documentnumber linenumber type language
_____________ ______________ __________ ___________ ________
"happy" 1 1 letters en
"anniversary" 1 1 letters en
"!" 1 1 punctuation en
"❤" 1 1 emoji en
"next" 1 1 letters en
"stop" 1 1 letters en
":" 1 1 punctuation en
"paris" 1 1 letters en
visualize the emojis in a word cloud by extracting the tokens with token type "emoji"
and inputting them into the wordcloud
function.
idx = tdetails.type == "emoji"; tokens = tdetails.token(idx); figure wordcloud(tokens); title("emojis")