analyze japanese text data -凯发k8网页登录

this example shows how to import, prepare, and analyze japanese text data using a topic model.

japanese text data can be large and can contain lots of noise that negatively affects statistical analysis. for example, the text data can contain the following:

variations in word forms. for example, "難しい" ("is difficult") and "難しかった" ("was difficult")
words that add noise. for example, stop words such as "あそこ" ("over there"), "あたり" ("around"), and "あちら" ("there")
punctuation and special characters

these word clouds illustrate word frequency analysis applied to some raw text data from "吾輩は猫である" by 夏目漱石, and a preprocessed version of the same text data.

this example first shows how to import and prepare japanese text data, and then it shows how to analyze the text data using a latent dirichlet allocation (lda) model. an lda model is a topic model that discovers underlying topics in a collection of documents and infers the word probabilities in topics.

import data

load the example data "factoryreportsjp.csv". the data contains factory reports, including a text description and categorical labels for each event in japanese. read the table using the readtable function and extract the text as strings. assign the names "var1", "var2", ..., "var5" to the table variables by setting the readvariablenames option to false.

filename = "factoryreportsjp.csv";
data = readtable(filename,"texttype","string","readvariablenames",false);

view the first few rows of the table. the table contains these variables:

var1 — description
var2 — category
var3 — urgency
var4 — resolution
var5 — cost

extract the text data from the variable var1 and view the first few reports.

textdata = data.var1;
textdata(1:10)

ans = 10×1 string
    "スキャナーのスプールにアイテムが詰まることがある。"
    "アセンブラのピストンからガタガタと大きな音がします。"
    "工場起動時に電源が切れる。"
    "アセンブラのコンデンサが飛ぶ。"
    "ミキサーでヒューズが切れる。"
    "コンストラクション・エージェントのパイプが破裂して冷却水を噴射している。"
    "ミキサーでヒューズが飛んだ。"
    "ベルトから物が続々と落ちてきます。"
    "ベルトから物が落下する。"
    "スキャナーのリールが割れている、すぐにカーブし始める。"

visualize the text data in a word cloud.

figure
wordcloud(textdata);

tokenize documents

tokenize the text using tokenizeddocument and view the first few documents.

documents = tokenizeddocument(textdata);
documents(1:10)

ans = 
  10×1 tokenizeddocument:
    11 tokens: スキャナー の スプール に アイテム が 詰まる こと が ある 。
    12 tokens: アセンブラ の ピストン から ガタガタ と 大きな 音 が し ます 。
     8 tokens: 工場 起動 時 に 電源 が 切れる 。
     6 tokens: アセンブラ の コンデンサ が 飛ぶ 。
     6 tokens: ミキサー で ヒューズ が 切れる 。
    17 tokens: コンストラクション ・ エージェント の パイプ が 破裂 し て 冷却 水 を 噴射 し て いる 。
     7 tokens: ミキサー で ヒューズ が 飛ん だ 。
    11 tokens: ベルト から 物 が 続々 と 落ち て き ます 。
     7 tokens: ベルト から 物 が 落下 する 。
    14 tokens: スキャナー の リール が 割れ て いる 、 すぐ に カーブ し 始める 。

get part-of-speech tags

get the token details and then view the details of the first few tokens.

tdetails = tokendetails(documents);
head(tdetails)

ans=8×8 table
       token       documentnumber    linenumber     type      language    partofspeech       lemma         entity  
    ___________    ______________    __________    _______    ________    ____________    ___________    __________
    "スキャナー"          1               1         letters       ja        noun           "スキャナー"     non-entity
    "の"                 1               1         letters       ja        adposition     "の"           non-entity
    "スプール"            1               1         letters       ja        noun           "スプール"       non-entity
    "に"                 1               1         letters       ja        adposition     "に"           non-entity
    "アイテム"            1               1         letters       ja        noun           "アイテム"       non-entity
    "が"                 1               1         letters       ja        adposition     "が"           non-entity
    "詰まる"              1               1         letters       ja        verb           "詰まる"        non-entity
    "こと"               1               1         letters       ja        noun           "こと"          non-entity

the partofspeech variable in the table contains the part-of-speech tags of the tokens. create word clouds of all the nouns and adjectives, respectively.

figure
idx = tdetails.partofspeech == "noun";
tokens = tdetails.token(idx);
subplot(1,2,1)
wordcloud(tokens);
title("nouns")
idx = tdetails.partofspeech == "adjective";
tokens = tdetails.token(idx);
subplot(1,2,2)
wc = wordcloud(tokens);
title("adjectives")

prepare text data for analysis

remove the stop words.

documents = removestopwords(documents);
documents(1:10)

ans = 
  10×1 tokenizeddocument:
    5 tokens: スキャナー スプール アイテム 詰まる 。
    6 tokens: アセンブラ ピストン ガタガタ 大きな 音 。
    5 tokens: 工場 起動 電源 切れる 。
    4 tokens: アセンブラ コンデンサ 飛ぶ 。
    4 tokens: ミキサー ヒューズ 切れる 。
    8 tokens: コンストラクション ・ エージェント パイプ 破裂 冷却 噴射 。
    4 tokens: ミキサー ヒューズ 飛ん 。
    6 tokens: ベルト 物 続々 落ち き 。
    4 tokens: ベルト 物 落下 。
    8 tokens: スキャナー リール 割れ 、 すぐ カーブ 始める 。

erase the punctuation.

documents = erasepunctuation(documents);
documents(1:10)

ans = 
  10×1 tokenizeddocument:
    4 tokens: スキャナー スプール アイテム 詰まる
    5 tokens: アセンブラ ピストン ガタガタ 大きな 音
    4 tokens: 工場 起動 電源 切れる
    3 tokens: アセンブラ コンデンサ 飛ぶ
    3 tokens: ミキサー ヒューズ 切れる
    6 tokens: コンストラクション エージェント パイプ 破裂 冷却 噴射
    3 tokens: ミキサー ヒューズ 飛ん
    5 tokens: ベルト 物 続々 落ち き
    3 tokens: ベルト 物 落下
    6 tokens: スキャナー リール 割れ すぐ カーブ 始める

lemmatize the text using normalizewords.

documents = normalizewords(documents);
documents(1:10)

ans = 
  10×1 tokenizeddocument:
    4 tokens: スキャナー スプール アイテム 詰まる
    5 tokens: アセンブラ ピストン ガタガタ 大きな 音
    4 tokens: 工場 起動 電源 切れる
    3 tokens: アセンブラ コンデンサ 飛ぶ
    3 tokens: ミキサー ヒューズ 切れる
    6 tokens: コンストラクション エージェント パイプ 破裂 冷却 噴射
    3 tokens: ミキサー ヒューズ 飛ぶ
    5 tokens: ベルト 物 続々 落ちる くる
    3 tokens: ベルト 物 落下
    6 tokens: スキャナー リール 割れる すぐ カーブ 始める

some preprocessing steps, such as removing stop words and erasing punctuation, return empty documents. remove the empty documents using the removeemptydocuments function.

documents = removeemptydocuments(documents);

create preprocessing function

creating a function that performs preprocessing can be useful to prepare different collections of text data in the same way. for example, you can use a function to preprocess new data using the same steps as the training data.

create a function which tokenizes and preprocesses the text data to use for analysis. the function preprocesstext, performs these steps:

tokenize the text using tokenizeddocument.
erase punctuation using erasepunctuation.
remove a list of stop words (such as "あそこ", "あたり", and "あちら") using removestopwords.
lemmatize the words using normalizewords.

remove the empty documents after preprocessing using the removeemptydocuments function. removing documents after using a preprocessing function makes it easier to remove corresponding data such as labels from other sources.

in this example, use the preprocessing function preprocesstext, listed at the end of the example, to prepare the text data.

documents = preprocesstext(textdata);
documents(1:5)

ans = 
  5×1 tokenizeddocument:
    4 tokens: スキャナー スプール アイテム 詰まる
    5 tokens: アセンブラ ピストン ガタガタ 大きな 音
    4 tokens: 工場 起動 電源 切れる
    3 tokens: アセンブラ コンデンサ 飛ぶ
    3 tokens: ミキサー ヒューズ 切れる

remove the empty documents.

documents = removeemptydocuments(documents);

fit topic model

fit a latent dirichlet allocation (lda) topic model to the data. an lda model discovers underlying topics in a collection of documents and infers word probabilities in topics.

to fit an lda model to the data, you first must create a bag-of-words model. a bag-of-words model (also known as a term-frequency counter) records the number of times that words appear in each document of a collection. create a bag-of-words model using bagofwords.

bag = bagofwords(documents);

remove the empty documents from the bag-of-words model.

bag = removeemptydocuments(bag);

fit an lda model with seven topics using fitlda. to suppress the verbose output, set 'verbose' to 0.

numtopics = 7;
mdl = fitlda(bag,numtopics,"verbose",0);

visualize the first four topics using word clouds.

figure
for i = 1:4
    subplot(2,2,i)
    wordcloud(mdl,i);
    title("topic "   i)
end

visualize multiple topic mixtures using stacked bar charts. view five input documents at random and visualize the corresponding topic mixtures.

numdocuments = numel(documents);
idx = randperm(numdocuments,5);
documents(idx)

ans = 
  5×1 tokenizeddocument:
    4 tokens: ミキサー 激しい 揺れる 音
    3 tokens: ベルト 物 落下
    3 tokens: コンベア ベルト 詰まる
    4 tokens: ミキサー 冷却 あちこち こぼれる
    3 tokens: トランスポート ライン 動く

topicmixtures = transform(mdl,documents(idx));
figure
barh(topicmixtures(1:5,:),"stacked")
xlim([0 1])
title("topic mixtures")
xlabel("topic probability")
ylabel("document")
legend("topic "   string(1:numtopics),"location","northeastoutside")

example preprocessing function

the function preprocesstext, performs these steps:

tokenize the text using tokenizeddocument.
erase punctuation using erasepunctuation.
remove a list of stop words (such as "あそこ", "あたり", and "あちら") using removestopwords.
lemmatize the words using normalizewords.

function documents = preprocesstext(textdata)
% tokenize the text.
documents = tokenizeddocument(textdata);
% erase the punctuation.
documents = erasepunctuation(documents);
% remove a list of stop words.
documents = removestopwords(documents);
% lemmatize the words.
documents = normalizewords(documents,"style","lemma");
end

analyze japanese text data -凯发k8网页登录

import data

tokenize documents

get part-of-speech tags

prepare text data for analysis

create preprocessing function

fit topic model

see also

related topics

analyze japanese text data -凯发k8网页登录

import data

tokenize documents

get part-of-speech tags

prepare text data for analysis

create preprocessing function

fit topic model

see also

related topics

wechat