array of tokenized documents for text analysis

description

a tokenized document is a document represented as a collection of words (also known as tokens) which is used for text analysis.

use tokenized documents to:

detect complex tokens in text, such as web addresses, emoticons, emoji, and hashtags.
remove words such as stop words using the or removestopwords functions.
perform word-level preprocessing tasks such as stemming or lemmatization using the normalizewords function.
analyze word and n-gram frequencies using and objects.
add sentence and part-of-speech details using the and addpartofspeechdetails functions.
add entity tags using the addentitydetails function.
add grammatical dependency details using the function.
view details about the tokens using the function.

the function supports english, japanese, german, and korean text. to learn how to use tokenizeddocument for other languages, see language considerations.

creation

syntax

documents = tokenizeddocument

documents = tokenizeddocument(str)

documents = tokenizeddocument(str,name,value)

description

documents = tokenizeddocument creates a scalar tokenized document with no tokens.

example

documents = tokenizeddocument(str) tokenizes the elements of a string array and returns a tokenized document array.

example

documents = tokenizeddocument(str,name,value) specifies additional options using one or more name-value pair arguments.

input arguments

`str` — input text
string array | character vector | cell array of character vectors | cell array of string arrays

input text, specified as a string array, character vector, cell array of character vectors, or cell array of string arrays.

if the input text has not already been split into words, then str must be a string array, character vector, cell array of character vectors, or a cell array of string scalars.

example: ["an example of a short document";"a second short document"]

example: 'an example of a single document'

example: {'an example of a short document';'a second short document'}

if the input text has already been split into words, then specify 'tokenizemethod' to be 'none'. if str contains a single document, then it must be a string vector of words, a row cell array of character vectors, or a cell array containing a single string vector of words. if str contains multiple documents, then it must be a cell array of string arrays.

example: ["an" "example" "document"]

example: {'an','example','document'}

example: {["an" "example" "of" "a" "short" "document"]}

example: {["an" "example" "of" "a" "short" "document"];["a" "second" "short" "document"]}

data types: string | char | cell

name-value arguments

specify optional pairs of arguments as name1=value1,...,namen=valuen, where name is the argument name and value is the corresponding value. name-value arguments must appear after other arguments, but the order of the pairs does not matter.

before r2021a, use commas to separate each name and value, and enclose name in quotes.

example: 'detectpatterns',{'email-address','web-address'} detects email addresses and web addresses

`tokenizemethod` — method to tokenize documents
`'unicode'` | `'mecab'` | `mecaboptions` object | `'none'`

method to tokenize documents, specified as the comma-separated pair consisting of 'tokenizemethod' and one of the following:

'unicode' – tokenize input text using rules based on unicode^® standard annex #29 [1] and the icu tokenizer [2]. if str is a cell array, then the elements of str must be string scalars or character vectors. if 'language' is 'en' or'de', then 'unicode' is the default.
'mecab' – tokenize japanese and korean text using the mecab tokenizer [3]. if 'language' is 'ja' or 'ko', then 'mecab' is the default.
mecaboptions object – tokenize japanese and korean text using the mecab options specified by a object.
'none' – do not tokenize the input text.

`detectpatterns` — patterns of complex tokens to detect
`'all'` (default) | character vector | string array | cell array of character vectors

patterns of complex tokens to detect, specified as the comma-separated pair consisting of 'detectpatterns' and 'none', 'all', or a string or cell array containing one or more of the following.

'email-address' – detect email addresses. for example, treat "user@domain.com" as a single token.
'web-address' – detect web addresses. for example, treat "https://www.mathworks.com" as a single token.
'hashtag' – detect hashtags. for example, treat "#matlab" as a single token.
'at-mention' – detect at-mentions. for example, treat "@mathworks" as a single token.
'emoticon' – detect emoticons. for example, treat ":-d" as a single token.

if detectpatterns is 'none', then the function does not detect any complex token patterns. if detectpatterns is 'all', then the function detects all the listed complex token patterns.

example: 'detectpatterns','hashtag'

example: 'detectpatterns',{'email-address','web-address'}

data types: char | string | cell

`customtokens` — custom tokens to detect
`''` (default) | string array | character vector | cell array of character vectors | table

custom tokens to detect, specified as the comma-separated pair consisting of 'customtokens' and one of the following.

a string array, character vector, or cell array of character vectors containing the custom tokens.
a table containing the custom tokens in a column named token and the corresponding token types a column named type.

if you specify the custom tokens as a string array, character vector, or cell array of character vectors, then the function assigns token type "custom". to specify a custom token type, use table input. to view the token types, use the function.

example: 'customtokens',["c " "c#"]

data types: char | string | table | cell

`regularexpressions` — regular expressions to detect
`''` (default) | string array | character vector | cell array of character vectors | table

regular expressions to detect, specified as the comma-separated pair consisting of 'regularexpressions' and one of the following.

a string array, character vector, or cell array of character vectors containing regular expressions.
a table containing regular expressions a column named pattern and the corresponding token types in a column named type.

if you specify the regular expressions as a string array, character vector, or cell array of character vectors, then the function assigns token type "custom". to specify a custom token type, use table input. to view the token types, use the function.

example: 'regularexpressions',["ver:\d " "rev:\d "]

data types: char | string | table | cell

`topleveldomains` — top-level domains to use for web address detection
character vector | string array | cell array of character vectors

top-level domains to use for web address detection, specified as the comma-separated pair consisting of 'topleveldomains' and a character vector, string array, or cell array of character vectors. by default, the function uses the output of .

this option only applies if 'detectpatterns' is 'all' or contains 'web-address'.

example: 'topleveldomains',["com" "net" "org"]

data types: char | string | cell

`language` — language
`'en'` | `'ja'` | `'de'` | `'ko'`

language, specified as the comma-separated pair consisting of 'language' and one of the following.

'en' – english. this option also sets the default value for 'tokenizemethod' to 'unicode'.
'ja' – japanese. this option also sets the default value for 'tokenizemethod' to 'mecab'.
'de' – german. this option also sets the default value for 'tokenizemethod' to 'unicode'.
'ko' – korean. this option also sets the default value for 'tokenizemethod' to 'mecab'.

if you do not specify a value, then the function detects the language from the input text using the corpuslanguage function.

this option specifies the language details of the tokens. to view the language details of the tokens, use . these language details determine the behavior of the removestopwords, addpartofspeechdetails, normalizewords, , and addentitydetails functions on the tokens.

for more information about language support in text analytics toolbox™, see .

example: 'language','ja'

properties

`vocabulary` — unique words in the documents
string array

unique words in the documents, specified as a string array. the words do not appear in any particular order.

data types: string

object functions

preprocessing

`erasepunctuation`	erase punctuation from text and documents
`removestopwords`	remove stop words from documents
	remove selected words from documents or bag-of-words model
`normalizewords`	stem or lemmatize words
	correct spelling of words
	replace words in documents
	replace n-grams in documents
	remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
	convert documents to lowercase
	convert documents to uppercase

tokens details

	details of tokens in tokenized document array
	add sentence numbers to documents
`addpartofspeechdetails`	add part-of-speech tags to documents
	add language identifiers to documents
	add token type details to documents
	add lemma forms of tokens to documents
`addentitydetails`	add entity tags to documents
	add grammatical dependency details to documents

export

write documents to text file

manipulation and conversion

	length of documents in document array
	search documents for word or n-gram occurrences in context
	check if pattern is substring in documents
	check if word is member of documents
	check if n-gram is member of documents
	split text into sentences
	convert documents to string by joining words
	convert documents to cell array of string vectors
	convert scalar document to string vector
	append documents
	replace substrings in documents
	apply function to words in documents
	replace text in words of documents using regular expression

display

	create word cloud chart from text, bag-of-words model, bag-of-n-grams model, or lda model
	plot grammatical dependency parse tree of sentence

examples

tokenize text

create tokenized documents from a string array.

str = [
    "an example of a short sentence" 
    "a second short sentence"]

str = 2x1 string
    "an example of a short sentence"
    "a second short sentence"

documents = tokenizeddocument(str)

documents = 
  2x1 tokenizeddocument:
    6 tokens: an example of a short sentence
    4 tokens: a second short sentence

detect complex tokens

create a tokenized document from the string str. by default, the function treats the hashtag "#matlab", the emoticon ":-d", and the web address "https://www.mathworks.com/help" as single tokens.

str = "learn how to analyze text in #matlab! :-d see https://www.mathworks.com/help/";
document = tokenizeddocument(str)

document = 
  tokenizeddocument:
   11 tokens: learn how to analyze text in #matlab ! :-d see https://www.mathworks.com/help/

to detect only hashtags as complex tokens, specify the 'detectpatterns' option to be 'hashtag' only. the function then tokenizes the emoticon ":-d" and the web address "https://www.mathworks.com/help" into multiple tokens.

document = tokenizeddocument(str,'detectpatterns','hashtag')

document = 
  tokenizeddocument:
   24 tokens: learn how to analyze text in #matlab ! : - d see https : / / www . mathworks . com / help /

remove stop words from documents

remove the stop words from an array of documents using removestopwords. the tokenizeddocument function detects that the documents are in english, so removestopwords removes english stop words.

documents = tokenizeddocument([
    "an example of a short sentence" 
    "a second short sentence"]);
newdocuments = removestopwords(documents)

newdocuments = 
  2x1 tokenizeddocument:
    3 tokens: example short sentence
    3 tokens: second short sentence

stem words in documents

stem the words in a document array using the porter stemmer.

documents = tokenizeddocument([
    "a strongly worded collection of words"
    "another collection of words"]);
newdocuments = normalizewords(documents)

newdocuments = 
  2x1 tokenizeddocument:
    6 tokens: a strongli word collect of word
    4 tokens: anoth collect of word

specify custom tokens

the tokenizeddocument function, by default, splits words and tokens that contain symbols. for example, the function splits "c " and "c#" into multiple tokens.

str = "i am experienced in matlab, c  , and c#.";
documents = tokenizeddocument(str)

documents = 
  tokenizeddocument:
   14 tokens: i am experienced in matlab , c     , and c # .

to prevent the function from splitting tokens that contain symbols, specify custom tokens using the 'customtokens' option.

documents = tokenizeddocument(str,'customtokens',["c  " "c#"])

documents = 
  tokenizeddocument:
   11 tokens: i am experienced in matlab , c   , and c# .

the custom tokens have token type "custom". view the token details. the column type contains the token types.

tdetails = tokendetails(documents)

tdetails=11×5 table
        token        documentnumber    linenumber       type        language
    _____________    ______________    __________    ___________    ________
    "i"                    1               1         letters           en   
    "am"                   1               1         letters           en   
    "experienced"          1               1         letters           en   
    "in"                   1               1         letters           en   
    "matlab"               1               1         letters           en   
    ","                    1               1         punctuation       en   
    "c  "                  1               1         custom            en   
    ","                    1               1         punctuation       en   
    "and"                  1               1         letters           en   
    "c#"                   1               1         custom            en   
    "."                    1               1         punctuation       en

to specify your own token types, input the custom tokens as a table with the tokens in a column named token, and the types in a column named type. to assign a custom type to a token that doesn't include symbols, include in the table too. for example, create a table that will assign "matlab", "c ", and "c#" to the "programming-language" token type.

t = table;
t.token = ["matlab" "c  " "c#"]';
t.type = ["programming-language" "programming-language" "programming-language"]'

t=3×2 table
     token               type         
    ________    ______________________
    "matlab"    "programming-language"
    "c  "       "programming-language"
    "c#"        "programming-language"

tokenize the text using the table of custom tokens and view the token details.

documents = tokenizeddocument(str,'customtokens',t);
tdetails = tokendetails(documents)

tdetails=11×5 table
        token        documentnumber    linenumber            type            language
    _____________    ______________    __________    ____________________    ________
    "i"                    1               1         letters                    en   
    "am"                   1               1         letters                    en   
    "experienced"          1               1         letters                    en   
    "in"                   1               1         letters                    en   
    "matlab"               1               1         programming-language       en   
    ","                    1               1         punctuation                en   
    "c  "                  1               1         programming-language       en   
    ","                    1               1         punctuation                en   
    "and"                  1               1         letters                    en   
    "c#"                   1               1         programming-language       en   
    "."                    1               1         punctuation                en

specify custom tokens using regular expressions

the tokenizeddocument function, by default, splits words and tokens containing symbols. for example, the function splits the text "ver:2" into multiple tokens.

str = "upgraded to ver:2 rev:3.";
documents = tokenizeddocument(str)

documents = 
  tokenizeddocument:
   9 tokens: upgraded to ver : 2 rev : 3 .

to prevent the function from splitting tokens that have particular patterns, specify those patterns using the 'regularexpressions' option.

specify regular expressions to detect tokens denoting version and revision numbers: strings of digits appearing after "ver:" and "rev:" respectively.

documents = tokenizeddocument(str,'regularexpressions',["ver:\d " "rev:\d "])

documents = 
  tokenizeddocument:
   5 tokens: upgraded to ver:2 rev:3 .

custom tokens, by default, have token type "custom". view the token details. the column type contains the token types.

tdetails = tokendetails(documents)

tdetails=5×5 table
      token       documentnumber    linenumber       type        language
    __________    ______________    __________    ___________    ________
    "upgraded"          1               1         letters           en   
    "to"                1               1         letters           en   
    "ver:2"             1               1         custom            en   
    "rev:3"             1               1         custom            en   
    "."                 1               1         punctuation       en

to specify your own token types, input the regular expressions as a table with the regular expressions in a column named pattern and the token types in a column named type.

t = table;
t.pattern = ["ver:\d " "rev:\d "]';
t.type = ["version" "revision"]'

t=2×2 table
     pattern        type   
    _________    __________
    "ver:\d "    "version" 
    "rev:\d "    "revision"

tokenize the text using the table of custom tokens and view the token details.

documents = tokenizeddocument(str,'regularexpressions',t);
tdetails = tokendetails(documents)

tdetails=5×5 table
      token       documentnumber    linenumber       type        language
    __________    ______________    __________    ___________    ________
    "upgraded"          1               1         letters           en   
    "to"                1               1         letters           en   
    "ver:2"             1               1         version           en   
    "rev:3"             1               1         revision          en   
    "."                 1               1         punctuation       en

search documents for word occurrences

load the example data. the file sonnetspreprocessed.txt contains preprocessed versions of shakespeare's sonnets. the file contains one sonnet per line, with words separated by a space. extract the text from sonnetspreprocessed.txt, split the text into documents at newline characters, and then tokenize the documents.

filename = "sonnetspreprocessed.txt";
str = extractfiletext(filename);
textdata = split(str,newline);
documents = tokenizeddocument(textdata);

search for the word "life".

tbl = context(documents,"life");
head(tbl)

                            context                             document    word
    ________________________________________________________    ________    ____
    "consumst thy self single life ah thou issueless shalt "        9        10 
    "ainted counterfeit lines life life repair times pencil"       16        35 
    "d counterfeit lines life life repair times pencil pupi"       16        36 
    " heaven knows tomb hides life shows half parts write b"       17        14 
    "he eyes long lives gives life thee                    "       18        69 
    "tender embassy love thee life made four two alone sink"       45        23 
    "ves beauty though lovers life beauty shall black lines"       63        50 
    "s shorn away live second life second head ere beautys "       68        27

view the occurrences in a string array.

tbl.context

ans = 23x1 string
    "consumst thy self single life ah thou issueless shalt "
    "ainted counterfeit lines life life repair times pencil"
    "d counterfeit lines life life repair times pencil pupi"
    " heaven knows tomb hides life shows half parts write b"
    "he eyes long lives gives life thee                    "
    "tender embassy love thee life made four two alone sink"
    "ves beauty though lovers life beauty shall black lines"
    "s shorn away live second life second head ere beautys "
    "e rehearse let love even life decay lest wise world lo"
    "st bail shall carry away life hath line interest memor"
    "art thou hast lost dregs life prey worms body dead cow"
    "           thoughts food life sweetseasond showers gro"
    "tten name hence immortal life shall though once gone w"
    " beauty mute others give life bring tomb lives life fa"
    "ve life bring tomb lives life fair eyes poets praise d"
    " steal thyself away term life thou art assured mine li"
    "fe thou art assured mine life longer thy love stay dep"
    " fear worst wrongs least life hath end better state be"
    "anst vex inconstant mind life thy revolt doth lie o ha"
    " fame faster time wastes life thou preventst scythe cr"
    "ess harmful deeds better life provide public means pub"
    "ate hate away threw savd life saying                  "
    " many nymphs vowd chaste life keep came tripping maide"

tokenize japanese text

tokenize japanese text using tokenizeddocument. the function automatically detects japanese text.

str = [
    "恋に悩み、苦しむ。"
    "恋の悩みで苦しむ。"
    "空に星が輝き、瞬いている。"
    "空の星が輝きを増している。"];
documents = tokenizeddocument(str)

documents = 
  4x1 tokenizeddocument:
     6 tokens: 恋 に 悩み 、 苦しむ 。
     6 tokens: 恋 の 悩み で 苦しむ 。
    10 tokens: 空 に 星 が 輝き 、 瞬い て いる 。
    10 tokens: 空 の 星 が 輝き を 増し て いる 。

tokenize german text

tokenize german text using tokenizeddocument. the function automatically detects german text.

str = [
    "guten morgen. wie geht es dir?"
    "heute wird ein guter tag."];
documents = tokenizeddocument(str)

documents = 
  2x1 tokenizeddocument:
    8 tokens: guten morgen . wie geht es dir ?
    6 tokens: heute wird ein guter tag .

more about

language considerations

the tokenizeddocument function has built-in rules for english, japanese, german, and korean only. for english and german text, the 'unicode' tokenization method of tokenizeddocument detects tokens using rules based on unicode standard annex #29 [1] and the icu tokenizer [2], modified to better detect complex tokens such as hashtags and urls. for japanese and korean text, the 'mecab' tokenization method detects tokens using rules based on the mecab tokenizer [3].

for other languages, you can still try using tokenizeddocument. if tokenizeddocument does not produce useful results, then try tokenizing the text manually. to create a tokenizeddocument array from manually tokenized text, set the 'tokenizemethod' option to 'none'.

for more information, see .

references

[1] unicode text segmentation.

[2] boundary analysis.

[3] mecab: yet another part-of-speech and morphological analyzer.

version history

introduced in r2017b

r2022a: `tokenizeddocument` does not split tokens containing digits and some special characters

starting in r2022a, tokenizeddocument does not split some tokens where digits appear next to some special characters such as periods, hyphens, colons, slashes, and scientific notation. this behavior can produce better results when tokenizing text containing numbers, dates, and times.

in previous versions, tokenizeddocument might split at these characters. to reproduce the behavior, tokenize the text manually or insert whitespace characters around special characters before using tokenizeddocument.

r2019b: `tokenizeddocument` detects korean language

starting in r2019b, tokenizeddocument detects the korean language and sets the 'language' option to 'ko'. this changes the default behavior of the , addpartofspeechdetails, removestopwords, and normalizewords functions for korean document input. this change allows the software to use korean-specific rules and word lists for analysis. if tokenizeddocument incorrectly detects text as korean, then you can specify the language manually by setting the 'language' name-value pair of tokenizeddocument.

in previous versions, tokenizeddocument usually detects korean text as english and sets the 'language' option to 'en'. to reproduce this behavior, manually set the 'language' name-value pair of tokenizeddocument to 'en'.

r2018b: `tokenizeddocument` detects emoticons

starting in r2018b, tokenizeddocument, by default, detects emoticon tokens. this behavior makes it easier to analyze text containing emoticons.

in r2017b and r2018a, tokenizeddocument splits emoticon tokens into multiple tokens. to reproduce this behavior, in tokenizeddocument, specify the 'detectpatterns' option to be {'email-address','web-address','hashtag','at-mention'}.

r2018b: `tokendetails` returns token type `emoji` for emoji characters

starting in r2018b, tokenizeddocument detects emoji characters and the function reports these tokens with type "emoji". this makes it easier to analyze text containing emoji characters.

in r2018a, reports emoji characters with type "other". to find the indices of the tokens with type "emoji" or "other", use the indices idx = tdetails.type == "emoji" | tdetails.type == "other", where tdetails is a table of token details.

r2018b: `tokenizeddocument` does not split at slash and colon characters between digits

starting in r2018b, tokenizeddocument does not split at slash, backslash, or colon characters when they appear between two digits. this behavior can produce better results when tokenizing text containing dates and times.

in previous versions, tokenizeddocument splits at these characters. to reproduce the behavior, tokenize the text manually or insert whitespace characters around slash, backslash, and colon characters before using tokenizeddocument.

array of tokenized documents for text analysis -凯发k8网页登录

description

creation

syntax

description

input arguments

str — input text string array | character vector | cell array of character vectors | cell array of string arrays

tokenizemethod — method to tokenize documents 'unicode' | 'mecab' | mecaboptions object | 'none'

detectpatterns — patterns of complex tokens to detect 'all' (default) | character vector | string array | cell array of character vectors

customtokens — custom tokens to detect '' (default) | string array | character vector | cell array of character vectors | table

regularexpressions — regular expressions to detect '' (default) | string array | character vector | cell array of character vectors | table

topleveldomains — top-level domains to use for web address detection character vector | string array | cell array of character vectors

language — language 'en' | 'ja' | 'de' | 'ko'

properties

vocabulary — unique words in the documents string array

object functions

preprocessing

tokens details

export

manipulation and conversion

display

examples

tokenize text

detect complex tokens

remove stop words from documents

stem words in documents

specify custom tokens

specify custom tokens using regular expressions

search documents for word occurrences

tokenize japanese text

tokenize german text

more about

language considerations

references

version history

r2022a: tokenizeddocument does not split tokens containing digits and some special characters

r2019b: tokenizeddocument detects korean language

r2018b: tokenizeddocument detects emoticons

r2018b: tokendetails returns token type emoji for emoji characters

r2018b: tokenizeddocument does not split at slash and colon characters between digits

see also

topics

wechat

`str` — input text
string array | character vector | cell array of character vectors | cell array of string arrays

`tokenizemethod` — method to tokenize documents
`'unicode'` | `'mecab'` | `mecaboptions` object | `'none'`

`detectpatterns` — patterns of complex tokens to detect
`'all'` (default) | character vector | string array | cell array of character vectors

`customtokens` — custom tokens to detect
`''` (default) | string array | character vector | cell array of character vectors | table

`regularexpressions` — regular expressions to detect
`''` (default) | string array | character vector | cell array of character vectors | table

`topleveldomains` — top-level domains to use for web address detection
character vector | string array | cell array of character vectors

`language` — language
`'en'` | `'ja'` | `'de'` | `'ko'`

`vocabulary` — unique words in the documents
string array

r2022a: `tokenizeddocument` does not split tokens containing digits and some special characters

r2019b: `tokenizeddocument` detects korean language

r2018b: `tokenizeddocument` detects emoticons

r2018b: `tokendetails` returns token type `emoji` for emoji characters

r2018b: `tokenizeddocument` does not split at slash and colon characters between digits