correct spelling in documents -凯发k8网页登录
this example shows how to correct spelling in documents using hunspell.
load text data
create an array of tokenized documents.
str = [ "use matlab to correct spelling of words." "correctly spelled worrds are important for lemmatization." "text analytics toolbox providesfunctions for spelling correction."]; documents = tokenizeddocument(str)
documents = 3x1 tokenizeddocument: 8 tokens: use matlab to correct spelling of words . 8 tokens: correctly spelled worrds are important for lemmatization . 8 tokens: text analytics toolbox providesfunctions for spelling correction .
correct spelling
correct the spelling of the documents using the correctspelling
function.
updateddocuments = correctspelling(documents)
updateddocuments = 3x1 tokenizeddocument: 9 tokens: use mat lab to correct spelling of words . 8 tokens: correctly spelled words are important for solemnization . 9 tokens: text analytic toolbox provides functions for spelling correction .
notice that:
the input word "matlab" has been split into the two words "mat" and "lab".
the input word "worrds" has been changed to "words".
the input word "lemmatization" has been changed to "solemnization".
the input word "analytics" has been changed to "analytic".
the input word "providesfunctions" has been split into the two words "provides" and "functions".
specify custom words
to prevent the software from updating particular words, you can provide a list of known words using the 'knownwords'
option of the correctspelling
function.
correct the spelling of the documents again and specify the words "matlab", "analytics", and "lemmatization" as known words.
updateddocuments = correctspelling(documents,'knownwords',["matlab" "analytics" "lemmatization"])
updateddocuments = 3x1 tokenizeddocument: 8 tokens: use matlab to correct spelling of words . 8 tokens: correctly spelled words are important for lemmatization . 9 tokens: text analytics toolbox provides functions for spelling correction .
notice here that the words "matlab", "analytics", and "lemmatization" remain unchanged.