text data preparation -凯发k8网页登录

import text data into matlab^® and preprocess it for analysis

text analytics toolbox™ includes tools for processing raw text from sources such as equipment logs, news feeds, surveys, operator reports, and social media. use these tools to extract text from popular file formats, preprocess raw text, extract individual words or multiword phrases (n-grams), convert text into numerical representations, and build statistical models. for an example showing how to get started, see prepare text data for analysis.

text analytics toolbox supports the languages english, japanese, german, and korean. most text analytics toolbox functions work with text from other languages. for more information, see .

live editor tasks

preprocess text data

preprocess and clean up text data for analysis

functions

import and export

	read text from pdf, microsoft word, html, and plain text files
	extract text from html
	read data from pdf forms
	pdf file information
	write documents to text file

html parsing

	parsed html tree
	find elements in html tree
	read html attribute of root node of html tree
	find html trees without values
	convert parsed html tree to string

document preprocessing

`tokenizeddocument`	array of tokenized documents for text analysis
`erasepunctuation`	erase punctuation from text and documents
	erase html and xml tags from text
`eraseurls`	erase http and https urls from text
`removestopwords`	remove stop words from documents
	remove short words from documents or bag-of-words model
	remove long words from documents or bag-of-words model
	remove selected words from documents or bag-of-words model
`normalizewords`	stem or lemmatize words
	replace words in documents
	replace n-grams in documents
	split text into sentences
	split text into paragraphs
	list of stop words
	convert html and xml entities into characters
	convert documents to lowercase
	convert documents to uppercase

token details

	search documents for word or n-gram occurrences in context
	details of tokens in tokenized document array
	add sentence numbers to documents
`addpartofspeechdetails`	add part-of-speech tags to documents
	add lemma forms of tokens to documents
	add language identifiers to documents
`addentitydetails`	add entity tags to documents
	add grammatical dependency details to documents
	add token type details to documents
	split text into sentences
	split text into paragraphs
`corpuslanguage`	detect language of text
	table of common abbreviations
	list of top-level domains

word and n-gram counting

	bag-of-words model
	bag-of-n-grams model
	add documents to bag-of-words or bag-of-n-grams model
	remove documents from bag-of-words or bag-of-n-grams model
	remove words with low counts from bag-of-words model
	remove infrequently seen n-grams from bag-of-n-grams model
	remove n-grams from bag-of-n-grams model
	remove empty documents from tokenized document array, bag-of-words model, or bag-of-n-grams model
	most important words in bag-of-words model or lda topic
	most frequent n-grams
	encode documents as matrix of word or n-gram counts
`tfidf`	term frequency–inverse document frequency (tf-idf) matrix
	combine multiple bag-of-words or bag-of-n-grams models

spelling correction and edit distance

	correct spelling of words
	find edit distance between two strings or documents
	edit distance nearest neighbor searcher
	find nearest neighbors by edit distance
	find nearest neighbors by edit distance range
	split string into graphemes

document manipulation and conversion

	apply function to words in documents
	check if word is member of documents
	check if n-gram is member of documents
	check if pattern is substring in documents
	append documents
	replace substrings in documents
	replace text in words of documents using regular expression
	length of documents in document array
	convert documents to cell array of string vectors
	convert documents to string by joining words
	convert scalar document to string vector

unicode

	unicode composed normalized form (nfc)
	unicode decomposed normalized form (nfd)
	unicode compatibility composed normalized form (nfkc)
	unicode compatibility decomposed normalized form (nfkd)
	unicode utf-32 string representation
	unicode character categories
	convert utf-32 representation to hexadecimal values
	convert utf-32 representation to string

topics

import

extract text data from files
this example shows how to extract the text data from text, html, microsoft® word, pdf, csv, and microsoft excel® files and import it into matlab® for analysis.
parse html and extract text content
this example shows how to parse html code and extract the text content from particular elements.
discover data sets for various text analytics tasks.

preprocessing

explore text preprocessing techniques using the preprocess text data live editor task.
prepare text data for analysis
this example shows how to create a function which cleans and preprocesses text data for analysis.
analyze text data containing emojis
this example shows how to analyze text data containing emojis.
correct spelling in documents
this example shows how to correct spelling in documents using hunspell.
this example shows how to create a hunspell extension dictionary for spelling correction.
this example shows how to correct spelling using edit distance searchers and a vocabulary of known words.
this example shows how to extract information from a sentence using grammatical dependency parsing.

language support

information on using text analytics toolbox features for other languages.
information on japanese support in text analytics toolbox.
analyze japanese text data
this example shows how to import, prepare, and analyze japanese text data using a topic model.
information on german support in text analytics toolbox.
analyze german text data
this example shows how to import, prepare, and analyze german text data using a topic model.

featured examples

extract text data from files

extract the text data from text, html, microsoft® word, pdf, csv, and microsoft excel® files and import it into matlab® for analysis.

prepare text data for analysis

create a function which cleans and preprocesses text data for analysis.

analyze text data containing emojis

analyze text data containing emojis.