parse html and extract text content -凯发k8网页登录
this example shows how to parse html code and extract the text content from particular elements.
parse html code
read html code from the url https://www.mathworks.com/help/textanalytics
using webread
.
url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);
parse the html code using htmltree
.
tree = htmltree(code);
view the html element name of the tree.
tree.name
ans = "html"
view the child elements of the tree. the children are subtrees of tree
.
tree.children
ans = 4×1 htmltree: " "text analytics toolbox documentation
create a word cloud from the text of the hyperlinks.
str = extracthtmltext(subtrees);
figure
wordcloud(str);
title("hyperlinks")
get html attributes
get the class attributes from the paragraph elements in the html tree.
subtrees = findelement(tree,'p'); attr = "class"; str = getattribute(subtrees,attr)
str = 21×1 string array"add_margin_5" "category_desc" "category_desc" "category_desc" "category_desc" "text-center" "凯发官网入口首页 copyright"
create a word cloud from the text contained in paragraph elements with class "category_desc"
.
subtrees = findelement(tree,'p.category_desc');
str = extracthtmltext(subtrees);
figure
wordcloud(str);
see also
| | | | tokenizeddocument