main content

parse html and extract text content -凯发k8网页登录

this example shows how to parse html code and extract the text content from particular elements.

parse html code

read html code from the url https://www.mathworks.com/help/textanalytics using webread.

url = "https://www.mathworks.com/help/textanalytics";
code = webread(url);

parse the html code using htmltree.

tree = htmltree(code);

view the html element name of the tree.

tree.name
ans = 
"html"

view the child elements of the tree. the children are subtrees of tree.

tree.children
ans = 
  4×1 htmltree:
    " "
    text analytics toolbox documentation
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    

create a word cloud from the text of the hyperlinks.

str = extracthtmltext(subtrees);
figure
wordcloud(str);
title("hyperlinks")

get html attributes

get the class attributes from the paragraph elements in the html tree.

subtrees = findelement(tree,'p');
attr = "class";
str = getattribute(subtrees,attr)
str = 21×1 string array
    
    
    "add_margin_5"
    
    
    
    
    
    "category_desc"
    "category_desc"
    "category_desc"
    "category_desc"
    
    
    
    "text-center"
    
    
    
    "凯发官网入口首页 copyright"
    

create a word cloud from the text contained in paragraph elements with class "category_desc".

subtrees = findelement(tree,'p.category_desc');
str = extracthtmltext(subtrees);
figure
wordcloud(str);

see also

| | | |

related topics

网站地图