extract summary from documents -凯发k8网页登录
extract summary from documents
since r2020a
syntax
description
[
specifies additional options using one or more name-value pair arguments.summary
,scores
] = extractsummary(documents
,name,value
)
examples
summarize documents
create an array of tokenized documents.
str = [ "the quick brown fox jumped over the lazy dog." "the fox jumped over the dog." "the lazy dog saw a fox jumping." "there seem to be animals jumping other animals." "there are quick animals and lazy animals"]; documents = tokenizeddocument(str);
extract a summary of the documents using the extractsummary
function. the function, by default, chooses 1/10 of the input documents, rounding up.
summary = extractsummary(documents)
summary = tokenizeddocument: 10 tokens: the quick brown fox jumped over the lazy dog .
to specify a larger summary, use the 'summarysize'
option. extract a three-document summary.
summary = extractsummary(documents,'summarysize',3)
summary = 3x1 tokenizeddocument: 10 tokens: the quick brown fox jumped over the lazy dog . 7 tokens: the fox jumped over the dog . 9 tokens: there seem to be animals jumping other animals .
evaluate document importance
create an array of tokenized documents.
str = [ "the quick brown fox jumped over the lazy dog." "the fox jumped over the dog." "the lazy dog saw a fox jumping." "there seem to be animals jumping over other animals." "there are quick animals and lazy animals"]; documents = tokenizeddocument(str);
extract a three-document summary. the second output scores
contains the summary document importance scores.
[summary,scores] = extractsummary(documents,'summarysize',3)
summary = 3x1 tokenizeddocument: 10 tokens: the quick brown fox jumped over the lazy dog . 10 tokens: there seem to be animals jumping over other animals . 7 tokens: the fox jumped over the dog .
scores = 3×1
0.2426
0.2174
0.1911
visualize the scores in a bar chart.
figure bar(scores) xlabel("summary document") ylabel("score") title("summary document importance")
sentence level summarization
to summarize a single document, split the document into an array of sentences, and use the extractsummary
function.
create a string scalar containing the document.
str = ... "there is a quick fox. the fox is brown. there is a dog which " ... "is lazy. the dog is very lazy. the fox jumped over the dog. " ... "the quick brown fox jumped over the lazy dog.";
split the string into sentences using the splitsentences
function.
str = splitsentences(str)
str = 6x1 string
"there is a quick fox."
"the fox is brown."
"there is a dog which is lazy."
"the dog is very lazy."
"the fox jumped over the dog."
"the quick brown fox jumped over the lazy dog."
create a tokenized document array containing the sentences.
documents = tokenizeddocument(str)
documents = 6x1 tokenizeddocument: 6 tokens: there is a quick fox . 5 tokens: the fox is brown . 8 tokens: there is a dog which is lazy . 6 tokens: the dog is very lazy . 7 tokens: the fox jumped over the dog . 10 tokens: the quick brown fox jumped over the lazy dog .
extract a summary from the sentences using the extractsummary
function. to return a summary withthree documents, set the 'summarysize'
option to 3.to ensure the summary documents appear in the same order as the input documents, set the 'orderby'
option to 'position'
.
summary = extractsummary(documents,'summarysize',3,'orderby','position')
summary = 3x1 tokenizeddocument: 6 tokens: there is a quick fox . 7 tokens: the fox jumped over the dog . 10 tokens: the quick brown fox jumped over the lazy dog .
to reconstruct the sentences into a single document, convert the documents to string using the joinwords
function and join the sentences using the join
function.
sentences = joinwords(summary); summarystr = join(sentences)
summarystr = "there is a quick fox . the fox jumped over the dog . the quick brown fox jumped over the lazy dog ."
to remove the surrounding punctuation characters, use the replace
function.
punctuationright = ["." "," "’" ")" ":" "?" "!"]; summarystr = replace(summarystr," " punctuationright,punctuationright); punctuationleft = ["(" "‘"]; summarystr = replace(summarystr,punctuationleft " ",punctuationleft)
summarystr = "there is a quick fox. the fox jumped over the dog. the quick brown fox jumped over the lazy dog."
input arguments
documents
— input documents
tokenizeddocument
array
input documents, specified as a tokenizeddocument
array.
name-value arguments
specify optional pairs of arguments as
name1=value1,...,namen=valuen
, where name
is
the argument name and value
is the corresponding value.
name-value arguments must appear after other arguments, but the order of the
pairs does not matter.
before r2021a, use commas to separate each name and value, and enclose
name
in quotes.
example: extractsummary(documents,'scoringmethod','lexrank')
extracts
a summary from documents
and sets the scoring method option to
'lexrank'
.
scoringmethod
— scoring method
'textrank'
(default) | 'lexrank'
| 'mmr'
scoring method used for extractive summarization, specified as the comma-separated
pair consisting of 'scoringmethod'
and one of the following:
'textrank'
– use the textrank algorithm.'lexrank'
– use the lexrank algorithm.'mmr'
– use the mmr algorithm.
query
— query document for mmr scoring
tokenizeddocument
scalar | string array | cell array of character vectors
query document for mmr scoring, specified as the comma-separated pair consisting
of 'query'
and a tokenizeddocument
scalar, a string array of words, or a cell array of
character vectors. if 'query'
not a
tokenizeddocument
scalar, then it must be a row vector representing
a single document, where each element is a word.
this option only has an effect when 'scoringmethod'
is
'mmr'
.
summarysize
— size of summary
0.1 (default) | scalar in the range (0,1) | positive integer | inf
size of summary, specified as the comma-separated pair consisting of
'summarysize'
and one of the following:
scalar in the range (0,1) – extract the specified proportion of input documents, rounding up. in this case, the number of summary documents
ceil(summarysize*numdocuments)
, wherenumdocuments
is the number of input documents.positive integer – extract a summary with the specified number of documents. if
summarysize
is greater than or equal to the number of input documents, then the function returns the input documents sorted according to the'orderby'
option.inf
– return the input documents sorted according to the'orderby'
option.
data types: double
orderby
— order of documents in summary
'score'
(default) | 'position'
order of documents in summary, specified as the comma-separated pair consisting of
'orderby'
and one of the following:
'score'
– order documents by their score according to the'scoringmethod'
option.'position'
– maintain the document order from the input.
output arguments
summary
— extracted summary
tokenizeddocument
array
extracted summary, returned as a tokenizeddocument
array. the
summary is a subset of documents
, and is sorted according to the
'orderby'
option.
scores
— summary document scores
vector
summary document scores, returned as a vector, where scores(i)
is
the score of the j
th summary document according to the
'scoringmethod'
option. the scores are sorted according to the
'orderby'
option.
version history
introduced in r2020a
see also
tokenizeddocument
| | | bm25similarity
| | textrankscores
| | | |
打开示例
您曾对此示例进行过修改。是否要打开带有您的编辑的示例?
matlab 命令
您点击的链接对应于以下 matlab 命令:
请在 matlab 命令行窗口中直接输入以执行命令。web 浏览器不支持 matlab 命令。
select a web site
choose a web site to get translated content where available and see local events and offers. based on your location, we recommend that you select: .
you can also select a web site from the following list:
how to get best site performance
select the china site (in chinese or english) for best site performance. other mathworks country sites are not optimized for visits from your location.
americas
- (español)
- (english)
- (english)
europe
- (english)
- (english)
- (deutsch)
- (español)
- (english)
- (français)
- (english)
- (italiano)
- (english)
- (english)
- (english)
- (deutsch)
- (english)
- (english)
- switzerland
- (english)
asia pacific
- (english)
- (english)
- (english)
- 中国
- (日本語)
- (한국어)