extract summary from documents

since r2020a

syntax

summary = extractsummary(documents)

[summary,scores] = extractsummary(documents)

[summary,scores] = extractsummary(documents,name,value)

description

summary = extractsummary(documents) chooses a subset of the input documents to serve as a summary, and returns them as a tokenizeddocument array.

example

[summary,scores] = extractsummary(documents) also returns the importance scores used for selecting the summary documents. in this case, scores(i) represents the score for summary(i).

example

[summary,scores] = extractsummary(documents,name,value) specifies additional options using one or more name-value pair arguments.

examples

summarize documents

create an array of tokenized documents.

str = [
    "the quick brown fox jumped over the lazy dog."
    "the fox jumped over the dog."
    "the lazy dog saw a fox jumping."
    "there seem to be animals jumping other animals."
    "there are quick animals and lazy animals"];
documents = tokenizeddocument(str);

extract a summary of the documents using the extractsummary function. the function, by default, chooses 1/10 of the input documents, rounding up.

summary = extractsummary(documents)

summary = 
  tokenizeddocument:
   10 tokens: the quick brown fox jumped over the lazy dog .

to specify a larger summary, use the 'summarysize' option. extract a three-document summary.

summary = extractsummary(documents,'summarysize',3)

summary = 
  3x1 tokenizeddocument:
    10 tokens: the quick brown fox jumped over the lazy dog .
     7 tokens: the fox jumped over the dog .
     9 tokens: there seem to be animals jumping other animals .

evaluate document importance

create an array of tokenized documents.

str = [
    "the quick brown fox jumped over the lazy dog."
    "the fox jumped over the dog."
    "the lazy dog saw a fox jumping."
    "there seem to be animals jumping over other animals."
    "there are quick animals and lazy animals"];
documents = tokenizeddocument(str);

extract a three-document summary. the second output scores contains the summary document importance scores.

[summary,scores] = extractsummary(documents,'summarysize',3)

summary = 
  3x1 tokenizeddocument:
    10 tokens: the quick brown fox jumped over the lazy dog .
    10 tokens: there seem to be animals jumping over other animals .
     7 tokens: the fox jumped over the dog .

scores = 3×1
    0.2426
    0.2174
    0.1911

visualize the scores in a bar chart.

figure
bar(scores)
xlabel("summary document")
ylabel("score")
title("summary document importance")

figure contains an axes object. the axes object with title summary document importance, xlabel summary document, ylabel score contains an object of type bar.

sentence level summarization

to summarize a single document, split the document into an array of sentences, and use the extractsummary function.

create a string scalar containing the document.

str = ...
    "there is a quick fox. the fox is brown. there is a dog which "   ...
    "is lazy. the dog is very lazy. the fox jumped over the dog. "   ...
    "the quick brown fox jumped over the lazy dog.";

split the string into sentences using the splitsentences function.

str = splitsentences(str)

str = 6x1 string
    "there is a quick fox."
    "the fox is brown."
    "there is a dog which is lazy."
    "the dog is very lazy."
    "the fox jumped over the dog."
    "the quick brown fox jumped over the lazy dog."

create a tokenized document array containing the sentences.

documents = tokenizeddocument(str)

documents = 
  6x1 tokenizeddocument:
     6 tokens: there is a quick fox .
     5 tokens: the fox is brown .
     8 tokens: there is a dog which is lazy .
     6 tokens: the dog is very lazy .
     7 tokens: the fox jumped over the dog .
    10 tokens: the quick brown fox jumped over the lazy dog .

extract a summary from the sentences using the extractsummary function. to return a summary withthree documents, set the 'summarysize' option to 3.to ensure the summary documents appear in the same order as the input documents, set the 'orderby' option to 'position'.

summary = extractsummary(documents,'summarysize',3,'orderby','position')

summary = 
  3x1 tokenizeddocument:
     6 tokens: there is a quick fox .
     7 tokens: the fox jumped over the dog .
    10 tokens: the quick brown fox jumped over the lazy dog .

to reconstruct the sentences into a single document, convert the documents to string using the joinwords function and join the sentences using the join function.

sentences = joinwords(summary);
summarystr = join(sentences)

summarystr = 
"there is a quick fox . the fox jumped over the dog . the quick brown fox jumped over the lazy dog ."

to remove the surrounding punctuation characters, use the replace function.

punctuationright = ["." "," "’" ")" ":" "?" "!"];
summarystr = replace(summarystr," "   punctuationright,punctuationright);
punctuationleft = ["(" "‘"];
summarystr = replace(summarystr,punctuationleft   " ",punctuationleft)

summarystr = 
"there is a quick fox. the fox jumped over the dog. the quick brown fox jumped over the lazy dog."

input arguments

`documents` — input documents
`tokenizeddocument` array

input documents, specified as a tokenizeddocument array.

name-value arguments

specify optional pairs of arguments as name1=value1,...,namen=valuen, where name is the argument name and value is the corresponding value. name-value arguments must appear after other arguments, but the order of the pairs does not matter.

before r2021a, use commas to separate each name and value, and enclose name in quotes.

example: extractsummary(documents,'scoringmethod','lexrank') extracts a summary from documents and sets the scoring method option to 'lexrank'.

`scoringmethod` — scoring method
`'textrank'` (default) | `'lexrank'` | `'mmr'`

scoring method used for extractive summarization, specified as the comma-separated pair consisting of 'scoringmethod' and one of the following:

'textrank' – use the textrank algorithm.
'lexrank' – use the lexrank algorithm.
'mmr' – use the mmr algorithm.

`query` — query document for mmr scoring
`tokenizeddocument` scalar | string array | cell array of character vectors

query document for mmr scoring, specified as the comma-separated pair consisting of 'query' and a tokenizeddocument scalar, a string array of words, or a cell array of character vectors. if 'query' not a tokenizeddocument scalar, then it must be a row vector representing a single document, where each element is a word.

this option only has an effect when 'scoringmethod' is 'mmr'.

`summarysize` — size of summary
0.1 (default) | scalar in the range (0,1) | positive integer | `inf`

size of summary, specified as the comma-separated pair consisting of 'summarysize' and one of the following:

scalar in the range (0,1) – extract the specified proportion of input documents, rounding up. in this case, the number of summary documents ceil(summarysize*numdocuments), where numdocuments is the number of input documents.
positive integer – extract a summary with the specified number of documents. if summarysize is greater than or equal to the number of input documents, then the function returns the input documents sorted according to the 'orderby' option.
inf – return the input documents sorted according to the 'orderby' option.

data types: double

`orderby` — order of documents in summary
`'score'` (default) | `'position'`

order of documents in summary, specified as the comma-separated pair consisting of 'orderby' and one of the following:

'score' – order documents by their score according to the 'scoringmethod' option.
'position' – maintain the document order from the input.

output arguments

`summary` — extracted summary
`tokenizeddocument` array

extracted summary, returned as a tokenizeddocument array. the summary is a subset of documents, and is sorted according to the 'orderby' option.

`scores` — summary document scores
vector

summary document scores, returned as a vector, where scores(i) is the score of the jth summary document according to the 'scoringmethod' option. the scores are sorted according to the 'orderby' option.

version history

introduced in r2020a

extract summary from documents -凯发k8网页登录

syntax

description

examples

summarize documents

evaluate document importance

sentence level summarization

input arguments

`documents` — input documents
`tokenizeddocument` array

name-value arguments

`scoringmethod` — scoring method
`'textrank'` (default) | `'lexrank'` | `'mmr'`

`query` — query document for mmr scoring
`tokenizeddocument` scalar | string array | cell array of character vectors

`summarysize` — size of summary
0.1 (default) | scalar in the range (0,1) | positive integer | `inf`

`orderby` — order of documents in summary
`'score'` (default) | `'position'`

output arguments

`summary` — extracted summary
`tokenizeddocument` array

`scores` — summary document scores
vector

version history

see also

topics

extract summary from documents -凯发k8网页登录

syntax

description

examples

summarize documents

evaluate document importance

sentence level summarization

input arguments

documents — input documents tokenizeddocument array

name-value arguments

scoringmethod — scoring method 'textrank' (default) | 'lexrank' | 'mmr'

query — query document for mmr scoring tokenizeddocument scalar | string array | cell array of character vectors

summarysize — size of summary 0.1 (default) | scalar in the range (0,1) | positive integer | inf

orderby — order of documents in summary 'score' (default) | 'position'

output arguments

summary — extracted summary tokenizeddocument array

scores — summary document scores vector

version history

see also

topics

wechat

`documents` — input documents
`tokenizeddocument` array

`scoringmethod` — scoring method
`'textrank'` (default) | `'lexrank'` | `'mmr'`

`query` — query document for mmr scoring
`tokenizeddocument` scalar | string array | cell array of character vectors

`summarysize` — size of summary
0.1 (default) | scalar in the range (0,1) | positive integer | `inf`

`orderby` — order of documents in summary
`'score'` (default) | `'position'`

`summary` — extracted summary
`tokenizeddocument` array

`scores` — summary document scores
vector