main content

document similarities with bm25 algorithm -凯发k8网页登录

document similarities with bm25 algorithm

since r2020a

description

use bm25similarity to calculate document similarities.

by default, this function calculates bm25 similarities. to calculate bm11, bm15, or bm25 similarities, use the 'documentlengthscaling' and 'documentlengthcorrection' arguments.

example

similarities = bm25similarity(documents) returns the pairwise bm25 similarities between the specified documents. the score in similarities(i,j) represents the similarity between documents(i) and documents(j).

example

similarities = bm25similarity(documents,queries) returns similarities between documents and queries. the score in similarities(i,j) represents the similarity between documents(i) and queries(j).

example

similarities = bm25similarity(bag) returns similarities between the documents encoded by the specified bag-of-words or bag-of-n-grams model. the score in similarities(i,j) represents the similarity between the ith and jth documents encoded by bag.

similarities = bm25similarity(bag,queries) returns similarities between the documents encoded by the bag-of-words or bag-of-n-grams model bag and the documents specified by queries. the score in similarities(i,j) represents the similarity between the ith document encoded by bag and queries(j).

example

similarities = bm25similarity(___,name,value) specifies additional options using one or more name-value pair arguments. for instance, to use the bm25 algorithm, set the 'documentlengthcorrection' option to a nonzero value.

examples

create an array of tokenized documents.

textdata = [
    "the quick brown fox jumped over the lazy dog"
    "the fast brown fox jumped over the lazy dog"
    "the lazy dog sat there and did nothing"
    "the other animals sat there watching"];
documents = tokenizeddocument(textdata)
documents = 
  4x1 tokenizeddocument:
    9 tokens: the quick brown fox jumped over the lazy dog
    9 tokens: the fast brown fox jumped over the lazy dog
    8 tokens: the lazy dog sat there and did nothing
    6 tokens: the other animals sat there watching

calculate the similarities between them using the bm25similarity function. the output is a sparse matrix.

similarities = bm25similarity(documents);

visualize the similarities of the documents in a heat map.

figure
heatmap(similarities);
xlabel("document")
ylabel("document")
title("bm25 similarities")

figure contains an object of type heatmap. the chart of type heatmap has title bm25 similarities.

the first three documents have the highest pairwise similarities which indicates that these documents are most similar. the last document has comparatively low pairwise similarities with the other documents which indicates that this document is less like the other documents.

create an array of input documents.

str = [
    "the quick brown fox jumped over the lazy dog"
    "the fast fox jumped over the lazy dog"
    "the dog sat there and did nothing"
    "the other animals sat there watching"];
documents = tokenizeddocument(str)
documents = 
  4x1 tokenizeddocument:
    9 tokens: the quick brown fox jumped over the lazy dog
    8 tokens: the fast fox jumped over the lazy dog
    7 tokens: the dog sat there and did nothing
    6 tokens: the other animals sat there watching

create an array of query documents.

str = [
    "a brown fox leaped over the lazy dog"
    "another fox leaped over the dog"];
queries = tokenizeddocument(str)
queries = 
  2x1 tokenizeddocument:
    8 tokens: a brown fox leaped over the lazy dog
    6 tokens: another fox leaped over the dog

calculate the similarities between input documents and query documents using the bm25similarity function. the output is a sparse matrix. the score in similarities(i,j) represents the similarity between documents(i) and queries(j).

similarities = bm25similarity(documents,queries);

visualize the similarities of the documents in a heat map.

figure
heatmap(similarities);
xlabel("query document")
ylabel("input document")
title("bm25 similarities")

figure contains an object of type heatmap. the chart of type heatmap has title bm25 similarities.

in this case, the first input document is most like the first query document.

create a bag-of-words model from the text data in sonnets.csv.

filename = "sonnets.csv";
tbl = readtable(filename,'texttype','string');
textdata = tbl.sonnet;
documents = tokenizeddocument(textdata);
bag = bagofwords(documents)
bag = 
  bagofwords with properties:
          counts: [154x3527 double]
      vocabulary: ["from"    "fairest"    "creatures"    "we"    "desire"    "increase"    ","    "that"    "thereby"    "beauty's"    "rose"    "might"    "never"    "die"    "but"    "as"    "the"    "riper"    "should"    "by"    "time"    ...    ]
        numwords: 3527
    numdocuments: 154

calculate similarities between the sonnets using the bm25similarity function. the output is a sparse matrix.

similarities = bm25similarity(bag);

visualize the similarities between the first five documents in a heat map.

figure
heatmap(similarities(1:5,1:5));
xlabel("document")
ylabel("document")
title("bm25 similarities")

the bm25 algorithm addresses a limitation of the bm25 algorithm: the component of the term-frequency normalization by document length is not properly lower bounded. as a result of this limitation, long documents which do not match the query term can often be scored unfairly by bm25 as having a similar relevance to shorter documents that do not contain the query term.

bm25 addresses this limitation by using a document length correction factor (the value of the 'documentlengthscaling' name-value pair). this factor prevents the algorithm from over-penalizing long documents.

create two arrays of tokenized documents.

textdata1 = [
    "the quick brown fox jumped over the lazy dog"
    "the fast fox jumped over the lazy dog"
    "the dog sat there and did nothing"
    "the other animals sat there watching"];
documents1 = tokenizeddocument(textdata1)
documents1 = 
  4x1 tokenizeddocument:
    9 tokens: the quick brown fox jumped over the lazy dog
    8 tokens: the fast fox jumped over the lazy dog
    7 tokens: the dog sat there and did nothing
    6 tokens: the other animals sat there watching
textdata2 = [
    "a brown fox leaped over the lazy dog"
    "another fox leaped over the dog"];
documents2 = tokenizeddocument(textdata2)
documents2 = 
  2x1 tokenizeddocument:
    8 tokens: a brown fox leaped over the lazy dog
    6 tokens: another fox leaped over the dog

to calculate the bm25 document similarities, use the bm25similarity function and set the 'documentlengthcorrection' option to a nonzero value. in this case, set the 'documentlengthcorrection' option to 1.

similarities = bm25similarity(documents1,documents2,'documentlengthcorrection',1);

visualize the similarities of the documents in a heat map.

figure
heatmap(similarities);
xlabel("query")
ylabel("document")
title("bm25  similarities")

figure contains an object of type heatmap. the chart of type heatmap has title bm25  similarities.

here, when compared with the example similarity between documents, the scores show more similarity between the input documents and the first query document.

input arguments

input documents, specified as a tokenizeddocument array, a string array of words, or a cell array of character vectors. if documents is not a tokenizeddocument array, then it must be a row vector representing a single document, where each element is a word. to specify multiple documents, use a tokenizeddocument array.

input bag-of-words or bag-of-n-grams model, specified as a object or a object. if bag is a bagofngrams object, then the function treats each n-gram as a single word.

set of query documents, specified as one of the following:

  • a tokenizeddocument array

  • a bagofwords or bagofngrams object

  • a 1-by-n string array representing a single document, where each element is a word

  • a 1-by-n cell array of character vectors representing a single document, where each element is a word

to compute term frequency and inverse document frequency statistics, the function encodes queries using a bag-of-words model. the model it uses depends on the syntax you call it with. if your syntax specifies the input argument documents, then it uses bagofwords(documents). if your syntax specifies bag, then it uses bag.

name-value arguments

specify optional pairs of arguments as name1=value1,...,namen=valuen, where name is the argument name and value is the corresponding value. name-value arguments must appear after other arguments, but the order of the pairs does not matter.

before r2021a, use commas to separate each name and value, and enclose name in quotes.

example: bm25similarity(documents,'tfscaling',1.5) returns the pairwise similarities for the specified documents and sets the token frequency scaling factor to 1.5.

method to compute inverse document frequency factor, specified as the comma-separated pair consisting of 'idfweight' and one of the following:

  • 'textrank' – use textrank idf weighting [2]. for each term, set the idf factor to

    • log((n-nt 0.5)/(nt 0.5)) if the term occurs in more than half of the documents, where n is the number of documents in the input data and nt is the number of documents in the input data containing each term.

    • idfcorrection*avgidf if the term occurs in half of the documents or f, where avgidf is the average idf of all tokens.

  • 'classic-bm25' – for each term, set the idf factor to log((n-nt 0.5)/(nt 0.5)).

  • 'normal' – for each term, set the idf factor to log(n/nt).

  • 'unary' – for each term, set the idf factor to 1.

  • 'smooth' – for each term, set the idf factor to log(1 n/nt).

  • 'max' – for each term, set the idf factor to log(1 max(nt)/nt).

  • 'probabilistic' – for each term, set the idf factor to log((n-nt)/nt).

where n is the number of documents in the input data and nt is the number of documents in the input data containing each term.

term frequency scaling factor, specified as the comma-separated pair consisting of 'tfscaling' and a nonnegative scalar.

this option corresponds to the value k in the bm25 algorithm. for more information, see bm25.

data types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

document length scaling factor, specified as the comma-separated pair consisting of 'documentlengthscaling' and a scalar in the range [0,1].

this option corresponds to the value b in the bm25 algorithm. when b=1, the bm25 algorithm is equivalent to bm11. when b=0, the bm25 algorithm is equivalent to bm15. for more information, see bm11, bm15, or bm25.

data types: double

inverse document frequency correction factor, specified as the comma-separated pair consisting of 'idfcorrection' and a nonnegative scalar.

this option only applies when 'idfweight' is 'textrank'.

data types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

document length correction factor, specified as the comma-separated pair consisting of 'documentlengthcorrection' and a nonnegative scalar.

this option corresponds to the value δ in the bm25 algorithm. if the document length correction factor is nonzero, then the bm25similarity function uses the bm25 algorithm. otherwise, the function uses the bm25 algorithm. for more information, see bm25 .

data types: double

output arguments

bm25 similarity scores, returned as a sparse matrix:

  • given a single array of tokenized documents, similarities is a n-by-n nonsymmetric matrix, where similarities(i,j) represents the similarity between documents(i) and documents(j), and n is the number of input documents.

  • given an array of tokenized documents and a set of query documents, similarities is an n1-by-n2 matrix, where similarities(i,j) represents the similarity between documents(i) and the jth query document, and n1 and n2 represents the number of documents in documents and queries, respectively.

  • given a single bag-of-words or bag-of-n-grams model, similarities is a bag.numdocuments-by-bag.numdocuments nonsymmetric matrix, where similarities(i,j) represents the similarity between the ith and jth documents encoded by bag.

  • given a bag-of-words or bag-of-n-grams models and a set of query documents, similarities is a bag.numdocuments-by-n2 matrix, where similarities(i,j) represents the similarity between the ith document encoded by bag and the jth document in queries, and n2 corresponds to the number of documents in queries.

tips

  • the bm25 algorithm aggregates and uses information from all the documents in the input data via the term frequency (tf) and inverse document frequency (idf) based options. this behavior means that the same pair of documents can yield different bm25 similarity scores when the function is given different collections of documents.

  • the bm25 algorithm can output different scores when comparing documents to themselves. this behavior is due to the use of the idf weights and the document length in the bm25 algorithm.

algorithms

bm25

given a document from a collection of documents d, and a query document, the bm25 score is given by

bm25(document,query;d)=word query(idf(word;d)count(word,document)(k 1)count(word,document) k(1b b|document|n¯)),

where

  • count(word,document) denotes the frequency of word in document.

  • n¯ denotes the average document length in d.

  • k denotes the term frequency scaling factor (the value of the 'tfscaling' name-value pair argument). this factor dampens the influence of frequently appearing terms on the bm25 score.

  • b denotes the document length scaling factor (the value of the 'documentlengthscaling' name-value pair argument). this factor controls how the length of a document influences the bm25 score. when b=1, the bm25 algorithm is equivalent to bm11. when b=0, the bm25 algorithm is equivalent to bm15.

  • idf(word,d) is the inverse document frequency of the specified word given the collection of documents d.

bm25

the bm25 algorithm addresses a limitation of the bm25 algorithm: the component of the term-frequency normalization by document length is not properly lower bounded. as a result of this limitation, long documents which do not match the query term can often be scored unfairly by bm25 as having a similar relevance to shorter documents that do not contain the query term.

the bm25 algorithm is the same as the bm25 algorithm with one extra parameter. given a document from a collection of documents d and a query document, the bm25 score is given by

bm25 (document,query;d)=word query(idf(word;d)(count(word,document)(k 1)count(word,document) k(1b b|document|n¯) δ)),

where the extra parameter δ denotes the document length correction factor (the value of the 'documentlengthscaling' name-value pair). this factor prevents the algorithm from over-penalizing long documents.

bm11

bm11 is a special case of bm25 when b=1.

given a document from a collection of documents d, and a query document, the bm11 score is given by

bm11(document,query;d)=word query(idf(word;d)count(word,document)(k 1)count(word,document) k(|document|n¯)).

bm15

bm15 is a special case of bm25 when b=0.

given a document from a collection of documents d, and a query document, the bm15 score is given by

bm15(document,query;d)=word query(idf(word;d)count(word,document)(k 1)count(word,document) k).

references

[1] robertson, stephen, and hugo zaragoza. "the probabilistic relevance framework: bm25 and beyond." foundations and trends® in information retrieval 3, no. 4 (2009): 333-389.

[2] barrios, federico, federico lópez, luis argerich, and rosa wachenchauzer. "variations of the similarity function of textrank for automated summarization." arxiv preprint arxiv:1602.03606 (2016).

version history

introduced in r2020a

see also

| | | | | | |

topics

    网站地图