document similarities with bm25 algorithm -凯发k8网页登录
document similarities with bm25 algorithm
since r2020a
syntax
description
use bm25similarity
to calculate document
similarities.
by default, this function calculates bm25 similarities. to calculate bm11, bm15, or bm25
similarities, use the 'documentlengthscaling'
and 'documentlengthcorrection'
arguments.
returns the pairwise bm25 similarities between the specified documents. the score in
similarities
= bm25similarity(documents
)similarities(i,j)
represents the similarity between
documents(i)
and documents(j)
.
returns similarities between similarities
= bm25similarity(documents
,queries
)documents
and
queries
. the score in similarities(i,j)
represents
the similarity between documents(i)
and
queries(j)
.
returns similarities between the documents encoded by the specified bag-of-words or
bag-of-n-grams model. the score in similarities
= bm25similarity(bag
)similarities(i,j)
represents the
similarity between the i
th and j
th documents encoded
by bag
.
returns similarities between the documents encoded by the bag-of-words or bag-of-n-grams
model similarities
= bm25similarity(bag
,queries
)bag
and the documents specified by
queries
. the score in similarities(i,j)
represents
the similarity between the i
th document encoded by
bag
and queries(j)
.
specifies additional options using one or more name-value pair arguments. for instance, to
use the bm25 algorithm, set the similarities
= bm25similarity(___,name,value
)'documentlengthcorrection'
option to
a nonzero value.
examples
input arguments
output arguments
tips
the bm25 algorithm aggregates and uses information from all the documents in the input data via the term frequency (tf) and inverse document frequency (idf) based options. this behavior means that the same pair of documents can yield different bm25 similarity scores when the function is given different collections of documents.
the bm25 algorithm can output different scores when comparing documents to themselves. this behavior is due to the use of the idf weights and the document length in the bm25 algorithm.
algorithms
references
[1] robertson, stephen, and hugo zaragoza. "the probabilistic relevance framework: bm25 and beyond." foundations and trends® in information retrieval 3, no. 4 (2009): 333-389.
[2] barrios, federico, federico lópez, luis argerich, and rosa wachenchauzer. "variations of the similarity function of textrank for automated summarization." arxiv preprint arxiv:1602.03606 (2016).
version history
introduced in r2020a