bag-凯发k8网页登录

extract features from text to build models for natural language processing (nlp) applications

the bag-of-words (bow) model is one of the simplest feature extraction techniques, used in many natural language processing (nlp) applications such as text classification, sentiment analysis, and topic modeling. bag-of-words is built by counting the number of occurrences of unique features such as words and symbols in a document.

example

in this example, the matlab® function bagofwords creates a bag-of-words model from a collection of abstracts of math papers published on arxiv. one of the easiest ways to visualize the model is by plotting a word cloud using the matlab function wordcloud(bag). words displayed in bigger fonts and in orange are the most dominant (frequent) in the bag-of-words model.

word cloud from a bag-of-words model.

when to use bag-of-words models

bag-of-words is easy to understand and implement. as a result, it is often the first method used to build models with text data. however, bag-of-words has several limitations, including:

  • lack of context: bag-of-words models do not preserve the order of appearance of features in a document, which can remove important information in some cases. for example, “is this a good day” and “this is a good day” would be considered equivalent if context is not taken into account while analyzing the text data.
  • unpredictable model quality: including all features from a document in a bag-of-words model can increase the model size, resulting in sparsity and numerical instabilities. careful preprocessing of the document text is often required to build a useful bag-of-words model.

alternatives to bag-of-words models

several good model alternatives don’t have the same inherent model limitations as bag-of-words:

  • : uses multiple features instead of single ones
  • term frequency–inverse document frequency: reflects importance
  • word embedding: creates distributed representations of features into numerical vectors such as word2vec, glove and 
  • : uses pretrained deep learning models for transfer learning

however, bag-of-words is easy to understand and implement and is sufficient for many use cases. to learn more about bag-of-words and other modeling techniques for text data, see text analytics toolbox™ for use with matlab.

see also: natural language processing, , sentiment analysis, word2vec, text mining with matlab, lemmatization, stemming, n-gram, data science, deep learning, ngram

网站地图