extract features from text to build models for natural language processing (nlp) applications
the bag-of-words (bow) model is one of the simplest feature extraction techniques, used in many natural language processing (nlp) applications such as text classification, sentiment analysis, and topic modeling. bag-of-words is built by counting the number of occurrences of unique features such as words and symbols in a document.
example
in this example, the matlab® function bagofwords
creates a bag-of-words model from a collection of abstracts of math papers published on arxiv. one of the easiest ways to visualize the model is by plotting a word cloud using the matlab function wordcloud(bag)
. words displayed in bigger fonts and in orange are the most dominant (frequent) in the bag-of-words model.
when to use bag-of-words models
bag-of-words is easy to understand and implement. as a result, it is often the first method used to build models with text data. however, bag-of-words has several limitations, including:
- lack of context: bag-of-words models do not preserve the order of appearance of features in a document, which can remove important information in some cases. for example, “is this a good day” and “this is a good day” would be considered equivalent if context is not taken into account while analyzing the text data.
- unpredictable model quality: including all features from a document in a bag-of-words model can increase the model size, resulting in sparsity and numerical instabilities. careful preprocessing of the document text is often required to build a useful bag-of-words model.
alternatives to bag-of-words models
several good model alternatives don’t have the same inherent model limitations as bag-of-words:
- : uses multiple features instead of single ones
- term frequency–inverse document frequency: reflects importance
- word embedding: creates distributed representations of features into numerical vectors such as word2vec, glove and
- : uses pretrained deep learning models for transfer learning
however, bag-of-words is easy to understand and implement and is sufficient for many use cases. to learn more about bag-of-words and other modeling techniques for text data, see text analytics toolbox™ for use with matlab.
examples and how to
see also: natural language processing, , sentiment analysis, word2vec, text mining with matlab, lemmatization, stemming, n-gram, data science, deep learning, ngram