3. Bag of Words in NLP

3 minute read

Published:

The Bag of Words (BoW) model is a fundamental technique in Natural Language Processing (NLP) used to extract features from text data. It helps in representing text in a numerical form, which is essential for many machine learning algorithms. In this post, we’ll explore how the Bag of Words model works, how to implement it, and some of its limitations.

Example Sentences

Let’s start with three simple sentences:

  • Sentence 1: He is a good boy
  • Sentence 2: She is a good girl
  • Sentence 3: Boy & girl are good

Text Preprocessing

Before applying the Bag of Words model, we perform some basic text preprocessing steps:

  1. Lowercasing: Convert all text to lowercase.
  2. Stemming and Lemmatization: Reduce words to their root forms.
  3. Stopwords Removal: Remove common words that add little meaning (e.g., “is”, “a”, “the”).

After applying these preprocessing steps, the sentences are transformed as follows:

  • Sentence 1: good boy
  • Sentence 2: good girl
  • Sentence 3: boy girl good

Creating the Bag of Words

To create the Bag of Words model, we’ll represent the words in the sentences and count their frequency. The result is a histogram where the x-axis represents the words and the y-axis represents the frequency of each word.

Word Frequency Table

WordCount/Frequency
good3
boy2
girl2

Binary Bag of Words

In the binary version of Bag of Words, we mark a word as 1 if it is present in a sentence and 0 if it is not. Here’s how the sentences look in this format:

 goodboygirl
Sentence 1110
Sentence 2101
Sentence 3111

Frequency Bag of Words

If we want to track the actual frequency of words in the sentences, the table would look like this:

 goodboygirl
Sentence 1120
Sentence 2101
Sentence 3111

These matrices are the feature vectors for our sentences, which can be used as input to machine learning models.

Advantages of Bag of Words

  • Simplicity: BoW is easy to implement and understand.
  • Versatility: It can be used in various text classification tasks, such as sentiment analysis and spam detection.

Disadvantages of Bag of Words

  1. Loss of Semantic Meaning: BoW does not capture the meaning or context of the words. For example, it treats “good” in “good boy” and “good girl” as the same, without considering the different meanings.
  2. High Dimensionality: The size of the feature vectors grows with the vocabulary, leading to sparse matrices that can be computationally expensive to process.

Conclusion

The Bag of Words model is a simple yet powerful technique for transforming text data into numerical feature vectors. While it has some limitations, particularly in preserving semantic meaning, it is widely used in NLP tasks for its simplicity and effectiveness. Understanding how to implement and use BoW is a fundamental skill for anyone working with text data.