3. Bag of Words in NLP

3 minute read

Published: August 28, 2024

The Bag of Words (BoW) model is a fundamental technique in Natural Language Processing (NLP) used to extract features from text data. It helps in representing text in a numerical form, which is essential for many machine learning algorithms. In this post, we’ll explore how the Bag of Words model works, how to implement it, and some of its limitations.

Example Sentences

Let’s start with three simple sentences:

Sentence 1: He is a good boy
Sentence 2: She is a good girl
Sentence 3: Boy & girl are good

Text Preprocessing

Before applying the Bag of Words model, we perform some basic text preprocessing steps:

Lowercasing: Convert all text to lowercase.
Stemming and Lemmatization: Reduce words to their root forms.
Stopwords Removal: Remove common words that add little meaning (e.g., “is”, “a”, “the”).

After applying these preprocessing steps, the sentences are transformed as follows:

Sentence 1: good boy
Sentence 2: good girl
Sentence 3: boy girl good

Creating the Bag of Words

To create the Bag of Words model, we’ll represent the words in the sentences and count their frequency. The result is a histogram where the x-axis represents the words and the y-axis represents the frequency of each word.

Word Frequency Table

Word	Count/Frequency
good	3
boy	2
girl	2

Binary Bag of Words

In the binary version of Bag of Words, we mark a word as 1 if it is present in a sentence and 0 if it is not. Here’s how the sentences look in this format:

	good	boy	girl
Sentence 1	1	1	0
Sentence 2	1	0	1
Sentence 3	1	1	1

Frequency Bag of Words

If we want to track the actual frequency of words in the sentences, the table would look like this:

	good	boy	girl
Sentence 1	1	2	0
Sentence 2	1	0	1
Sentence 3	1	1	1

These matrices are the feature vectors for our sentences, which can be used as input to machine learning models.

Advantages of Bag of Words

Simplicity: BoW is easy to implement and understand.
Versatility: It can be used in various text classification tasks, such as sentiment analysis and spam detection.

Disadvantages of Bag of Words

Loss of Semantic Meaning: BoW does not capture the meaning or context of the words. For example, it treats “good” in “good boy” and “good girl” as the same, without considering the different meanings.
High Dimensionality: The size of the feature vectors grows with the vocabulary, leading to sparse matrices that can be computationally expensive to process.

Conclusion

The Bag of Words model is a simple yet powerful technique for transforming text data into numerical feature vectors. While it has some limitations, particularly in preserving semantic meaning, it is widely used in NLP tasks for its simplicity and effectiveness. Understanding how to implement and use BoW is a fundamental skill for anyone working with text data.

Share on

Twitter Facebook LinkedIn

Niloy Kumar Kundu

3. Bag of Words in NLP

Example Sentences

Text Preprocessing

Creating the Bag of Words

Word Frequency Table

Binary Bag of Words

Frequency Bag of Words

Advantages of Bag of Words

Disadvantages of Bag of Words

Conclusion

Share on

You May Also Enjoy

Understanding Matrix Multiplication with NumPy

10. Understanding Word Embeddings

9. Understanding LSTM Networks

8. Problems in Simple RNNs