3. Bag of Words in NLP
Published:
The Bag of Words (BoW) model is a fundamental technique in Natural Language Processing (NLP) used to extract features from text data. It helps in representing text in a numerical form, which is essential for many machine learning algorithms. In this post, we’ll explore how the Bag of Words model works, how to implement it, and some of its limitations.
Example Sentences
Let’s start with three simple sentences:
- Sentence 1: He is a good boy
- Sentence 2: She is a good girl
- Sentence 3: Boy & girl are good
Text Preprocessing
Before applying the Bag of Words model, we perform some basic text preprocessing steps:
- Lowercasing: Convert all text to lowercase.
- Stemming and Lemmatization: Reduce words to their root forms.
- Stopwords Removal: Remove common words that add little meaning (e.g., “is”, “a”, “the”).
After applying these preprocessing steps, the sentences are transformed as follows:
- Sentence 1: good boy
- Sentence 2: good girl
- Sentence 3: boy girl good
Creating the Bag of Words
To create the Bag of Words model, we’ll represent the words in the sentences and count their frequency. The result is a histogram where the x-axis represents the words and the y-axis represents the frequency of each word.
Word Frequency Table
Word | Count/Frequency |
---|---|
good | 3 |
boy | 2 |
girl | 2 |
Binary Bag of Words
In the binary version of Bag of Words, we mark a word as 1 if it is present in a sentence and 0 if it is not. Here’s how the sentences look in this format:
good | boy | girl | |
---|---|---|---|
Sentence 1 | 1 | 1 | 0 |
Sentence 2 | 1 | 0 | 1 |
Sentence 3 | 1 | 1 | 1 |
Frequency Bag of Words
If we want to track the actual frequency of words in the sentences, the table would look like this:
good | boy | girl | |
---|---|---|---|
Sentence 1 | 1 | 2 | 0 |
Sentence 2 | 1 | 0 | 1 |
Sentence 3 | 1 | 1 | 1 |
These matrices are the feature vectors for our sentences, which can be used as input to machine learning models.
Advantages of Bag of Words
- Simplicity: BoW is easy to implement and understand.
- Versatility: It can be used in various text classification tasks, such as sentiment analysis and spam detection.
Disadvantages of Bag of Words
- Loss of Semantic Meaning: BoW does not capture the meaning or context of the words. For example, it treats “good” in “good boy” and “good girl” as the same, without considering the different meanings.
- High Dimensionality: The size of the feature vectors grows with the vocabulary, leading to sparse matrices that can be computationally expensive to process.
Conclusion
The Bag of Words model is a simple yet powerful technique for transforming text data into numerical feature vectors. While it has some limitations, particularly in preserving semantic meaning, it is widely used in NLP tasks for its simplicity and effectiveness. Understanding how to implement and use BoW is a fundamental skill for anyone working with text data.