4. Understanding TF-IDF
Published:
TF (Term Frequency)
Term Frequency (TF) is a measure of how frequently a word appears in a sentence, normalized by the total number of words in that sentence. It is calculated as:
\[ \text{TF} = \frac{\text{Number of times a word appears in a sentence}}{\text{Total number of words in the sentence}} \]
IDF (Inverse Document Frequency)
Inverse Document Frequency (IDF) is a measure of how important a word is across multiple sentences. It is calculated as:
\[ \text{IDF} = \log\left(\frac{\text{Total number of sentences}}{\text{Number of sentences containing the word}}\right) \]
TF-IDF
TF-IDF is a product of TF and IDF, which gives a score that represents the importance of a word in a particular sentence while considering its frequency across all sentences:
\[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]
Example Sentences
After preprocessing, we have the following sentences:
- Sentence 1: good boy
- Sentence 2: good girl
- Sentence 3: boy girl good
Word Frequencies
The frequency of each word across all sentences is:
- good: 3 occurrences
- boy: 2 occurrences
- girl: 2 occurrences
Calculating Term Frequency (TF)
The TF for each word in each sentence is as follows:
Word | Sentence 1 (good boy) | Sentence 2 (good girl) | Sentence 3 (boy girl good) |
---|---|---|---|
good | \(\frac{1}{2}\) | \(\frac{1}{2}\) | \(\frac{1}{3}\) |
boy | \(\frac{1}{2}\) | 0 | \(\frac{1}{3}\) |
girl | 0 | \(\frac{1}{2}\) | \(\frac{1}{3}\) |
Calculating Inverse Document Frequency (IDF)
The IDF for each word is calculated as:
- good: \(\log\left(\frac{3}{3}\right) = \log(1) = 0\)
- boy: \(\log\left(\frac{3}{2}\right)\)
- girl: \(\log\left(\frac{3}{2}\right)\)
Since “good” appears in all three sentences, its IDF is zero, indicating that it is less informative.
Calculating TF-IDF
Now, we multiply the TF values by the IDF values to obtain the TF-IDF scores:
good | boy | girl | |
---|---|---|---|
Sentence 1 | 0 | \(\frac{1}{2} \times \log\left(\frac{3}{2}\right)\) | 0 |
Sentence 2 | 0 | 0 | \(\frac{1}{2} \times \log\left(\frac{3}{2}\right)\) |
Sentence 3 | 0 | \(\frac{1}{3} \times \log\left(\frac{3}{2}\right)\) | \(\frac{1}{3} \times \log\left(\frac{3}{2}\right)\) |
Interpretation
In Sentence 1, the word “boy” has the highest TF-IDF value because it is less frequent across all sentences, making it more informative within that sentence. The word “good,” on the other hand, has a TF-IDF of 0 in all sentences, as it is too common and not particularly informative.
Conclusion
TF-IDF helps in identifying words that are significant in a specific context while filtering out commonly occurring words that may not contribute much to the meaning of the text. This approach retains some level of semantic information, making it useful for various NLP tasks such as text classification and information retrieval.