4. Understanding TF-IDF

2 minute read

Published:

TF (Term Frequency)

Term Frequency (TF) is a measure of how frequently a word appears in a sentence, normalized by the total number of words in that sentence. It is calculated as:

\[ \text{TF} = \frac{\text{Number of times a word appears in a sentence}}{\text{Total number of words in the sentence}} \]

IDF (Inverse Document Frequency)

Inverse Document Frequency (IDF) is a measure of how important a word is across multiple sentences. It is calculated as:

\[ \text{IDF} = \log\left(\frac{\text{Total number of sentences}}{\text{Number of sentences containing the word}}\right) \]

TF-IDF

TF-IDF is a product of TF and IDF, which gives a score that represents the importance of a word in a particular sentence while considering its frequency across all sentences:

\[ \text{TF-IDF} = \text{TF} \times \text{IDF} \]

Example Sentences

After preprocessing, we have the following sentences:

  • Sentence 1: good boy
  • Sentence 2: good girl
  • Sentence 3: boy girl good

Word Frequencies

The frequency of each word across all sentences is:

  • good: 3 occurrences
  • boy: 2 occurrences
  • girl: 2 occurrences

Calculating Term Frequency (TF)

The TF for each word in each sentence is as follows:

WordSentence 1 (good boy)Sentence 2 (good girl)Sentence 3 (boy girl good)
good\(\frac{1}{2}\)\(\frac{1}{2}\)\(\frac{1}{3}\)
boy\(\frac{1}{2}\)0\(\frac{1}{3}\)
girl0\(\frac{1}{2}\)\(\frac{1}{3}\)

Calculating Inverse Document Frequency (IDF)

The IDF for each word is calculated as:

  • good: \(\log\left(\frac{3}{3}\right) = \log(1) = 0\)
  • boy: \(\log\left(\frac{3}{2}\right)\)
  • girl: \(\log\left(\frac{3}{2}\right)\)

Since “good” appears in all three sentences, its IDF is zero, indicating that it is less informative.

Calculating TF-IDF

Now, we multiply the TF values by the IDF values to obtain the TF-IDF scores:

 goodboygirl
Sentence 10\(\frac{1}{2} \times \log\left(\frac{3}{2}\right)\)0
Sentence 200\(\frac{1}{2} \times \log\left(\frac{3}{2}\right)\)
Sentence 30\(\frac{1}{3} \times \log\left(\frac{3}{2}\right)\)\(\frac{1}{3} \times \log\left(\frac{3}{2}\right)\)

Interpretation

In Sentence 1, the word “boy” has the highest TF-IDF value because it is less frequent across all sentences, making it more informative within that sentence. The word “good,” on the other hand, has a TF-IDF of 0 in all sentences, as it is too common and not particularly informative.

Conclusion

TF-IDF helps in identifying words that are significant in a specific context while filtering out commonly occurring words that may not contribute much to the meaning of the text. This approach retains some level of semantic information, making it useful for various NLP tasks such as text classification and information retrieval.