5. Word2Vec

3 minute read

Published:

1. Lack of Semantic Information

Semantic information refers to the meaning and relationship between words in a sentence or document. Traditional methods like Bag of Words (BoW) and TF-IDF focus solely on the frequency of words and ignore the context in which they appear. This means they don’t capture the meaning of the words or how they relate to each other.

Example: Consider the sentences:

  • “He is a good boy.”
  • “He is not a bad boy.”

Both “good” and “bad” in these sentences represent opposing meanings. However, in a Bag of Words or TF-IDF model, these words would be treated as independent and unrelated features. The models would not understand that “good” and “bad” have opposite meanings and might end up considering them as equally relevant, which could lead to incorrect conclusions.

2. Risk of Overfitting

Overfitting occurs when a model learns not only the underlying pattern in the training data but also the noise, making it perform well on training data but poorly on new, unseen data.

In Bag of Words and TF-IDF:

  • High Dimensionality: These methods generate very high-dimensional feature vectors, especially when the vocabulary is large. Each unique word is treated as a separate feature, which increases the risk of overfitting.
  • Sparse Vectors: The resulting vectors are often sparse, meaning they contain many zeros. This sparsity can make the model sensitive to noise and less generalizable to new data.

Word2Vec: Addressing the Issues

Word2Vec is a technique in NLP for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus. Once trained, the model can be used to generate vector representations of new text.

Key Features of Word2Vec

  • Vector Representation: Each word is represented as a vector of fixed dimensions (e.g., 32, 100, or even 300 dimensions). Unlike a single number in BoW or TF-IDF, these vectors capture rich information about the word.

  • Semantic Information: Word2Vec preserves the semantic relationships between words. Words that are similar in meaning are placed closer together in the vector space. For example, “king” and “queen” will have similar vectors, and the relationship between “man” and “woman” can be represented as a vector operation: \[ \text{king} - \text{man} + \text{woman} \approx \text{queen} \]

  • Contextual Awareness: Word2Vec considers the context in which a word appears, helping to capture the meaning and relationship between words.

Steps to Create a Word2Vec Model

  1. Tokenization:
    • Split the sentences into individual words (tokens).

    Example:

       Sentence: "He is a good boy."
       Tokens: ["He", "is", "a", "good", "boy"]
    
  2. Create Histograms:

    • Count the occurrences of each word across the corpus to identify the most frequent words.

    Example:

       Word Frequencies: {"He": 1, "is": 1, "a": 1, "good": 1, "boy": 1}
    
  3. Take the Most Frequent Words:

    • Select the most frequent words from the histogram to focus on the most relevant vocabulary.

    Example:

       Frequent Words: ["good", "boy", "girl"]
    
  4. Create a Co-occurrence Matrix:

    • Construct a matrix where each row and column represent a unique word, and the values represent the co-occurrence of these words within a certain context window.
       Words: ["good", "boy", "girl"]
       Matrix:
            | good | boy | girl |
       good |  0   |  1  |  1   |
       boy  |  1   |  0  |  1   |
       girl |  1   |  1  |  0   |
    
  5. Generate Word Vectors:

    • Use the co-occurrence matrix to train the Word2Vec model, which generates dense, fixed-size vectors for each word, capturing their semantic meanings and relationships.

    Example:

       Word2Vec Vectors:
       "good" -> [0.15, 0.45, -0.23, ..., 0.10]
       "boy"  -> [0.20, 0.30, -0.45, ..., 0.05]
    

By using Word2Vec, we create word vectors that are more informative and better at capturing the semantic relationships between words, leading to more accurate and generalizable models.