10. Understanding Word Embeddings

3 minute read

Published:

In Natural Language Processing (NLP), a word embedding is a representation of a word in a continuous vector space. This representation is typically a real-valued vector that encodes the meaning of the word, such that words closer together in the vector space are expected to have similar meanings.

Example of a Word Embedding

Suppose the words “king” and “queen” are embedded in a vector space. Their vectors might be closer to each other, representing their semantic similarity:

king: [0.7, 0.3, 0.2, 0.9]
queen: [0.68, 0.32, 0.25, 0.87]

In this example, the vectors are close in the vector space, reflecting the related meanings of “king” and “queen.”

Converting Words and Sentences into Vectors

We can convert both words and sentences into vectors using word embeddings. These vectors capture semantic meaning, enabling machines to perform tasks such as text classification, translation, and sentiment analysis.

Types of Word Embeddings

Word embeddings can be generated using various techniques, categorized as follows:

1. Count or Frequency-Based Methods

  • One-Hot Encoding: Represents each word as a binary vector where only one element is 1, and all others are 0.
  • Bag of Words (BoW): Represents text by counting the occurrence of each word in the text.
  • TF-IDF: Reflects how important a word is to a document in a collection or corpus, using term frequency and inverse document frequency.

2. Deep Learning Trained Models

  • Word2Vec:
    • Continuous Bag of Words (CBOW)
    • Skip-Gram

Understanding Word2Vec

The basic idea of the Word2Vec model has already been introduced in a previous post. You can explore it further in the Word2Vec.

Continuous Bag of Words (CBOW)

CBOW is a Word2Vec model that predicts a target word based on its surrounding context words.

Dataset Example

Consider the sentence:

"Niloy name is related to data science"

Vocabulary: [Niloy, name, is, related, to, data, science]

Window Size: 5 (This represents the number of context words used to predict the target word.)

Dataset Preparation:

Input DataOutput Data
[Niloy, name, related, to]is
[name, is, to, data]related
[is, related, data, science]to

Vocabulary Size: 7 (Total number of unique words)

One-Hot Encoding:

  • Niloy: [1, 0, 0, 0, 0, 0, 0]
  • name: [0, 1, 0, 0, 0, 0, 0]
  • related: [0, 0, 1, 0, 0, 0, 0]
  • to: [0, 0, 0, 1, 0, 0, 0]

These vectors serve as input for the deep learning model. In CBOW, we use an Artificial Neural Network (ANN), specifically a fully connected neural network.

Model Architecture:

  • Input Layer: Size 7 (since vocab_size = 7)
  • Hidden Layer: Size 5 (since word_size = 5)
  • Output Layer: Size 7 (since vocab_size = 7)

Forward and Backward Propagation:

Through forward and backward propagation, we obtain the output vector of size equal to the word size, which represents the feature dimension.

Skip-Gram Model

The Skip-Gram model is essentially the reverse of CBOW. Instead of predicting a target word based on context words, it predicts context words based on a given target word.

Reversed Dataset Preparation:

Output DataInput Data
[Niloy, name, related, to]is
[name, is, to, data]related
[is, related, data, science]to

In the Skip-Gram model, we train the neural network using this reversed dataset. The process is similar to CBOW, but the inputs and outputs are swapped.

CBOW vs. Skip-Gram

  • Small Dataset: Use CBOW
  • Large Dataset: Use Skip-Gram

Advantages of Word2Vec

  • Sparse to Dense Matrix: Converts a sparse matrix into a dense one.
  • Semantic Meaning Capture: Captures the semantic meaning of words in the vector space.
  • Fixed Vocabulary Size: The vocabulary size is represented by a fixed set of dimensions.