9. Understanding LSTM Networks

5 minute read

Published:

In Recurrent Neural Networks (RNNs), one of the major challenges is the vanishing gradient problem. To address this, we use more advanced architectures like Long Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs). Today, we’ll focus on LSTM networks.

For a more detailed explanation, you can refer to colah’s blog.

Introduction to LSTM

LSTMs are specifically designed to avoid the long-term dependency problem often encountered in traditional RNNs. They achieve this by introducing a memory cell that can maintain information over extended periods. The LSTM architecture consists of the following components:

  • Memory Cell
  • Forget Gate
  • Input Gate
  • Output Gate

Memory Cell

The memory cell is the core of the LSTM network, responsible for remembering information over time and selectively forgetting irrelevant information. The memory cell’s state is updated based on the context provided by the input.

Example: How LSTM Remembers Context

Suppose the LSTM is used to generate text. If the sentence begins with “The cat,” the memory cell will store information about “cat” to ensure that the correct noun is used later in the sentence. As the sentence progresses to “sat on the mat,” the LSTM updates its memory to include this new context while retaining the knowledge about “cat.”

Pointwise Operations

Pointwise operations refer to element-wise mathematical operations applied across vectors. In LSTMs, pointwise multiplication and addition are commonly used to update the memory cell and gates.

Example: Pointwise Multiplication

Let’s say we have two vectors representing the memory cell state and the candidate values:

\[ x = [0.2, 0.8, 0.5] \] \[ y = [0.3, 0.7, 0.4] \]

The pointwise multiplication \( z = x \times y \) would be:

\[ z = [0.2 \times 0.3, 0.8 \times 0.7, 0.5 \times 0.4] = [0.06, 0.56, 0.20] \]

In the context of LSTMs, this operation is used to update the memory cell by scaling the values, ensuring that only relevant information is retained.

Addition Operation

When the context changes, LSTMs must forget some old information and add new information. This is done using the addition operation.

Example: Context Update with Addition

Suppose the current memory cell state is:

\[ C_{t-1} = [0.5, 0.1, 0.3] \]

And the new information is:

\[ \tilde{C}_t = [0.4, 0.7, 0.2] \]

The updated memory cell state \( C_t \) would be:

\[ C_t = C_{t-1} + \tilde{C}_t = [0.5 + 0.4, 0.1 + 0.7, 0.3 + 0.2] = [0.9, 0.8, 0.5] \]

This update ensures that the memory cell reflects the most recent and relevant context, such as “sat on the mat.”

Forget Gate

The forget gate controls which information should be discarded from the memory cell. It operates by taking the previous memory context and the current input, and then determining how much of the old information should be retained.

Example: Forget Gate with Sigmoid

Let’s say the previous context vector is:

\[ C_{t-1} = [0.8, 0.3, 0.5] \]

And the current input is:

\[ x_t = [0.9, 0.4, 0.7] \]

The forget gate \( f_t \) computes:

\[ f_t = \sigma(W_f \cdot [C_{t-1}, x_t] + b_f) \]

Assume \( W_f \) and \( b_f \) are weights and biases learned during training. After applying the sigmoid function, \( f_t \) might look like:

\[ f_t = [0.7, 0.2, 0.6] \]

This means 70% of the first element, 20% of the second element, and 60% of the third element will be retained in the memory cell after the pointwise multiplication with the current memory state.

Input Gate

The input gate determines what new information should be added to the memory cell. It takes the input and passes it through a sigmoid function to scale the values between 0 and 1, and through a tanh function to adjust the values to the range \([-1, 1]\).

Example: Input Gate Operation

Given an input \( x_t = [0.8, 0.6, 0.9] \), the input gate computes:

\[ i_t = \sigma(W_i \cdot x_t + b_i) = [0.5, 0.7, 0.8] \]

And:

\[ \tilde{C}_t = \tanh(W_C \cdot x_t + b_C) = [0.3, -0.2, 0.5] \]

The pointwise multiplication gives us the input for the memory cell:

\[ C_t = i_t \times \tilde{C}_t = [0.5 \times 0.3, 0.7 \times -0.2, 0.8 \times 0.5] = [0.15, -0.14, 0.4] \]

This updated information is then added to the memory cell.

Output Gate

The output gate decides what information should be output from the LSTM. It uses the current memory cell state and the input.

Example: Output Gate Operation

Let’s say the current memory cell state is:

\[ C_t = [0.7, 0.1, 0.5] \]

And the input is:

\[ x_t = [0.8, 0.6, 0.9] \]

The output gate computes:

\[ o_t = \sigma(W_o \cdot x_t + b_o) = [0.6, 0.4, 0.9] \]

The memory cell state passes through a tanh function:

\[ \tanh(C_t) = [0.6, 0.1, 0.46] \]

Finally, the output is:

\[ h_t = o_t \times \tanh(C_t) = [0.6 \times 0.6, 0.4 \times 0.1, 0.9 \times 0.46] = [0.36, 0.04, 0.414] \]

This output will be passed to the next layer in the network, helping in generating the next words or predictions.

Key Takeaways

  • If there is a change in the context (e.g., from “The cat” to “sat on the mat”), the Forget Gate will determine how much of the previous context should be forgotten.
  • If there is no significant change in context, the Forget Gate remains inactive, allowing the memory cell to retain its information.

LSTM networks excel at preserving important information over long sequences, allowing them to effectively model long-term dependencies in sequential data. This is achieved through the combined operations of the forget, input, and output gates, which carefully regulate the flow of information through the network.