8. Problems in Simple RNNs

3 minute read

Published:

In a Recurrent Neural Network (RNN), the process starts with forward propagation, followed by backward propagation (Backpropagation Through Time, or BPTT). During backward propagation, the network’s weights are updated over time to minimize the loss function. However, simple RNNs encounter significant challenges known as the vanishing gradient problem and the exploding gradient problem.

Vanishing Gradient Problem

The vanishing gradient problem occurs when the gradients used to update the weights in the network become extremely small as they are propagated back through time. This issue is especially pronounced in deep networks with many layers or long sequences.

Example:

  • In the previous example, consider that each weight update involves calculating derivatives using the chain rule.
  • If the activation function used is the sigmoid function, its derivative lies between 0 and 1.
  • As you propagate backward from later layers (e.g., \(w_4\)) to earlier layers (e.g., \(w_1\)), the derivative values are continuously multiplied.
  • Since the derivative of the sigmoid function is a small value (between 0 and 1), this repeated multiplication results in a very small gradient value.
  • As the gradient becomes smaller and smaller, the weight updates become negligible, meaning the network fails to learn effectively because the weights are not significantly adjusted. This prevents the model from converging to the global minimum in the gradient descent process.

Consequence:

  • The vanishing gradient problem makes it difficult for the network to learn long-range dependencies in sequential data, as the earlier layers (those closer to the input) receive very little gradient information during training.

Exploding Gradient Problem

The exploding gradient problem is the opposite of the vanishing gradient problem. It occurs when the gradients become excessively large during backpropagation, leading to significant updates in the weights.

Example:

  • If you use the ReLU (Rectified Linear Unit) activation function, the derivative can be greater than 1.
  • As you propagate backward, if the gradient value is greater than 1 and is multiplied across several layers, it can grow exponentially.
  • This can result in extremely large gradient values, causing the weights to be updated by excessively large amounts.
  • These large updates can cause the network to overshoot the optimal solution, making it difficult for the model to converge to the global minimum.

Consequence:

  • The exploding gradient problem can lead to numerical instability and makes the training process highly erratic. The network may either fail to converge or diverge altogether, resulting in poor performance.

Addressing These Problems: LSTM

To address the vanishing and exploding gradient problems, a more sophisticated architecture called Long Short-Term Memory (LSTM) is often used. LSTM networks are designed to maintain a more stable gradient flow through time by introducing a memory cell that selectively updates and forgets information. This helps in preserving important gradients, preventing them from vanishing or exploding, and thus allowing the model to learn long-range dependencies effectively.

LSTM’s architecture includes mechanisms like input gates, forget gates, and output gates, which control the flow of information and help mitigate the issues of vanishing and exploding gradients, making them more effective for tasks involving long sequences.