How LSTMs Solve the Vanishing Gradient Problem

Learn how LSTMs solve the vanishing gradient problem.

We'll cover the following

Even though RNNs are theoretically sound, in practice, they suffer from a serious drawback. That is, when BPTT is used, the gradient diminishes quickly, which allows us to propagate the information of only a few time steps. Consequently, we can only store the information of very few time steps, possessing only short-term memory. This, in turn, limits the usefulness of RNNs in real-world sequential tasks.

Often, useful and interesting sequential tasks (such as stock market predictions or language modeling) require the ability to learn and store long-term dependencies. Think of the following example for predicting the next word:

John is a talented student. He is an A-grade student and plays rugby and cricket. All the other students envy ________.

For us, this is a very easy task. The answer would be “John.” However, for an RNN, this is a difficult task. We are trying to predict an answer that lies at the very beginning of the text. Also, to solve this task, we need a way to store long-term dependencies in the state of the RNN. This is exactly the type of task LSTMs are designed to solve.

Already, we discussed how a vanishing/exploding gradient can appear without any nonlinear functions present. We’ll now see that it could still happen even with the nonlinear term present. For this, we will derive the term ht/htk\partial h_t/\partial h_{t-k} for a standard RNN and ct/ctk\partial c_t/\partial c_{t-k} for an LSTM network to understand the differences. This is the crucial term that causes the vanishing gradient, as we learned before.

Get hands-on with 1200+ tech skills courses.