# How LSTMs Solve the Vanishing Gradient Problem

Learn how LSTMs solve the vanishing gradient problem.

We'll cover the following

Even though RNNs are theoretically sound, in practice, they suffer from a serious drawback. That is, when BPTT is used, the gradient diminishes quickly, which allows us to propagate the information of only a few time steps. Consequently, we can only store the information of very few time steps, possessing only short-term memory. This, in turn, limits the usefulness of RNNs in real-world sequential tasks.

Often, useful and interesting sequential tasks (such as stock market predictions or language modeling) require the ability to learn and store long-term dependencies. Think of the following example for predicting the next word:

John is a talented student. He is an A-grade student and plays rugby and cricket. All the other students envy ________.

For us, this is a very easy task. The answer would be “John.” However, for an RNN, this is a difficult task. We are trying to predict an answer that lies at the very beginning of the text. Also, to solve this task, we need a way to store long-term dependencies in the state of the RNN. This is exactly the type of task LSTMs are designed to solve.

Already, we discussed how a vanishing/exploding gradient can appear without any nonlinear functions present. We’ll now see that it could still happen even with the nonlinear term present. For this, we will derive the term $\partial h_t/\partial h_{t-k}$ for a standard RNN and $\partial c_t/\partial c_{t-k}$ for an LSTM network to understand the differences. This is the crucial term that causes the vanishing gradient, as we learned before.

Get hands-on with 1200+ tech skills courses.