Summary: Understanding Long Short-Term Memory Networks

Review what we've learned in this chapter.

In this chapter, we learned about LSTM networks. First, we discussed what an LSTM is and its high-level architecture. We also delved into the detailed computations that take place in an LSTM and discussed the computations through an example.

Composition of LSTM

We saw that an LSTM is composed mainly of five different things:

  • Cell state: This is the internal cell state of an LSTM cell.

  • Hidden state: The external hidden state is used to calculate predictions.

  • Input gate: This determines how much of the current input is read into the cell state.

  • Forget gate: This determines how much of the previous cell state is sent into the current cell state.

  • Output gate: This determines how much of the cell state is output into the hidden state.

Having such a complex structure allows LSTMs to capture both short-term and long-term dependencies quite well.

LSTMs and RNNs

We compared LSTMs to vanilla RNNs and saw that LSTMs are actually capable of learning long-term dependencies as an inherent part of their structure, whereas RNNs can fail to learn long-term dependencies. Afterward, we discussed how LSTMs solve the vanishing gradient with its complex structure.

Improving the performance of LSTMs

Then, we discussed several extensions that improve the performance of LSTMs. First, a very simple technique we called greedy sampling, in which instead of always outputting the best candidate, we randomly sample a prediction from a set of best candidates. We saw that this improves the diversity of the generated text. After that, we looked at a more complex search technique called beam search. With this, instead of making a prediction for a single time step into the future, we predict several time steps into the future and pick the candidates that produce the best joint probability.

Another improvement involved seeing how word vectors can help improve the quality of the predictions of an LSTM. Using word vectors, LSTMs can learn more effectively to replace semantically similar words during prediction (for example, instead of outputting “dog,” LSTM might output “cat”), leading to more realism and correctness of the generated text. The final extension we considered was BiLSTMs or bidirectional LSTMs. A popular application of BiLSTMs is filling in missing words in a phrase. BiLSTMs read the text in both directions, from the beginning to the end and the end to the beginning. This gives more context because we’re looking at both the past and future before predicting.

Get hands-on with 1200+ tech skills courses.