Summary: Recurrent Neural Networks

In this chapter, we looked at RNNs, which are different from conventional feed-forward neural networks and more powerful in terms of solving temporal tasks. Specifically, we discussed how to arrive at an RNN from a feed-forward neural network-type structure. We assumed a sequence of inputs and outputs and designed a computational graph that can represent the sequence of inputs and outputs.

This computational graph resulted in a series of copies of functions that we applied to each individual input-output tuple in the sequence. Then, by generalizing this model to any given single time step tt in the sequence, we were able to arrive at the basic computational graph of an RNN. We discussed the exact equations and update rules used to calculate the hidden state and the output.

Training RNNs using BPTT

Next, we discussed how RNNs are trained with data using BPTT. We examined how we can arrive at BPTT with standard backpropagation, as well as why we can’t use standard backpropagation for RNNs. We also discussed two important practical issues that arise with BPTT—vanishing gradient and exploding gradient—and how these can be solved on the surface level.

Different kinds of RNNs and their applications

Then, we moved on to the practical applications of RNNs. We discussed four main categories of RNNs. One-to-one architectures are used for tasks such as text generation, scene classification, and video frame labeling. Many-to-one architectures are used for sentiment analysis, where we process the sentences/phrases word by word (compared to processing a full sentence in a single go, as we saw before). One-to-many architectures are common in image captioning tasks, where we map a single image to an arbitrarily long sentence phrase describing the image. Many-to-many architectures are leveraged for machine translation tasks.

Named entity recognition with RNNs

We solved the task of NER with RNNs. In NER, the problem is to predict a label for each token, given a sequence of tokens. The label represents an entity (e.g., organization, location, person, etc.). For this, we used embeddings and an RNN to process each token while considering the sequence of tokens as a time-series input. We also used a text vectorization layer to convert tokens into word IDs. A key benefit of the text vectorization layer is that it’s built as a part of the model, unlike the tokenizer we used before.

Generate token embedding

Finally, we looked at how we can adopt character embeddings and the convolution operation to generate token embeddings. We used these new token embeddings along with standard word embeddings to improve model accuracy.

Get hands-on with 1200+ tech skills courses.