CNNs for Sentence Classification: Transformation of Data

Learn about data transformations for sentence classification using a CNN.

Though CNNs have mostly been used for computer vision tasks, nothing stops them from being used in NLP applications. But as we highlighted earlier, CNNs were originally designed for visual content. Therefore, using CNNs for NLP tasks requires somewhat more effort. This is why we started out learning about CNNs with a simple computer vision problem. CNNs are an attractive choice for machine learning problems due to the low parameter count of convolution layers. One such NLP application for which CNNs have been used effectively is sentence classification.

In sentence classification, a given sentence should be classified with a class. We’ll use a question database, where each question is labeled by what the question is about. For example, the question “Who was Abraham Lincoln?” will be a question, and its label will be “Person.” For this, we’ll use a sentence classification dataset. We’re using the set with around 5,500 training questions and their respective labels and 500 testing sentences.

We’ll use the CNN network introduced in a paper by Yoon Kim, “Convolutional Neural Networks for Sentence Classification,” to help us understand the value of CNNs for NLP tasks. However, using CNNs for sentence classification is somewhat different from the Fashion-MNIST example we discussed because operations (for example, convolution and pooling) now happen in one dimension (length) rather than two dimensions (height and width). Furthermore, the pooling operations will also have a different flavor to the normal pooling operation, as we’ll see soon. As the first step, we’ll understand the data.

How data is transformed for sentence classification

Let's assume a sentence of pp words. First, we’ll pad the sentence with some special words (if the length of the sentence is <n<n) to set the sentence length to nn words, where npn\geq p. Next, we’ll represent each word in the sentence by a vector of size kk, where this vector can either be a one-hot encoded representation or Word2vec word vectors learned using skip-gram, CBOW, or GloVe. Then, a batch of sentences of size bb can be represented by a b×n×kb\times n \times k matrix.

Let’s walk through an example. Let’s consider the following three sentences:

  • Bob and Mary are friends.

  • Bob plays soccer.

  • Mary likes to sing in the choir.

In this example, the third sentence has the most words, so let’s set n=7n=7, which is the number of words in the third sentence. Next, let’s look at the one-hot encoded representation for each word. In this case, there are 13 distinct words. Therefore, we get this:

  • Bob: 1,0,0,0,0,0,0,0,0,0,0,0,01,0,0,0,0,0,0,0,0,0,0,0,0

  • and: 0,1,0,0,0,0,0,0,0,0,0,0,00,1,0,0,0,0,0,0,0,0,0,0,0

  • Mary: 0,0,1,0,0,0,0,0,0,0,0,0,00,0,1,0,0,0,0,0,0,0,0,0,0

Also, k=13k = 13 for the same reason. With this representation, we can represent the three sentences as a 3D matrix of size 3 × 7 × 133 \space \times \space 7\space \times \space 13, as shown in the figure below:

Get hands-on with 1200+ tech skills courses.