Introducing CNNs

Learn about the fundamentals of CNNs.

In this lesson, we’ll learn more about CNNs. Specifically, we’ll first get an understanding of the sort of operations present in a CNN, such as convolution layers, pooling layers, and fully connected layers. Next, we’ll briefly see how all of these are connected to form an end-to-end model.

It’s important to note that the first use case we’ll be solving with CNNs is an image classification task. CNNs were originally used to solve computer vision tasks and were adopted for NLP much later. Furthermore, CNNs have a stronger presence in the computer vision domain than the NLP domain, making it easier to explain the underlying concepts in a vision context. For this reason, we’ll first learn how CNNs are used in computer vision and then move on to NLP.

CNN fundamentals

Now, let’s explore the fundamental ideas behind a CNN without delving into too much technical detail. A CNN is a stack of layers, such as convolution layers, pooling layers, and fully connected layers. We’ll discuss each of these to understand their role in the CNN.

Convolution layer

Initially, the input is connected to a set of convolution layers. These convolution layers slide a patch of weights (sometimes called the convolution window or filter) over the input and produce an output by means of the convolution operation. Convolution layers use a small number of weights, organized to cover only a small patch of input in each layer, unlike fully connected neural networks, and these weights are shared across certain dimensions (for example, the width and height dimensions of an image). Also, CNNs use convolution operations to share the weights from the output by sliding this small set of weights along the desired dimension.

What we ultimately get from this convolution operation is illustrated in the figure below. If the pattern present in a convolution filter is present in a patch of the image, the convolution will output a high value for that location; if not, it will output a low value. Also, by convolving the full image, we get a matrix indicating whether a pattern was present in a given location. Finally, we’ll get a matrix as the convolution output:

Get hands-on with 1200+ tech skills courses.