...
/Architecture of Convolutional Networks
Architecture of Convolutional Networks
Explore the structure of convolutional networks, from grid-like inputs to the dense layer output.
We'll cover the following...
Structure
This section exemplifies the structure of a convolutional network. The illustration below shows an elementary convolutional network.
The components of the network are as follows:
-
Grid-like input: Convolutional layers take grid-like inputs. The input in the illustration is like an image, that is, it has two axes and three channels each for , , and .
-
Convolutional layer: A layer comprises filters. A filter is made up of kernels. It has one kernel for each input channel. The size of a layer is essentially the number of filters in it which is a network configuration. Here, five illustrative filters, such as diagonal stripes, horizontal stripes, diamond grid, shingles, and waves, are shown in the convolutional layer. Each of them has , , and channels to match the input.
-
Convolutional output: A filter sweeps the input to tell the presence/absence of a pattern and its location in the input. The outputs corresponding to each filter are shown with a black square of the same pattern in the figure. It must be noted that the colored channels in the input are absent in the layer’s output. This is because the information across the channels is aggregated during the convolution operation. Consequently, the original channels in the input are relinquished. Instead, the output of each filter becomes a channel for the next layer.
-
Pooling layer: A convolutional layer is conjoined with a pooling layer. The pooling layer summarizes the spatial features, which are the horizontal and vertical axes in the illustration.
-
Pooling output: Pooling reduces the sizes of the spatial axes due to a data summarization along the axes. This makes the network invariant to minor translations and robust to noisy inputs. It’s important to note that the pooling occurs only along the spatial axes. Therefore, the number of channels remains intact.
-
Flatten: The feature map so far is still in a grid-like structure. The flatten operation vectorizes the grid feature map. This is necessary before passing the feature map onto a dense output layer.
-
Dense (output) layer: Ultimately, a dense layer maps the convolution-derived features with the response.
A convolutional network’s purpose is to automatically learn predictive filters from data. Multiple layers are often stacked to learn from low- to high-level features. For instance, a face recognition network could learn the edges of a face in the lower layers and the shape of eyes in the higher layers.
Note: The purpose of a convolutional network is to learn the filters automatically.
Conv1D
, Conv2D
, and Conv3D
In Tensorflow, the convolutional layer can be chosen from Conv1D
, Conv2D
, and Conv3D
. The three types of layers are designed for inputs with one, two, or three spatial axes, respectively. Let’s look at when each of them is applicable and their interchangeability.
Convolutional networks work with grid-like inputs. Such inputs are categorized based on their axes and channels. The table below summarizes them for a few grid-like data, such as time series, image, and video.
Axes and Channels in Grid-Like Inputs to Convolutional Networks
Time Series | Image | Video | |
Axis-1 (Spatial dim1) | Time | Height | Height |
Axis-2 (Spatial dim2) | - | Width | Width |
Axis-3 (Spatial dim3) | - | - | Time |
Channels | Features (one in univariate time series) | Colors | Colors |
Conv’x’d |
|
|
|
Input Shape | (samples, time, features) | (samples, height, width, colors) | (samples, height, width, time, colors) |
| An integer, t, specifying the time window | An integer tuple (h, w) specifies the height and width window. | An integer tuple, (h, w, t), specifying the height, width, and time window. |
Kernel Shape | (t, features) | (h, w, colors) | (h, w, t, colors) |
A univariate time series has a single spatial axis corresponding to time. If it is multivariate, then the features make the channels. Irrespective of the number of channels, a time series is modeled with Conv1D
as it has only one spatial dimension.
Images, on the other hand, have two spatial axes along their height and width. Videos have an additional spatial axis oxymoronically along time. Conv2D
and Conv3D
are, therefore, applicable to them, respectively. The channels in them are the palette colors such as , , and .
Note:
Conv1D
,Conv2D
, andConv3D
are used to model inputs with one, two, and three spatial axes, respectively.
The Conv‘x’D
selection is independent of the channels. There could be any number of channels of arbitrary features. The Conv‘x’D
is chosen based on the number of spatial axes only.
Inputs to Conv1D
, Conv2D
, and Conv3D
are structured as N-D tensors of the following shapes, respectively:
-
(samples, time_steps, features)
-
(samples, height, width, channels)
-
(samples, height, width, channels)
The first axis is reserved for samples for almost every layer in TensorFlow.
The shape of a sample is defined by the rest of the axes (shown in the illustrations above and below).
Among them, the last axis corresponds to the channels (by default) in any of the Conv‘x’D
The kernel_size
argument in Conv‘x’D
determines the spatial dimension of the convolution kernel. The argument is a tuple of integers. Each element of the tuple corresponds to the kernel’s size along the respective spatial dimension. The depth of the kernel is fixed and equal to the number of channels. The depth is, therefore, not included in the argument.
Note: Conv layers are agnostic to the number of channels. They differ only by the shape of the input’s spatial axes.
Besides, we might observe that a Conv2D
can be used to model the inputs of Conv1D
by appropriately reshaping the samples. For example, a time series can be reshaped as ...