InceptionV1—GoogleNet (2014)

Understand the InceptionV1 GoogleNet architecture that won the 2014 LSVRC, featuring its network-in-network approach, multiple classifier heads, and efficient convolution techniques. Learn how this model balances depth and parameter efficiency for improved image classification performance.

We'll cover the following...

General structure
Network-in-network
Auxiliary classifiers
1x1 convolution
InceptionV2 and InceptionV3
Comparison
Backbone and head
Creating an Inception module

General structure

InceptionV1 is the image classification architecture that won the LSVRC competition in 2014.

It has a 22-layer architecture that uses the network-in-network approach for some layers that they call Inception modules.
It’s training strategies are similar to other architectures. It has an SGD with a momentum of 0.9, fixed learning rate decreasing by 4% every 8 epochs, drop out at the fully connected layers with a rate of 0.4, ReLU activation function in Inception modules, and softmax at the end.
Average pooling is applied between the final convolution layer and fully connected ones.
Instead of having one fully connected head, they have three. They call two additional fully connected extensions auxiliary classifiers. The exciting part is they use these three classifier heads during training and take the average of the final weights of these different classifier heads to obtain the final and unique head to use alone in inference.

Network-in-network

The main logic of network-in-network layers is to apply the different sizes of convolutions to the same input and concatenate the outcoming feature maps to obtain the final output from 1 layer. This approach provides feature maps with different scales from just the same input and increases the variety of the information coming from the input image. Therefore, it widens the learning capacity of the model with different scales from a given input.

In this logic, any network-in-network can be created with varying filters of convolution. The model calls the special layers using the network-in-network approach as Inception modules. Its structure is as follows:

Auxiliary classifiers

Apart from the main classifier head at the end of the model, they create two extensions to make predictions from different scales and call these additional parts auxiliary classifiers. An auxiliary classifier’s structure is as follows:

An average pooling layer with 5×5 filter size and stride 3, resulting in a 4×4×512 output for the first auxiliary extension and 4×4×528 for the second one.
A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation (ReLU).
A fully connected layer with 1024 units and rectified linear activation function (ReLU) with a dropout having a 70% ratio (40% in the main classifier head.)
A fully connected layer with softmax activation function as the classifier, predicting the same 1000 classes as the primary classifier.

1.Before We Start

2.Basics of Convolutional Neural Networks

Project

3.Popular Neural Network Architectures for Image Classification

4.Using PyTorch for Image Classification

5.Model Deployment

Project

6.Basics of Object Detection

7.Two-Stage Object Detection Architectures

8.One-Stage Object Detection Architectures

9.YOLOv7 Model Train and Inference on Edge

10.Conclusion

11.Appendix

Project

InceptionV1—GoogleNet (2014)

General structure

Network-in-network

Auxiliary classifiers