Understanding Multi-Layer Perceptrons

Unveil the structure and operational flow of MLPs, from input processing to backpropagation in deep learning.

We'll cover the following...

Multi-layer perceptrons are possibly one of the most visually illustrated neural networks. Yet most of them lack a few fundamental explanations. Since MLPs are the foundation of deep learning, this section provides a clearer perspective.

MLP architecture

Press + to interact
A high-level representation of a multi-layer perceptron
A high-level representation of a multi-layer perceptron

A typical visual representation of an MLP is shown in the illustration above where:

  • X1-X3\text{X}_\text{1} \text{-} \text{X}_\text{3} on the left represent the inputs.
  • The middle nodes represent the hidden layers.
  • They layer on the right is the output.

This high-level representation shows the feed-forward nature of the network. In a feed-forward network, information between layers flows in only the forward direction. Information (features) learned at a layer is not shared with any prior layer.

The abstracted network shown in the illustration above is unwrapped to its elements in the illustration below.

Press + to interact
An unwrapped visual of a multi-layer perceptron
An unwrapped visual of a multi-layer perceptron

MLP workflow

The journey through an MLP involves a series of meticulously orchestrated steps, each contributing to the network’s ability to learn and make predictions. Each element, its interactions, and its implementation in the context of TensorFlow are explained step-by-step as follows:

  1. Data introduction: The process starts with a dataset. Suppose a dataset is shown at the top left, Xn×pX_{n×p}, with nn samples and pp features.
  2. Batch selection: The model ingests a randomly selected batch during training. The batch contains random samples (rows) from XX unless otherwise mentioned. The batch size is denoted as nbn_b here.
  3. Independent processing: By default, the samples in a batch are processed independently. Their sequence is, therefore, not necessary.
  4. Entering the input layer: The input batch enters the network through an input layer. Each node in the input layer corresponds to a sample feature. Explicitly defining the input layer is optional, but it is done here for clarity.
  1. Navigating hidden layers: The input layer is followed by a stack of hidden layers till the last (output) layer. These layers perform the “complex” interconnected nonlinear operations. Although perceived as complex, the underlying operations are rather simple arithmetic computations.

  2. Node functionality: A hidden layer is a stack of computing nodes. Each node extracts a feature from the input. For example, in the sheet-break problem, a node at a hidden layer might determine whether the rotations between two specific rollers are out of sync or not. A node can, therefore, be imagined as solving one arbitrary subproblem.

  3. Feature mapping: The stack of output coming from a layer’s nodes is called a feature map or representation. The size of the feature map, also equal to the number of nodes, is called the layer size.

  4. Layer-to-layer transmission: Intuitively, this feature map has results of various subproblems solved at each node. They provide predictive information for the next layer until the output layer to predict the response.

  5. Perceptron basics: Mathematically, a node is a perceptron made of weights and bias parameters. The weights at a node are denoted with a vector ww and a bias bb.

  6. Processing layer inputs: All the input sample features go to a node. The input to the first hidden layer is the input data features x={x1,...,xp}x = \{x_{1}, . . . , x_{p}\}. For any intermediate layer, it’s the output (feature map) of the previous layer, denoted as z={z1,...,zm}z = \{z_{1},...,z_{m}\}, where mm is the size of the prior layer.

  7. Feature extraction logic: Consider a hidden layer ll of size mlm_l in the illustration. A node jj in the layer ll performs a feature extraction with a dot product between the input feature map z(l1)z^{(l−1)} ...