Sigmoid vs ReLU

An activation function transforms the weighted sum input of a neuron or node. Sigmoid and ReLU are two highly popular activation functions, and we'll cover them in detail in our Answer today. To understand activation functions, we first must establish what a node is in neural networks.

Nodes in neural networks

A node is a processing unit connected to many other nodes to make up a neural network. Nodes convert the input they obtain into an output which serves as the input to the next layer (if any), and so on. The nodes are connected using edges that hold various weights i.e. connection strengths.

The non-linearity concept

In simple words, non-linearity allows neural networks to learn complex and nonlinear relationships in the data. If a neural network used only linear activation functions, it would be limited to learning linear relationships between input features. This would restrict its ability to capture and then predict patterns accurately.

By introducing non-linearity through activation functions, neural networks gain the ability to capture complex features in the data. We need this ability for advanced tasks such as image recognition or natural language processing.

Sigmoid functions and ReLU are two such activations functions and have been extensively used in neural networks.

Advantages of using sigmoid functions

Sigmoid functions are usually implemented in binary classification tasks because of their ability to map real-value numbers to probabilities.

The output of sigmoid functions is bounded and prevents large or unstable outputs during training.

Disadvantages of using sigmoid functions

A major disadvantage of sigmoid functions is the gradient vanishing problem.

Note: The gradient vanishing problem occurs during training when the gradients of the loss functions become really small as they backpropagate through different layers. This causes slow learning due to less updation of the weights of early layers.

Sigmoid functions are prone to this because their gradients saturate to very small values for extreme inputs, limiting the flow of gradients backward through many layers.

The computation of the exponential function in the sigmoid formula can be computationally expensive, especially when dealing with large datasets.

Sigmoid function graph depiction

This is how the sigmoid function is displayed using a graph.

ReLU

The ReLU (rectified linear unit) function treats the weighted sum of the neuron inputs as an identity function if it is positive and zero otherwise. It lets the positive values pass as they are and doesn't activate the neuron if this sum is negative. In easy words, if a neuron passes a certain threshold i.e. 0, it is activated.

Mathematical representation

ReLU is a function that takes an input of $x$ and either returns the number itself if it is greater than zero or zero otherwise. It takes the maximum of the number $x$ and $0$ . It's given by: $f(x) = max(0, x)$

Advantages of using ReLU

ReLU can overcome both of the disadvantages found in sigmoid functions.

It avoids the vanishing gradient problem since it has a constant gradient of 1 for all the positive inputs. The flow of gradients backward during backpropagation becomes easier, and the training becomes more effective.

Second, it becomes power efficient because all values under zero are discarded, unlike the exponential function.

Disadvantages of using ReLU

ReLU is also not devoid is concerns, the biggest one being the dying ReLU problem. If all of the weights lead to negative inputs for a neuron, then the eLU function always returns zero. This causes the neuron to become "dead". One possible solution for it is using a variant of reLU called "Leaky ReLU".

Note: You can learn more about the dying ReLU problem here.

ReLU graph depiction

This is how the ReLU function is displayed using a graph.

End notes

Sigmoid functions have been overshadowed by newer activation functions such as ReLU nowadays. This is perhaps due to the simpler yet efficient nature of the mechanism behind ReLU. Even though ReLU has its own concerns but its advantages, including avoidance of the vanishing gradient problem and sparsity, are quite more prominent and are the reason it is being used so widely.

Sigmoid functions in a nutshell

Sigmoid functions are more useful in binary classification problems where the output needs to be interpreted as probabilities. They are also suitable for applications that require bounded outputs between 0 and 1. Lastly, some applications need a smooth and continuous activation function like sigmoid functions.

ReLU in a nutshell

ReLU functions are more useful in deep learning architectures for most hidden layers. They excel in overcoming the vanishing gradient problem, leading to faster convergence during training and even for larger datasets. However, the dying ReLU problem should be handled in case its probability is high in a particular situation.

Condensed table of differences

Let's go through this table to grasp the major differences between the two activation functions quickly.

Sigmoid activation	ReLU activation
Outputs probabilities (0 to 1)	Non-linear and fast
Prone to vanishing gradient	Avoids vanishing gradient
Used in binary classification	Widely used in deep networks
Smooth and continuous	Introduces sparsity and efficiency
Computationally expensive	Simple and computationally efficient
Saturates for extreme inputs	Can suffer from the "dying ReLU" problem
Not zero-centered	Does not bind the output
Historically used in older networks	Prevalent in modern deep learning
Suitable for small networks	Suitable for large networks
Moderate performance in deep networks	Better performance in deep networks
Formula: f(x) = 1 / (1 + exp(-x))	Formula: f(x) = max(0, x)

Sigmoid vs ReLU

Nodes in neural networks

Purpose of activation functions

The non-linearity concept

Sigmoid functions

Mathematical representation

Advantages of using sigmoid functions

Disadvantages of using sigmoid functions

Sigmoid function graph depiction

ReLU

Mathematical representation

Advantages of using ReLU

Disadvantages of using ReLU

ReLU graph depiction

End notes

Sigmoid functions in a nutshell

ReLU in a nutshell

Condensed table of differences

How well do you know reLU and sigmoid functions?