An activation function transforms the weighted sum input of a neuron or node. Sigmoid and ReLU are two highly popular activation functions, and we'll cover them in detail in our Answer today. To understand activation functions, we first must establish what a node is in neural networks.
A node is a processing unit connected to many other nodes to make up a neural network. Nodes convert the input they obtain into an output which serves as the input to the next layer (if any), and so on. The nodes are connected using edges that hold various weights i.e. connection strengths.
The purpose of activation functions is to decide whether a neuron is to be fired or not. Through activation functions, we decide the importance of a particular neuron in the prediction process and if it should be counted.
The second crucial reason they exist is to bring non-linearity to the network.
In simple words, non-linearity allows neural networks to learn complex and nonlinear relationships in the data. If a neural network used only linear activation functions, it would be limited to learning linear relationships between input features. This would restrict its ability to capture and then predict patterns accurately.
By introducing non-linearity through activation functions, neural networks gain the ability to capture complex features in the data. We need this ability for advanced tasks such as image recognition or natural language processing.
Sigmoid functions and ReLU are two such activations functions and have been extensively used in neural networks.
Sigmoid functions convert the weighted sum of the neuron inputs between the values of 0 and 1 (for logistic sigmoid) and between -1 and 1 (for tanh).
The sigmoid function has an S-shaped curve, and its output approaches 0 as the input
Sigmoid functions are usually implemented in binary classification tasks because of their ability to map real-value numbers to probabilities.
The output of sigmoid functions is bounded and prevents large or unstable outputs during training.
A major disadvantage of sigmoid functions is the gradient vanishing problem.
Note: The gradient vanishing problem occurs during training when the gradients of the loss functions become really small as they backpropagate through different layers. This causes slow learning due to less updation of the weights of early layers.
Sigmoid functions are prone to this because their gradients saturate to very small values for extreme inputs, limiting the flow of gradients backward through many layers.
The computation of the exponential function in the sigmoid formula can be computationally expensive, especially when dealing with large datasets.
This is how the sigmoid function is displayed using a graph.
The ReLU (rectified linear unit) function treats the weighted sum of the neuron inputs as an identity function if it is positive and zero otherwise. It lets the positive values pass as they are and doesn't activate the neuron if this sum is negative. In easy words, if a neuron passes a certain threshold i.e. 0, it is activated.
ReLU is a function that takes an input of
ReLU can overcome both of the disadvantages found in sigmoid functions.
It avoids the vanishing gradient problem since it has a constant gradient of 1 for all the positive inputs. The flow of gradients backward during backpropagation becomes easier, and the training becomes more effective.
Second, it becomes power efficient because all values under zero are discarded, unlike the exponential function.
ReLU is also not devoid is concerns, the biggest one being the dying ReLU problem. If all of the weights lead to negative inputs for a neuron, then the eLU function always returns zero. This causes the neuron to become "dead". One possible solution for it is using a variant of reLU called "Leaky ReLU".
Note: You can learn more about the dying ReLU problem here.
This is how the ReLU function is displayed using a graph.
Sigmoid functions have been overshadowed by newer activation functions such as ReLU nowadays. This is perhaps due to the simpler yet efficient nature of the mechanism behind ReLU. Even though ReLU has its own concerns but its advantages, including avoidance of the vanishing gradient problem and sparsity, are quite more prominent and are the reason it is being used so widely.
Sigmoid functions are more useful in binary classification problems where the output needs to be interpreted as probabilities. They are also suitable for applications that require bounded outputs between 0 and 1. Lastly, some applications need a smooth and continuous activation function like sigmoid functions.
ReLU functions are more useful in deep learning architectures for most hidden layers. They excel in overcoming the vanishing gradient problem, leading to faster convergence during training and even for larger datasets. However, the dying ReLU problem should be handled in case its probability is high in a particular situation.
Let's go through this table to grasp the major differences between the two activation functions quickly.
Sigmoid activation | ReLU activation |
Outputs probabilities (0 to 1) | Non-linear and fast |
Prone to vanishing gradient | Avoids vanishing gradient |
Used in binary classification | Widely used in deep networks |
Smooth and continuous | Introduces sparsity and efficiency |
Computationally expensive | Simple and computationally efficient |
Saturates for extreme inputs | Can suffer from the "dying ReLU" problem |
Not zero-centered | Does not bind the output |
Historically used in older networks | Prevalent in modern deep learning |
Suitable for small networks | Suitable for large networks |
Moderate performance in deep networks | Better performance in deep networks |
Formula: f(x) = 1 / (1 + exp(-x)) | Formula: f(x) = max(0, x) |
Try these simple match-the-answer questions below and get a recap.
The vanishing gradient problem
leads to fully deactivated neurons
The dying reLU problem
leads to slow learning through training data