Sigmoid vs ReLU

An activation function transforms the weighted sum input of a neuron or node. Sigmoid and ReLU are two highly popular activation functions, and we'll cover them in detail in our Answer today. To understand activation functions, we first must establish what a node is in neural networks.

Nodes in neural networks

A node is a processing unit connected to many other nodes to make up a neural network. Nodes convert the input they obtain into an output which serves as the input to the next layer (if any), and so on. The nodes are connected using edges that hold various weights i.e. connection strengths.

A multiple layer neural network
A multiple layer neural network

Purpose of activation functions

The purpose of activation functions is to decide whether a neuron is to be fired or not. Through activation functions, we decide the importance of a particular neuron in the prediction process and if it should be counted.

The second crucial reason they exist is to bring non-linearity to the network.

Activation functions
Activation functions

The non-linearity concept

In simple words, non-linearity allows neural networks to learn complex and nonlinear relationships in the data. If a neural network used only linear activation functions, it would be limited to learning linear relationships between input features. This would restrict its ability to capture and then predict patterns accurately.

By introducing non-linearity through activation functions, neural networks gain the ability to capture complex features in the data. We need this ability for advanced tasks such as image recognition or natural language processing.

Sigmoid functions and ReLU are two such activations functions and have been extensively used in neural networks.

ReLU and sigmoid
ReLU and sigmoid

Sigmoid functions

Sigmoid functions convert the weighted sum of the neuron inputs between the values of 0 and 1 (for logistic sigmoid) and between -1 and 1 (for tanh).

Mathematical representation

The sigmoid function has an S-shaped curve, and its output approaches 0 as the input xx becomes very negative and approaches 1 as xx becomes very positive. The weighted sum is converted into the range using the following formula: f(x)=1/(1+exp(x))f(x) = 1 / (1 + exp(-x))

Sigmoid function formula
Sigmoid function formula

Advantages of using sigmoid functions

Sigmoid functions are usually implemented in binary classification tasks because of their ability to map real-value numbers to probabilities.

The output of sigmoid functions is bounded and prevents large or unstable outputs during training.

Disadvantages of using sigmoid functions

A major disadvantage of sigmoid functions is the gradient vanishing problem.

Note: The gradient vanishing problem occurs during training when the gradients of the loss functions become really small as they backpropagate through different layers. This causes slow learning due to less updation of the weights of early layers.

Sigmoid functions are prone to this because their gradients saturate to very small values for extreme inputs, limiting the flow of gradients backward through many layers.

The computation of the exponential function in the sigmoid formula can be computationally expensive, especially when dealing with large datasets.

Sigmoid function graph depiction

This is how the sigmoid function is displayed using a graph.

Sigmoid graph depiction
Sigmoid graph depiction

ReLU

The ReLU (rectified linear unit) function treats the weighted sum of the neuron inputs as an identity function if it is positive and zero otherwise. It lets the positive values pass as they are and doesn't activate the neuron if this sum is negative. In easy words, if a neuron passes a certain threshold i.e. 0, it is activated.

Mathematical representation

ReLU is a function that takes an input of xx and either returns the number itself if it is greater than zero or zero otherwise. It takes the maximum of the number xx and 00. It's given by: f(x)=max(0,x)f(x) = max(0, x)

ReLU function formula
ReLU function formula

Advantages of using ReLU

ReLU can overcome both of the disadvantages found in sigmoid functions.

It avoids the vanishing gradient problem since it has a constant gradient of 1 for all the positive inputs. The flow of gradients backward during backpropagation becomes easier, and the training becomes more effective.

Second, it becomes power efficient because all values under zero are discarded, unlike the exponential function.

Disadvantages of using ReLU

ReLU is also not devoid is concerns, the biggest one being the dying ReLU problem. If all of the weights lead to negative inputs for a neuron, then the eLU function always returns zero. This causes the neuron to become "dead". One possible solution for it is using a variant of reLU called "Leaky ReLU".

Note: You can learn more about the dying ReLU problem here.

ReLU graph depiction

This is how the ReLU function is displayed using a graph.

ReLU graph depiction
ReLU graph depiction

End notes

Sigmoid functions have been overshadowed by newer activation functions such as ReLU nowadays. This is perhaps due to the simpler yet efficient nature of the mechanism behind ReLU. Even though ReLU has its own concerns but its advantages, including avoidance of the vanishing gradient problem and sparsity, are quite more prominent and are the reason it is being used so widely.

Sigmoid functions in a nutshell

Sigmoid functions are more useful in binary classification problems where the output needs to be interpreted as probabilities. They are also suitable for applications that require bounded outputs between 0 and 1. Lastly, some applications need a smooth and continuous activation function like sigmoid functions.

ReLU in a nutshell

ReLU functions are more useful in deep learning architectures for most hidden layers. They excel in overcoming the vanishing gradient problem, leading to faster convergence during training and even for larger datasets. However, the dying ReLU problem should be handled in case its probability is high in a particular situation.

Condensed table of differences

Let's go through this table to grasp the major differences between the two activation functions quickly.

Sigmoid activation

ReLU activation

Outputs probabilities (0 to 1)

Non-linear and fast

Prone to vanishing gradient

Avoids vanishing gradient

Used in binary classification

Widely used in deep networks

Smooth and continuous

Introduces sparsity and efficiency

Computationally expensive

Simple and computationally efficient

Saturates for extreme inputs

Can suffer from the "dying ReLU" problem

Not zero-centered

Does not bind the output

Historically used in older networks

Prevalent in modern deep learning

Suitable for small networks

Suitable for large networks

Moderate performance in deep networks

Better performance in deep networks

Formula: f(x) = 1 / (1 + exp(-x))

Formula: f(x) = max(0, x)

How well do you know reLU and sigmoid functions?

Try these simple match-the-answer questions below and get a recap.

Match The Answer
Select an option from the left-hand side

The vanishing gradient problem

leads to fully deactivated neurons

The dying reLU problem

leads to slow learning through training data


Copyright ©2024 Educative, Inc. All rights reserved