Search⌘ K

Activating Neural Network Potential with Advanced Functions

Understand how advanced activation functions such as ReLU, leaky ReLU, ELU, and SELU influence neural network training by addressing vanishing and exploding gradient problems. Learn how these functions introduce nonlinearity, stabilize gradients, and enable deep networks to learn complex patterns effectively.

Activation functions are one of the primary drivers of neural networks. An “activation” introduces the “nonlinear” properties to a network.

Note: A network with linear activation is equivalent to a simple regression model.

The nonlinearity of the activations makes a neural network capable of learning nonlinear patterns in complex problems.

But there are a variety of activations, like, tanhHyperbolic Tangent eluExponential Linear Unit, reluRectified Linear Unit, and many more. Does choosing one over the other improve a model?

Yes, if appropriately chosen, an activation can significantly improve a model. An appropriate activation doesn’t have vanishing and/or exploding gradient issues.

Vanishing and exploding gradients

Deep learning networks are learned with backpropagation. Backpropagation methods are gradient-based. A gradient-based parameter learning can be generalized as:

θn+1θnηθ\theta_{n+1}← \theta_n−η∇_\theta

where

  • nn is a learning iteration.
  • ηη is a learning rate.
  • θ∇_\theta is the gradient of the loss L(θ)\mathcal{L(\theta)} with respect to the model parameters θθ.

The equation shows that gradient-based learning iteratively estimates θ\theta. In each iteration, the parameter θ\theta is moved “closer” to its optimal value θ\theta^∗.

However, whether the gradient will truly bring θ\theta closer to θ\theta^∗ will depend on the gradient itself. This is visually demonstrated in the illustrations below. In these illustrations, the horizontal axis is the model parameter θ\theta, the vertical axis is the loss L(θ)\mathcal{L(\theta)}, and θ\theta^∗ indicates the optimal parameter at the lowest point of loss. ...