Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

What is gradient clipping?

Saifullah Shakeel

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Gradient clipping and its needs

When we train models, we iterate over the training samples, make predictions about the training samples, and estimate the error between the predicated label and the real label. Next, we update the weights using the gradient of the error with respect to the weights. Usually, in deep models, we have to multiply a lot of terms in order to calculate the gradient. Two problems arise due to this approach.

Suppose, we have two vectors, both of which have all of the values greater than $1$. Once we multiply them, each element will also be greater than $1$. If we multiply the resultant with another vector that has all the values greater than $1$, we'll again find the new resultant vector to have all values greater than $1$. If we perform this operation multiple times, we'll eventually get to a point where the resultant vector will have values that are too large. This problem is called the exploding gradient problem.

vector1 = [1.3,2.1, 1.73,0.42,1.25]vector2 = [1.26,1.35,2.58,2.81,1.32]resultant = vector1for _ in range(5):    for i in range(len(vector1)):        resultant[i] = resultant[i]*vector2[i]print("Final product of vector1*(vector2)^5")print(resultant)

Consider a case where we have two vectors having all values less than one. Once we start multiplying such vectors, the values of the resultant vector will start shrinking. This problem is called the vanishing gradient problem.

vector1 = [0.3,0.1,0.73,0.42,0.25]vector2 = [0.26,0.35,0.58,0.81,0.32]resultant = vector1for _ in range(5):    for i in range(len(vector1)):        resultant[i] = resultant[i]*vector2[i]print("Final product of vector1*(vector2)^5")print(resultant)

Every time we multiply two vectors, we check whether the resultant vector is above a threshold parameter, and we normalize the values by the norm of the vector. It prevents the resultant from exploding in the next multiplication and ensures a good training process. This technique mostly solves the problem of exploding gradients.

A simple visualization of gradient clipping

Logically:

if resultant > threshold:
resultant = resultant / ||resultant||


Where ||resultant|| represents the norm of the vector which can be L1, L2 or any other norm.

### Tensorflow syntax ###tf.clip_by_global_norm(    t_list, clip_norm, use_norm=None, name=None)### Pytorch syntax ###torch.nn.utils.clip_grad_norm_(    parameters, max_norm, norm_type=2.0,     error_if_nonfinite=False)
Tensorflow syntax for gradient clipping

RELATED TAGS

CONTRIBUTOR

Saifullah Shakeel

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring

Learn in-demand tech skills in half the time