# Techniques to Improve Neural Network

Learn techniques to improve neural network performance such as batch normalization, advanced regularization, upgrading gradient descent, and better weight initializations.

We'll cover the following

It is crucial to pick the right activation functions, but when we design a neural network, we face plenty more. We decide how to initialize the weights, which GD algorithm to use, what kind of regularization to apply, and so forth. We have a wide range of techniques to choose from, and new ones come up all the time.

It would be pointless to go into too much detail about all the popular techniques available today. We could fill entire volumes with them. Besides, some of them might be old-fashioned and quaint by the time we complete this course.

For those reasons, this lesson is not comprehensive. We’ll learn about a handful of techniques that generally work well — a starter’s kit in our journey to ML mastery. At the end of this chapter, we’ll also get a chance to test these techniques firsthand.

## Better weight initialization

In Initializing the Weights, we learned that to avoid squandering a neural network’s power, initialize its weights with values that are random and small.

However, that random and small principle does not give us concrete numbers. For that, we can use a formula such as Xavier initialization, also known as Glorot initialization. (Both names come from Xavier Glorot, the man who proposed it.

Xavier initialization comes in a few variants. They all give us an approximate range to initialize the weights, based on the number of nodes connected to them. One common variant gives us this range:

$\large{ |w|\le \sqrt{\frac{2}{\text{nodes in layer}}} }$

The core concept of Xavier initialization is that the more nodes we have in a layer, the smaller the weights. Intuitively, that means that it does not matter how many nodes we have in a layer. The weighted sum of the nodes stays about the same size. Without Xavier initialization, a layer with many nodes would generate a large weighted sum, and that large number could cause problems like dead neurons and vanishing or exploding gradients.

Even though we did not mention Xavier initialization so far, we already used it as the default initializer in Keras. If we want to replace it with another initialization method, of which Keras has a few, use the kernel_initializer argument. For example, here is a layer that uses an alternative weight initialization method called he_normal:

model.add(Dense(100, kernel_initializer='he_normal'))


## Changing the Gradient descent

If something stays unchanged through this course, it’s the gradient descent algorithm. We changed the way we compute that gradient, from simple derivatives to backpropagation, but so far, the “descent” part is the same as we introduced in the first chapters: multiply the gradient by the learning rate and take a step in the opposite direction.

However, Modern GD can be subtler than that. In Keras, we can pass additional parameters to the SGD algorithm:

 model.compile(loss='categorical_crossentropy',
optimizer=SGD(lr=0.1, decay=1e-6, momentum=0.9),
metrics=['accuracy'])


This code includes two new hyperparameters that tweak SGD. To understand decay, remember that the learning rate is a trade-off, the smaller it is, the smaller each step of GD. It makes the algorithm more precise, but also slower. When we use decay, the learning rate decreases a bit at each step. A well-configured decay causes GD to take big leaps at the beginning of training when we usually need speed, and baby steps near the end, when we would rather have precision. This twist on GD is called learning rate decay.

The momentum hyperparameter is even subtler. When we introduced GD, we learned that this algorithm has trouble with certain surfaces. For example, it might get stuck into local minima, or, holes in the loss. Another troublesome situation can happen around canyons like the one shown in the following diagram:

Create a free account to view this lesson.