Prepare Data: Random Initial Weights

Learn how to initialize the weights randomly, and to address the problems in initial weight selection.

Random initiazation of weights

The same argument applies here as with the inputs and outputs. We should avoid large initial weights because they cause large signals into an activation function, leading to the saturation we just talked about, and the reduced ability to learn better weights.

We could choose initial weights randomly and uniformly from a range of 1.0-1.0 to +1.0+1.0. That would be a much better idea than using a very large range, say 1000-1000 to +1000+1000.

Mathematicians and computer scientists have done the math to work out a rule of thumb for setting the random initial weights given specific shapes of networks and with specific activation functions.

We won’t go into the details of that, but the core idea is that if we have many signals into a node, which we do in a neural network, and if these signals are already well behaved and are not too large or randomly distributed, then the weights should support keeping those signals well behaved because they are combined and the activation function applied. In other words, we don’t want the weights to undermine the effort we put into carefully scaling the input signals. The rule of thumb these mathematicians arrive at is that the weights are initialized randomly sampling from a range that is roughly the inverse of the square root of the number of links into a node. So if each node has 3 links into it, the initial weights should be in the range 1/3=0.5771/\sqrt{3}=0.577. If each node has 100 incoming links, the weights should be in the range 1/100=0.11/\sqrt{100}=0.1.

Bias and saturation

Intuitively, this makes sense. Some overly large initial weights would bias the activation function in a biased direction, and very large weights would saturate the activation functions. Plus, the more links we have into a node, the more signals are being added together. So, the rule of thumb that reduces the weight range if there are more links.

If we’re already familiar with the idea of sampling from probability distributions, this rule of thumb is actually about sampling from a normal distribution with mean zero and a standard deviation, which is the inverse of the square root of the number of links into a node. But let’s not worry too much about getting this precisely right because that rule of thumb assumes quite a few things which may not be true, such as an activation function like the alternative tanh()\text{tanh()} and a specific distribution of the input signals.

The following diagram summarizes visually both the simple approach, and the more sophisticated approach with a normal distribution.

Get hands-on with 1200+ tech skills courses.