Search⌘ K
AI Features

Invasion of the Sigmoids

Explore how sigmoid functions adapt linear regression for binary classification by constraining outputs between 0 and 1 using logistic regression. Understand the implementation of the sigmoid in Python, the concept of forward propagation, and the importance of choosing log loss to optimize gradient descent. This lesson equips you with foundational knowledge to build and train binary classifiers effectively.

Overview of sigmoids

Even though linear regression is not a natural fit for binary classification, that does not mean that we have to scrap our linear regression code and start from scratch. Instead, we can adapt our existing algorithm to this new problem using a technique that statisticians call logistic regression.

Let’s start by looking back at y^\hat{y}, the weighted sum of the inputs that we introduced in the lesson: Adding More Dimensions.

y^=x1w1+x2w2+x3w3+...\hat{y} = x_1 * w_1 + x_2 * w_2 + x_3 * w_3 + ...

In linear regression, y^\hat{y} could take any value. Binary classification, however, imposes a tight constraint that y^\hat{y} must not drop below 00 nor raise above 11. Here’s an idea: maybe we can find a function that wraps around the weighted sum and constrains it to the range from 00 to 11:

y^=wrapper_function(x1w1+x2w2+x3w3+...)\hat{y} = wrapper\_function (x_1 * w_1 + x_2 * w_2 + x_3 * w_3 + ...)

Let’s revise what the wrapper_function() does. It takes any number out of the weighted sum, and squashes it into the range from 00 to 11.

The other requirement is that the function that we look for should work well with gradient descent. Think about the following:

  • We use this function to calculate y^\hat{y}.
  • Then we use y^\hat{y} to calculate the loss.
  • Finally, we descend the loss with gradient descent.

For gradient descent, the wrapper function should be smooth, without flat areas (where the gradient drops to zero) or gaps (where the gradient is not even defined).

For the sake of gradient descent, the wrapper function should be smooth, without flat areas (where the gradient drops to zero) or gaps (where the gradient is not even defined).

Note: To wrap it up, we want a function that smoothly changes across the range from 00 to 11 without ever jumping or flatlining. Something like this:

As it happens, a well-known function is introduced. It’s called the logistic function, and it belongs to a family of S-shaped functions called sigmoid. Since logistic function is a mouthful, people usually just call it the sigmoid, for short. Here is the sigmoid’s formula:

σ(z)=11+ez\sigma \left ( z\right ) = \frac{1}{1+e^{-z}}

The greek letter sigma (σ) stands for sigmoid. We’ll also use the letter zz for the sigmoid’s input to avoid confusion with the system’s inputs xx.

The sigmoid formula is hard to grok intuitively, but its picture tells us everything that we need to know. When its input is 00, the sigmoid returns 0.50.5. Then, it quickly and smoothly falls toward 00 for negative numbers and rises toward 11 for positive numbers, but it never quite reaches those two extremes. In other words, the sigmoid squeezes any value to a narrow band ranging from 00 to 11. it doesn’t have any steep cliffs, and it never goes completely flat. That’s the function we need!

Let’s return to the code and apply this newfound knowledge.

Confidence and doubt

First, we take the formula of the sigmoid and convert it to Python, using NumPy’s exp() function to implement the exponential. The result is a one-line function:

Python 3.5
def sigmoid(z):
return 1 / (1 + np.exp(-z))

As usual with NumPy-based functions, sigmoid() takes advantage of broadcasting. The zz argument can be a number, or a multidimensional array. In the second case, the function returns an array that contains the sigmoids of all the elements of zz.

Then we go back to the prediction code, or the point where we calculate the weighted sum. The original function looks like this:

Python 3.5
def predict:
return np.matmul(X, w)

We modify that function to pass the result through the sigmoid() function:

Python 3.5
def forward(X, w):
weighted_sum = np.matmul(X, w)
return sigmoid(weighted_sum)

Later in this course, we’ll see that this process of moving data through the system is also called forward propagation, so we renamed the predict() function to forward().

The result of forward() is our prediction y^\hat{y}, which is a matrix with the same dimensions as the weighted sum, that is, one row per example and one column. Each element in the matrix is now constrained between 00 and 11 only.

Intuitively, we can see the values of y^\hat{y} as forecasts that can be more or less certain. If a value is close to the extremes, like 0.010.01 or 0.980.98, that’s a highly confident forecast. If it’s close to the middle, like 0.510.51, that’s a very uncertain forecast.

During the training phase, that gradual variation in confidence is just what we need. We want the loss to change smoothly to slide over it with gradient descent. However, we want the system to get straight to the point once we switch from the training phase to the classification phase. The labels that we use to train the classifier are either 00 or 11, so the classification should also be a straight 00 or 11. To get that unambiguous answer, during the classification phase, we can round the result to the nearest integer, like this:

Python 3.5
def classify(X, w):
return np.round(forward(X, w))

Like we used earlier, this function could be named predict(), or classify(). In the case of a classifier, the two words are pretty much synonyms. We opted for classify() to highlight the fact that we are not doing linear regression anymore because now we are forecasting a binary value.

It seems that we are making great progress toward a classification program, except for a minor difficulty that we are about to face.

Smoothing it out

We introduce a subtle problem by adding the sigmoid to our program. We made gradient descent less reliable. The problem happens when we update the loss() function in our system to use the new classification code:

Python 3.5
def mse_loss(X, Y, w):
return np.average((forward(X, w) - Y) ** 2)

At first glance, this function is almost identical to the loss() function, the mean squared error of the predictions compared with the actual labels. The only difference from the previous loss() is that the function that calculates the predicted labels y^\hat{y} has changed from predict() to forward().

That change, however, has far-reaching consequences. The forward() function involves the calculation of a sigmoid, and because of that sigmoid, this loss is not the same loss that we had before. Here is what the new loss function looks like:

It looks like we have a problem here. See those deep canyons leading straight into holes? We mentioned those holes when we introduced gradient descent—they are the dreaded local minima. Remember that the goal of gradient descent is to move downhill? Now consider what happens if GD enters a local minimum: since there is no “downhill” at the bottom of a hole, the algorithm stops there, falsely convinced that it’s reached the global minimum that it was aiming for.

By looking at this diagram, we can conclude that if we use the mean squared error and the sigmoid together, the resulting loss has an uneven surface littered with local minima. Such a surface is hard to navigate with gradient descent. We’d better look for a different loss function with a smoother, more GD-friendly surface.

We can find one such function in statistics textbooks. It’s called the log loss because it’s based on logarithms:

L=1mi=1m(yi.log(yi^)+(1yi).log(1yi^)L = - \frac{1}{m} \sum_{i=1}^{m}(y_i . \log(\hat{y_i}) + (1-y_i). \log(1-\hat{y_i})

The log loss formula might look daunting, but do not let it intimidate us. We just need to know that it behaves like a good loss function. The closer the prediction y^\hat{y} is to the ground truth yy, the lower the loss. Also, the formula looks more friendly when we write it into code:

Python 3.5
def loss(X, Y, w):
y_hat = forward(X, w)
first_term = Y * np.log(y_hat)
second_term = (1 - Y) * np.log(1 - y_hat)
return -np.average(first_term + second_term)

If we give it a try, we’ll find that the log loss is simpler than it looks. Remember that each label in the matrix YY is either 00 or 11. For labels that are 00, the first_term is multiplied by 00, so it disappears. For labels that are 11, the second_term disappears because it’s multiplied by (1Y)(1−Y). So each element of YY contributes only one of the two terms.

Let’s plot the log loss and see what it looks like:

It is nice and smooth! There are no canyons, flat areas, or holes. From now on, this will be our loss function.