Invasion of the Sigmoids
Explore how sigmoid functions adapt linear regression for binary classification by constraining outputs between 0 and 1 using logistic regression. Understand the implementation of the sigmoid in Python, the concept of forward propagation, and the importance of choosing log loss to optimize gradient descent. This lesson equips you with foundational knowledge to build and train binary classifiers effectively.
We'll cover the following...
Overview of sigmoids
Even though linear regression is not a natural fit for binary classification, that does not mean that we have to scrap our linear regression code and start from scratch. Instead, we can adapt our existing algorithm to this new problem using a technique that statisticians call logistic regression.
Let’s start by looking back at , the weighted sum of the inputs that we introduced in the lesson: Adding More Dimensions.
In linear regression, could take any value. Binary classification, however, imposes a tight constraint that must not drop below nor raise above . Here’s an idea: maybe we can find a function that wraps around the weighted sum and constrains it to the range from to :
Let’s revise what the wrapper_function() does. It takes any number out of the weighted sum, and squashes it into the range from to .
The other requirement is that the function that we look for should work well with gradient descent. Think about the following:
- We use this function to calculate .
- Then we use to calculate the loss.
- Finally, we descend the loss with gradient descent.
For gradient descent, the wrapper function should be smooth, without flat areas (where the gradient drops to zero) or gaps (where the gradient is not even defined).
For the sake of gradient descent, the wrapper function should be smooth, without flat areas (where the gradient drops to zero) or gaps (where the gradient is not even defined).
Note: To wrap it up, we want a function that smoothly changes across the range from to without ever jumping or flatlining. Something like this:
As it happens, a well-known function is introduced. It’s called the logistic function, and it belongs to a family of S-shaped functions called sigmoid. Since logistic function is a mouthful, people usually just call it the sigmoid, for short. Here is the sigmoid’s formula:
The greek letter sigma (σ) stands for sigmoid. We’ll also use the letter for the sigmoid’s input to avoid confusion with the system’s inputs .
The sigmoid formula is hard to grok intuitively, but its picture tells us everything that we need to know. When its input is , the sigmoid returns . Then, it quickly and smoothly falls toward for negative numbers and rises toward for positive numbers, but it never quite reaches those two extremes. In other words, the sigmoid squeezes any value to a narrow band ranging from to . it doesn’t have any steep cliffs, and it never goes completely flat. That’s the function we need!
Let’s return to the code and apply this newfound knowledge.
Confidence and doubt
First, we take the formula of the sigmoid and convert it to Python, using NumPy’s exp() function to implement the exponential. The result is a one-line function:
As usual with NumPy-based functions, sigmoid() takes advantage of broadcasting. The argument can be a number, or a multidimensional array. In the second case, the function returns an array that contains the sigmoids of all the elements of .
Then we go back to the prediction code, or the point where we calculate the weighted sum. The original function looks like this:
We modify that function to pass the result through the sigmoid() function:
Later in this course, we’ll see that this process of moving data through the system is also called forward propagation, so we renamed the predict() function to forward().
The result of forward() is our prediction , which is a matrix with the same dimensions as the weighted sum, that is, one row per example and one column. Each element in the matrix is now constrained between and only.
Intuitively, we can see the values of as forecasts that can be more or less certain. If a value is close to the extremes, like or , that’s a highly confident forecast. If it’s close to the middle, like , that’s a very uncertain forecast.
During the training phase, that gradual variation in confidence is just what we need. We want the loss to change smoothly to slide over it with gradient descent. However, we want the system to get straight to the point once we switch from the training phase to the classification phase. The labels that we use to train the classifier are either or , so the classification should also be a straight or . To get that unambiguous answer, during the classification phase, we can round the result to the nearest integer, like this:
Like we used earlier, this function could be named predict(), or classify(). In the case of a classifier, the two words are pretty much synonyms. We opted for classify() to highlight the fact that we are not doing linear regression anymore because now we are forecasting a binary value.
It seems that we are making great progress toward a classification program, except for a minor difficulty that we are about to face.
Smoothing it out
We introduce a subtle problem by adding the sigmoid to our program. We made gradient descent less reliable. The problem happens when we update the loss() function in our system to use the new classification code:
At first glance, this function is almost identical to the loss() function, the mean squared error of the predictions compared with the actual labels. The only difference from the previous loss() is that the function that calculates the predicted labels has changed from predict() to forward().
That change, however, has far-reaching consequences. The forward() function involves the calculation of a sigmoid, and because of that sigmoid, this loss is not the same loss that we had before. Here is what the new loss function looks like:
It looks like we have a problem here. See those deep canyons leading straight into holes? We mentioned those holes when we introduced gradient descent—they are the dreaded local minima. Remember that the goal of gradient descent is to move downhill? Now consider what happens if GD enters a local minimum: since there is no “downhill” at the bottom of a hole, the algorithm stops there, falsely convinced that it’s reached the global minimum that it was aiming for.
By looking at this diagram, we can conclude that if we use the mean squared error and the sigmoid together, the resulting loss has an uneven surface littered with local minima. Such a surface is hard to navigate with gradient descent. We’d better look for a different loss function with a smoother, more GD-friendly surface.
We can find one such function in statistics textbooks. It’s called the log loss because it’s based on logarithms:
The log loss formula might look daunting, but do not let it intimidate us. We just need to know that it behaves like a good loss function. The closer the prediction is to the ground truth , the lower the loss. Also, the formula looks more friendly when we write it into code:
If we give it a try, we’ll find that the log loss is simpler than it looks. Remember that each label in the matrix is either or . For labels that are , the first_term is multiplied by , so it disappears. For labels that are , the second_term disappears because it’s multiplied by . So each element of contributes only one of the two terms.
Let’s plot the log loss and see what it looks like:
It is nice and smooth! There are no canyons, flat areas, or holes. From now on, this will be our loss function.