Invasion of the Sigmoids

Explore how sigmoid functions adapt linear regression for binary classification by constraining outputs between 0 and 1 using logistic regression. Understand the implementation of the sigmoid in Python, the concept of forward propagation, and the importance of choosing log loss to optimize gradient descent. This lesson equips you with foundational knowledge to build and train binary classifiers effectively.

We'll cover the following...

Overview of sigmoids
Confidence and doubt
Smoothing it out

Overview of sigmoids

Even though linear regression is not a natural fit for binary classification, that does not mean that we have to scrap our linear regression code and start from scratch. Instead, we can adapt our existing algorithm to this new problem using a technique that statisticians call logistic regression.

Let’s start by looking back at $\hat{y}$ , the weighted sum of the inputs that we introduced in the lesson: Adding More Dimensions.

\hat{y} = x_1 * w_1 + x_2 * w_2 + x_3 * w_3 + ...

In linear regression, $\hat{y}$ could take any value. Binary classification, however, imposes a tight constraint that $\hat{y}$ must not drop below $0$ nor raise above $1$ . Here’s an idea: maybe we can find a function that wraps around the weighted sum and constrains it to the range from $0$ to $1$ :

\hat{y} = wrapper\_function (x_1 * w_1 + x_2 * w_2 + x_3 * w_3 + ...)

Let’s revise what the wrapper_function() does. It takes any number out of the weighted sum, and squashes it into the range from $0$ to $1$ .

The other requirement is that the function that we look for should work well with gradient descent. Think about the following:

We use this function to calculate $\hat{y}$ .
Then we use $\hat{y}$ to calculate the loss.
Finally, we descend the loss with gradient descent.

For gradient descent, the wrapper function should be smooth, without flat areas (where the gradient drops to zero) or gaps (where the gradient is not even defined).

For the sake of gradient descent, the wrapper function should be smooth, without flat areas (where the gradient drops to zero) or gaps (where the gradient is not even defined).

Note: To wrap it up, we want a function that smoothly changes across the range from $0$ to $1$ without ever jumping or flatlining. Something like this:

As it happens, a well-known function is introduced. It’s called the logistic function, and it belongs to a family of S-shaped functions called sigmoid. Since logistic function is a mouthful, people usually just call it the sigmoid, for short. Here is the sigmoid’s formula:

\sigma \left ( z\right ) = \frac{1}{1+e^{-z}}

The greek letter sigma (σ) stands for sigmoid. We’ll also use the letter $z$ for the sigmoid’s input to avoid confusion with the system’s inputs $x$ .

The sigmoid formula is hard to grok intuitively, but its picture tells us everything that we need to know. When its input is $0$ , the sigmoid returns $0.5$ . Then, it quickly and smoothly falls toward $0$ for negative numbers and rises toward $1$ for positive numbers, but it never quite reaches those two extremes. In other words, the sigmoid squeezes any value to a narrow band ranging from $0$ to $1$ . it doesn’t have any steep cliffs, and it never goes completely flat. That’s the function we need!

Let’s return to the code and apply this newfound knowledge.

Confidence and doubt

First, we take the formula of the sigmoid and convert it to Python, using NumPy’s exp() function to implement the exponential. The result is a one-line function:

Later in this course, we’ll see that this process of moving data through the system is also called forward propagation, so we renamed the predict() function to forward().

The result of forward() is our prediction $\hat{y}$ , which is a matrix with the same dimensions as the weighted sum, that is, one row per example and one column. Each element in the matrix is now constrained between $0$ and $1$ only.

Intuitively, we can see the values of $\hat{y}$ as forecasts that can be more or less certain. If a value is close to the extremes, like $0.01$ or $0.98$ , that’s a highly confident forecast. If it’s close to the middle, like $0.51$ , that’s a very uncertain forecast.

During the training phase, that gradual variation in confidence is just what we need. We want the loss to change smoothly to slide over it with gradient descent. However, we want the system to get straight to the point once we switch from the training phase to the classification phase. The labels that we use to train the classifier are either $0$ or $1$ , so the classification should also be a straight $0$ or $1$ . To get that unambiguous answer, during the classification phase, we can round the result to the nearest integer, like this:

Like we used earlier, this function could be named predict(), or classify(). In the case of a classifier, the two words are pretty much synonyms. We opted for classify() to highlight the fact that we are not doing linear regression anymore because now we are forecasting a binary value.

It seems that we are making great progress toward a classification program, except for a minor difficulty that we are about to face.

Smoothing it out

We introduce a subtle problem by adding the sigmoid to our program. We made gradient descent less reliable. The problem happens when we update the loss() function in our system to use the new classification code:

At first glance, this function is almost identical to the loss() function, the mean squared error of the predictions compared with the actual labels. The only difference from the previous loss() is that the function that calculates the predicted labels $\hat{y}$ has changed from predict() to forward().

That change, however, has far-reaching consequences. The forward() function involves the calculation of a sigmoid, and because of that sigmoid, this loss is not the same loss that we had before. Here is what the new loss function looks like:

It looks like we have a problem here. See those deep canyons leading straight into holes? We mentioned those holes when we introduced gradient descent—they are the dreaded local minima. Remember that the goal of gradient descent is to move downhill? Now consider what happens if GD enters a local minimum: since there is no “downhill” at the bottom of a hole, the algorithm stops there, falsely convinced that it’s reached the global minimum that it was aiming for.

By looking at this diagram, we can conclude that if we use the mean squared error and the sigmoid together, the resulting loss has an uneven surface littered with local minima. Such a surface is hard to navigate with gradient descent. We’d better look for a different loss function with a smoother, more GD-friendly surface.

We can find one such function in statistics textbooks. It’s called the log loss because it’s based on logarithms:

L = - \frac{1}{m} \sum_{i=1}^{m}(y_i . \log(\hat{y_i}) + (1-y_i). \log(1-\hat{y_i})

The log loss formula might look daunting, but do not let it intimidate us. We just need to know that it behaves like a good loss function. The closer the prediction $\hat{y}$ is to the ground truth $y$ , the lower the loss. Also, the formula looks more friendly when we write it into code:

1.How Machine Learning Works

2.Our First Learning Program

3.Walking the Gradient

4.Hyperspace

5.A Discern Machine

6.Get Real

7.The Final Challenge

8.The Perceptron

9.Designing the Network

10.Building the Network

11.Training the Network

12.How Classifiers Work

13.Batchin’ Up

14.The Zen of Testing

15.Let’s Do Development

16.A Deeper Kind of Network

Project

17.Defeating Overfitting

18.Taming Deep Networks

19.Beyond Vanilla Networks

20.Into the Deep

Project

Mock Interview

Invasion of the Sigmoids

Overview of sigmoids

Confidence and doubt

Smoothing it out