Optimization

Logistic regression aims to learn a parameter vector $\bold{w}$ by minimizing a chosen loss function. While the squared loss $L_s(\bold{w})=\sum_{i=1}^n\bigg(y_i-\frac{1}{1+e^{-\bold{w}^T\phi(\bold{x}_i)}}\bigg)^2$ might appear as a natural choice, it’s not convex. Fortunately, we have the flexibility to consider alternative loss functions that are convex. One such loss function is the binary cross-entropy (BCE) loss, denoted as $L_{BCE}$ , which possesses convexity properties. The BCE loss can be defined as:

L_{BCE}(\bold{w})=-\sum_{i=1}^n(y_ilog(\hat y_i)+(1-y_i)log(1-\hat y_i))

Explanation of BCE loss

Let’s delve into the explanation of the BCE loss. For a single example in a dataset with a target label $y_i$ , if $y_i=1$ and the prediction $\hat{y}_i \approx 1$ , the loss $- \log(\hat{y}_i) \approx 0$ . Conversely, if $\hat{y}_i \approx 0$ , the loss $- \log(\hat{y}_i)$ becomes significantly large. Similarly, we can evaluate the pairs $(y_i=0, \hat{y}_i \approx 0)$ and $(y_i=0, \hat{y}_i \approx 1)$ . The code snippet provided below illustrates the computation of the BCE loss for a single example:

Press + to interact

Python 3.10.4

import numpy as np
def BCE_loss(y, y_hat):
    """
    Compute the binary cross-entropy (BCE) loss for a given target label and predicted probability.
    Args:
        y: Target label (0 or 1)
        y_hat: Predicted probability
    Returns:
        BCE loss value
    """
    if y == 1:
        return -np.log(y_hat)
    else:
        return -np.log(1 - y_hat)
# Iterate over different combinations of y and z
for y in [0, 1]:
    for y_hat in [0.0001, 0.99]:
        # Compute and print the BCE loss for each combination
        print(f"y = {y}, y_hat = {y_hat}, BCE_loss = {BCE_loss(y, y_hat)}")

By utilizing the BCE loss, we can effectively capture the dissimilarity between the target labels and predicted probabilities, enabling convex optimization during the parameter estimation process of logistic regression.

Minimzing BCE loss

We need to find the model parameters (that is, $\bold w$ ) that result in the smallest BCE loss function value to minimize the BCE loss. The BCE loss is defined as:

\begin{align*} L_{BCE}(\bold{w})&=\sum_{i=1}^n L_i\\ L_i &= -(y_ilog(\hat y_i)+(1-y_i)log(1-\hat y_i)) \\ \hat{y}_i&=\sigma(z_i)=\frac{1}{1+e^{-z_i}} \\ z_i&=\bold w^T \phi(\bold{x}_i) \end{align*}

Here, $\hat{y}_i$ is the predicted probability that the $i ^{th}$ sample belongs to either the positive class (class 1) or the negative class (class 0), and $z_i$ ...

Course Overview

Supervised Learning

Detect Cyber Intrusion Using Machine Learning

Clustering

Project: Bag of Visual Words

Generalized Linear Regression

Face Recognition Using Kernel Linear Discriminant

Support Vector Machine

Logistic Regression

Ensemble Learning

Early Stage Diabetes Prediction Using Ensemble Learning

Decoding Dimensions: PCA and Autoencoders

Image Reconstruction Using PCA

Image Colorization using Autoencoders

Colorful Face Generation with VAEs

Appendix

Wrapping Up

How to Predict the Traffic Volume Using Machine Learning

Optimizing BCE Loss

Optimization

Explanation of BCE loss

Minimzing BCE loss