# Optimizing BCE Loss

Learn how to minimize BCE loss using gradient descent.

We'll cover the following

## Optimization

Logistic regression aims to learn a parameter vector $\bold{w}$ by minimizing a chosen loss function. While the squared loss $L_s(\bold{w})=\sum_{i=1}^n\bigg(y_i-\frac{1}{1+e^{-\bold{w}^T\phi(\bold{x}_i)}}\bigg)^2$ might appear as a natural choice, it’s not convex. Fortunately, we have the flexibility to consider alternative loss functions that are convex. One such loss function is the binary cross-entropy (BCE) loss, denoted as $L_{BCE}$, which possesses convexity properties. The BCE loss can be defined as:

$L_{BCE}(\bold{w})=-\sum_{i=1}^n(y_ilog(\hat y_i)+(1-y_i)log(1-\hat y_i))$

### Explanation of BCE loss

Let’s delve into the explanation of the BCE loss. For a single example in a dataset with a target label $y_i$, if $y_i=1$ and the prediction $\hat{y}_i \approx 1$, the loss $- \log(\hat{y}_i) \approx 0$. Conversely, if $\hat{y}_i \approx 0$, the loss $- \log(\hat{y}_i)$ becomes significantly large. Similarly, we can evaluate the pairs $(y_i=0, \hat{y}_i \approx 0)$ and $(y_i=0, \hat{y}_i \approx 1)$. The code snippet provided below illustrates the computation of the BCE loss for a single example:

Get hands-on with 1200+ tech skills courses.