Optimizing BCE Loss

Learn how to minimize BCE loss using gradient descent.


Logistic regression aims to learn a parameter vector w\bold{w} by minimizing a chosen loss function. While the squared loss Ls(w)=i=1n(yi11+ewTϕ(xi))2L_s(\bold{w})=\sum_{i=1}^n\bigg(y_i-\frac{1}{1+e^{-\bold{w}^T\phi(\bold{x}_i)}}\bigg)^2 might appear as a natural choice, it’s not convex. Fortunately, we have the flexibility to consider alternative loss functions that are convex. One such loss function is the binary cross-entropy (BCE) loss, denoted as LBCEL_{BCE}, which possesses convexity properties. The BCE loss can be defined as:

LBCE(w)=i=1n(yilog(y^i)+(1yi)log(1y^i))L_{BCE}(\bold{w})=-\sum_{i=1}^n(y_ilog(\hat y_i)+(1-y_i)log(1-\hat y_i))

Explanation of BCE loss

Let’s delve into the explanation of the BCE loss. For a single example in a dataset with a target label yiy_i, if yi=1y_i=1 and the prediction y^i1\hat{y}_i \approx 1, the loss log(y^i)0- \log(\hat{y}_i) \approx 0. Conversely, if y^i0\hat{y}_i \approx 0, the loss log(y^i)- \log(\hat{y}_i) becomes significantly large. Similarly, we can evaluate the pairs (yi=0,y^i0)(y_i=0, \hat{y}_i \approx 0) and (yi=0,y^i1)(y_i=0, \hat{y}_i \approx 1). The code snippet provided below illustrates the computation of the BCE loss for a single example:

Get hands-on with 1200+ tech skills courses.