We can kernelize logistic regression just like other linear models by observing that the parameter vector w is a linear combination of the feature vectors Φ(X), that is:
w=Φ(X)a
Here, a is the dual parameter vector, and, in this case, the loss function now depends upon a.
Minimzing BCE Loss
We need to find the model parameters (a) that result in the smallest BCE loss function value to minimize the BCE loss.
The BCE loss is defined as:
LBCE(a)Liy^izi=i=1∑nLi=−(yilog(y^i)+(1−yi)log(1−y^i))=σ(zi)=1+e−zi1=aTΦ(X)Tϕ(xi)
To minimize LBCE, we need to compute its gradient with respect to the parameters a as follows:
∂a∂LBCE=i=1∑N∂y^i∂Li∂zi∂y^i∂a∂zi=−i=1∑N[yiy^i1−(1−yi)1−y^i1]y^i(1−y^i)Φ(X)Tϕ(xi)=−i=1∑N[yi−y^i]Φ(X)Tϕ(xi)=i=1∑N[y^i−yi]Φ(X)Tϕ(xi)=K(y^−y)T
Here, K is the Gram matrixThe Gram matrix is a square matrix containing dot products of all pairs of data points after mapping them to a higher-dimensional feature space using a kernel function.:
K=Φ(X)Tϕ(xi)
and
Φ(X)=[ϕ(x1)ϕ(x2)…ϕ(xn)]
Note: The Gram matrix is used to compute the dot products in feature space using kernel functions without actually computing the feature vectors themselves, making it computationally efficient.
Implementation
The code below implements the kernel logistic regression for binary classification: