Search⌘ K
AI Features

Generalized Linear Regression

Explore generalized linear regression by understanding basis function transformations and the use of regularization in ridge regression. This lesson covers vectorizing the loss function, deriving the closed-form solution, and implementing the model from scratch and with scikit-learn in Python, enabling you to apply these concepts to real datasets effectively.

We’ve previously learned that while standard linear models are powerful, many real-world relationships are non-linear. The generalized linear model (GLM) framework solves this by introducing a basis function (ϕ(x)\phi(\mathbf{x})) that transforms the input features into a higher-dimensional space, allowing a linear model to fit a complex, non-linear curve to the data. In this lesson, we move from conceptual understanding to practical implementation by exploring closed-form solutions for training generalized linear models.

Single target

The input features xiRd{x}_i \in {R}^d are vectors where each data point has dd distinct, real-valued features (e.g., size, age). The target variable yiRy_i \in {R} is a single, continuous, real-valued number (e.g., house price) that the model aims to predict, defining this as a single-target regression problem. The model fw(fx)=wTϕ(x)f_{{w}}(f{x}) = {w}^T\phi({x}) is a generalized linear model (GLM). It achieves non-linear modeling by first applying a basis function ϕ(x)\phi({x}) (the mapping) to transform the input features, and then making the prediction via a linear dot product with the learned parameters w\mathbf{w}.

Try this quiz to review what you’ve learned so far.

1.

In the context of the function fw(x)=wTϕ(x)f_\bold w(\bold x) = \bold w^T\phi(\bold x), if xRd\bold x \in \R^d, ϕ(x)Rm\phi(\bold x) \in \R^m and wRk\bold w \in \R^k, then what is kk?

A.

k=dk=d

B.

k=mk=m


1 / 1

The function fw(x)=wTϕ(x)f_{{w}}({x}) = {w}^T\phi({x}) successfully defines the structure of our generalized linear model (GLM) for any given input x{x}. However, this model structure is useless until we determine the ideal values for the parameter vector w{w}. These parameters must be chosen so that the model’s predictions best match the true target values in our training dataset DD.

To quantify how well a given set of parameters w{w} performs, we use a loss function L(w)L({w}). This function measures the total error between the model’s predictions and the actual observed values across all nn data points. To find the w{w} that provides the best fit, we must find the w{w} that minimizes this loss.

The optimal parameters w{w}^* can be determined by minimizing a regularized squared loss as follows:

w=arg minw{i=1n(wTϕ(xi)yi)2+λwTw}\bold w^*=\argmin_{\bold w}\bigg\{\sum_{i=1}^n (\bold w^T\phi(\bold x_i)-y_i)^2 + \lambda \bold w^T\bold w\bigg\}

Here, i=1n(wTϕ(xi)yi)2\sum_{i=1}^n (\mathbf{w}^T\phi(\mathbf{x}_i)-y_i)^2 is the squared error (or data loss) term, and λwTw\lambda \mathbf{w}^T\mathbf{w} is the L2 regularization term. Their sum, i=1n(wTϕ(xi)yi)2+λwTw\sum_{i=1}^n (\mathbf{w}^T\phi(\mathbf{x}_i)-y_i)^2 + \lambda \mathbf{w}^T\mathbf{w} ...