Multivariate Linear Regression

With an understanding of Regression, you can also learn about Multivariate Linear Regression in this lesson.

Multivariate Linear Regression

In Multivariate Linear Regression, we have multiple input or independent features. Based on these features, we predict an output column. Again, let’s use the Tips Dataset.

We will use the following columns from the dataset for Multivariate Analysis.

  • Total_bill: It is the total bill of food served.

  • Sex: It is the sex of the bill payer.

  • Size: It is the number of people visiting the restaurant.

  • Smoker Is the person a smoker or not?

  • Tip: It is the tip given on the meal.

Goal of Multivariate Linear Regression: The goal is to predict the “tip”, given all the independent features above. The Regression model constructs an equation to do so.

  • We plot the Scatter plot between the numeric independent variables (total_bill) and numeric output variable (tip) to analyze the relationship.

  • We plot the BoxPlot between the categorical independent variables (sex, size and smoker) and the numeric output variable (tip) to analyze the relationship.

  • You can see that the points in the Scatter plot are mostly scattered along the diagonal.

  • This indicates that there might be some positive correlation between the total_bill and tip. This will be fruitful in modeling.

  • We can see that males tend to give more tips than females.

  • There are some outliers in males who have given exceptional tips as can be seen on the upper whisker above. There is an outlier in females too.

  • We can see that the tip tends to increase with the number of people. It is visible from the upward trend of Box-Plots. So, this will be fruitful in modeling.

  • There are some outliers in the size of two and three.

  • We can see that people who smoke tend to give a little higher tip.

  • There are many outliers in the people who do not smoke.


Multivariate Linear Regression comes up with the following equation in higher dimensions:

y^=w0x0+w1x1+w2x2+w3x3+w4x4\hat{y} = w_0 * x_0 + w_1 * x_1 + w_2 * x_2 + w_3 * x_3 + w_4 * x_4wnxnw_n * x_n

Here, x0x_0 = 1

x=[x0x1x2..xn]w=[w0w1w2..wn]wT=[w0w1w2...wn]x = \begin{bmatrix} x_0 \\ x_1 \\ x_2 \\ . \\ . \\ x_n \\ \end{bmatrix} \quad w = \begin{bmatrix} w_0 \\ w_1 \\ w_2 \\ . \\ . \\ w_n \\ \end{bmatrix} \quad w^T = \begin{bmatrix} w_0 & w_1 & w_2 ... w_n\\ \end{bmatrix} \quad

y^=wTx\hat{y} = w^T * x

Goal: Find such values of w0w_0, w1w_1, w2w_2, … where w0w_0 and w1w_1 are the parameters, so that the predicted tip ("y^\hat{y}") is as much close to actual tip i.e ("yy") as possible. Mathematically we can say that we have to minimize the following function.

J(w)J(w) = 12mi=1m(y^iyi)2\frac{1}{2m}\sum_{i=1}^{m}(\hat{y}^i-y^i)^2

This time y^i\hat{y}^i is incorporating more than one parameter w0,w1,...w_0, w_1, ... and more than one features x0,x1,...x_0, x_1, ..., compared to Univariate Linear Regression. ww is a vector with the dimensions (n+1)1(n+1) * 1

Gradient Descent

Gradient descent changes as below

Repeat until convergence {

wj=wjαwjJ(w)w_j = w_j - \alpha \frac{\partial}{\partial w_j} J(w)


  • Here j = 0, 1, 2, 3, …
  • wjJ(w)\frac{\partial}{\partial w_j} J(w) = 1mi=1m(y^iyi)xji\frac{1}{m} \sum_{i=1}^{m}(\hat{y}^i-y^i) * x_j^i

So, the above equation becomes

Repeat until convergence {

wj=wjα1mi=1m(y^iyi)xjiw_j = w_j - \alpha \frac{1}{m} \sum_{i=1}^{m}(\hat{y}^i-y^i) * x_j^i



I would like to thank Professor Andrew Ng from Stanford University for providing amazing resources to explain the mathematical foundations of models.

Get hands-on with 1200+ tech skills courses.