Model Evaluation
Evaluate the quality of the model using a handful of evaluation metrics.
We'll cover the following...
A residual value measures how much a regression line vertically misses a data point. Regression lines are the best fit for a set of data. We can think of the lines as averages; a few data points will fit the line, and others will miss. A residual plot has the residual values on the vertical axis, and the horizontal axis typically displays the independent variable.
Residual scatterplot
The scatterplot below (residual values against y_hat predictions) is helpful for understanding randomness.
We are interested in the following.
Scedasticity: We want a consistent variance between our low and high predictions (homoscedasticity). Our target is not normally distributed if we spot the opposite (heteroscedasticity). The remedy is to run the target vector through a power transformation to make it more Gaussian-like, such as Box-Cox or Yeo-Johnson.
Outliers: If the loss function involves squaring the residuals (for example, MSE, RMSE, R2), then outliers will have a lot of leverage over the model. We recommend removing the worst outliers or offenders from the training data.
In statistics, scedasticity is the distribution of error terms, and they can either be distributed randomly and with constant variance (homoscedasticity) or with some pattern (heteroscedasticity).
Residual histogram
The residual histogram is also helpful as it can tell us how much the predicted value differs from the actual value in y_test. We can do the subtraction y_test - predictions for this plot. Before we move on, let's try to understand the skewness in the distribution plots. The code below generates data and creates the plots for learning.
Note: The
meanandmedianare the same for normal distributions. However, they become different numbers if the distributions are skewed. The mean will be on the right side of the median in a right-skewed (positive) distribution. In a left-skewed (negative) distribution, the mean will be on the left of the median.
For the best-fitted model, the mean and the sum of residuals are always zero for the training data (X_train, y_train). Let's see how the residual histogram looks for the testing data (X_test, y_test). We don't expect the mean and the sum of residuals to be zero in this plot.
We have trained our model lm, and the residual plot looks good.
R-squared and accuracy of the fit
Let's see what the accuracy of our model is. We can call the score function on our trained model for this purpose.
Adjusted R2 is always preferred for multivariant (multiple) linear regression, where we have more than one predictor variable.
Alternatively, we can use the r2_score function from sklearn.metrics.
R-squared—coefficient of determination—is a regression score function. It is a statistical measure representing the proportion of the variance for a dependent variable explained by an independent variable or variables in a regression model. The best possible score is 1.0. It can be negative (because the model can be arbitrarily worse).
Note: A constant model that always predicts the expected value of
y, disregarding the input features, would get the R-squared score of0.0. The lesson "R-square and Goodness of the Fit" could be a great read to get more details onvs. ...