A residual value measures how much a regression line vertically misses a data point. Regression lines are the best fit for a set of data. We can think of the lines as averages; a few data points will fit the line, and others will miss. A residual plot has the residual values on the vertical axis, and the horizontal axis typically displays the independent variable.

Residual scatterplot

The scatterplot below (residual values against y_hat predictions) is helpful for understanding randomness.

We are interested in the following.

Scedasticity: We want a consistent variance between our low and high predictions (homoscedasticity). Our target is not normally distributed if we spot the opposite (heteroscedasticity). The remedy is to run the target vector through a power transformation to make it more Gaussian-like, such as Box-Cox or Yeo-Johnson.
Outliers: If the loss function involves squaring the residuals (for example, MSE, RMSE, R2), then outliers will have a lot of leverage over the model. We recommend removing the worst outliers or offenders from the training data.

In statistics, scedasticity is the distribution of error terms, and they can either be distributed randomly and with constant variance (homoscedasticity) or with some pattern (heteroscedasticity).

Residual histogram

The residual histogram is also helpful as it can tell us how much the predicted value differs from the actual value in y_test. We can do the subtraction y_test - predictions for this plot. Before we move on, let's try to understand the skewness in the distribution plots. The code below generates data and creates the plots for learning.

Python 3.8

fig,ax=plt.subplots(nrows=3,figsize=(18,14))
# Generating data
np.random.seed(41);size,scale = 100000,10
PositiveSkew=pd.Series(np.random.gamma(scale,size=size)**2.5)
NotmalDistribution=pd.Series(np.random.randn(size));NegativeSkew = PositiveSkew[::-1]
# Positive skewed plot
sns.histplot(data=PositiveSkew,ax=ax[0],kde=True,bins=200,color='r',line_kws = {'lw':3});
ax[0].text(1000,4000,'Positive Skew / Right Skewed',fontsize=18,color='r')
ax[0].axvline(np.mean(PositiveSkew),lw=5,c='DarkRed')
ax[0].text(390,1500,'Mean',fontsize=18,color='DarkRed',weight='bold')
ax[0].axvline(np.median(PositiveSkew),lw =5,c='gold',ls = '--')
ax[0].text(160,1500,'Median',fontsize=18,color='gold',weight='bold')
ax[0].set_xlim(0,1500);ax[0].set(xticklabels=[]);ax[0].set(yticklabels=[]);ax[0].set(ylabel=None)
# Normal distribution - No Bias
sns.histplot(data=NotmalDistribution,ax=ax[1],kde=True,bins=40,color='g',line_kws={'lw':3});
ax[1].text(1.6,6050,'Normal Distribution, No Bias - OK',fontsize=18,color='g')
ax[1].axvline(np.mean(NotmalDistribution),lw=10,c ='DarkRed')
ax[1].text(0.1,4000,'Mean',fontsize=18,color='DarkRed',weight='bold')
ax[1].axvline(np.median(NotmalDistribution),lw =3,c='gold',ls = '--')
ax[1].text(0.1,3000,'Median',fontsize=18,color='gold',weight='bold')
ax[1].set(xticklabels=[]);ax[1].set(yticklabels=[]);ax[1].set(ylabel=None);
# Negative Skew
sns.histplot(data=NegativeSkew,ax=ax[2],kde=True,bins=200,color='r',line_kws={'lw':3})
ax[2].text(1000,4000,'Negative Skew / Left Skewed',fontsize=18,color='r')
ax[2].axvline(np.mean(NegativeSkew), lw=5, c = 'DarkRed')
ax[2].text(470,1500,'Mean',fontsize=18,color='DarkRed',weight='bold')
ax[2].axvline(np.median(NegativeSkew),lw =5,c='gold',ls = '--')
ax[2].text(280,1500,'Median',fontsize=18,color='gold',weight='bold')
ax[2].set_xlim(1500,0);ax[2].set(xticklabels=[]);ax[2].set(yticklabels=[]);ax[2].set(ylabel=None);
print("NOTE: The 'x-' limits are enforced in these histograms, to see the full plots, remove the limits!")

Note: The mean and median are the same for normal distributions. However, they become different numbers if the distributions are skewed. The mean will be on the right side of the median in a right-skewed (positive) distribution. In a left-skewed (negative) distribution, the mean will be on the left of the median.

For the best-fitted model, the mean and the sum of residuals are always zero for the training data (X_train, y_train). Let's see how the residual histogram looks for the testing data (X_test, y_test). We don't expect the mean and the sum of residuals to be zero in this plot.

Python 3.8

pred_train = lm.predict(X_train)
print("TRAINING DATA (X_train, y_train)")
print("Mean of the residual:", abs(round(np.mean(y_train-pred_train), 4)))
print("Sum of the residual:", abs(round(np.sum(y_train-pred_train), 4)))
# Residual Histogram
# Training Data
fig1, ax1 = plt.subplots(nrows=2,figsize=(16, 10))
sns.histplot(data=y_train-pred_train,ax=ax1[0], kde=True, bins=100);
#sns.displot(y_test-predictions);
ax1[0].axvline(x=np.mean(y_train-pred_train), color='r', ls='--', linewidth=2)
ax1[0].set_title("Residual histogram for TRAINING DATA (X_train, y_train)", fontsize=16)
ax1[0].set_xlabel('Residuals (y_train-pred_train)', fontsize=14)
ax1[0].set_ylabel('Relative Frequency', fontsize=14)
ax1[0].set_xlim(-20, 30)
# putting ssome text on the plot!
ax1[0].text(x=5, y=18, bbox=dict(facecolor='green', alpha=0.3),fontsize=12,  
         s="For best fitted model, the mean and the sum of residual always\
            \nequals to zero for the TRAINING DATA.");
ax1[0].text(x=5, y=13, bbox=dict(facecolor='red', alpha=0.3),fontsize=12,  
         s="Compare this plot with the above example histograms.\n\nWhat do you think?");
# Testing Data
print("\nTEST DATA (X_test, y_test)")
print("Mean of the residual:", round(np.mean(y_test-predictions),4))
print("Sum of the residual:", round(sum(y_test-predictions),4))
sns.histplot(data=y_test-predictions, ax=ax1[1], kde=True, bins=100);
#sns.displot(y_test-predictions);
ax1[1].axvline(x=np.mean(y_test-predictions), color='r', ls='--', linewidth=2)
ax1[1].set_xlabel('Residuals (y_test - pred_test)', fontsize=14)
ax1[1].set_ylabel('Relative Frequency', fontsize=14)
ax1[1].set_title('Residual histogram from TEST DATA (X_test, y_test)', fontsize=16)
ax1[1].set_xlim(-20, 30)
# putting some text on the plot!
ax1[1].text(x=5, y=6, bbox=dict(facecolor='red', alpha=0.3),fontsize=12,  
         s="We don't expect mean and sum of residual to be zero in this plot.\
         \nThis is unseen/test data for the model!");
fig1.tight_layout(pad=2.0)

Python 3.8

# Adjusted R^2 (see explanation "R^2 vs R^2_adj" at the end)
r_2_train = lm.score(X = X_train, y = y_train)
r_2_test  = lm.score(X = X_test, y = y_test)
p = len(X_train.columns) # n_features same for train & test
n_train = len(X_train) # n_data_points train
n_test  = len(X_test) # n_data_points test
print("R^2 (adjusted) for (X_train, y_train) = ", (1 - (((1-r_2_train)*(n_train-1))/(n_train-p-1))))
print("R^2 (adjusted) for (X_train, y_train) = ", (1 - (((1-r_2_test)*(n_test-1))/(n_test-p-1))))

R-squared—coefficient of determination—is a regression score function. It is a statistical measure representing the proportion of the variance for a dependent variable explained by an independent variable or variables in a regression model. The best possible score is 1.0. It can be negative (because the model can be arbitrarily worse).

Note: A constant model that always predicts the expected value of y, disregarding the input features, would get the R-squared score of 0.0. The lesson "R-square and Goodness of the Fit" could be a great read to get more details on $R^2$ vs. $R^2_{adj}$ ...

1.Course Introduction

2.Linear Regression

3.Regularization

4.Bias-Variance Trade-off

5.Categorical Features

6.Logistic Regression

7.Logistic Regression: Titanic Data

Project

8.Multiclass Classification and Handling Imbalanced Classes

9.Project: Predicting Chronic Kidney Disease

10.K-Nearest Neighbors

11.Implementation of K-Nearest Neighbors

12.Logistic Regression vs. KNN

13.Decision Tree Learning

Project

14.Bootstrapping and Confidence Interval

15.Support Vector Machine

16.Practice and Comparisons

17.What's Next?

18.Appendix

Model Evaluation

Residual scatterplot

Residual histogram

R-squared and accuracy of the fit