LR Implementation Steps: 4 to 7

We will continue to go through the implementation steps (4-8) of linear regression.

4) Remove or modify variables with missing values

Our exploratory data analysis shows that missing values pose a problem for this dataset, especially since a linear regression does not run smoothly with missing values. Therefore, we need to estimate or remove these values from the data frame.

However, the data frame size will be greatly reduced if you choose to remove all missing values on a row-by-row basis. The variable BuildingArea, for instance, has 21,115 missing rows, which makes up two-thirds of the data frame! To preserve row depth, you can remove this variable entirely, especially as it’s not highly correlated with the dependent variable of Price (0.1).

The remaining variables can be removed on a row-by-row basis or filled with the mean value. Based on exploratory data analysis, you can:

  • Use the mean to fill variables with partial correlation to Price (i.e., Car).
  • Remove rows for variables with a small number of missing values (i.e., Distance).
  • Avoid filling values for variables with significant correlation to Price and, instead, remove those missing values row-by-row (i.e., Bathroom).

Get hands-on with 1200+ tech skills courses.