Welcome and Data Overview

Learn to load and explore data.

We'll cover the following

In this chapter, we'll learn the linear regression (supervised machine learning) model hands-on.


We are very excited because it takes tremendous effort to come to this stage, where we are doing our first machine learning project. We'll go through the process step by step. We'll be doing the following lessons without going through these steps in detail. However, we'll revisit these steps several times along our way in the machine learning section.

Let’s start with a very famous and real dataset. Our task is to build a machine learning model to predict housing prices in the Boston (USA) area. This housing dataset is a part of scikit-learn. Let’s read this project’s Boston housing dataset from the scikit-learn repository. In this way, we’ll learn about the process of loading built-in datasets from scikit-learn.

Relating the project to a context is always helpful. Let’s create a context.

Context building

Imagine that a real estate company hires us to achieve its business goals. The company wants to predict housing prices in the Boston area. Based on the community and other criteria, some areas are in high demand. The company is interested in an automated way of suggesting a house price based on its features. The given dataset contains features such as the age of the house, number of rooms, crime rate by town, the proportion of residential land, nitric oxide concentration, property tax, and so on.

When we look at the dataset, we think linear regression is an excellent model for this problem. We have the data, so let’s start working on the model. Details of the full list of features are given below:

  • CRIM: Per capita crime rate by town.

  • ZN: Proportion of residential land zoned for lots over 25,000 square feet.

  • INDUS: Proportion of non-retail business acres per town.

  • CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

  • NOX: Nitric oxides concentration (parts per 10 million).

  • RM: Average number of rooms per dwelling.

  • AGE: Proportion of owner-occupied units built before 1940.

  • DIS: Weighted distances to five Boston employment centers.

  • RAD: Index of accessibility to radial highways.

  • TAX: Full-value property tax rate per USD 10,000.

  • PTRATIO: Pupil-teacher ratio by town.

  • MEDV: Median value of owner-occupied homes in $1,000s.

Note: We’ll work with more than one variable or feature; this is a multiple linear regression problem. We can try to create a model with one feature—for example, predicting house price using the number of rooms only. This would be our simple linear regression problem.

Let’s import datasets from scikit-learn, load the built-in housing price dataset boston into bh, and check its keys.

Press + to interact
# importing the datasets from sklearn
from sklearn import datasets
# loading the Boston data
bh = datasets.load_boston()
# displaying the bh.keys

So, bh contains data: target is the price, feature_names are the columns, and DESCR is the description of the dataset. We can start by exploring the description of the dataset.

Press + to interact
# displaying the bh['DESCR'] aka data description

Let's create a pandas data frame with bh.features_names as columns so that the bh.data will go to its respective column. We can also add the target as another column named price.

Press + to interact
# importing the pandas
import pandas as pd
# creating dataframe using data and feature_names
df = pd.DataFrame(data=bh.data, columns=bh.feature_names)
# adding price column
df['price'] = bh.target
# displaying the first two rows of data

Let's get some information on the data using info().

Press + to interact
# displaying the data info

So, if we look at each column, there is no missing data. The price (dependent variable) is our target column along with related features (independent variables). We can use describe() on the data frame object df to get a quick view of basic statistics.

Press + to interact
# displaying the basic statistics

If we look at the output from describe(), we have max, min, mean, and std (standard deviation), which suggest the distributions in our selected features.