Welcome and Data Overview

Learn to load and explore data.

We'll cover the following...

Welcome
Data overview
- Context building

Context building

Imagine that a real estate company hires us to achieve its business goals. The company wants to predict housing prices in the Boston area. Based on the community and other criteria, some areas are in high demand. The company is interested in an automated way of suggesting a house price based on its features. The given dataset contains features such as the age of the house, number of rooms, crime rate by town, the proportion of residential land, nitric oxide concentration, property tax, and so on.

When we look at the dataset, we think linear regression is an excellent model for this problem. We have the data, so let’s start working on the model. Details of the full list of features are given below:

CRIM: Per capita crime rate by town.
ZN: Proportion of residential land zoned for lots over 25,000 square feet.
INDUS: Proportion of non-retail business acres per town.
CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
NOX: Nitric oxides concentration (parts per 10 million).
RM: Average number of rooms per dwelling.
AGE: Proportion of owner-occupied units built before 1940.
DIS: Weighted distances to five Boston employment centers.
RAD: Index of accessibility to radial highways.
TAX: Full-value property tax rate per USD 10,000.
PTRATIO: Pupil-teacher ratio by town.
MEDV: Median value of owner-occupied homes in $1,000s.

Note: We’ll work with more than one variable or feature; this is a multiple linear regression problem. We can try to create a model with one feature—for example, predicting house price using the number of rooms only. This would be our simple linear regression problem.

Let’s import datasets from scikit-learn, load the built-in housing price dataset boston into bh, and check its keys.

Press + to interact

Course Introduction

Linear Regression

Regularization

Bias-Variance Trade-off

Categorical Features

Logistic Regression

Logistic Regression: Titanic Data

Sentiment Analysis Using Multinomial Logistic Regression

Multiclass Classification and Handling Imbalanced Classes

Project: Predicting Chronic Kidney Disease

K-Nearest Neighbors

Implementation of K-Nearest Neighbors

Logistic Regression vs. KNN

Decision Tree Learning

Implement the Decision Tree Classifier from Scratch

Bootstrapping and Confidence Interval

Support Vector Machine

Practice and Comparisons

What's Next?

Appendix

Welcome and Data Overview

Welcome

Data overview

Context building