The Course Datasets

Get introduced to the course datasets and data basics.

The Adult Census Income dataset

This dataset will be used throughout the course for lesson examples. The dataset was extracted from the database for the US 1994 Census bureau. The dataset was created to explore the following question:

  • What characteristics are associated with income levels?

    • Less than, or equal to, $50,000 USD/year?

    • More than $50,000 USD/year?

Each row of the Adult Census Income dataset represents a US resident, and the columns represent the characteristics of that US resident. Here are some example characteristics of the dataset:

  1. age: Age measured in years.

  2. education: The highest level of education attained.

  3. sex: Gender denoted as female or male.

  4. hours_per_week: The number of hours worked at a job each week.

  5. income: Yearly income denoted as <=50K or >50K.

Lessons throughout the course will use samples from the Adult Census Income dataset. In all cases, the goal is to use characteristics in the dataset (e.g., age) to predict income level (i.e., income).

More information on this dataset is available at the UCI Machine Learning Repository.

Get the Adult Censues Income dataset from the UCI ML Repository
Get the Adult Censues Income dataset from the UCI ML Repository

The Titanic dataset

This dataset will be used for the interactive coding aspects of the course. The dataset was created to explore the following question:

  • What are Titanic passenger characteristics associated with survival?

Each row of the Titanic dataset represents a passenger, and the columns represent the characteristics of the passenger. The following are the characteristics of the dataset:

  • Survived: Passenger survival. Values: 0 = no and 1 = yes.

  • Pclass: Class of a passenger ticket. Values: 1 = 1st class, 2 = 2nd class, and 3 = 3rd class.

  • Sex: Gender of the passenger. Values: female and male.

  • Age: Passenger age in years.

  • SipSp: Count of the passenger’s siblings/spouses aboard the Titanic.

  • Parch: Count of the passenger’s parents/children aboard the Titanic.

  • Ticket: Passenger’s ticket number.

  • Fare: The amount paid for the passenger’s ticket.

  • Cabin: Passenger’s cabin number.

  • Embarked: Passenger’s port of embarkation. Values: C = Cherbourg, Q = Queenstown, S = Southampton.

The Titanic dataset is used in this course for the following reasons:

  • The dataset is widely known.

  • The dataset is not 100 percent clean.

  • There are many opportunities to enrich the dataset (i.e., feature engineering).

  • Crafting a useful machine learning model (i.e., high prediction accuracy) is not easy.

  • For interested students, there is an opportunity to apply learning via the Kaggle Titanic machine learning competition.

More information on this dataset is available via the Kaggle website:

The Titanic dataset on Kaggle's website
The Titanic dataset on Kaggle's website

Data basics

The raw materials of machine learning are data; specifically, tables of data. For example, take the following sample of data from the Adult Census Income dataset:

Adult Census Income Sample Data

Age

Education

Sex

Hours Per Week

Income

39

Bachelors

Male

40

<=50K

50

Bachelors

Male

13

<=50K

38

HS-grad

Male

40

<=50K

53

11th

Male

40

<=50K

28

Bachelors

Female

40

<=50K

When entering the world of machine learning, it’s common to find different names used for various aspects of data tables. The following identify synonyms as they relate to tables of data:

  • Table / dataset / data frame / matrix

  • Row / observation / example

  • Column / feature / predictor / independent variable / characteristic

  • Label / prediction / output / dependent variable

Also, machine learning practitioners must consider the types of data being used—this is similar to data formats / types in technologies like Microsoft Excel and relational databases. The following defines the types of data used in machine learning:

  • Numeric: Data that can be measured (e.g., height, weight, price, etc.)

  • Categorical: Data that can be divided into distinct groups / classes (e.g., US states, Olympic medals, brands of automobiles, etc.)

Numeric data can be further divided into interval and ratio data. Categorical data can be further divided into nominal and ordinal data.

The differences between interval and ratio data will be covered later in the course. The machine learning techniques used in this course do not differentiate between nominal and ordinal data.