...

/

Set Up Your Data for Learning

Set Up Your Data for Learning

Learn to set up data by defining features and labels, splitting it correctly, and preventing leaks for trustworthy machine learning.

In real-world machine learning projects, up to 80% of our time is spent preparing data, defining what to predict, cleaning and splitting to ensure fair evaluation, and preventing subtle leaks that can ruin our model’s trustworthiness. While the “Clean It Up!” chapter covered general data-cleaning techniques (e.g., handling missing values, outlier treatment, normalization), this lesson focuses on the ML-specific steps we’ll repeat on every project:

  1. Defining exactly what we’re predicting and which inputs will drive that prediction.

  2. Splitting data so our model is truly tested on unseen examples.

  3. Safeguarding against accidental peeks into test data (data leakage).


With this foundation in mind, let’s clarify the two core components of every supervised ML dataset—labels and features.

Label vs. features

In supervised machine learning, we aim to learn a function that maps features (the inputs) to a label (the output).

  • A label is the single variable we want to predict, sometimes called the target or response.

  • Features (also called predictors or covariates) are the inputs, the measurable properties or characteristics that help explain variation in the label.

If we choose the wrong label, our model won’t answer the question we care about. The model can’t learn the underlying patterns if we pick poor features. Consider a real estate application that estimates home values. Here:

  • The label is the sale price of each property.

  • The features might include:

    • Total square footage

    • Number of bedrooms

    • Encoded indicator of neighborhood quality

Each feature contributes information—size suggests space, bedrooms hint at capacity, and neighborhood reflects demand, which together help predict price.

Example

Let’s now code the above example. Before training any model, we must explicitly define which columns serve as inputs and which are the outputs. This clarity prevents confusion downstream and maintains a clean workflow.

Press + to interact
Python 3.10.4
import pandas as pd
# Build a small housing dataset
df = pd.DataFrame({
'sqft': [1500, 2000, 1200, 1800],
'bedrooms': [3, 4, 2, 3],
'neighborhood_code': [1, 2, 1, 2],
'price': [300000, 450000, 250000, 400000]
})
# Extract features and label
X = df[['sqft', 'bedrooms', 'neighborhood_code']]
y = df['price']
# Confirm dimensions
print(f"Features X: {X.shape[0]} rows × {X.shape[1]} columns")
print(f"Label y: {y.shape[0]} rows")
  • Lines 4–9: Construct a DataFrame with four homes.

  • Line 12: Assign X to the three feature columns and y to the price column.

  • Lines 16–17: Verify that X has four examples with three predictors each, and y has four target values.

With our label and features clearly defined, the next crucial step is to divide these data into training and test sets so we can evaluate model performance on truly unseen examples.

Train/test split

To evaluate how well our model generalizes to new, unseen data, we partition our dataset into two subsets:

  • A training set, which the model learns from.

  • A test set, which we use only for final evaluation.

Common ratios are 70/30 or 80/20, depending on how much data we can afford to reserve for testing. For example, continuing our housing‐price scenario: if we ...