Set Up Your Data for Learning
Learn to set up data by defining features and labels, splitting it correctly, and preventing leaks for trustworthy machine learning.
In real-world machine learning projects, up to 80% of our time is spent preparing data, defining what to predict, cleaning and splitting to ensure fair evaluation, and preventing subtle leaks that can ruin our model’s trustworthiness. While the “Clean It Up!” chapter covered general data-cleaning techniques (e.g., handling missing values, outlier treatment, normalization), this lesson focuses on the ML-specific steps we’ll repeat on every project:
Defining exactly what we’re predicting and which inputs will drive that prediction.
Splitting data so our model is truly tested on unseen examples.
Safeguarding against accidental peeks into test data (data leakage).
With this foundation in mind, let’s clarify the two core components of every supervised ML dataset—labels and features.
Label vs. features
In supervised machine learning, we aim to learn a function that maps features (the inputs) to a label (the output).
A label is the single variable we want to predict, sometimes called the target or response.
Features (also called predictors or covariates) are the inputs, the measurable properties or characteristics that help explain variation in the label.
If we choose the wrong label, our model won’t answer the question we care about. The model can’t learn the underlying patterns if we pick poor features. Consider a real estate application that estimates home values. Here:
The label is the sale price of each property.
The features might include:
Total square footage
Number of bedrooms
Encoded indicator of neighborhood quality
Each feature contributes information—size suggests space, bedrooms hint at capacity, and neighborhood reflects demand, which together help predict price.
Example
Let’s now code the above example. Before training any model, we must explicitly define which columns serve as inputs and which are the outputs. This clarity prevents confusion downstream and maintains a clean workflow.
import pandas as pd# Build a small housing datasetdf = pd.DataFrame({'sqft': [1500, 2000, 1200, 1800],'bedrooms': [3, 4, 2, 3],'neighborhood_code': [1, 2, 1, 2],'price': [300000, 450000, 250000, 400000]})# Extract features and labelX = df[['sqft', 'bedrooms', 'neighborhood_code']]y = df['price']# Confirm dimensionsprint(f"Features X: {X.shape[0]} rows × {X.shape[1]} columns")print(f"Label y: {y.shape[0]} rows")
Lines 4–9: Construct a DataFrame with four homes.
Line 12: Assign
X
to the three feature columns andy
to the price column.
Lines 16–17: Verify that
X
has four examples with three predictors each, andy
has four target values.
With our label and features clearly defined, the next crucial step is to divide these data into training and test sets so we can evaluate model performance on truly unseen examples.
Train/test split
To evaluate how well our model generalizes to new, unseen data, we partition our dataset into two subsets:
A training set, which the model learns from.
A test set, which we use only for final evaluation.
Common ratios are 70/30 or 80/20, depending on how much data we can afford to reserve for testing. For example, continuing our housing‐price scenario: if we ...