Splitting the Data: Training and Test Sets

Understand the importance of splitting data into training and test sets to evaluate predictive models. Learn to use scikit-learn's train_test_split function to create these sets, maintain class balance, and simulate real-world model deployment conditions.

We'll cover the following...

Evaluating binary classification with a train/test split
Train/test split in scikit-learn
Try it yourself

In the lesson Introduction: Scikit-Learn and Model Evaluation, we introduced the concept of using a trained model to make predictions on new data that the model had never “seen” before. It turns out this is a foundational concept in predictive modeling. In our quest to create a model that has predictive capabilities, we need some kind of measure of how well the model can make predictions on data that was not used to fit the model. This is because in fitting a model, the model becomes “specialized” at learning the relationship between features and response on the specific set of labeled data that were used for fitting. While this is nice, in the end we want to be able to use the model to make accurate predictions on new, unseen data, for which we don’t know the true value of the labels. ...

1.Introduction

2.Data Exploration and Cleaning

Mini Project

3.Introduction to scikit-learn and Model Evaluation

Project

Mini Project

4.Details of Logistic Regression and Feature Extraction

Mini Project

5.The Bias-Variance Trade-Off

Mini Project

6.Decision Trees and Random Forests

Mini Project

7.Gradient Boosting, XGBoost, and SHAP Values

Mini Project

Project

8.Test Set Analysis, Financial Insights, and Delivery to the Client

Mini Project

9.Appendix

Splitting the Data: Training and Test Sets