Splitting the Data: Training and Test Sets
Learn how to split the data for the model evaluation using scikit-learn.
We'll cover the following...
In the lesson Introduction: Scikit-Learn and Model Evaluation, we introduced the concept of using a trained model to make predictions on new data that the model had never “seen” before. It turns out this is a foundational concept in predictive modeling. In our quest to create a model that has predictive capabilities, we need some kind of measure of how well the model can make predictions on data that was not used to fit the model. This is because in fitting a model, the model becomes “specialized” at learning the relationship between features and response on the specific set of labeled data that were used for fitting. While this is nice, in the end we want to be able to use the model to make accurate predictions on new, unseen data, for which we don’t know the true value of the labels.
Evaluating binary classification with a train/test split
In our case study, once we deliver the trained model to our client, they will then generate a new dataset of features like those we have now, except instead of spanning the period from April to September, they will span from May to October. And our client will be using the model with these features, to predict whether accounts will default in November.
In order to know how well we can expect our model to predict which accounts will actually default in ...