Splitting the Data: Training and Test Sets
Understand the importance of splitting data into training and test sets to evaluate predictive models. Learn to use scikit-learn's train_test_split function to create these sets, maintain class balance, and simulate real-world model deployment conditions.
We'll cover the following...
In the lesson Introduction: Scikit-Learn and Model Evaluation, we introduced the concept of using a trained model to make predictions on new data that the model had never “seen” before. It turns out this is a foundational concept in predictive modeling. In our quest to create a model that has predictive capabilities, we need some kind of measure of how well the model can make predictions on data that was not used to fit the model. This is because in fitting a model, the model becomes “specialized” at learning the relationship between features and response on the specific set of labeled data that were used for fitting. While this is nice, in the end we want to be able to use the model to make accurate predictions on new, unseen data, for which we don’t know the true value of the labels. ...