Data Preprocessing: Training and Testing

Learn how we can separate the data into a training set and a testing set, and how to use the testing set to validate the performance.

Training and testing

We’ve talked about the goal of building an algorithm that performs well on data it already knows and predicts the labels of yet unknown data. This is what makes it essential to separate the data into a training and a testing set. We use the training set to build our algorithm, and we use the testing set to validate its performance.

Even though Kaggle provides a testing set, we skipped it for not including the “Survived” column. This is because we would need to ask Kaggle every time we wanted to validate it. To keep things simple and do the validation ourselves, it’s more convenient to spare some rows from the Kaggle training set for testing.

Separating a test set is quite simple. Scikit-learn provides a useful method for that, too. This is train_test_split.

Furthermore, we need to separate the input data from the resulting label that we want to predict.

Get hands-on with 1200+ tech skills courses.