Getting Data Ready and Building Machine Learning Model
Explore how to prepare and preprocess datasets, split data, standardize features, and address multicollinearity issues. Understand logistic regression training using Titanic data, including regularization techniques to improve model generalization and avoid overfitting.
We'll cover the following...
Typically, once we have the processed data, we split it into train and test parts using train_test_split().
# Importing required method from sklearnfrom sklearn.model_selection import train_test_split# Let's keep the default size and statesX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
However, we have separate files for the training and test datasets. We will work with two separate datasets here. Usually, this is how we work with real-life projects. We must perform all the preprocessing on the test part of the data, like we do with the train part.
Train part
Let's separate the data features as X_train and the target as y_train. Our target column has survived, whereas all others are features in train (the entire dataset).
We have separated the features as X_train, and it’s always good to standardize them. Another good ...