Search⌘ K
AI Features

Getting Data Ready and Building Machine Learning Model

Explore how to prepare and preprocess datasets, split data, standardize features, and address multicollinearity issues. Understand logistic regression training using Titanic data, including regularization techniques to improve model generalization and avoid overfitting.

Typically, once we have the processed data, we split it into train and test parts using train_test_split().

# Importing required method from sklearn
from sklearn.model_selection import train_test_split
# Let's keep the default size and states
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
Data split

However, we have separate files for the training and test datasets. We will work with two separate datasets here. Usually, this is how we work with real-life projects. We must perform all the preprocessing on the test part of the data, like we do with the train part.

Train part

Let's separate the data features as X_train and the target as y_train. Our target column has survived, whereas all others are features in train (the entire dataset).

Python 3.8
X_train = train.drop('Survived', axis = 1) # features or variables
y_train = train['Survived'] # target, the values we need to predict
print(X_train.shape, y_train.shape)

We have separated the features as X_train, and it’s always good to standardize them. Another good ...