Search⌘ K
AI Features

Data Preprocessing

Explore essential data preprocessing steps for logistic regression on the Titanic dataset. Understand how to handle missing age values by class-based imputation, drop irrelevant columns, and convert categorical data into numeric formats using dummy variables. This lesson equips you with practical skills to prepare real-world data for accurate machine learning modeling.

So, we know from EDA that some data is missing in our dataset. Let's deal with that first.

Data cleaning

The Age column is missing ~19.9% of its data. A convenient way to fix the Age column is by filling the missing data with the mean or average value of all passengers in that column. We can do even better in this case because we know that there are three passenger classes. It's better to use the average age for each missing passenger for its class. Let's use a boxplot() to visually explore if there is any relationship between class and passenger age.

Python 3.8
plt.figure(figsize=(14, 7)) # setting the figure size, its subjective
sns.boxplot(x='Age',y='Pclass',data=train,palette='rainbow',orient='h');

Yes, Pclass and Age are somehow related; this makes sense. The older the passenger is, the higher the class they traveled in. Therefore, our hypothesis to fill the missing Age with respect to the passenger class is the better way to fill in missing data in the Age column. We can write a function and use the apply() method from pandas for this task. However, before writing a function, we may want to know each ...