Search⌘ K
AI Features

Outliers

Explore methods for identifying outliers in your dataset using box plots, interquartile ranges, and standard deviations. Understand when to keep or remove outliers by considering domain knowledge and data context to ensure accurate data cleaning.

What is an outlier? #

Another area of cleaning can be dealing with outliers. First off, how do you define an outlier? This can require domain knowledge as well as other information, but a simple way to start is by taking a look at box plots:

Box Plot of Hours Per Week
Box Plot of Hours Per Week

The above plot was calculated with this command:

bbox = train_df['hoursperweek'].plot(kind="box")

Detection of an outlier #

Here, anything outside the “whiskers” could be considered an outlier. As a refresher, the “whiskers” are the lines sticking out from the box and are 1.5 times the interquartile range. The ...