Search⌘ K
AI Features

Outliers

Explore how to identify outliers in data sets using statistical techniques and visual tools in Python. Learn to apply the interquartile range method and use boxplots and line plots to detect and understand outliers. This lesson provides skills to handle data anomalies that can skew analysis, ensuring more accurate statistical results and model performance.

Defining outliers

An outlier in a dataset is a record that stands out from the others due to its significant deviation from the rest of the data. Mistakes in data entry or measurement can cause these records. On the other hand, they could be legitimate but unusual records. Generally, outliers can significantly affect the results of data analysis, so it’s common practice to identify and examine them before further research is done.

An outlier in a dataset
An outlier in a dataset

Examples of outliers in a dataset include:

  • In a dataset of patient records, an outlier may be a patient with a much shorter or longer hospital stay than other patients.

  • In a dataset of student grades, an outlier may be a student with a much lower or higher grade than the rest of the class.

Why deal with outliers

Here are reasons why it's crucial to deal with outliers in a dataset:

  • Outliers in a dataset can impact analysis, including measures of central tendency (e.g., mean and median) and variability (e.g., standard deviation). As an example, consider a group of students' heights. If the dataset includes a student much taller or shorter than the rest of the class, this would significantly affect the mean height of the class.

  • Outliers can affect the fit of statistical models, such as linear regression. For example, if a model is trying to predict house prices based on square footage, and an outlier is present with a much higher price than expected based on the size of the house, it could cause the model to fit poorly. ...