Search⌘ K
AI Features

Visualizing Outliers

Explore techniques to identify outliers in datasets through histograms, box plots, and scatter plots. Understand how to analyze their trends and relationships within data, and learn strategies for deciding whether to remove, transform, or keep outliers to create accurate and insightful data narratives.

Outliers are data points that are notably different from the main body/group of samples in our dataset. They can be found in many real-world datasets. We can see an example of an outlier in the below plot, where the outliers are data points in the 700–1000 range that are very different from the other data points in the 0–300 range.

Python 3.10.4
import numpy as np
import random
from matplotlib import pyplot as plt
#Create a numpy seed
np.random.seed(42)
#Generate random numbers
data = np.random.uniform(0, 500, 100)
data = np.append(data, [1000, 1025, 1030, 1055])
plt.hist(data, bins=5)
plt.title('Random Data')
plt.xlabel('Sample Variable')
plt.ylabel('Frequency')
plt.savefig('output/to.png')
plt.close(fig)

Identifying the context around outliers can help add interesting insights to narratives and help data scientists make decisions about how to handle outliers.

Let's explore three steps toward implementing solutions for outliers for data storytelling:

  1. Identifying and visualizing outliers

  2. Identifying trends and relationships of outliers and other data points

  3. Resolving or keeping outliers

Context of the data

We will be looking at the Tips dataset, composed of information one waiter collected about tips they received working in a restaurant over a few months.

Python 3.10.4
import plotly
#Import the tips dataset
tips_data = plotly.data.tips()
#Print the feature names and head of the dataframe
print(tips_data.columns.tolist())
print(tips_data.head(10))

The variables in the dataset include:

  • total_bill: The total bill in dollars

  • tip: The total tip ...