Python Bokeh box plot
Bokeh is a Python library used for creating interactive visualizations in a web browser. It provides powerful tools that offer flexibility, interactivity, and scalability for exploring various data insights.
What is a box plot?
Box plots are widely used to represent a visual data summary for the dataset using statistical measures. These measures are commonly used to assess the range and tendency of the dataset for detailed insights into the data distribution.
Upper extreme: It is the maximum value in the dataset that depicts the highest data range can go.
Upper quartile: It is the third quartile that represents the upper bound value below which 75% of the data falls.
Median: It is the middle value that divides the dataset into two halves, i.e., 50% dataset is above it, and 50% dataset is below it.
Lower quartile: It is the first quartile that represents the upper bound value below which 25% of the data falls.
Lower extreme: It is the minimum value in the dataset that depicts the lowest data range can go.
Real-life application
Box plots are widely used in industry and research centers to analyze the achieved outputs and results in various domains.
Required imports
import pandas as pdfrom bokeh.io import output_file, savefrom bokeh.models import ColumnDataSource, Whiskerfrom bokeh.plotting import figure, showfrom bokeh.sampledata.autompg2 import autompg2from bokeh.transform import factor_cmap
pandas:To manipulate data.bokeh.io:To control the output and display of the plots. We specifically importoutput_fileandsavemethods from.bokeh.models:To create highly customized visualizations in Bokeh. We specifically importColumnDataSourceandWhiskermethods from it.bokeh.plotting:To create and customize plots without working directly with the lower-level Bokeh models. We specifically importfigureandshowmethods from it.bokeh.sampledata:To import and access the available datasets for Python Bokeh and use them to test your code.autompg2is one of the datasets that contain information about various car models, including MPG, engine displacement, cylinders, and fuel consumption.bokeh.transform:To transform the data by adding visual properties such as colors, sizes, and positions. We specifically importfactor_cmapmethods from it.
Example code
import pandas as pd
from bokeh.io import output_file, save
from bokeh.models import ColumnDataSource, Whisker
from bokeh.plotting import figure, show
from bokeh.sampledata.autompg2 import autompg2
from bokeh.transform import factor_cmap
dataFrame = autompg2[["class", "cty"]].rename(columns={"class": "kind"})
kinds = dataFrame.kind.unique()
#compute quartiles
quartilesDF = dataFrame.groupby("kind").cty.quantile([0.25, 0.5, 0.75])
quartilesDF = quartilesDF.unstack().reset_index()
quartilesDF.columns = ["kind", "q1", "q2", "q3"]
dataFrame = pd.merge(dataFrame, quartilesDF, on="kind", how="left")
#compute IQR outlier bounds
iqr = dataFrame.q3 - dataFrame.q1
dataFrame["upper"] = dataFrame.q3 + 1.5*iqr
dataFrame["lower"] = dataFrame.q1 - 1.5*iqr
source = ColumnDataSource(dataFrame)
#create plot
myPlot = figure(x_range=kinds, tools="", toolbar_location=None,
title="City driving MPG distribution by vehicle class",
background_fill_color="#bbbfbf", y_axis_label="Feul efficiency")
#outlier range
whisker = Whisker(base="kind", upper="upper", lower="lower", source=source)
whisker.upper_head.size = whisker.lower_head.size = 20
myPlot.add_layout(whisker)
#colour pallete
cmap = factor_cmap("kind", "TolRainbow7", kinds)
#quartile boxes
myPlot.vbar("kind", 0.7, "q2", "q3", source=source, color=cmap, line_color="black")
myPlot.vbar("kind", 0.7, "q1", "q2", source=source, color=cmap, line_color="black")
# outliers
outliers = dataFrame[~dataFrame.cty.between(dataFrame.lower, dataFrame.upper)]
myPlot.scatter("kind", "cty", source=outliers, size=6, color="black", alpha=0.3)
output_file("output.html")
show(myPlot)Code explanation
Lines 1–6: Import all the necessary libraries and modules.
Line 8: Select
classandctycolumn fromautompg2dataset to create a newdataFrameandrename()classcolumn askind. Note that it is not necessary to rename, but we do it for ease to refer it in the code.Line 10: Extract all the unique values from the
kindcolumn and assign the values to thekindsvariable.Line 13: Use
groupby()to group the kind column and calculate the quartiles for thectycolumn. The obtained pandas series is then assigned to thequartileDSdata frame.Lines 14–15: Create separate columns for each quartile using
unstack()and assign names to each column.Line 16: Merge the data frames
dataFrameandquartilesDF, according to thekindcolumn and using the left joint.Lines 19–21: Calculate the interquartile range, i.e., the difference between the 75th and 25th percentile, and assign it to
iqrvariable. Then save the upper and lower bounds in newdataFramecolumns.
Note: We multiply the
iqrwith 1.5 because it is a widely accepted convention to use it when calculating the bounds in inter-quartile range.
Line 23: Create a
ColumnDataSourceobject and assign thedataFrameto it so the data can be provided to the plot.Lines 26–28: Create myPlot using
figure()function and pass all the specifications as parameters. Set x-range askindsand specify the title, y-axis label, and background color for the plot.Line 31: Create a
whiskerobject usingWhisker()and pass the base, upper, and lower as parameters.Line 32: Specify the
upper_headandlower_headsize for thewhiskeras it represents the length of them in the plot.Line 33: Add the
whiskerplot to themyPlotfigure using theadd_layout()method.Line 36: Select the color palette for the
kindcolumn's attributes using thefactor_cmap()function and assign them tocmap.Lines 39–40: Create the quartile boxes on
myPlotusing thevbar()function and pass the column name, quartiles, source, and color palette as parameters. Call the function twice for the upper and lower quartile, respectively.Line 43: Identify the outlying rows from the
dataFramewhere thectycolumn values are not between the upper and lower bound and assign them tooutliers.Line 44: Create the scattered points for the
outliersusingscatter()and pass the column, source, size, color, and transparency as parameters.Lines 46–47: Set the output to
output.htmlto specify the endpoint where the plot will appear and usingshow()to display the created plot.
Code output
The box plot is displayed at the output.html endpoint with TolRainbow7 color palette boxes, #bbbfbf shade grid, and whiskers and labels as specified in the code.
Common Query
Can we modify the visual appearance of the plot?
Free Resources