How to make a boxplot in Polars using Matplotlib
Boxplot is a valuable tool for visualizing the distribution of data, and while the Polars library itself doesn’t offer direct support for creating boxplots, we can easily generate them in Polars by integrating with the Matplotlib library.
The boxplot() function
The boxplot() function in Matplotlib is used to create boxplots, a common way to visualize a dataset’s distribution and summary statistics.
Syntax
plt.boxplot(x, notch=None, sym=None, vert=None, whis=None, positions=None, widths=None, patch_artist=None)
Parameters
Here are the main parameters of the boxplot() function and their explanations:
x: This is the data we want to plot. It can be a single array or a list of arrays (one array per box in the boxplot).notch: This creates a notched boxplot that displays a confidence interval around the median if it’s set toTrue.sym: This is the symbol to indicate outliers. By default, it’s set to'+', but we can customize it to any symbol.vert: This creates vertical boxplots if set toTrue(default),. If set toFalse, it creates horizontal boxplots.whis: This is the whisker length as a proportion of the interquartile range (IQR). The default is 1.5, which is the standard definition. The line (whisker) will be drawn from the box to the minimum value within the range (Q1 - 1.5 * IQR) and from the box to the maximum value within the range (Q3 + 1.5 * IQR). Any data points that fall outside this range are treated as outliers and are displayed as individual points, not connected to the end of the whiskers.positions: This specifies the positions of boxes on the x-axis. This can be a list of scalars or an array-like object.widths: This specifies the width of the boxes. We can provide a list of scalars or an array-like object to customize box widths.patch_artist: This function returns a list of that allow us to customize the appearance of the boxes, if it’s set topatch objects Patch objects are essential for creating informative and visually appealing plots in data visualization. True.
These parameters allow us to customize various aspects of the boxplot to suit our visualization needs. Depending on the data and the specific insights we want to convey, we can adjust these parameters accordingly when calling plt.boxplot().
Code example
Here is an example code to demonstrate how to create a boxplot using Matplotlib with data from a Polars DataFrame:
# Import required librariesimport polars as plimport matplotlib.pyplot as plt# Create a sample Polars DataFramedata = pl.DataFrame({'Category': ['X', 'Y', 'Z', 'X', 'Y', 'Z','X', 'Y', 'Z', 'X', 'Y', 'Z'],'Value': [5, 8, 12, 6, 9, 14, 7, 10, 16, 20, 4, 11]})# Extract the data we want to visualizecategories = data['Category'].to_list()values = data['Value'].to_list()# Create an empty list to hold the data for each categorycategory_data = []# Extract and organize data by categoryfor category in set(categories):category_values = [values[i] for i in range(len(categories)) if categories[i] == category]category_data.append(category_values)# Create a boxplot using Matplotlibplt.figure(figsize=(8, 6))plt.boxplot(category_data, labels=set(categories))plt.xlabel('Category')plt.ylabel('Value')plt.title('Boxplot')plt.show()
Explanation
In the above code:
Lines 6–9: We create a Polars DataFrame called
datawith two columns:CategoryandValue.Lines 12–13: We extract the
CategoryandValuecolumns into Python lists usingto_list()for the purpose of organizing and plotting the data using Matplotlib.Line 16: We create an empty list called
category_datato store data for each category.Lines 19–21: We iterate through the unique categories in the
Categorycolumn and extract the correspondingValuedata for each category.Lines 24–28: We use Matplotlib to create a boxplot, passing the
category_datalist and labels as arguments. We set the title and axis labels.Line 29: Finally, we display the boxplot using
plt.show().
The code generates a boxplot that visualizes the distribution of Value data for each unique Category in the sample dataset. The x-axis represents the categories ('X', 'Y', 'Z'), and the y-axis represents the values. Each box in the plot represents a category, and within each box, we see a horizontal line indicating the median value, a box representing the IQR, and whiskers extending to the minimum and maximum values within a certain range (typically 1.5 times the IQR).
Free Resources