How to create heatmap clustering using seaborn
Heatmap clustering is a particularly useful visualization for exploring the relationship between variables and identifying patterns within large datasets. Seaborn, a Python visualization library based on matplotlib, offers an easy-to-use interface for creating sophisticated visualizations, including heatmaps with clustering.
Let's explore how to create heatmap clustering using seaborn, covering setup, key parameters, and code examples.
Understanding heatmap clustering
Heatmap clustering is a data visualization technique that organizes data into a grid where colors represent the magnitude of the values. Clustering adds another layer of insight by grouping similar rows and/or columns together based on a similarity metric. This is particularly useful in bioinformatics, market research, social sciences, and other fields whemore, where spotting patterns in complex datasets is crucial.
Seaborn's clustermap function is specifically designed for this purpose. It not only generates a heatmap but also applies hierarchical clustering to group similar rows and columns together, making patterns more evident.
Setup: libraries installation
First, ensure that Python is installed on the system. Then, we will install numpy for numerical operations, pandas for data manipulation, matplotlib for plotting, seaborn for data visualization, and scipy to support the clustermap function in seaborn. We can install all these libraries using pip:
pip install numpy pandas matplotlib scipy seaborn
With the environment set, we can start using Seaborn to create heatmap clustering. Here are the details about the syntax, parameters, and output of the clustermap function.
Syntax
The data is a required parameter that must be passed to the function.
seaborn.clustermap(data, **kwargs)
Parameters
The clustermap function in seaborn is highly customizable, with several parameters allowing you to tailor the visualization to your needs. Key parameters include:
data: The dataset to visualize, typically a pandas DataFrame.method: The linkage method to use for clustering (e.g., 'single', 'complete', 'average'). This affects how the distance between clusters is calculated.metric: The distance metric to use for clustering (e.g., 'euclidean', 'cityblock'). This determines how similarity is measured.z_score: Whether to standardize the data by row (1) or column (0) before plotting. This can make patterns more apparent by normalizing the data range.standard_scale: Similar toz_score, but scales the data to have a minimum of 0 and a maximum of 1.cmap: The colormap to use for the heatmap. Seaborn has many built-in colormaps, or you can use matplotlib colormaps.row_clusterandcolumn_cluster: Booleans to specify whether to cluster rows and/or columns.
Output
The function returns a ClusterGrid object. This object provides access to the underlying figure and axes objects and allows further customization.
Code
The following script demonstrates how to generate sample data (using random), create a heatmap with clustering using seaborn, and display the plot.
import seaborn as snsimport pandas as pdimport numpy as npimport matplotlib.pyplot as plt# Generate sample datanp.random.seed(0)data = pd.DataFrame(np.random.rand(10, 12), columns=[f'Var{i+1}' for i in range(12)])# Create a heatmap with clusteringsns.clustermap(data,cmap='viridis',standard_scale=1,method='average',metric='euclidean',row_cluster=True,)# Show the plotplt.show()
Explanation
The code above is explained in detail below:
Lines 1–3: Import the required libraries.
Lines 6–7: Set a seed to generate a
pandas DataFramecontaining ax array of random numbers between and using numpy. Thecolumnsparameter names the columns as 'Var1', 'Var2', ..., 'Var12' using list comprehension and formatted strings.Line 10: Uses seaborn's
clustermapfunction to create a heatmap with hierarchical clustering applied to both rows and columns. Thecmap='viridis'parameter sets the color map to 'viridis', which is a color scheme in matplotlib. Thestandard_scale=1parameter scales each row to have unit variance and zero mean, which helps in comparing patterns across different rows more clearly. Themethod='average'specifies the clustering method to use andmetric='euclidean'sets the distance metric for the clustering. Finally, therow_clusterflag allows you to specify whether to cluster rows.Line 14: Displays the generated plot for the heatmap clustering using
plt.show().
Heatmap and clustering
Heatmap:
The heatmap displays values in the data matrix, where each cell's color represents the value of the corresponding variable (column) for a particular observation (row).
The colors in the heatmap range from dark purple (low values) to bright yellow (high values), according to the
viridiscolor map.The color bar on the left side indicates the scale of values from 0.0 (dark purple) to 1.0 (bright yellow).
Clustering:
The heatmap is accompanied by dendrograms, which are tree-like diagrams that show the arrangement of the clusters produced by hierarchical clustering.
Row clustering: The dendrogram on the left shows how the rows (observations) are clustered together based on the similarity of their patterns across the variables.
Column clustering: The dendrogram at the top shows how the columns (variables) are clustered together based on the similarity of their values across the observations.
Patterns in clustering
The dendrograms provide insight into the structure of the data:
Row clusters: Rows that are close to each other in the dendrogram are more similar to each other in terms of the values across all variables. For example, rows that are clustered together at the bottom or top of the heatmap share similar value patterns across the variables.
Column clusters: Similarly, variables (columns) that are close to each other in the dendrogram are more similar to each other in terms of their values across all observations.
Color information
Dark purple (low values): These cells indicate that the value for a particular observation and variable is low (close to 0).
Bright yellow (high values): These cells indicate that the value for a particular observation and variable is high (close to 1).
Other colors (intermediate values): The shades of green and yellow represent intermediate values between the extremes of 0 and 1.
Conclusion
Heatmap clustering is a useful tool for analyzing data. It helps to find patterns and connections in large sets of data. Seaborn makes it easier to create heatmap clusters, offering many options to customize the results for different purposes.
Free Resources