How to plot a parallel coordinates chart in Pandas
Pandas is one of the most popular Python libraries used for data manipulation and visualization. Real-world problems are often multivariate in nature because they require several features or variables to predict an outcome accurately.
Therefore, in this answer, we will be examining parallel coordinate charts and how they can be used to analyze multivariate data.
Parallel coordinate charts
Parallel coordinate charts display variables, both on their own axis and scale, as opposed to the traditional x and y axes that are used in other plots. Since each feature is now represented in its own axis, data points appear to be connected through lines. If these lines are parallel, it means that the relationship between these variables is positive. Otherwise, they have a negative relationship.
Cons
The downside of these charts is that too many variables and data points crowd the chart, and no meaningful patterns are visible. It’s advisable to have a few variables showcased in each case or use a technique named brushing that allows you to emphasize only on some data points and ignore the noise.
The other disadvantage is that each variable has its own scale and axis. In cases where the variation is high, data normalization is recommended for better results.
These charts are recommended for data that has perceived classes in it. Otherwise, for regression problems, the results might not be satisfactory.
Advantages
Parallel coordinates charts can be used to achieve the below:
-
Visualize different features at the same time without plotting univariate charts.
-
Detect outliers that exist in a dataset.
-
Compare characteristics in different classes in classification problems.
-
Feature selection and dimensionality reduction as these charts can reveal variables that have similar characteristic patterns.
-
Communicate complex patterns that exist in the data.
How to plot a parallel coordinates chart in Pandas
The syntax to plot a parallel coordinates chart is as below:
Syntax
pandas.plotting.parallel_coordinates(
frame, class_column, cols=None, ax=None, color=None, use_columns=False, xticks=None, colormap=None, axvlines=True, axvlines_kwds=None, sort_labels=False, **kwargs
)
Parameters
-
frame[required] refers to the data frame to be used. -
class_column[required] column name containing the class names. -
cols[optional] refers to the list of column names to be used. -
ax[optional] refers to the axis object to be used. -
color[optional] refers to the colors to be used for the different classes. -
use_columns's [optional] default isFalse. IfTrue, columns will be used as thexticks. -
xticks[optional] is used to pass a list of values for xticks. -
colmap[optional] is used to indicate colormap to use for the lines. -
axvlines's [optional] default isTrue. This adds vertical lines at each xtick. -
axvlines_kwdsis [optional] used to provideforoptions to the axvlines above. -
sort_labels[optional] is used to sort the class labels especially if colors have been used. -
kwargsstands for keyword arguments. This means that more parameters can be passed to customize the chart.
Example code
import pandas as pdimport matplotlib.pyplot as pltfrom sklearn.datasets import make_classificationX,y = make_classification(n_samples = 50,n_features = 4)df = pd.DataFrame(X)df.columns = ['X1','X2','X3','X4']df['y'] = yprint(df.shape)print(df.head())plt.figure(figsize=(10, 6))pd.plotting.parallel_coordinates(frame = df,class_column = 'y')plt.legend(loc='upper right')plt.show()
Note: We are using Python 3.10.4 in this answer.
Code explanation
Lines 1-3: We import necessary libraries.
Lines 5-8: We create a DataFrame consisting of 5 columns, 50 rows and 2 classification classes.
Lines 13-17: We use Pandas to plot a parallel coordinates chart, indicating “y” as the class_column for reference.
We notice how these two classes have distinct variations, with class 1 represented by blue, taking up higher values in all variables, while class 0 is on the lower side of the different axes.
Free Resources