Python offers several libraries, like seaborn, Matplotlib, and pandas for data manipulation and visualization. Data visualization visually represents data through graphs, charts, and plots to recognize patterns, trends, and relations between variables in simple and complex data. We can process information and display data efficiently, especially for visual learners. For this Answer, we’ll plot our scatter matrix using pandas.plotting.scatter_matrix()
.
A scatter matrix plots all the variables in the data against each other. Suppose the total variables in the dataset are
To get started, simply install Python and its packages, numpy
, pandas
, and matplotlib
. Then, we’ll import these packages in our Python file. The following command shows how to install these packages:
!pip install numpy pandas matplotlib
The following command shows how to import them:
import numpy as npimport pandas as pdimport matplotlib.pyplot as plt
Now, we’ll go over the real players that enable the plotting of a scatter matrix using pandas.plotting.scatter_matrix()
.
Note: All the parameters specified below are optional except for
dataframe
.
pandas.plotting.scatter_matrix(dataframe, alpha=0.5, figsize=None, ax=None, grid=False, marker=".", diagonal="hist", hist_kwds=None, density_kwds=None, range_padding=0.05, **kwargs)
dataframe
: This is the Pandas DataFrame object.
alpha
: It’s the amount of transparency as a floating point value is specified by this variable.
figsize
: It passes a tuple containing the width and height of the matrix to set the figure size.
ax
: This is Matplotlib
’s axis object.
grid
: Passing it a value of True
displays the entire grid.
marker
: This defines the shape of the marker—data value displayed on the plot—on the scatter plot.
diagonal
: The diagonal can display a hist
—histogram—or a kde
—
hist_kwds
: This parameter will be passed keyword arguments for a histogram as a dictionary.
density_kwds
: Similar to hist_kwds
, this parameter will be passed arguments for a kernel density estimate plot.
range_padding
: It sets the value of
**kwargs
: This is any additional keyword arguments.
Check out how we can plot a histogram and a kernel density estimate plot with this simple function. We can tweak the code by pressing “Run” and manipulating it as we like it on the Jupyter Notebook.
import numpy as np import pandas as pd import matplotlib.pyplot as plt l = ["column1", "column2","column3", "column4"] educatives_dataframe = pd.DataFrame(np.random.randint(0,50,size=(50,4)), columns=l) educatives_dataframe.tail() educatives_scatter_plot = pd.plotting.scatter_matrix(educatives_dataframe, alpha=0.9,figsize=(15,15), grid=True, marker="*", diagonal="hist", hist_kwds={"bins":5,"color":"pink"}, range_padding=0.1, color="red") plt.suptitle("Scatter matrix",fontsize=50) plt.show() educatives_scatter_plot = pd.plotting.scatter_matrix(educatives_dataframe, alpha=0.9,figsize=(15,15), grid=True,marker="D", diagonal="kde", density_kwds={"alpha":0.3, "color":"red"}, range_padding=0.1, color="green") plt.suptitle("Scatter matrix",fontsize=50) plt.show()
In the code above:
Lines 1–3: We make the necessary imports as described above.
Lines 5–8: These lines initialize a pandas
DataFrame object. They create a DataFrame with 50 rows and four columns filled with random values from zero to 50. Ultimately, we display the last values from the generated DataFrame.
Lines 10–12: Here, we create a scatter matrix. Notably, we’ll see a histogram along the diagonal with five bins. alpha
has been set to 0.9
meaning the plot will mostly be opaque with red-colored markers. We have also used the suptitle
function to give a title to the entire matrix, not just to a single plot.
Lines 14–16: This piece of code does the same, except it would display a kernel density estimate plot instead of histograms along the diagonal.
In conclusion, as we have seen above, pandas.plotting.scatter_matrix()
is a useful function for plotting scatter matrices to better visualize our data. This plot shows the correlation between data points since they are plotted against each other.