Related Tags

correlation matrix
python

# What is a correlation matrix?

Hassaan Waqar

A correlation matrix is used to show the degree of the linear relationship between variables in a dataset. It indicates the correlation using the correlation coefficient.

The correlation coefficient shows how strongly or weakly any two variables are related. Scores range between 1 and -1. 1 indicates a perfect positive correlation, whereas -1 indicates a perfect negative correlation. Scores closer to 0 indicate a weak correlation.

## Understanding correlation coefficient

Correlation refers to a degree of relationship between variables. It can be causal or non-causal. We say that there is a positive correlation when an increase in variable $x$ causes an increase in variable $y$. We say that there is a negative correlation when an increase in variable $x$ causes a decrease in variable $y$.

The illustration below shows positive and negative correlations:

Positive and Negative Correlation

The table below summarizes correlation coefficients:

Coefficient Meaning
1 Perfect positive correlation. A unit increase in variable $x$ means a unit increase in variable $y$.
-1 Perfect negative correlation. A unit increase in variable $x$ means a unit decrease in variable $y$.
0 No correlation. Variables are not related.

A correlation matrix displays the correlation between all numerical variables present in the dataset. If a dataset has $n$ numerical features, a correlation matrix may have $n^2$ values that are symmetric about the center. Therefore, it is sufficient to analyze only the top or bottom half of the matrix.

The illustration below shows a visual representation of a correlation matrix:

A Correlation Matrix

The diagonal always has a coefficient of 1.00, since it represents a relation between the variable with itself.

A gradient color scheme helps to improve understanding of the coefficient scores.

## Example

The code snippet below shows how we can create a correlation matrix in Python:

import pandas as pd # for creating a dataframe
import seaborn as sn # for shaping our matrix
import matplotlib.pyplot as plt # for creating visualizations

# Data for matrix
data = {'A': [45,37,42,35,39],
'B': [38,31,26,28,33],
'C': [10,15,17,21,12]
}

df = pd.DataFrame(data,columns=['A','B','C'])
print("Original Matrix")
print(df) # original matrix

print("\n")
corrMatrix = df.corr() # finding correlations
print("Correlation Coefficients Matrix")
print (corrMatrix) # printing correlations
# Visual Representation of Correlation Matrix
sn.heatmap(corrMatrix, annot = True, cmap = 'Blues')


Line 11 creates a dataframe. A dataframe can be referred to as a matrix.

Line 16 uses the corr function on our dataframe to calculate the correlation coefficients matrix.

The second code snippet is a continuation of the first code snippet.

It creates a visualization of the correlation matrix using Seaborn and Matplotlib. It takes in the correlation coefficients, annotates them, and colors them blue.

RELATED TAGS

correlation matrix
python

CONTRIBUTOR

Hassaan Waqar