Correlation
Explore how correlation measures relationships between variables in data science. Understand the use of contingency tables for categorical data, scatterplots for continuous data, and how to calculate and interpret Pearson’s r coefficient to assess correlation strength and direction.
We'll cover the following...
Correlation is used to obtain the relationship between variables. Variables are not always independent from each other. They can change with other variables. In this lesson, we will learn different ways to figure out these relationships.
Contingency table
A contingency table is used to show the relationships within categorical data.
Suppose we have the data of salaried employees in a company. We have two variables. One is the experience of the employee in years, and the other is their monthly salary. Here is the data:
| Years in Experience | Salary |
|---|---|
| 2 | $3,000 |
| 3 | $3,500 |
| 6 | $5,000 |
| 8 | $5,500 |
| 7 | $5,200 |
| 3 | $4,000 |
| 4 | $4,600 |
| 2 | $2,500 |
| 8 | $6,700 |
| 12 | $8,000 |
| 10 | $9,000 |
| 7 | $6,900 |
This data is not directly useful, but we can create buckets from this.
| Experience/Salary | <$2000 | $2000 - $5000 | $5000-$8000 | $8000+ | Total |
|---|---|---|---|---|---|
| <2 Years | 56 | 25 | 16 | 5 | 102 |
| 2 - 5 Years | 41 | 78 | 58 | 16 | 193 |
| 5 - 10 years | 21 | 51 | 125 | 69 | 266 |
| 10+ Years | 3 | 8 | 15 | 19 | 45 |
| Total | 121 | 162 | 214 | 109 | 606 |
Now we can understand that 56 of 121 employees have a salary less than $2,000. Also, more experience corresponds with higher salary. So, this table is a good way to understand the relationship between variables. ...