build a regression model, which means fitting a straight line on the data to predict future values, we first visualize our data to get an idea of how it looks and to see the patterns and relationships.
The data may appear to show a positive linear relationship, but we confirm it by calculating the Pearson correlation coefficient, which tells us how close our data is to linearity.
Let’s consider a simple Salary Dataset to understand the Pearson correlation coefficient.
The dataset consists of two columns:
YearsExperience: the number of years a person has been working
Salary (target): the corresponding annual salary in US dollars
Now we need to build a model that predicts salary based …
build a regression model, which means fitting a straight line on the data to predict future values, we first visualize our data to get an idea of how it looks and to see the patterns and relationships.
The data may appear to show a positive linear relationship, but we confirm it by calculating the Pearson correlation coefficient, which tells us how close our data is to linearity.
Let’s consider a simple Salary Dataset to understand the Pearson correlation coefficient.
The dataset consists of two columns:
YearsExperience: the number of years a person has been working
Salary (target): the corresponding annual salary in US dollars
Now we need to build a model that predicts salary based on years of experience.
We can understand that this can be done with a simple linear regression model because we have only one predictor and a continuous target variable.
But can we directly apply the simple linear regression algorithm just like that?
No.
We have several assumptions for linear regression to apply, and one of them is linearity.
We need to check linearity, and for that, we calculate the correlation coefficient.
But what is linearity?
Let’s understand this with an example.
Image by Author
From the table above, we can see that for every one-year increase in experience, there is a $5,000 increase in salary.
The change is constant, and when we plot these values, we get a straight line.
This type of relationship is called a linear relationship.
Now in simple linear regression, we already know that we fit a regression line on the data to predict future values, and this can be effective only when the data has a linear relationship.
So, we need to check for linearity in our data.
For that, let’s calculate the correlation coefficient.
Before that, we first visualize the data using a scatter plot to get an idea of the relationship between the two variables.
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Load the dataset
df = pd.read_csv("C:/Salary_dataset.csv")
# Set plot style
sns.set(style="whitegrid")
# Create scatter plot
plt.figure(figsize=(8, 5))
sns.scatterplot(x='YearsExperience', y='Salary', data=df, color='blue', s=60)
plt.title("Scatter Plot: Years of Experience vs Salary")
plt.xlabel("Years of Experience")
plt.ylabel("Salary (USD)")
plt.tight_layout()
plt.show()
Image by Author
We can observe from the scatter plot that as years of experience increases, salary also tends to increase.
Although the points do not form a perfect straight line, the relationship appears to be strong and linear.
To confirm this, let’s now calculate the Pearson correlation coefficient.
import pandas as pd
# Load the dataset
df = pd.read_csv("C:/Salary_dataset.csv")
# Calculate Pearson correlation
pearson_corr = df['YearsExperience'].corr(df['Salary'], method='pearson')
print(f"Pearson correlation coefficient: {pearson_corr:.4f}")
Pearson correlation coefficient is 0.9782.
We get the value of correlation coefficient in between -1 and +1.
If it is… close to 1: strong positive linear relationship close to 0: no linear relationship close to -1: strong negative linear relationship
Here, we got a correlation coefficient value of 0.9782, which means the data mostly follows a straight-line pattern, and there is a very strong positive relationship between the variables.
From this, we can observe that simple linear regression is well suited for modeling this relationship.
But how do we calculate this Pearson correlation coefficient?
Let’s consider a 10-point sample data from our dataset.
Image by Author
Now, let’s calculate the Pearson correlation coefficient.
When both X and Y increase together, the correlation is said to be positive. On the other hand, if one increases while the other decreases, the correlation is negative.
First, let’s calculate the variance for each variable.
Variance helps us understand how far the values are spread from the mean.
We’ll start by calculating the variance for X (Years of Experience). To do that, we first need to compute the mean of X.
[ \bar{X} = \frac{1}{n} \sum_{i=1}^{n} X_i ]
[ = \frac{1.2 + 3.3 + 3.8 + 4.1 + 5.0 + 5.4 + 8.3 + 8.8 + 9.7 + 10.4}{10} ] [ = \frac{70.0}{10} ] [ = 7.0 ]
Next, we subtract each value from the mean and then square it to cancel out the negatives.
Image by Author
We’ve calculated the squared deviations of each value from the mean. Now, we can find the variance of X by taking the average of those squared deviations.
[ \text{Sample Variance of } X = \frac{1}{n – 1} \sum_{i=1}^{n} (X_i – \bar{X})^2 ]
[ = \frac{33.64 + 13.69 + 10.24 + 8.41 + 4.00 + 2.56 + 1.69 + 3.24 + 7.29 + 11.56}{10 – 1} ] [ = \frac{96.32}{9} \approx 10.70 ]
Here we divided by ‘n-1’ because we are dealing with a sample data and using ‘n-1’ gives us the unbiased estimate of variance.
The sample variance of X is 10.70, which tells us that the values of Years of Experience are, on average, 10.70 squared units away from the mean.
Since variance is a squared value, we take the square root to interpret it in the same unit as the original data.
This is called Standard Deviation.
[ s_X = \sqrt{\text{Sample Variance}} = \sqrt{10.70} \approx 3.27 ]
The standard deviation of X is 3.27, which means that the values of Years of Experience fall about 3.27 years above or below the mean.
In the same way we calculate the variance and standard deviation of ‘Y’.
[ \bar{Y} = \frac{1}{n} \sum_{i=1}^{n} Y_i ]
[ = \frac{39344 + 64446 + 57190 + 56958 + 67939 + 83089 + 113813 + 109432 + 112636 + 122392}{10} ] [ = \frac{827239}{10} ] [ = 82,!723.90 ] [ \text{Sample Variance of } Y = \frac{1}{n – 1} \sum (Y_i – \bar{Y})^2 ] [ = \frac{7,!898,!632,!198.90}{9} = 877,!625,!799.88 ] [ \text{Standard Deviation of } Y \text{ is } s_Y = \sqrt{877,!625,!799.88} \approx 29,!624.75 ]
We calculated the variance and standard deviation of ‘X’ and ‘Y’.
Now, the next step is to calculate the covariance between X and Y.
We already have the means of X and Y, as well as the deviations of each value from their respective means.
Now, we multiply these deviations to see how the two variables vary together.
Image by Author
By multiplying these deviations, we are trying to capture how X and Y move together.
If both X and Y are above their means, then the deviations are positive, which means the product is positive.
If both X and Y are below their means, then the deviations are negative, but since a negative times a negative is positive, the product is positive.
If one is above the mean and the other is below, the product is negative.
This product tells us whether the two variables tend to move in the same direction (both increasing or both decreasing) or in opposite directions.
Using the sum of the product of deviations, we now calculate the sample covariance.
[ \text{Sample Covariance} = \frac{1}{n – 1} \sum_{i=1}^{n}(X_i – \bar{X})(Y_i – \bar{Y}) ]
[ = \frac{808771.5}{10 – 1} ] [ = \frac{808771.5}{9} = 89,!863.5 ]
We got a sample covariance of 89863.5. This indicates that as experience increases, salary also tends to increase.
But the magnitude of covariance depends on the units of the variables (years × dollars), so it’s not directly interpretable.
This value only shows the direction.
Now we divide the covariance by the product of the standard deviations of X and Y.
This gives us the Pearson correlation coefficient which can be called as a normalized version of covariance.
Since the standard deviation of X has units of years and Y has units of dollars, multiplying them gives us years times dollars.
These units cancel out when we divide, resulting in the Pearson correlation coefficient, which is unitless.
But the main reason we divide covariance by the standard deviations is to normalize it, so the result is easier to interpret and can be compared across different datasets.
[ r = \frac{\text{Cov}(X, Y)}{s_X \cdot s_Y} = \frac{89,!863.5}{3.27 \times 29,!624.75} = \frac{89,!863.5}{96,!992.13} \approx 0.9265 ]
So, the Pearson correlation coefficient (r) we calculated is 0.9265.
This tells us there’s a very strong positive linear relationship between years of experience and salary.
This way we find the Pearson correlation coefficient.
The formula for Pearson correlation coefficient is:
[ r = \frac{\text{Cov}(X, Y)}{s_X \cdot s_Y} = \frac{\frac{1}{n – 1} \sum_{i=1}{n} (X_i – \bar{X})(Y_i – \bar{Y})} {\sqrt{\frac{1}{n – 1} \sum_{i=1}{n} (X_i – \bar{X})2} \cdot \sqrt{\frac{1}{n – 1} \sum_{i=1}{n} (Y_i – \bar{Y})^2}} ]
[ = \frac{\sum_{i=1}{n} (X_i – \bar{X})(Y_i – \bar{Y})} {\sqrt{\sum_{i=1}{n} (X_i – \bar{X})2} \cdot \sqrt{\sum_{i=1}{n} (Y_i – \bar{Y})^2}} ]
We need to make sure certain conditions are met before calculating the Pearson correlation coefficient:
- The relationship between the variables should be linear.
- Both variables should be continuous and numeric.
- There should be no strong outliers.
- The data should be normally distributed.
Dataset
The dataset used in this blog is the Salary dataset.
It is publicly available on Kaggle and is licensed under the Creative Commons Zero (CC0 Public Domain) license. This means it can be freely used, modified, and shared for both non-commercial and commercial purposes without restriction.
I hope this gave you a clear understanding of how the Pearson correlation coefficient is calculated and when it’s used.
Thanks for reading!