17.2. Correlation for Linear Associations#
In the last section, we saw several examples of trends between variables. In Section 10.1, we called these trends associations. There are many types of associations. We discussed earlier how some are causal and some are spurious. We will see in this section that some are strong while some are weak. It is also true that some are linear, and some are non-linear. What does this mean?
Think back to your algebra class where you learned about functions. Some functions take the form of a line when plotted and have the equation format: \(y = mx + b\). This equation tells us that \(x\) and \(y\) are associated and that they have a linear relationship. There are also non-linear associations, for example parabolas. Parabolas have the equation format: \(y = ax^2 + bx + c\), and tell us that \(x\) and \(y\) are related quadratically. In this section, we will focus on measuring the strength of linear associations.
Measuring Association#
What does it mean to measure the strength of an association? The strength of a linear relationship is the extent to which, when the relationship is plotted, the dots cluster around a line. These relationships can be either positive (the line has positive slope) or negative (the line has negative slope). Two variables have a positive relationship when changes in one variable correspond with changes in the same direction in the other variable, while they have a negative relationship when changes in one variable correspond with changes in the opposite direction in the other variable. We can measure both the strength and direction of a linear relationship using a correlation coefficient. A correlation coefficient is a single number with no units that lies between -1 and 1. A correlation coefficient of 1 indicates a perfect positive association while a correlation coefficient of -1 indicates a perfect negative association. The closer the correlation coefficient lies to 0, the weaker the relationship. A correlation coefficient of 0 indicates the variables are uncorrelated.
Correlation Coefficient Value |
Interpretation |
---|---|
1 |
Perfect Positive Correlation |
> 0 |
Positively Correlated |
0 |
Uncorrelated |
< 0 |
Negatively Correlated |
-1 |
Perfect Negative Correlation |
The following figure gives a visual depiction of strong or weak linear relationships compared to nonlinear relationships.

Calculating the Correlation Coefficient#
So, now that we know how to interpret a correlation, how do we calculate a correlation coefficient?
The correlation coefficient needs to show strength and direction and also needs to be unit agnostic. That is, height and weight should have the same correlation with each other regardless of if height is measured in inches or centimeters. Because of this, the first step in calculating a correlation coefficient is to standardize the data. Recall our discussion of the standard normal distribution in Section 12.3. We transformed data into a standard normal distribution by subtracting the mean and dividing by the standard deviation. This process is called standardization and effectively makes sure that all data is measured on the same scale - called the standard unit. A standardized data point is often called a z-score. Standardization makes the data and, as a result, the correlation coefficient, unit-less. Below, we’ve written a function that takes in an array of data and returns the standardized version of that data.
def standard_units(my_data):
'''Takes in an array of data and returns that data standardized by subtracting the mean
and dividing by the standard deviation'''
my_mean = np.mean(my_data)
my_stddev = np.std(my_data, ddof = 1)
standardized_data = (my_data - my_mean) / my_stddev
return standardized_data
About numpy
’s std()
function
By default, np.std()
calculates the standard deviation using the equation: \(\sigma={\sqrt {\frac {\sum(x_{i}-{\mu})^{2}}{n}}}\). However, statisticians often prefer to use: \(\sigma={\sqrt {\frac {\sum(x_{i}-{\mu})^{2}}{n-1}}}\) because it is an ‘unbiased estimator’. The explanation for this is beyond the scope of this book, but to use the preferred formula for standard deviation we need to set the parameter ddof
(which stands for delta degrees of freedom) to 1.
Knowing how and when to standardize your data is an important skill that we will discuss in more detail later in this book. For now, let’s see how standardization is used to calculate a correlation coefficient.
A common formula for a correlation coefficient is Pearson’s r which is calculated for two variables x and y as follows:
\(r=\frac{1}{n-1} \sum_{i=1}^n \underbrace{\left(\frac{x_i - \bar{x}}{s_x}\right)}_{\begin{array}{c} z \text{-score} \\ \text{of } x_i \end{array}} \underbrace{\left(\frac{y_i - \bar{y}}{s_y}\right)}_{\begin{array}{c} z \text{-score} \\ \text{of } y_i \end{array}}\)
Recall that \(\bar{x}\) and \(\bar{y}\) are the sample means and \(s_x\) and \(s_y\) are the sample standard deviations of x and y respectively.
Note
If I change which variable is on which axis, this will not change the correlation! Thinking back to the formula for r, multiplication has the same result regardless of order, making which variable is x and which is y irrelevant.
We can use our standard_units
function to write a function for calculating r.
def correlation(x,y):
'''Takes in two arrays x and y and returns
the correlation coefficient r of x and y.
Returns an error if x and y are not the same length.'''
if len(x) != len(y): # check if the lengths of x and y are the same
raise ValueError("Length of x and y must be the same.") # if not, return an error
else: # if they are the same, calculate the correlation
x_st = standard_units(x)
y_st = standard_units(y)
corr = np.sum(x_st * y_st) / (len(x) - 1)
return corr
Writing the function is a useful exercise, but as correlation is a commonly used concept in statistics and data science, there are several pre-built functions you can use to calculate r. Here are a few:
In
pandas
you can usex.corr(y)
to get the r valueIn
numpy
you can use the functionnp.corrcoef(x,y)
which produces a correlation matrix (more on this later)
Let’s try an example with the pandas .corr()
method first. Recall the classical height dataset we used in Section 12.3. This dataset contains heights of parents and their children. We might expect there to be a relationship between a parent’s height and that of their child. Let’s plot this and see.
heights_df = pd.read_csv("../../data/height.csv")
heights_df.head()
family | father | mother | midparentHeight | children | childNum | gender | childHeight | |
---|---|---|---|---|---|---|---|---|
0 | 1 | 78.5 | 67.0 | 75.43 | 4 | 1 | male | 73.2 |
1 | 1 | 78.5 | 67.0 | 75.43 | 4 | 2 | female | 69.2 |
2 | 1 | 78.5 | 67.0 | 75.43 | 4 | 3 | female | 69.0 |
3 | 1 | 78.5 | 67.0 | 75.43 | 4 | 4 | female | 69.0 |
4 | 2 | 75.5 | 66.5 | 73.66 | 4 | 1 | male | 73.5 |
plt.scatter(heights_df['father'],heights_df['childHeight'])
plt.xlabel("Father's Height")
plt.ylabel("Child's Height")
plt.title("Relationship Between Father and Child Heights")
plt.show()

plt.scatter(heights_df['mother'],heights_df['childHeight'])
plt.xlabel("Mother's Height")
plt.ylabel("Child's Height")
plt.title("Relationship Between Mother and Child Heights")
plt.show()

Neither of these relationships seems nonlinear, though the father’s height seems to have a stronger relationship by visual inspection. Next, let’s calculate the correlation coefficients to measure the true strength and direction of the relationships.
r_father = heights_df['father'].corr(heights_df['childHeight'])
r_mother = heights_df['mother'].corr(heights_df['childHeight'])
print("Correlation coefficient for relationship between \
\n\tfather's height and child's height:",np.round(r_father,4))
print()
print("Correlation coefficient for relationship between \
\n\tmother's height and child's height:",np.round(r_mother,4))
Correlation coefficient for relationship between
father's height and child's height: 0.266
Correlation coefficient for relationship between
mother's height and child's height: 0.2013
Both correlation coefficients are positive, indicating that taller parents tend to have taller children (and shorter parents tend to have shorter children). This matches what we know about heredity. We can also see that father’s height has a stronger relationship than mother’s height with the height of the child, which matches what we visually inspected in the graphs. However, both of these correlation coefficients are fairly small (not close to 1) indicating weak relationships.
A commonly used metric when predicting how tall a child will be is something called ‘midparent height’. Midparent height is the average of the mother and father’s heights. Let’s see if midparent height has a stronger correlation with child height.
plt.scatter(heights_df['midparentHeight'],heights_df['childHeight'])
plt.xlabel("Midparent Height")
plt.ylabel("Child's Height")
plt.title("Relationship Between Midparent and Child Heights")
plt.show()

r_midparent = heights_df['midparentHeight'].corr(heights_df['childHeight'])
print("Correlation coefficient for relationship between \
\n\tmidparent height and child's height:",np.round(r_midparent,4))
Correlation coefficient for relationship between
midparent height and child's height: 0.3209
It does! This also matches what we know about heredity. Children get a combination of genes from their mother and father that together decide their physical characteristics, including height.
Correlation Matrices#
Let’s try using the corrcoef()
function from numpy
to calculate the correlation coefficient.
np.corrcoef(heights_df['midparentHeight'],heights_df['childHeight'])
array([[1. , 0.3209499],
[0.3209499, 1. ]])
The numpy
code results in a 2-D array or matrix. This matrix is called a correlation matrix and gives correlation coefficients for all combinations of variables inputted to the function in the following format:
In this case, the r value for midparentHeight
with itself is 1, as it is perfectly correlated with itself - the same for childHeight
and itself. The r value of x = midparentHeight
and y = childHeight
is 0.32 and this is the same value for y = midparentHeight
and x = childHeight
. Correlation matrices are more useful when there are more than 2 variables. Let’s try it with both mother’s and father’s heights as well.
np.corrcoef([heights_df['father'],heights_df['mother'],heights_df['midparentHeight'],heights_df['childHeight']])
array([[1. , 0.06036612, 0.72843929, 0.26603854],
[0.06036612, 1. , 0.72783397, 0.20132195],
[0.72843929, 0.72783397, 1. , 0.3209499 ],
[0.26603854, 0.20132195, 0.3209499 , 1. ]])
The 1’s on the diagonal of this matrix represent the correlations between each variable and itself. The off-diagonal values are the correlation coefficients between variables. For example, the first row is all correlation coefficients with father
. The last value in this row is the r value for father
and childHeight
and matches the value we calculated earlier.
Note
All correlation matrices are symmetric because changing which variable is x and which is y does not change the value of r.
Keep in Mind#
Recall in Section 10.1 we said that association is not the same as causation. The same is true for correlation. Correlation measures associations between variables. Two variables may be associated but not have a causal relationship. In fact, there are many spurious (random) correlations (for more on this click here).
It is also important to remember that correlation only works for linear relationships. A correlation coefficient cannot measure the strength of a nonlinear relationship between variables. For this reason, it is very important to make sure you understand your data before making calculations. It is useful to always graph your variables before calculating a correlation coefficient to make sure the relationship isn’t nonlinear.
Correlation coefficients are also susceptible to outliers in your data. Outliers can both make strong correlations appear weak or make weak correlations appear strong, depending on the data and the placement of the outlier. It is best practice to remove or otherwise account for outliers in your data to avoid these potential issues.
To explain this further, let’s use another classical dataset known as Anscombe’s quartet[1].
anscombe_df = pd.read_csv("../../data/anscombe.csv")
anscombe_df
x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 | |
---|---|---|---|---|---|---|---|---|
0 | 10 | 10 | 10 | 8 | 8.04 | 9.14 | 7.46 | 6.58 |
1 | 8 | 8 | 8 | 8 | 6.95 | 8.14 | 6.77 | 5.76 |
2 | 13 | 13 | 13 | 8 | 7.58 | 8.74 | 12.74 | 7.71 |
3 | 9 | 9 | 9 | 8 | 8.81 | 8.77 | 7.11 | 8.84 |
4 | 11 | 11 | 11 | 8 | 8.33 | 9.26 | 7.81 | 8.47 |
5 | 14 | 14 | 14 | 8 | 9.96 | 8.10 | 8.84 | 7.04 |
6 | 6 | 6 | 6 | 8 | 7.24 | 6.13 | 6.08 | 5.25 |
7 | 4 | 4 | 4 | 19 | 4.26 | 3.10 | 5.39 | 12.50 |
8 | 12 | 12 | 12 | 8 | 10.84 | 9.13 | 8.15 | 5.56 |
9 | 7 | 7 | 7 | 8 | 4.82 | 7.26 | 6.42 | 7.91 |
10 | 5 | 5 | 5 | 8 | 5.68 | 4.74 | 5.73 | 6.89 |
The data set contains four pairs of data, \(x_i\) and \(y_i\).
anscombe_df.describe().iloc[1:3]
x1 | x2 | x3 | x4 | y1 | y2 | y3 | y4 | |
---|---|---|---|---|---|---|---|---|
mean | 9.000000 | 9.000000 | 9.000000 | 9.000000 | 7.500909 | 7.500909 | 7.500000 | 7.500909 |
std | 3.316625 | 3.316625 | 3.316625 | 3.316625 | 2.031568 | 2.031657 | 2.030424 | 2.030579 |
Each \(x_i\) and each \(y_i\) have equal means and standard deviations.
[anscombe_df.x1.corr(anscombe_df.y1),
anscombe_df.x2.corr(anscombe_df.y2),
anscombe_df.x3.corr(anscombe_df.y3),
anscombe_df.x4.corr(anscombe_df.y4)]
[0.81642051634484, 0.8162365060002428, 0.8162867394895984, 0.8165214368885028]
They also have almost identical correlations.
However, when you plot them, you see that the patterns are quite different!
plt.figure(figsize=(8, 10))
plt.subplot(2,2,1)
plt.scatter(anscombe_df.x1,anscombe_df.y1,c="black")
plt.title("Dataset 1")
plt.subplot(2,2,2)
plt.scatter(anscombe_df.x2,anscombe_df.y2, c="black")
plt.title("Dataset 2")
plt.subplot(2,2,3)
plt.scatter(anscombe_df.x3,anscombe_df.y3,c="black")
plt.title("Dataset 3")
plt.subplot(2,2,4)
plt.scatter(anscombe_df.x4,anscombe_df.y4,c="black")
plt.title("Dataset 4")
plt.show()

The first dataset shows a strong linear association, but the second dataset shows a perfect non-linear association. If we only looked at descriptive statistics, we would not realize the different patterns.
Additionally datasets 3 and 4 both have outliers. Dataset 3 would have a correlation of 1 if not for its outlier. The data in dataset 4 all has the same x value except for the one outlier. If not for this outlier, the correlation would be 0.
Anscombe’s quartet shows the importance of visualizing and otherwise inspecting your data before reporting correlation coefficients.
In the next few chapters we will continue our discussion of linear relationships and learn to use patterns like those seen in this chapter to construct predictors.