7 Minute Read

Published 2022

↓ Click to expand


If its snows, how likely is there to be ice? You might say it’s very likely. You just commented on the correlation of snow and ice. Pearson’s correlation coefficient gives you a number that indicates this based on data on times it’s snowed and the temperature. Essentially, given and independent and dependent variable, (ex. snowfall and ice), how closely are they correlated?

Dangers of correlation

It is important to note that the correlation coefficient is useless on it’s own. Without further quantitative analysis like outlier detection and qualitative analysis such as common sense you’ll believe that people consuming cheese and dying from their bedsheets is correlated.

r=0.947. More interesting correlations on https://www.tylervigen.com/spurious-correlations

r=0.947. More interesting correlations on https://www.tylervigen.com/spurious-correlations

Outliers

The correlation coefficient will be wildly skewed if you have outliers in your data - when calculating r, data scientists typically remove any outliers. However, in practice, there is no such thing as an outlier. When describing real world phenomena, outliers are a part of the equation. In machine learning, data collection, cleaning and understanding is arguably more important than the machine learning problem itself. Typically you can learn a lot by asking questions about outliers -

In a machine learning problem you never remove outliers, you should just consider them as a different regression problem and describe them with another function. But why is r sensitive to outliers?

The equation

Let’s start with a set of points that we can use to calculate the equation.

Untitled

The below equation is a simplified representation of the coefficient.