Thursday, February 16, 2023

Correlation is not Causation

 Taking a Chance on Love

Here I go again
I hear those trumpets blow again
All aglow again
Taking a chance on love

Are you considering chances, randomness, when you are doing regressions?

Statistician George Box famously said that all models are wrong, but some are useful. However if they are very wrong, but their regression has a good correlation, they may also be misleading.

Models are typically developed from regressions of observed data. That regression is generally linear but can also be non-linear. However regression, is only the process of developing coefficients that are validated against an assumption of the pattern in the data. A fundamental question which is often embedded in that regression is an assumption that the data is of a deterministic event which is being observed. This can lead to a regression of the data that is completely wrong although it appears highly correlated.

For example, a random event will produce a normal distribution of data. One such normal random distribution is the logistics distribution, also known as the hyperbolic secant squared distribution. If its range is 0.5, which it should be in a on/off, yes/no, heads/tails, distribution, then it should have average odds of 50%, i.e. 0.5.  This means that, no matter what the value is of the mean, the range, s, should be 0.5. This requires that the average value of the distribution be 0.5 and its Cumulative Distribution Function should vary between zero and 1 without repeating. The chart of this distribution, with a mean of zero, is shown in Figure 1.

Figure 1

As a random event, it does not repeat and is a plot of ½*sech2x. However, if the data was erroneously thought not to be random, and will repeat, the substitution of a traditional trigonometric cosine, for the hyperbolic secant, i.e. ½*cos2x, has almost the same shape as the logistics distribution function around a mean of 0, although it does repeat as shown in Figure 2..

Figure 2

If the data from the logistics distribution between -4.0 and 4.0, which is equivalent to random nonzero x data with a mean of 4, was used to fit to the equation a*cos2(x*b) using a non-linear regression, the amplitude of the cosine would be a= 0.33491 and the inverse of the period, b/2π, of the cosine would be, b=0.479546 as shown in Figure 3. This is a smaller amplitude and a longer period than the theoretical value of the repeating event. However the regression with the random data would be quite good, with a coefficient of determination of 0.784252 and a correlation coefficient of 0.88558. 

Figure 3

However, like the trigonometric deterministic function, the regression repeats, while the observed random data does not repeat. Care should be taken to examine the original premise of the data being random or deterministic. If the Cumulative Distribution Function, CDF, of the regression were shown, as it is in Figure 4, it would erroneously show that its value increases as the observation increases. It would also erroneously assume that at the mean, in this case zero, the CDF at the mean is 0%, not 50% as it should be. The regression of data only suggests the correlation within the range of the data. Caution should be used when making assumptions outside of the range of that data.

Figure 4

If the regression had been to the hyperbolic secant squared, a*sech2b*x, then the amplitude, a, would be ½, b would be 1, which is consistent with a period of 2πi which only repeats in the imaginary plane, the Coefficient of Determination would be 1 and the Correlation Coefficient would also be 1. A good correlation of the data with a deterministic equation, in this case almost 0.78, could mean that the data is actually random and will not repeat, even if the regression assumes that it will repeat.

 


No comments:

Post a Comment