Taking a Chance on Love
Are you considering chances, randomness, when you are doing regressions?
Statistician George Box famously said that all models are
wrong, but some are useful. However if they are very wrong, but their regression has a good correlation, they may also be
misleading.
Models are typically developed from regressions of observed
data. That regression is generally linear but can also be non-linear. However regression,
is only the process of developing coefficients that are validated against an
assumption of the pattern in the data. A fundamental question which is often
embedded in that regression is an assumption that the data is of a deterministic
event which is being observed. This can lead to a regression of the data that
is completely wrong although it appears highly correlated.
For example, a random event will produce a normal distribution of data. One such normal random distribution is the logistics distribution, also known as the hyperbolic secant squared distribution. If its range is 0.5, which it should be in a on/off, yes/no, heads/tails, distribution, then it should have average odds of 50%, i.e. 0.5. This means that, no matter what the value is of the mean, the range, s, should be 0.5. This requires that the average value of the distribution be 0.5 and its Cumulative Distribution Function should vary between zero and 1 without repeating. The chart of this distribution, with a mean of zero, is shown in Figure 1.
Figure 1
As a random event, it does not repeat and is a plot of ½*sech2x. However, if the data was erroneously thought not to be random, and will repeat, the substitution of a traditional trigonometric cosine, for the hyperbolic secant, i.e. ½*cos2x, has almost the same shape as the logistics distribution function around a mean of 0, although it does repeat as shown in Figure 2..
If the data from the logistics distribution between -4.0 and 4.0, which is equivalent to random nonzero x data with a mean of 4, was used to fit to the equation a*cos2(x*b) using a non-linear regression, the amplitude of the cosine would be a= 0.33491 and the inverse of the period, b/2π, of the cosine would be, b=0.479546 as shown in Figure 3. This is a smaller amplitude and a longer period than the theoretical value of the repeating event. However the regression with the random data would be quite good, with a coefficient of determination of 0.784252 and a correlation coefficient of 0.88558.
Figure 3
However, like the trigonometric deterministic function, the regression repeats, while the observed random data does not repeat. Care should be taken to examine the original premise of the data being random or deterministic. If the Cumulative Distribution Function, CDF, of the regression were shown, as it is in Figure 4, it would erroneously show that its value increases as the observation increases. It would also erroneously assume that at the mean, in this case zero, the CDF at the mean is 0%, not 50% as it should be. The regression of data only suggests the correlation within the range of the data. Caution should be used when making assumptions outside of the range of that data.
Figure 4If the regression had been to the hyperbolic secant squared, a*sech2b*x,
then the amplitude, a, would be ½, b would be 1, which is consistent
with a period of 2πi which only repeats in the imaginary plane, the
Coefficient of Determination would be 1 and the Correlation Coefficient would
also be 1. A good correlation of the data with a deterministic equation, in
this case almost 0.78, could mean that the data is actually random and
will not repeat, even if the regression assumes that it will repeat.
No comments:
Post a Comment