## Monday, May 12, 2008

### Statistical Analysis

Even if you only casually read through news websites (such as those of CNN or FOXNews), several times per month you will notice headlines such as the following:

TOO MUCH, TOO LITTLE SLEEP TIED TO ILL HEALTH IN CDC STUDY

Study: Long-Term Breast-Feeding Will Raise Child's IQ

WOMEN, WANT A HEALTHY MARRIAGE? MARRY MAN UGLIER THAN YOU, STUDY SAYS

STUDY: FOOD IN MCDONALD'S WRAPPER TASTES BETTER TO KIDS

Study: 1 in 50 U.S. babies abused, neglected in 2006
And naturally we’re all aware of the competing studies that exist too. One study shows that eggs are bad for you; another that they’re good for you. One study shows how margarine is a healthier alternative than butter; another that butter is better for you. With so many competing studies, you can find a scientific backing for just about any position you want to take (especially in health matters).

The existence of so many studies helps to emphasize a point regarding statistical analysis. Despite being a powerful tool, if you do not set up the guidelines and restrictions for your samples properly any statistics you observe won’t amount to a hill of beans. And we’re not even talking about the inherent fluctuations that require the existence of error bars (that’s the line that says +/- 3%, for example). Nor are we even addressing political manipulation of statistics in the form of pollaganda. Instead, I’m talking about something at the heart of statistics itself—it’s a universal.

To demonstrate what it is, let us first ask a simple question. When we do a statistical analysis of some observation, for what reason are we doing it? As you can see in the above headline examples, most of the time studies are done to find a causal linking between some object and/or action and some result. Thus, the first headline above says that too much or too little sleep (the cause) is “tied” to “ill health” (the effect). We also see that women should marry uglier men for a healthy marriage (in a study obviously written by an ugly man).

Now let us assume that there is a correlation that all these studies found. Let us assume that it is the case that people who sleep less than six hours a night weigh more than those who sleep eight hours a night, and that women who married uglier men (however that is defined) are in healthier (however that is defined) marriages. The fact of the matter is that when you compare any subset of a group, however you wish to define that subset, with the rest of the group as a whole, you will find things that the small group has in common at a statistically higher rate than the group as a whole. This happens automatically and does not mean that it is relevant in a causative sense!

To give a simple example, let’s examine hockey (since I like hockey). There are 30 teams in the NHL. Of those 30 teams, 7 are named after animals (the Penguins, Bruins, Thrashers, Panthers, Ducks, Coyotes, and Sharks) and 7 are named after people-groups (the Islanders, Rangers, Canadiens, Senators, Blackhawks, Oilers, and Kings). Each group of 7 constitutes 23% of the teams in the League.

There have been 80 Stanley Cups awarded since 1926. During that time, teams named after animals have won 8 Stanley Cups, which means that they won 10%. However, teams named after people-groups have won 39 Stanley Cups during that time, which means they won 49% of them. Clearly, having a team named after a people-group instead of after an animal provides a statistical advantage to a hockey team…

Perhaps someone could argue that the statistical data isn’t fair. After all, the Thrashers (1999), Panthers (1993), Ducks (1993), Coyotes (1996), and Sharks (1991) are all teams that did not exist before the 1990s! On the other hand, the Rangers, Canadiens, Senators, and Blackhawks all existed in 1926 (the start of this survey). Furthermore, the Kings were founded in 1967, the Oilers in 1971 and the Islanders in 1972. Of the animal teams, only the Bruins were around in 1926 (the Penguins were founded in 1967). Thus, using 1926 as the baseline (since before that there were other teams besides just NHL teams that could play for the Cup), the average year of founding for animal teams is 1981 and for people-group teams it’s 1945.

However, we can adjust for that. Animal teams have won a Cup on average every 3.25 years they’ve existed; while people-groups win a Cup for every 1.59 years they’ve existed. Clearly, it still remains better to have a team named after a people-group than an animal. (And I’m not biased since I cheer for the Avalanche, which is neither a people-group nor an animal…)

Now here’s the thing. The statistical data that I’ve given here is all correct (assuming I didn’t make any typos or anything of that nature), but every rational person would immediately recognize that the type of name a sports team has, has no bearing on the performance of that team. This is an attribute that is linked statistically, but the statistical linkage is accidental rather than causative.

Every time that we do these surveys and examine the numbers we have to realize that there are some number of things that will be discovered in common that are accidental correlations. The problem is that we ignore most of these connections. And when I say we ignore them, I don’t mean that we test the data and then go, “This isn’t relevant” but we do not even look for them in the first place. After all, were it not for the fact that I was looking for an example for this blog entry I would never have cared what percentage of teams named after animals won the Stanley Cup. This correlation would have been excluded a priori as being irrelevant.

But these irrelevant correlations are important to statistical analysis! Why? Because since a certain percentage of linkages are accidental, we have to account for them in our conclusion. In other words, we have to have some way of determining if the link we discover is causative or if it is merely the kind of statistical fluke you get when examining hockey mascots. And that means that we would need to examine all possible connections and discard those that are accidental in order to find out if the statistical percentages are covered.

That, however, is impractical to the point of impossibility. After all, it is relatively easy to come up with statistical correlations between things. For instance, with my hockey example it took me all of 15 minutes to come up with that correlation. The longest part was pulling up the Wiki sheets on the number of Stanley Cup wins various teams had had. Indeed, based on my experience I would argue that it is so easy to come up with meaningless links between data that it will always remain more likely that a correlation is accidental than causative. That is, for every one true causative link between a subset of a group and the average of the entire group, I would argue there are several accidental links. And these accidental links are not always as obviously accidental as the examples I’ve given. (For a less obvious example, think of the correlation between diabetes and obesity. Does one cause the other? Or is it just a statistical fluke, similar to the names of hockey teams?)

So there are some ways to salvage statistics. But it requires that we be able to conduct further tests with our predictions in place in order to sort out whether we have a meaningful causative link or a meaningless accidental link. If we cannot conduct those further tests, then any causative links will be lost in the noise of the countless accidental links. They may be true, but it is impossible to verify it.

1. Nice post, Peter.

The fact of the matter is that when you compare any subset of a group, however you wish to define that subset, with the rest of the group as a whole, you will find things that the small group has in common at a statistically higher rate than the group as a whole.

What do you mean by "statistically higher rate"?

Clearly, having a team named after a people-group instead of after an animal provides a statistical advantage to a hockey team...

What do you mean by "statistical advantage"? Do you know that 39% versus 10% of cups won since 1926 is a statistically significant difference?

This is an attribute that is linked statistically, but the statistical linkage is accidental rather than causative.

I'm not sure what you mean by "linked statistically"? Are you saying that there is a statistically significant correlation between team name and cups won? If so, I'd be interested in how strong the association is.

(For a less obvious example, think of the correlation between diabetes and obesity. Does one cause the other? Or is it just a statistical fluke, similar to the names of hockey teams?)

If we assume the pathologies of diabetes and obesity to be correct (theory), I think it's quite clear that the correlation is not coincidental. And, of course, theory and research design are what will allow us to make strong inferences from the data, not the statistical analyses.

On the other hand, if it is accidental then it is a random linkage, and random linkages will break down through further testing.

Are you sure about that? Is the association between standing height and shoe size random?

the fact that people-group teams have won more Stanley Cups than animal teams does not help us predict who will win the Stanley Cup this year or next year or the year after that; therefore, it is an accidental link rather than a causative link.

Actually, it could. If there is a strong association, which is statistically significant, between team name and cups won, then it will predict a good portion of the unexlained variance.

Sound theory and research designs are what will allow for sound inferences from the data, not statistical analyses.

2. Barnzilla said:
---
What do you mean by "statistically higher rate"?
---

The ratio yields a larger (higher) number. Example: 2/3 = 0.667. 2/4 = 0.500. 2/3 is a larger number. If something occurs 2 out of 3 times it will happen more often than when it occurs 2 out of 4 times.

If you take any subset of a group, you can find some factor that they have in common at a higher rate than the group as a whole. I gave examples of this already, but perhaps a contrived example will show it clearly.

Suppose that 10% of basketball players wear glasses. That's 1 in 10. But suppose that four of the starters for The World Champion Team wear glasses. That's 4 in 5, which is 8 in 10. 8/10 > 1>10. Therefore, one could argue that the reason The World Champion Team is the World Champion Team is because they wear glasses at a higher rate than average.

Barnzilla said:
---
What do you mean by "statistical advantage"?
---

I mean that those with a certain factor statistically succeed more often than those without the factor. Again, I gave examples of this.

Barnzilla said:
---
Do you know that 39% versus 10% of cups won since 1926 is a statistically significant difference?
---

Vegas must love you if you can't tell the answer to that :-)

Would you be better off putting a \$100 bet for a chance to win the \$1,000,000 jackpot in a slot machine that gave you a 39/100 shot of winning or in a slot machine that gave you a 10/100 shot of winning?

Barnzilla said:
---
I'm not sure what you mean by "linked statistically"?
---

On this one I didn't spell out the implied link, but it is "linked statistically to winning the cup." And I've demonstrated how in the post.

Barnzilla said:
---
Are you saying that there is a statistically significant correlation between team name and cups won?
---

No, you're adding the word "significant" there. That's a value judgment. As I said, it's an accidental linkage, so it's not statistically significant at all. However, you cannot tell that from the math alone. Which was sort of the point of my post.

Barnzilla said:
---
If we assume the pathologies of diabetes and obesity to be correct (theory), I think it's quite clear that the correlation is not coincidental.
---

That's the same thing as saying, "If we assume the link isn't coincidental then I think it's quite clear that the correlation is not coincidental."

Barnzilla said:
---
And, of course, theory and research design are what will allow us to make strong inferences from the data, not the statistical analyses.
---

Isn't that basically what I've said in my post? I.e., just because you have a mathematical correlation doesn't mean it's relevant. I'm pretty sure that's why I said: "Despite being a powerful tool, if you do not set up the guidelines and restrictions for your samples properly any statistics you observe won’t amount to a hill of beans."

Barnzilla said:
---
Is the association between standing height and shoe size random?
---

I've never bothered to study the correlation.

I originally said:
---
the fact that people-group teams have won more Stanley Cups than animal teams does not help us predict who will win the Stanley Cup this year or next year or the year after that; therefore, it is an accidental link rather than a causative link.
---

Barnzilla said:
---
Actually, it could.
---

Again, you're Vegas's best friend.

Barnzilla said:
---
If there is a strong association, which is statistically significant, between team name and cups won, then it will predict a good portion of the unexlained variance.
---

Except there isn't a "strong association" because it's not statistically relevant. That's why I used this as an example. It's one that's obviously not correlated, yet you can easily figure out statistics to make it appear to be correlated. And, as I've argued, you can do that with any sample because it's an inherent flaw in the system of statistics.

That's why the only statistics that are relevant are ones that provide us with a predictive theory that is then confirmed. If it can't do this, then there is no way to differentiate between a legitimate linkage and a random linkage that comes about as a result of the way humans organize data.

3. Thanks for the response, Peter.

The ratio yields a larger (higher) number.

Just because a number is larger does not mean that it is, ipso facto, *statistically* larger.

If you take any subset of a group, you can find some factor that they have in common at a higher rate than the group as a whole.

What if the subset is representative of the group from whence it is taken on all factors?

As an aside, I'd use "prevalence" or "frequency" instead of "rate." ;)

I gave examples of this already.

The examples you gave (hours of sleep and body weight, ugly men and healthy marriages) don't prove the point.

Suppose that 10% of basketball players wear glasses. That's 1 in 10. But suppose that four of the starters for The World Champion Team wear glasses. That's 4 in 5, which is 8 in 10. 8/10 > 1>10.

But suppose that 0/5 starters wear glasses. What then? Is "the fact of the matter that when you compare any subset of a group with the rest of the group as a whole...you will find things that the small group has in common at a statistically higher rate than the group as a whole"? No.

Further, it's not *statistically* higher or lower by the mere presence of a discrepancy between the prevalences in the sample and population.

I mean that those with a certain factor statistically succeed more often than those without the factor.

You're still failing to explain what you mean by "statistically." For example, are you suggesting that given the fact that there is a difference in the number of cups won by people-group versus animal teams that in a regression analysis, team name type will be a *statistically significant* predictor of cups won?

I agree with you if by "statistically succeed" you simply mean that we have observed that people-group teams have won more cups than animal teams. But why you're using "statistical" and "statisically" is not clear to me. For example, have you analyzed the difference in cups won between the subsets (team name type) and found it to be a true difference at an alpha level of 0.05?

Vegas must love you if you can't tell the answer to that :)

So you're suggesting that 39% verse 10% is statistically significant by the very fact that the two numbers are not equal?

Would you be better off putting a \$100 bet for a chance to win the \$1,000,000 jackpot in a slot machine that gave you a 39/100 shot of winning or in a slot machine that gave you a 10/100 shot of winning?

Well, if the 39/100 and 10/100 are statistically significant, then obviously it would be better to go with the former. But, unfortunately, in all the examples you have given in your original post, you haven't established statistical significance, let alone defined your terms.

On this one I didn't spell out the implied link, but it is "linked statistically to winning the cup." And I've demonstrated how in the post.

I understand that it's linked to winning the cup. I'm asking what you mean by "linked statistically." Please clarify. For example, are you suggesting that this is a statistically significant correlation?

No, you're adding the word "significant" there.

In my use of the term "statistical," that's what it means. An observation that is statistical is one that is significant at some alpha level (e.g., 0.05).

As I said, it's an accidental linkage, so it's not statistically significant at all.

Peter, I find you very confusing. In your original post, you said that team name type is "linked statistically" although being accidental not causative. Here you're telling me that it's not statistically significant. What, then, does it mean to you for something to be linked statistically?

However, you cannot tell that from the math alone. Which was sort of the point of my post.

You're confusing significance with meaningfulness. A non-causative link or correlation can most certainly be statistically significant. But it does not follow that it is meaningful or a causative explanation. Meaningfulness can be got at, partially, through statistically calculations like effect sizes. But at the end of the day, meaningfulness and causation are based on theory and logic.

That's the same thing as saying, "If we assume the link isn't coincidental then I think it's quite clear that the correlation is not coincidental."

Yup, that's exactly right. Theory is what drives the inference we make, not statistical analyses.

Isn't that basically what I've said in my post?

You tell me. :)

I find it an interesting post, Peter. I just think your terminology is confusing and I also think you've made generalizations and given examples that don't actually lead to the conclusion you're making.

I've never bothered to study the correlation.

It's not necessary. The point is, a non-causative correlation is not necessarily random and one which will "break down" upon further testing.

Again, you're Vegas's best friend.

Not at all. I'm calling you on a statement that you have no rational basis to make. Have you analyzed the data statistically and determined that team name type is not a significant predictor of cups won? A non-causative correlation is not, ipso facto, a random and non-significant predictor. Take age and IQ during the formative years, as another example.

Except there isn't a "strong association" because it's not statistically relevant.

I don't even know what "statistically relevant" means. Do you?

If you mean that a non-causative correlation can't be strong because it's non-causative, you're mistaken. Look at the relationship between age and IQ or age and shoe size during the formative years. You'll find a strong association there.

It's one that's obviously not correlated, yet you can easily figure out statistics to make it appear to be correlated. And, as I've argued, you can do that with any sample because it's an inherent flaw in the system of statistics.

Okay, I don't know what you're background is, but I think you have a basic misunderstanding of statistics, and I don't mean to be nasty in saying that. Many variables that are not in a causative relationship correlate strongly. That's not a weakness or flaw to statistical analyses. Nor is it a mere "appearance." Statistically, they correlate strongly. Plain and simple. But as the adage goes, "Correlation does not infer causation." Statistical analyses are not intended to spit out a causative link. That conclusion requires sound theory, a sound research design, and the laws of logic.

Nice exchange.

4. Barnzilla,

I use the term "statistic" in its most simplistic form: a ratio. Namely, it's the ratio of a factor's presence in a population either to the factor's absence in a population or to the population as a whole. Thus, if 1 in 7 people have brown hair (I made this up, but it's just an example) then 1/7 is a statistic. It shows the presence of a factor (brown hair) compared to the population as a whole.

But since you have said you include "significance" within your definition of "statistic" then that probably explains most of the misunderstanding there. Although I do have to say that I think there are times you don't apply your definition throughout too. For example, you said:

---
Many variables that are not in a causative relationship correlate strongly. That's not a weakness or flaw to statistical analyses. Nor is it a mere "appearance." Statistically, they correlate strongly.
---

This makes it seem as if you're using the term "statistic" in the same way that I did, which is to say that the ratios do exist. We observe them. They're there. (But of course their existence cannot imply causality.)

However, elsewhere you say:
---
I agree with you if by "statistically succeed" you simply mean that we have observed that people-group teams have won more cups than animal teams. But why you're using "statistical" and "statisically" is not clear to me. For example, have you analyzed the difference in cups won between the subsets (team name type) and found it to be a true difference at an alpha level of 0.05?
---

At this point, you're no longer speaking of the simple ratio of a factor to it's absense or to the population as a whole. Instead, you're now employing processes within the science of statistics and equating those to the term "statistics."

I think ultimately this is where the confusion is arising, because we come to the same conclusions. We both agree that the presence of a correlation does not imply causation.

Now my ultimate point in the post was to demonstrate that not all statistical occurances demonstrate causality. Thus, when I say that something is not "statistically relevant" it means that the statistic (that is, the ratio of a certain factor in a subset to its absense or the population as a whole) is not relevant to determining whether that factor caused the subset to become the subset. There are accidental correlations (by accidental I mean non-causal). Ultimately, boil it all down, this is what my post was demonstrating.

Now I'm fully aware that statiticians already know that correlation doesn't imply causality. However, this is often lost on people. That's why the media hypes all these studies without (generally) bothering to say, "Oh, this may just be a random correlation instead of a cause."

I agree that these don't harm the science of statistics, but only because the science of statistics already tries to neutralize this thinking. However, most people who use statistics are not statiticians, and therefore fall prey to the fallacy.

Finally, I'll leave you with the words of David Raup speaking on extinctions, which says basically what I was trying to say too:
---
Size is not the only trait that suggests a proneness to extinction. It is commonly held, for example, that tropical organisms are more likely to go extinct than their relatives in cooler climates. Planktonic organisms are said to be at greater risk than bottom-dwelling aquatics, and marine reef communities more vulnerable than nonreef communities.

My own feeling is that most of these claims are not worth a damn! Sadly, to test such claims is nearly impossible. Let me explain. Suppose we are studying one particular extinction event and have a list of victims and survivors. Such lists tend to be rather short, especially if we are working at a high taxonomic level (order, class, family). …Small numbers make statistical testing tricky.

Once we have the lists, we must search for common denominators: characteristics shared by most victims but not survivors, or vice versa. This is straightforward, and we have seen the results in the case of mammalian body size. The problem is that organisms have a virtually unlimited number of characteristics that might be important: anatomical, behavioral, physiological, geographical, ecological, and even genealogical. We can compare lists of victims and survivors with so many different traits as we have energy. If the lists are not long, it becomes virtually inevitable that we will find one or more traits that match the lists closely enough for us to make a case.

If we find an interesting correlation by this procedure, we can apply standard statistical tests to evaluate the possibility that the correlation is due to chance alone. Each such test asks, in one way or another, “What is the probability that the random sprinkling of a particular trait among species would, by chance, yield a correlation as good as the one we observe?” If that probability turns out to be very low—say, 5 percent or less—we feel comfortable in rejecting random sprinkling and concluding that the observed correlation is true cause and effect.

The fatal flaw in this logic is that testing cannot be adjusted for the fact that we tried many traits before finding a promising one. Remember that one out of every twenty completely random sprinklings will, on average, pass our test if odds of twenty to one are considered acceptable—as is common in scientific research. Because it is virtually impossible to keep track of the number of traits we have considered—many were discarded at a glance—we cannot evaluate the test results for any one trait.

This problem is not unique to paleontology, or to science either. If you have difficulty accepting my reasoning, try some experiments yourself. Take some baseball statistics or election results or anything that will provide a list of winners and losers. Fifty or a hundred results should be adequate. Then inspect the list to see what characteristics the winners or the losers have in common. The pattern does not have to be perfectly consistent—a statistical tendency is enough—and you are free to change the ground rules as you go along. You can even redefine winner and loser if this will help. Pay special attention to the smaller category of outcomes. For example, you may wish to compare characteristics of first-place baseball teams with those of all other teams. The shorter list (first-place teams) is more likely to have things in common than the longer list. If so, you may be able to venture conclusions like “Most managers (or all, if you are lucky) of first-place teams are firstborns, whereas managers of other teams follow the national average.”

(Extinction: Bad Luck or Bad Genes? 1991. New York: W. W. Norton Company, Inc. p. 96-97).
---

Raup gave a tongue-in-cheek example using the World Wide Atlas from Readers Digest’s 1984 edition to demonstrate that the most populous cities begin with letters in the last half of the alphabet, therefore people tend to flock towards cities that have this attribute. The data is simple. The seven most populous cities (in 1984) were: Tokyo-Yokohama, New York City, Mexico City, Osaka-Kobe-Kyoto, Sao Paulo, Seoul, and Moscow. All of them start with letters in the M-Z range of the alphabet. The next seven cities, however, were: Calcutta, Buenos Aires, London, Bombay, Los Angeles, Cairo, and Rio de Janeiro. Of these, only Rio de Janeiro does not fit the pattern. Thus, Raup states (again, tongue-in-cheek): “The statistical likelihood that this was caused by chance alone is so small that rejection of a hypothesis of randomness is routine. Cause and effect is clearly indicated (p. 99).”

5. But since you have said you include "significance" within your definition of "statistic" then that probably explains most of the misunderstanding there.

Agreed.

Although I do have to say that I think there are times you don't apply your definition throughout too....This makes it seem as if you're using the term "statistic" in the same way that I did, which is to say that the ratios do exist. We observe them. They're there. (But of course their existence cannot imply causality.)

No, my usage is consistent, Peter. When I say that non-causitive variables can correlate strongly, I mean that their correlation is statistically significant (e.g., IQ and shoe size during the formative years).

Keep in mind that statistical significance is not tantamount to causation. Statistical analyses cannot get at causation. Causal inferences require *interpretation* of the statistical analyses which, of course, involves, sound theory, research design, etc.

I think ultimately this is where the confusion is arising, because we come to the same conclusions. We both agree that the presence of a correlation does not imply causation.

I hope we agree. :) It's one of the most basic ideas within statistics.

You have, however, created a lot of confusion as I've noted already. For example, you said earlier, "Except there isn't a strong association' because it's not statistically relevant." This is not only confusing, it's plain wrong. IQ and shoe size have a very strong association during the formative years, even a statistically significant association. Statistical "relevance," however you define that, is completely irrelevant, ironicaly enough :), as to whether the assocation is strong.

There are accidental correlations (by accidental I mean non-causal). Ultimately, boil it all down, this is what my post was demonstrating.

But I think you used some examples and made some generalizations that are incorrect (as already mentioned in my second comment).

Now I'm fully aware that statiticians already know that correlation doesn't imply causality. However, this is often lost on people. That's why the media hypes all these studies without (generally) bothering to say, "Oh, this may just be a random correlation instead of a cause."

I wouldn't blame it all on the media necessarily. Scientists get attached to their research and often overstep the data with their conclusions.

Also, there are fields of study that make causal conclusions from descriptive (as opposed to experimental) study designs simply because they aren't able to use an experimental design for ethical reasons, etc. (e.g. whether smoking causes cancer). The field of epidemiology quickly comes to mind.

Thanks for the quote.