A Data Science Central Community
Every day, several articles are published and cross-posted in major news outlets, featuring a new statistical discovery, many times based on real hard facts, but put out of context.
Here's an example: Could A Daily Dose Of Red Wine Reduce One's Risk Of Depression? This article was published in Forbes on 8/31.
This article is actually not that bad, because it tries to answer some of my questions, and the author mentions a few caveats, as well as how the survey was done.
Sometimes, statistics and advice like this come from the government. This is worse as some people (not me) tend to believe and comply with government recommendations, no matter how wrong they might be (e.g. due to poor design of experiment, or coming up with a solution that is worse than the problem because side effects are ignored, or wrong on purpose to serve political interests).
What is the worst advice, based on pseudo-analytic studies, that you have ever heard?
Other statistics that should be taken with a grain of salt is by how much sea levels will rise with global warming, as ice sheets in Antarctica and Greenland melt.
These stats might be very true, but alarmists forget three important facts:
Ooops...global warming proponents may have to re-adjust their biases:
"Arctic sea ice up 60 percent in 2013"
Read more: http://www.foxnews.com/science/2013/09/09/arctic-sea-ice-up-60-perc...
A scientist recently quoted a study which correlated 'lack of orange juice drinking' to 'propensity to die in a road accident'. Apparently the correlation was sound, and had to do with motorcyclists being statistically adverse to OJ. However, a good example of correlation, not causation. Clearly some deeper analysis needs to be done (i.e. inverse correlation between healthy eating habits and risky sports?).
The great danger is that corporations and governments simply start enacting policy based on first-order correlations and not even telling us (i.e. you don't drink orange juice, therefore higher auto insurance rates). There is a 'dark side' to analytics when lazy bureaucracies and computer generated correlations masquerading as causation collide...
I hate the manipulation done on gender equality statistics.
Sociological studies shows that men and women are equally offensive at verbal level and small physical attacks but men injure more their partner as they are usually bigger.
Sometimes the message is that the men is always the offender.
About the wine i do see for a long time studies pointing out that small quantities of wine lead to a longer life. I think that is not the wine itself that has the effect but the socialization effect of wine drinking that leads to benefits when in small scale.
I was a government statistician from 1987 to 1996 and still stay in touch with many of my colleagues. If you've never worked for the government, you might not know that there is a strong sense of pride among us that we present data carefully and objectively and free from political pressures. That may sound like a biased perspective, but there are many outsiders who have great admiration for the quality of work done by the Census Bureau, the Bureau of Labor Statistics, the Centers for Disease Control and Prevention (my former organization), and many others. That does not mean that we're perfect, but only that we try hard. Also, for every person who "tends to believe and comply with government recommendations" there are ten who will reflexively mistrust any statistics produced by the government. Both blind acceptance and cynical disregard are dangerous extremes.
You do raise a lot of interesting questions about dietary studies in general, and until we are able to incarcerate people and forcibly feed them according to an experimental protocol, the evidence that we gather about dietary habits (including the consumption of wine) will be based on a limited and weak evidence base. That doesn't mean that you disregard studies of diet. It just means that you demand a persuasive biological mechanism. You also look for credible replications. A dose-response pattern is also helpful in establishing the credibility of a study. Animal studies, though not persuasive by themselves, can often strengthen an observational finding in humans.
The big problem that you don't state directly, is that the media is often not sufficiently skeptical of new research and does not include the appropriate cautions and caveats. It's hard to write a readable news article if you're always hemming and hawing about limitations, but good journalists are able to write well and still be fair about the limitations of the study. Sadly, the good journalists are in short supply.
Steve Simon, www.pmean.com
There are two issues that comes to my mind with government stats:
1) Clinical trials for FDA drug approval: the issue is in the way some of the participants are recruited, related to financial incentives, introducing biases. It's not the statistician's fault, but it certainly has an impact on the validity of the conclusions drawn from such clinical trials.
2) Census data: Why do we need to spend 2 billion dollars on this, every x years? Why each year do hundreds of thousands of people receive a leaflet (survey) with tons of invasive questions that they must answer? It is as if the census bureau does not known that there is a science called statistical inference, with design of experiments and sampling. I am sure the proportion of people lying on these forms is very significant. Knowing how personal, private data is exploited by NSA and other third parties in this country, indeed everyone should lie or not fill these forms, and claim they don't speak English or any other known language when they get a visit from a census inspector. And what about those not speaking English nor Spanish, not understanding the questions, and providing wrong answers due to language barriers? Even a question such as "how many people live in this house" is subject to multiple interpretations, depending on how you define "live", "people" and "this house" (does the basement count, what if the person has been in this house for only one month, or if it's an illegal alien and do not count them as a person, etc.?)
Clearly, you are not familiar with the American Community Survey, which uses a very sophisticated sampling scheme. A nice introduction is at http://www.amstat.org/sections/srms/proceedings/y2007/Files/JSM2007.... Furthermore, the Census Bureau provides critical resources for big data folks like you. I'm guessing that you don't do a lot of GIS work. It's hard to imagine how that sort of work would proceed without the US Census. Finally, the comment about the NSA is rather bizarre. If you are worried about the NSA, then the places you should stay away from are private Internet companies like Google and Facebook and private phone companies like AT&T and Verizon. There have been no reports of the NSA stealing data from the Census Bureau. Quite honestly, the type of data that the Census Bureau collects is not current enough to be useful for anything the NSA wants to do.
You correctly note that the problem with trials submitted to FDA is not the fault of FDA. The problem is the statisticians at the pharmaceutical companies, who design trials that are often biased. The FDA tries its best to prevent this, but they are understaffed and outgunned. One proposed solution to this is forcing pharmaceutical companies to disclose their research protocols and share their data openly. This would allow you and I to serve as an additional check against abuses like the Cox-2 inhibitors.
Most of my PhD research and post doc work involved spatio-temporal modeling, which is sometimes called GIS, especially by engineers. And you can get the same predictive power and information that the Census Bureau get, for a tiny fraction of the cost. How do you think Zillow builds statistical models to estimate the value of any single home in US? Do you think they spend $2 billion per year to gather the data? No, it's based on GIS models, sampling, data gathering and harvesting both internally and using external sources. And their stats are updated almost in real time, not every 10 years.
So Zillow is "harvesting both internally and using external sources"? Would one of those external sources be the American Community Survey, made available for free by the U.S. Census Bureau? Maybe they are using the map information provides in the TIGER database, another U.S. Census product? I don't know Zillow. But I do know that there are many small and large businesses who are doing great things with the products of the U.S. Census Bureau.
Also, if you have a gripe about the 10 year gap between censuses, take it up with John Rutledge, who wrote that requirement into the U.S. Constitution in 1787.
Sure the government is full of bureaucracy and paperwork. But someone who chooses to work in that environment can still produce good quality work. I'm proud of the contributions I made from 1987 to 1996. Many of my colleagues who are still with the government and who are far better than I are doing even greater things.
Steve Simon, www.pmean.com
Much of the data that Zillow uses is historical home sales. Most homes were once sold, sometimes multiple times, providing a wealth of information. Home "for sale" listings usually have an ID, each sales has a transaction ID. Indeed, the Census Bureau could certainly leverage the data source in question.
Another comment about government statisticians: they are very different from other statisticians. Not that they are better a worse, just very different. In my case, as an independent data scientist and highly creative statistician, I would never be able to land a government job: the bureaucracy (just the paperwork to submit your application) is so overwhelming that I could never go through even if I tried hard (I feel the same about getting health insurance under Obamacare). So government is eliminating statisticians like me from the pool of applicants. Maybe that's a good thing, maybe not, but the implication is that statistical analyses produced by the government have a very special flavor. Also, working for the Census Bureau requires US citizenship. This further reduces the pool of talented statisticians where they can hire.