A Data Science Central Community
The U.S. Census Bureau recently held a contest to see who could develop the best statistical model to predict census mail return rates. Contestants could use a set of block group-level data provided by the Census plus whatever else they could get their hands on (provided it was freely available and they shared their sources). I’d never taken part in this kind of predictive analytics competition before, but as a sometimes SAS-programmer and demographer by training, this one caught my attention. The contest was run by Kaggle and 291 contestants participated (I came in 47th…). It was interesting not just from a data-mining-geek standpoint, but to see how one might be able to harness crowdsourcing to provide new approaches and solutions to decades old problems.
Asked to comment, Roderick Little, the Census Bureau’s associate director for research and methodology, told SurveyPost: “I’d say there is a value to the Census Bureau in introducing new modeling ideas, and also advertising to the outside statistical community that there are interesting problems in the federal government, making it a good place to work.” I have to agree that the challenge got me to think outside my typical approaches to problem solving and the motivation to be a winner amongst data miners was alluring (as was the prospect of the $14,000 top prize!).
Bill Bame, a software developer, was the winner of the Census Return Rate Challenge. I chatted with Bill to get his perspective on the Census challenge and his experience with crowdsourced contests in general. Some of our chat follows:
Joe: I saw that you consider yourself an “algorithmist” rather than a mathematician or statistician by training. How does your background influence your approach toward problem-solving?
Bill: I think my background makes me a pragmatist rather than a theoretician. Don’t get me wrong, I have tremendous respect for people who can do theoretical work, but in my endeavors the “just make it work” approach almost always seems to be the goal. I also have many interests outside of this sort of thing, and I think that helps greatly.
Joe: I know you participated in the Netflix Prize and other challenges before the Census contest. What motivated you to first take part in challenges of this kind?
Bill: Honestly? The Netflix challenge was just a lark. Just a fun bit of entertainment and something to encourage me to learn some new things. That applies, more or less, to other similar things I’ve attempted. I never go into these things with an eye toward winning – and really never expected to do as well as I have.
Joe: How about the Census challenge? Was there something in particular that caught your interest with this one and did you have experience working with Census data prior to the challenge?
Bill: I had never worked with Census data before, but I had worked with researchers who used TIGER files and other things in their work. At the beginning of the challenge I thought it might be fun just to draw some maps showing trends in various ways, maybe take a stab at the visualization competition. After all, maps are fun in general, and I like thinking about how to create visualizations that will be meaningful to people who aren’t experts in the data being represented. But then my analysis of the data became interesting in itself, and before I knew what was happening I had thrown together some predictive models… that happens sometimes.
Joe: Have your thoughts about the decennial Census changed as a result of working on this challenge?
Bill: The only real surprise in the data, for me, was the renter/owner dichotomy. I suspected that there would be some predictive value there, but it turned out to be the single-most important predictor. To me this implies that the Census Bureau would be well served by embarking on an education campaign, targeted at renters, prior to the next Census – but it’s probably not as simple as that. And there were some interactions between that and other variables that might complicate that plan significantly.
Joe: What do you think of the challenge as a model for solving particularly “challenging” research questions? What limitations do you see for the challenge arrangement for solving problems? Are there certain problems it may not be right for?
Bill: I think the whole competitive crowd-sourcing model is great for solving many kinds of problems, and can bring truly new ideas to otherwise stagnant problem areas, but like anything else it can be abused – and it’s not always the best way to get the job done. What I mean is, the sponsors can get cheap and hopefully valuable insights into their data, and the participants (and audience? is there an audience?) get entertainment, education (same thing for me) and possibly prizes and/or prestige. But that can turn ugly if the amount of work involved doesn’t match the “purse.” As for what it’s not very good at, I don’t think it’s particularly good at the “just give me some ideas” sorts of questions – it’s much better suited to the analytical and/or “visualizable” – currently at least.
Joe: That reminds me of the contest we ran last year to generate some items to include in a survey. We were astounded by the range of topics submitted and people who entered. We tried to keep the work required low as we knew we could only select a handful of winners. So, if you had to design your own challenge, what would it be about?
Bill: Design my own challenge? Wow, I’ve never actually thought about that… I’ve done a fair amount of investigation into what’s usually termed synthetic or artificial creativity–computer-generated art of various types. I think it would be a blast to hold a competition along those lines. On the other hand I don’t think there would be a way to score such a competition quantitatively, and I’m not generally of fan “judgment by panel”-style competitions. Still, it might be fun, and would certainly yield fascinating results.
Joe: So, any big plans for the Census challenge prize money?
Bill: I’ve got one daughter in college, and another entering in the fall… need I say more? But I did skim a chunk off the top to build a new number-cruncher. Its predecessor was built for the Netflix prize – which should tell you how old it was!