A Data Science Central Community
Kaggle is currently hosting a bioinformatics contest, which requires participants to pick markers in a series of HIV genetic sequences that correlate with a change in viral load (a measure of the severity of infection). Within a week and a half, the best submission had already outdone the best methods in the scientific literature.
This result neatly illustrates the strength of data modeling competitions. Whereas scientific literature tends to evolve slowly (somebody writes a paper, somebody else tweaks that paper and so on), a competition inspires rapid innovation by introducing the problem to a wide audience. There are an infinite number of approaches that can be applied to any modeling task and it is impossible to know at the outset which technique will be most effective. By exposing a problem to a wide audience, competitions expose the problem to a range of different techniques. This maximises the chances of finding a solution, and gets the most out of any particular dataset – given its inherent noise and richness.
Competitions can do more than generate optimal results for specific problems. They can also help to correct a coordination problem in the wider research community. It need hardly be observed that data is being collected in greater volumes and at greater speeds than ever before. Innovations such as the human genome project, high-resolution camera-clad telescopes and other advanced data collection instruments mean that researchers in many field are inundated with data. But it is equally the case that those collecting the data do not necessarily have the best means to analyse it. It is unlikely to be the case that a single researcher has access to the most advanced machine learning, statistical and other techniques that would allow them to get the most out of their datasets. At the same time, many data mining and statistics researchers find it difficult to access real-world datasets, and develop their techniques on whatever data they have access to.
Kaggle aims to address this coordination problem. Data-rich researchers can post their datasets and have them scrutinised by analytics-rich researchers. This gives data-rich researchers access to cutting edge techniques and analytics-rich researchers access to new datasets and current problems.
Data modeling competitions are particularly powerful because they facilitate real-time science. Consider this week’s announcement about the discovery of genetic markers that correlate with extreme longevity. Work on the study began in 1995, with results published in 2010. Had the study been run as a data modelling competition, the results would have been generated in real time and insights available much sooner (and with a higher level of precision). (more…)