A Data Science Central Community

Kaggle aims to help companies and researchers make predictions more precise by providing a platform for data prediction competitions. Competitions turn out to be a great way to get the most out of a dataset. This is because there are infinitely many approaches to any data modeling problem. By opening up a data prediction problem to a wide audience, a competition makes it possible to get to the frontier of what is possible given a dataset's inherent noise and richness.

Data modeling competitions can facilitate real-time science. Consider the recent announcement about the discovery of genetic markers that correlate with extreme longevity. Work on the study began in 1995, with results published in 2010. Had the study been run as a data modeling competition, the results would have been generated in real time and insights available much sooner (and with a higher level of precision).

Data modeling competitions also benchmark, in real time, new techniques against old. A technique that performs well in competitions can prove its mettle long before any paper can be published, helping the science to progress more quickly.

Competitions also help to avoid situations where valuable techniques are overlooked by the scientific establishment. This aspect of the case for competitions is neatly illustrated by Ruslan Salakhutdinov, now a postdoctoral fellow at the Massachusetts Institute of Technology, who had a new algorithm rejected by the NIPS conference. According to Ruslan, the reviewer ‘basically said “it’s junk and I am very confident it’s junk”’. It later turned out that his algorithm was good enough to make him an early leader in the Netflix Prize (he called his Netflix Prize team NIPS_reject).

Companies can use Kaggle to gain an advantage over their competitors. Consider a bank that wants to improve the algorithms that vet loan applicants. If a bank can develop a more effective algorithm they will have fewer defaults and can charge lower interest rates than their competitors. Kaggle has proven to be an effective way to improve existing models very quickly.

Competitions are also really useful to companies that want to develop new products and capabilities. Consider a hedge fund that wants to be able to generate long-range weather forecasts in key agriculture regions. They can attempt to hire a weather forecasting expert or they can use Kaggle to throw the problem open to a wide audience. Using Kaggle they can be sure they'll get great results very quickly.

**- How is the best model selected?**

The competition host will typically split their dataset into two parts - a training dataset and a test dataset. The training dataset includes all explanatory variables as well as the dependent variable (or the answer). The test dataset also includes all the explanatory variables but the dependent variable (or answer) is withheld.

Participants train their models on the training dataset. They then apply their models to generate predictions on the test dataset. Those predictions are then scored on-the-fly against the actual answers (using one of several evaluation methods). Once the competition deadline passes, the team that generates the most accurate predictions gives the winning methodology to the competition host in exchange for the prize money.

© 2020 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge