By Arkadiusz Paterek, finalist in the Netflix competition.
- introduction to predictive modeling,
- a comprehensive summary of the Netflix Prize,
- detailed description of my top-50 Netflix Prize solution predicting movie ratings,
- summary of methods published by others - RMSE's from different papers listed and grouped in one place,
- detailed analysis of matrix factorizations / regularized SVD,
- how to interpret the factorization results - new, most informative movie genres (see how I use it here and here),
- how to adapt the algorithms developed for the Netflix Prize to calculate good quality personalized recommendations,
- dealing with the cold-start: simple content-based augmentation,
- description of two rating-based recommender systems realized by me (see one of them in action),
- commentary on everything: novel and unique insights, know-how from >9 years of practicing and analysing predictive modeling.
See the sample part (pdf) containing the abstract, the table of contents and a sample section. Or see the abstract (html)alone.
- people interested in a comprehensive summary of the developments around the Netflix Prize contest,
- for people developing recommender systems based on ratings - the publication can potentially save you hundreds of hours of work, and maybe give a tech edge over the competition.
Can be useful for:
- people interested in machine learning and prediction - the Netflix Prize task is rare case of a prediction task analysed well,
- for people competing in prediction contests, to better understand the time-efficient way to obtain maximally accurate predictions,
- for software developers trying to write their own recommender system or wanting to understand the know-how behind recommender systems,
- for adepts of physics and other natural sciences, to better understand how to make best use of gathered data, how to properly take into account different kinds of uncertainties, always present when doing inference, no matter how much data is gathered, and learn how to perform model identification in a time-efficient way, by maintaining ensembles of methods,
- for applied mathematicans, who want to see the surprising, but necessary complexity behind a simply formulated real-life prediction task,
- for traders, risk specialists, gamblers and bookmakers, who need very accurate predictions, up to the last one percent of accuracy possible,
- for data analysts, to learn tricks from another experienced data analyst, learn how to develop simpler and more accurate methods, with less effort, and master better the data analysis process: choosing the right task, gathering the right data, identifying the underlying probabilistic model, and finding the best methods solving the task,
- for everyone who was taught the maximum likelihood method and other methods of classical statistics, to learn about the more accurate approximate Bayesian approaches,
- for anyone planning a career in one of the top 5 professions, (according to this study): software engineer, mathematican, actuary, statistician, computer systems analyst, to see what the practical, modern data analysis is about, a subject rarely properly taught at universities,
- for people interested in film theory and genre theory, to see how the automatically learned movie genres relate to the traditional movie genre taxonomy.
This monograph describes author's large experimental work on one machine learning task - prediction of movie ratings in the Netflix Prize dataset. The main objective of the experiments was to obtain maximally accurate prediction, as evaluated by hold-out RMSE, but also important was the perspective of applying the developed methods in recommender systems. The publication has two goals: summarizing the understanding of the subject due to the published work of many people on the same task, and presenting some novel insights. Reaching a good understanding of one task and one dataset gives hope to generalize on other prediction tasks, as similar challenges recur in analyses of any datasets.
The idea of collaborative filtering is to make use of relations between tasks (users in our data), and between task attributes (items in our data). Collaborative filtering methods are used in recommender systems to calculate personalized recommendations, or in other words, to identify items preferred by a particular user. To realize that goal, a good intermediate task is prediction of user ratings, and the most accurate models for this task are based on dimensionality reduction, describing each item by a small number of variables, which can be seen as automatically learned analogues of movie genres, and a small number of variables describes each user's taste. One the most accurate models, regularized SVD, was analyzed more closely, and the assumptions of that model, such as the single-variable output, combining hidden variables by multiplication, and using Gaussian priors, were critically examined. In addition, an interpretation of the learned features by naming new movie genres has been proposed.
To learn the parameters in the developed models the best predictive accuracy was obtained by using different degrees of approximation of the Bayesian approach, from MCMC and Variational Bayes, to neural-networks-like simplifications. When identifying the model, that is, while approaching the unknown probabilistic model that generated the data, good engineering practice was maintaining a blend of an ensemble of many accurate, but varied methods. Blends of large ensembles also gave the best reached accuracy, indicating that, despite the large combined effort of many people, the process of model identification for the analyzed data remained largely unfinished, which is probably an unavoidable situation in an analysis of real-life datasets.
The work is complemented by giving heuristics adapting rating prediction to generate lists of recommendations, heuristics for cold-start situations, and descriptions of two SVD-based recommender systems.