Subscribe to DSC Newsletter

[Course] Data Mining: Learning from Large Data Sets (ETH, Zurich, Switzerland)

Data Mining: Learning from Large Data Sets


Many scientific and commercial applications require us to obtain insights from massive, high-dimensional data sets. In this graduate-level course, we will study principled, state-of-the-art techniques from statistics, algorithms and discrete and convex optimization for learning from such large data sets. The course will both cover theoretical foundations and practical applications.

Topics covered

  • Dealing with large data (Data centers; Map-Reduce/Hadoop; Amazon Mechanical Turk)
  • Fast nearest neighbor methods (Shingling, locality sensitive hashing)
  • Online learning (Online optimization and regret minimization, online convex programming, applications to large-scale Support Vector Machines)
  • Multi-armed bandits (exploration-exploitation tradeoffs, applications to online advertising and relevance feedback)
  • Active learning (uncertainty sampling, pool-based methods, label complexity)
  • Dimension reduction (random projections, nonlinear methods)
  • Data streams (Sketches, coresets, applications to online clustering)
  • Recommender systems


  • Homework 3 is out.
  • No class on April 14.
  • Friday recitation is cancelled. Please go to the Thursday recitation after the lecture.


  • eDoz Information: See here.
  • Recitations:
    • Thu 15-16 in NO C 6. Last names starting with A-M
    • Fri 14-15 in HG E 33.3. Last names starting with N-Z
  • Textbook: A. Rajaraman, J. Ullman. Mining of Massive Data Sets. Available here.

Lecture Notes

  • Feb 24: Introduction
  • Mar 3: Nearest neighbor search; Min-hashing
  • Mar 10: Locality sensitive hashing
  • Mar 17: Locality sensitive hashing; PageRank
  • Mar 24: PageRank and variants
  • Mar 31: SVM, online convex programming
  • Apr 7: Parallel online learning; other loss functions and regularizers
  • Apr 21: Active learning
  • May 5: Unsupervised learning on large data sets

Relevant Readings

  • Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
  • James Hays, Alexei A. Efros. Scene Completion Using Millions of Photographs. ACM Transactions on Graphics (SIGGRAPH 2007). August 2007, vol. 26, No. 3.
  • Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan and Uri Shaft. When is "Nearest Neighbor" Meaningful?. Proc. 7th International Conference on Database Theory (ICDT 1999) LNCS 1540: 217-235.
  • Aristides Gionis, Piotr Indyk, Rajeev Motwani. Similarity Search in High Dimensions via Hashing [ps]. 25th International Conference on Very Large Databases (VLDB) , 1999
  • Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, Janet Wiener. Graph structure in the web [html]. Proceedings of the 9th international World Wide Web conference on Computer networks 2000
  • Zoltan Gyöngyi, Hector Garcia-Molina, Jan Pedersen. Combating Web Spam with TrustRank. VLBD 2004 [pdf]
  • Martin Zinkevich. Online Convex Programming and Generalized Infinitesimal Gradient Ascent. Proc. International Conference on Machine Learning 2003 [pdf]
  • Martin Zinkevich, Markus Weimer, Alex Smola, Lihong Li. Parallelized Stochastic Gradient Descent. Advances in Neural Information Processing Systems 23, 2010 [pdf]
  • Ji Zhu, Saharon Rosset, Trevor Hastie, Rob Tibshirani. L1 norm support vector machines. Advances in Neural Information Processing Systems 2003 [html]
  • Prateek Jain, Sudheendra Vijayanarasimhan, Kristen Grauman. Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning. Advances in Neural Information Processing Systems 2003 [pdf]
  • Simon Tong, Daphne Koller. Support Vector Machine Active Learning with Applications to Text Classification. Proceedings of the Seventeenth International Conference on Machine Learning 2000 [ps]

Related Courses

Views: 891

On Data Science Central

© 2021   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service