A Data Science Central Community
Randomized algorithms for very large matrix problems (such as matrix multiplication, least-squares regression, the Singular Value Decomposition, etc.) have received a great deal of attention in recent years. Much of this work was motivated by problems in large-scale data analysis; this approach provides a novel paradigm and complementary perspective to traditional numerical linear algebra approaches to matrix computations; and the success of this line of work opens the possibility of performing matrix-based computations with truly massive data sets. Originating within theoretical computer science, this work was subsequently extended and applied in important ways by researchers from numerical linear algebra, statistics, applied mathematics, data analysis, and machine learning, as well as domain scientists.In this talk, we will provide an overview of this approach, with an emphasis on a few simple core ideas that underlie not only recent theoretical advances but also the usefulness of these tools in large-scale data analysis applications. Crucial in this context is the connection with the concept of statistical leverage.Historically, this notion, and in particular the diagonal elements of the so-called hat matrix, has been used in regression diagnostics to identify errors and outliers. Recently, however, the connection with statistical leverage has proved crucial in the development of improved matrix algorithms that come with worst-case guarantees, that are amenable to high-quality numerical implementation, and that are also useful to domain scientists. These developments, how to approximate very precisely the statistical leverage scores in time qualitatively faster than the usual naive method, and an example of how these idea scan be applied in large-scale distributed and parallel computational environments will all be described.
Michael Mahoney is at Stanford University. His research interests center around algorithms for very large-scale statistical data analysis, including both theoretical and applied aspects of problems in scientific and Internet domains. His current research interests include geometric network analysis; developing approximate computation and regularization methods for large informatics graphs; and applications to community detection, clustering, and information dynamics in large social and information networks. He has also worked on randomized matrix algorithms and their applications to genetics,medical imaging, and Internet problems. He has been a faculty member at Yale University and a researcher at Yahoo, and his PhD was is computational statistical mechanics at Yale University.