A Data Science Central Community
Bootstraps, Permutation Tests, and Sampling Orders of Magnitude Faster Using SAS, Computational Statistics-WIREs, Vol. 5, Issue 5, 391-405. Download @ http://www.datamineit.com/DMI_publications.htm
While permutation tests and bootstraps have very wide-ranging application, both share a common potential drawback: as data-intensive resampling methods, both can be runtime prohibitive when applied to large or even medium-sized data samples drawn from large datasets. The data explosion over the past few decades has made this a common occurrence, and it highlights the increasing need for faster, and more efficient and scalable, permutation test and bootstrap algorithms.
Seven bootstrap and six permutation test algorithms coded in SAS (the largest privately owned software firm globally) are compared. The fastest algorithms (“OPDY” for the bootstrap, “OPDN” for permutation tests) are new, use no modules beyond Base SAS, and achieve speed increases orders of magnitude faster than the relevant “built-in” SAS procedures (OPDY is over 200x faster than Proc SurveySelect; OPDN is over 240x faster than Proc SurveySelect, over 350x faster than NPAR1WAY (which crashes on datasets less than a tenth the size OPDN can handle), and over 720x faster than Proc Multtest). OPDY also is much faster than hashing, which crashes on datasets smaller – sometimes by orders of magnitude – than OPDY can handle. OPDY is easily generalizable to multivariate regression models, and OPDN, which uses an extremely efficient draw-by-draw random-sampling-without-replacement algorithm, can use virtually any permutation statistic, so both have a very wide range of application. And the time complexity of both OPDY and OPDN is sub-linear, making them not only the fastest, but also the only truly scalable bootstrap and permutation test algorithms, respectively, in SAS.
Keywords: Bootstrap, Permutation, SAS, Scalable, Hashing, With Replacement, Without Replacement, Sampling
JEL Classifications: C12, C13, C14, C15, C63, C88
Mathematics Subject Classifications: 62F40, 62G09, 62G10
* J.D. Opdyke is Senior Managing Director, DataMineit, LLC, a consultancy specializing in advanced statistical and econometric modeling, risk analytics, and algorithm development for the banking, finance, and consulting sectors. J.D. has been a SAS user for over 20 years and routinely writes SAS code faster (often orders of magnitude faster) than SAS Procs (including but not limited to Proc Logistic, Proc MultTest, Proc Summary, Proc NPAR1WAY, Proc Freq, Proc Plan, and Proc SurveySelect). He earned his undergraduate degree with honors from Yale University, his graduate degree from Harvard University where he was both a Kennedy Fellow and a Social Policy Research Fellow, and he has completed post-graduate work as an ASP Fellow in the graduate mathematics department at MIT. His peer reviewed publications span number theory/combinatorics, robust statistics and high-convexity VaR modeling for regulatory and economic capital estimation, statistical finance, statistical computation, applied econometrics, and hypothesis testing for statistical quality control. Most are available upon request from J.D. at [email protected].
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. Other brand and product names are trademarks of their respective companies.