# Vincent Granville's Blog – May 2011 Archive (11)

### O(n) clustering algorithm for very large, unstructured data

Let's say that you have a large number n of elements a, b, c, etc. and you want to group them into clusters. Each cluster is supposed to contain few elements, say O(1).

You have one similarity metric d(a,b) to compare any two elements a, b. Also, you have a list of all pairs where d(a,b) > threshold, or in other words, all pairs (a,b) where a and b belong to the same cluster. The n x…

Added by Vincent Granville on May 30, 2011 at 11:00pm — 1 Comment

### The Analyticbridge Theorem (AKA the Fundamental Business Analytics Theorem)

See attached document, including the theorem, its proof and applications to business analytics (e.g. to produce model-free, data-driven confidence intervals for predictive scores). More explanations coming soon, in particular about how to leverage this deep statistical result when computing metrics against very large data sets.

The AnalyticBridge Theorem

Added by Vincent Granville on May 29, 2011 at 7:00pm — 1 Comment

### What causes predictive models to fail - and how to fix it?

Here are potential issues:

• Over-fitting.If you perform a regression with 200 predictors (with strong cross-correlations among predictors), use meta regression coefficients: that is, use coefficients of the form f[Corr(Var, Response), a,b, c] where a, b, c are three meta-parameters (e.g. priors in a Bayesian framework). This will reduce your number of parameters from 200 to 3, and eliminate most of the over-fitting
• Perform the right type of…
Added by Vincent Granville on May 28, 2011 at 8:00pm — 8 Comments

If you have more than 100 friends on Facebook, you've probably noticed that Facebook always show up the same 20 friends on your profile page, day after day. FB actually shows up to 10 friends, but they rotate from a list of 20 friends that, according to FB data mining algorithms, are deemed to be your best friends.

What makes a connection become one of your FB best friend is how frequently she visits your profile. Your can influence this list to some extent, by posting comments…

Added by Vincent Granville on May 28, 2011 at 6:30pm — No Comments

### IBM Commits \$100 Million to Massive Scale Analytics Research

ARMONK, N.Y.May 20, 2011 /PRNewswire/ -- As companies seek to gain real-time insight from diverse types of data, IBM (NYSE: IBM) today unveiled new software and services to help clients more effectively gain competitive insight, optimize infrastructure and better manage resources to address Internet-scale data. For the first time, organizations can…

Added by Vincent Granville on May 28, 2011 at 10:58am — No Comments

### RapidMiner voted most popular data mining / analytic software on KDNuggets

The poll had a record participation (over 1,100 voters). Among them, 43% used only commercial software, 32% only free software, and 25% both. The average number of tools per user was 2.2.

RapidMiner, R, and Excel were again the most popular tools, with SAS remaining the top commercial tool. Pie chart shows the breakdown of voters by region. We also note that W. European data miners had the highest % of free tool use (due to popularuty of tools like RapidMiner and KNIME… Continue

Added by Vincent Granville on May 24, 2011 at 6:15pm — No Comments

### ASA and CHANCE Magazine Sponsor Blog to Foster Discussions of Probability, Statistics

The American Statistical Association and CHANCE magazine have debuted The Statistics Forum, a blog to provide everyone the opportunity to participate in discussions about probability and statistics and their role in important and interesting topics. The blog, which is located on the CHANCE web site atchance.amstat.org, is edited by Andrew Gelman. Everyone is invited to read and comment on the…

Added by Vincent Granville on May 19, 2011 at 5:43pm — No Comments

### American Statistical Association Urges Support of Statistical Literacy Bill

The American Statistical Association (ASA), the nation's preeminent statistical society, urges members of the House of Representatives to support the Statistics Teaching, Aptitude and Training Act of 2011 (STAT Act of 2011), which was introduced today by Congressman Dave Loebsack (D-Iowa). A copy of the bill may be viewed at…

Added by Vincent Granville on May 19, 2011 at 5:41pm — No Comments

### New Ways to Exploit Raw Data May Bring Surge of Innovation | New York Times

Math majors, rejoice. Businesses are going to need tens of thousands of you in the coming years as companies grapple with a growing mountain of data.

Data is a vital raw material of the information economy, much as coal and iron ore were in the Industrial Revolution. But the business world is just beginning to learn how to process it all.

The current data surge is coming from sophisticated computer tracking of shipments, sales, suppliers and customers, as well as e-mail, Web…

Added by Vincent Granville on May 14, 2011 at 9:41am — No Comments

### About 200,000 data miners needed according to McKinsey

Analyzing large data sets—so called big data—will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus as long as the right policies and enablers are in place.

Research by MGI and McKinsey's Business Technology Office examines the state of digital data and documents the significant value that can potentially be unlocked.…

Added by Vincent Granville on May 13, 2011 at 5:28pm — 2 Comments

 Qaeda suspect killed in AbbottabadDaily Times…
Added by Vincent Granville on May 7, 2011 at 11:00am — 1 Comment

