# AnalyticBridge

A Data Science Central Community

# From chaos to clusters - statistical modeling without models

Here I provide the mathematics, explanations and source code to produce the data and moving clusters in the From chaos to clusters video series.

A little bit of history on how the project started:

1. Interest in astronomy, visualization and how physics models apply to business problems
2. Interest in systems that produce clusters (as well as birth and death processes) and in visualizing cluster formation with videos rather than charts
3. Creating art: videos with sound and images synchronized and both generated using data (coming soon). Maybe I'll be able to turn your business data into a movie (either artistic or insightful or both)! I'm already at the point where I can produce the video frames faster than they are delivered on the streaming device. I called it FRT for faster than real time.

What is a statistical model without model?

There's actually a generic mathematical model behind the algorithm. But nobody cares about the model, the algorithm was created first without having a mathematical model in mind. Initially, I had a gravitational model in mind, but I eventually abandoned it as it was not producing what I expected.

This illustrates a new trend in data science: we care less and less about modeling, but more and more about results. My algorithm has a bunch of parameters and features that can be fine-tuned to produce anything you want - be it a simulation of a Neyman-Scott cluster process, or a simulation of some no-name stochastic process.

It's a bit similar to how modern rock climbing has evolved: focusing on big names such as Everest in the past, to exploring deeper wilderness and climbing no-name peaks today (with their own challenges), to rock climbing on Mars in the future.

You can fine tune the parameters to

1. Achieve best fit between simulated data and real business (or other data), using traditional goodness-of-fit testing and sensitivity analysis. Note that the simulated data represents a realization (an instance for object-oriented people) of a spatio-temporal stochastic process.
2. Once the parameters are calibrated, perform predictions (if you speak statistician language) or extrapolations (if you speak mathematician language).

So how does the algorithm work?

It starts with a random distribution of m mobile points in the [0,1] x [0,1] square window. The points get attracted to each other (attraction is stronger to closest neighbors) and thus over time, they group into clusters.

The algorithm has the following components:

1. Creation of n random fixed points (n=100) on [-0.5, 1.5] x [-0.5, 1.5]. This window is 4 times bigger than the one containing the mobile points, to eliminate edge effects impacting the mobile points. These fixed points (they never move) also act as some sort of dark matter: they are invisible, they are not represented in the video, but they are the glue that prevents the whole system from collapsing onto itself and converging to a single point.
2. Creation of m random mobile points (m=500) on [0,1] x [0,1].
3. Main loop (200 iterations). At each iteration, we compute the distance d between each mobile point (x,y) and each of his m-1 mobile neighbors and n fixed neighbors. A weight w is computed as a function of d, with a special weight for the point (x,y) itself. Then the updated (x,y) is the weighted sum aggregated over all points, and we do that for each point (x,y) at each iteration. The weight is such that the sum of weights over all points is always 1. In other words, we replace each point with a convex linear combination of all points.

Special features

• If the weight for (x,y) [the point being updated] is very high at a given iteration, then (x,y) will barely move.
• We have tested negative weights (especially for the point being updated) and we liked the results better. A delicate amount of negative weights also further prevents the system from collapsing and introduce a bit of chaos.
• Occasionally, one point is replaced by a brand new, random point, rather than updated using the weighted sum of neighbors. We call this event a "birth". It happens for less than 1% of all point updates, and it happens more frequently at the beginning. Of course, you can play with these parameters.

In the source code, the birth process  (for point \$k) is simply encoded as:

if (rand()<0.1/(1+\$iteration)) { # birth and death
\$tmp_x[\$k]=rand();
\$tmp_y[\$k]=rand();
\$rebirth[\$k]=1;
}

In the source code, in the inner loop over \$k, the point (\$x,\$y) to be updated is referenced as point \$k, that is,  (\$y, \$y) = (\$moving_x[\$k], \$moving_y[\$k]). Also, in a loop over \$l, one level deeper, (\$p, \$q) referenced as point \$l, represents a neighboring point when computing the weighted average formula used to update (\$x, \$y). The distance d is computed using the function distance which accepts four arguments (\$x, \$y, \$p, \$q) and returns \$weight, the weight w.

Related articles

Views: 10386

Comment

Join AnalyticBridge

Comment by Mathieu Landry on April 30, 2015 at 8:35pm

Isn't this just a sort of strange-attractor formula?

Comment by Vincent Granville on May 12, 2013 at 7:41pm

A lot can be done with black-box pattern detection, where patterns are found but not understood. For many applications (e.g. high frequency training) it's fine as long as your algorithm works. But in other contexts (e.g. root cause analysis for cancer eradication), deeper investigation is needed for higher success. And in all contexts, identifying and weighting true factors that explain the cause, usually allows for better forecasts, especially if good model selection, model fitting and cross-validation is performed. But if advanced modeling requires paying a high salary to a statistician for 12 months, maybe the ROI becomes negative and black-box brute force performs better, ROI-wise. In both cases, whether caring about cause or not, it is still science. Indeed it is actually data science - and it includes an analysis to figure out when/whether deeper statistical science can or can not be required. And ill all cases, it always involves cross-validation and design of experiment. Only the statistical theoretical modeling aspect can be ignored. Other aspects, such as scalability and speed, must be considered, and this is science too: data and computer science.

Comment by Vincent Granville on May 3, 2013 at 4:08pm

Hi James,

There a few potential applications. Example: visually see how insurance segments evolve over time (how they grow and shrink, new segments appear), and how members move from one segment to another, showing the most active or popular paths (e.g. from young to old as population ages, or from client to non-client), as well as velocity in all these changes. Particularly useful in an unsupervised clustering context.

-Vincent

Comment by jamesxli2007 on May 3, 2013 at 3:05pm

I am just wondering how do you apply the mentioned algorithm to business problems? Can you use it as an unsupervised clustering algorithm for large data sets?

James

Comment by Vincent Granville on April 26, 2013 at 3:15am

I have added two videos, you could call them "shooting stars":

Comment by David J Corliss on April 25, 2013 at 2:32pm

Thanks! Now I understand what you mean. My problem was that, in my areas (statistical astrophysics and statistical analysis for busines), this is still called a model. An example of the first is the inflationary model of the evolution of the universe. An astrophysicist will likely use  the word "model" in relation to any concise description of a proposed mechanism underlying a physical process.

As a mathematician, I would be inclined to describe your "statistics without statistics" as statistics without correspondance, that is, without the one-to-one correspondance between predictor variables and an outcome described by an equation or set of equations. For me, a model isn't the equation: it's the explanation. The explaining might be done using equations but it might not...

Comment by Vincent Granville on April 25, 2013 at 2:17pm

Yes David, I did not write any mathematical equation: I did in the very early stages when I considered attraction forces that were inversely proportional to the square of the distance. Then I switched to exponential decay, then to ignore neighbors that are not very close (thus ignoring long-range interactions). After that, I entirely dropped the idea of using models.

I switched to just writing code (an algorithm more precisely) and changed the code to see what it produces, keeping pieces of code that produce desirable effects or with desirable properties (stability, scalability, replicability etc.) In a nutshell, this is code-driven or visually-driven (as opposed to model-driven) statistical research.

Now of course there's a model behind my final results. I just don't know what the model is. Yet you can still perform goodness-of-fit and predictions. Just estimate the parameters in the code (parameters that minimize some error between statistics computed on both simulated and real data - statistics that characterize the underlying unknown process) then run the code, it will create realizations of the (unknown) process well into the future - just compute estimates on "future" realizations, and you get your predictions. You can even provide confidence interval (model-free) using the AnalyticBridge Theorem.

Comment by David J Corliss on April 25, 2013 at 1:58pm

Tell me again how this isn't a model? What is your definition of a model? Do mean that you didn't write a mathematic equation relating predictor variables to an outcome? If not, then what?

Thanks!!

1

2

3

4