A Data Science Central Community
Here I provide the mathematics, explanations and source code to produce the data and moving clusters in the From chaos to clusters video series.
A little bit of history on how the project started:
What is a statistical model without model?
There's actually a generic mathematical model behind the algorithm. But nobody cares about the model, the algorithm was created first without having a mathematical model in mind. Initially, I had a gravitational model in mind, but I eventually abandoned it as it was not producing what I expected.
This illustrates a new trend in data science: we care less and less about modeling, but more and more about results. My algorithm has a bunch of parameters and features that can be fine-tuned to produce anything you want - be it a simulation of a Neyman-Scott cluster process, or a simulation of some no-name stochastic process.
It's a bit similar to how modern rock climbing has evolved: focusing on big names such as Everest in the past, to exploring deeper wilderness and climbing no-name peaks today (with their own challenges), to rock climbing on Mars in the future.
You can fine tune the parameters to
So how does the algorithm work?
It starts with a random distribution of m mobile points in the [0,1] x [0,1] square window. The points get attracted to each other (attraction is stronger to closest neighbors) and thus over time, they group into clusters.
The algorithm has the following components:
Special features
In the source code, the birth process (for point $k) is simply encoded as:
if (rand()<0.1/(1+$iteration)) { # birth and death
$tmp_x[$k]=rand();
$tmp_y[$k]=rand();
$rebirth[$k]=1;
}
In the source code, in the inner loop over $k, the point ($x,$y) to be updated is referenced as point $k, that is, ($y, $y) = ($moving_x[$k], $moving_y[$k]). Also, in a loop over $l, one level deeper, ($p, $q) referenced as point $l, represents a neighboring point when computing the weighted average formula used to update ($x, $y). The distance d is computed using the function distance which accepts four arguments ($x, $y, $p, $q) and returns $weight, the weight w.
Click here to view source code.
Related articles
Comment
Isn't this just a sort of strange-attractor formula?
Rebuttal to The End of Theory: The Data Deluge Makes the Scientific Method Obso...:
A lot can be done with black-box pattern detection, where patterns are found but not understood. For many applications (e.g. high frequency training) it's fine as long as your algorithm works. But in other contexts (e.g. root cause analysis for cancer eradication), deeper investigation is needed for higher success. And in all contexts, identifying and weighting true factors that explain the cause, usually allows for better forecasts, especially if good model selection, model fitting and cross-validation is performed. But if advanced modeling requires paying a high salary to a statistician for 12 months, maybe the ROI becomes negative and black-box brute force performs better, ROI-wise. In both cases, whether caring about cause or not, it is still science. Indeed it is actually data science - and it includes an analysis to figure out when/whether deeper statistical science can or can not be required. And ill all cases, it always involves cross-validation and design of experiment. Only the statistical theoretical modeling aspect can be ignored. Other aspects, such as scalability and speed, must be considered, and this is science too: data and computer science.
Hi James,
There a few potential applications. Example: visually see how insurance segments evolve over time (how they grow and shrink, new segments appear), and how members move from one segment to another, showing the most active or popular paths (e.g. from young to old as population ages, or from client to non-client), as well as velocity in all these changes. Particularly useful in an unsupervised clustering context.
-Vincent
I am just wondering how do you apply the mentioned algorithm to business problems? Can you use it as an unsupervised clustering algorithm for large data sets?
I have added two videos, you could call them "shooting stars":
Click here to get source code with explanations.
Thanks! Now I understand what you mean. My problem was that, in my areas (statistical astrophysics and statistical analysis for busines), this is still called a model. An example of the first is the inflationary model of the evolution of the universe. An astrophysicist will likely use the word "model" in relation to any concise description of a proposed mechanism underlying a physical process.
As a mathematician, I would be inclined to describe your "statistics without statistics" as statistics without correspondance, that is, without the one-to-one correspondance between predictor variables and an outcome described by an equation or set of equations. For me, a model isn't the equation: it's the explanation. The explaining might be done using equations but it might not...
Yes David, I did not write any mathematical equation: I did in the very early stages when I considered attraction forces that were inversely proportional to the square of the distance. Then I switched to exponential decay, then to ignore neighbors that are not very close (thus ignoring long-range interactions). After that, I entirely dropped the idea of using models.
I switched to just writing code (an algorithm more precisely) and changed the code to see what it produces, keeping pieces of code that produce desirable effects or with desirable properties (stability, scalability, replicability etc.) In a nutshell, this is code-driven or visually-driven (as opposed to model-driven) statistical research.
Now of course there's a model behind my final results. I just don't know what the model is. Yet you can still perform goodness-of-fit and predictions. Just estimate the parameters in the code (parameters that minimize some error between statistics computed on both simulated and real data - statistics that characterize the underlying unknown process) then run the code, it will create realizations of the (unknown) process well into the future - just compute estimates on "future" realizations, and you get your predictions. You can even provide confidence interval (model-free) using the AnalyticBridge Theorem.
Tell me again how this isn't a model? What is your definition of a model? Do mean that you didn't write a mathematic equation relating predictor variables to an outcome? If not, then what?
Thanks!!
You need to be a member of AnalyticBridge to add comments!
Join AnalyticBridge