I describe here the ultimate number guessing game, played with real money. It is a new trading and gaming system, based on state-of-the-art mathematical engineering, robust architecture, and patent-pending technology. It offers an alternative to the stock market and traditional gaming. This system is also far more transparent than the stock market, and can not be manipulated, as formulas to win the biggest returns (with real money) are made public. Also, it simulates a neutral, efficient stock market. In short, there is nothing random, everything is deterministic and fixed in advance, and known to all users. Yet it behaves in a way that looks perfectly random, and public algorithms offered to win the biggest gains require so much computing power, that for all purposes, they are useless -- except to comply with gaming laws and to establish trustworthiness.We use private algorithms to determine the winning numbers, and while they produce the exact same results as the public algorithms (we tested this extensively), they are incredibly more efficient, by many orders of magnitude. Also, it can be mathematically proved that the public and private algorithms are equivalent, and we actually proved it. We go through this verification process for any new algorithm introduced in our system. In the last section, we offer a competition: can you use the public algorithm to identify the winning numbers computed with the private (secret) algorithm? If yes, the system is breakable, and a more sophisticated approach is needed, to make it work. I don't think anyone can find the winning numbers (you are welcome to prove me wrong), so the award will be offered to the contestant providing the best insights on how to improve the robustness of this system. And if by chance you manage to identify those winning numbers, great, you'll get a bonus! But it is not a requirement to win the award.Read the full articleContent1. Description, Main Features and Advantages2. How it Works: the Secret SaucePublic AlgorithmThe Winning NumbersUsing Seeds to Find the Winning NumbersROI Tables3. Business Model and ApplicationsManaging the Money Flow4. Challenge and Statistical ResultsData Science / Math CompetitionControlling the Variance of the Portfolio ValueProbability of Cracking the System5. Designing 16-bit and 32-bit SystemsLayered ROI TablesSmooth ROI TablesSystems with Winning Numbers in [0, 1]See More

Summary: A new business model strategy based around intermediary platforms powered by AI/ML is promising the most direct path to fastest growth, profitability, and competitive success. Adopting this new approach requires a deep change in mindset and is quite different from just adopting AI/ML to optimize your current operations.As a data scientist you may be wondering why you need to be concerned about strategy and business models. It’s simple. Different types of AI/ML are most appropriate for different business objectives. So whether you’re a data scientist being asked to plan and present the most appropriate portfolio of projects, or a CXO looking to support your new digital business model, you need to understand the relationship between data science and strategy.In our last article we laid out the four major AI/ML powered business models. We set up a structure to help you think about “AI Inside”, essentially pasted on and used to optimize an existing old-style business model versus “AI-First”, business models that can lead to real digital transformation.AI-First models are typically associated with startups so not necessarily the first place a mature existing business would look for a strategy in its digital journey. But hidden in plain sight within AI-First is a business model strategy so bold that mature companies that have embraced it have outpaced their competitors by a wide margin. That’s adopting a “Platform Strategy”.Read the full article, by Bill Vorhies, here. For more articles by the same author, follow this link. For more about AI applications, click here. See More

This discussion has been recovered from our archives. I'm new to predictive modelling and I'am currently developing a model of student churn for an educative institution where I work. I´m using logistic regression for this issue , so which technique should I use in order to detect outliers in my training set?.Answers:The way we take care of outliers in Logistic Regression is creating dummy variables based on EDA (Exploratory Data Analysis).Regression analysis, the available "DRS" SoftwareYou brought a good question for discussion. We use Half-Normal Probability Plot of the deviance residuals with a Simulated envelope to detect outliers in binary logistic regression. The plot helps to identify the deviance residuals. A good reference is a book authored by Cook, R.d and S. Weisberg, titled Applied Regression Including Computing and Graphics (1999). For reference how to do half-normal plot with envelop check https://cran.r-project.org/web/packages/auditor/vignettes/model_fit_audit.htmlwe normally screen out the most extreme 2 percentile of any variable(total of 4pct). those records that have the extreme variable got removed. u can reduce the cutoff to 1pct if yr sample size is smallSee More

We investigate a large class of auto-correlated, stationary time series, proposing a new statistical test to measure departure from the base model, known as Brownian motion. We also discuss a methodology to deconstruct these time series, in order to identify the root mechanism that generates the observations. The time series studied here can be discrete or continuous in time, they can have various degrees of smoothness (typically measured using the Hurst exponent) as well as long-range or short-range correlations between successive values. Applications are numerous, and we focus here on a case study arising from some interesting number theory problem. In particular, we show that one of the times series investigated in my article on randomness theory [see here, read section 4.1.(c)] is not Brownian despite the appearance. It has important implications regarding the problem in question. Applied to finance or economics, it makes the difference between an efficient market, and one that can be gamed.This article it accessible to a large audience, thanks to its tutorial style, illustrations, and easily replicable simulations. Nevertheless, we discuss modern, advanced, and state-of-the-art concepts. This is an area of active research. Content1. Introduction and time series deconstructionExampleDeconstructing time seriesCorrelations, Fractional Brownian motions2. Smoothness, Hurst exponent, and Brownian testOur Brownian tests of hypothesisData3. Results and conclusionsCharts and interpretationConclusionsRead the full article, here. See More

I present here some innovative results from my most recent research on stochastic processes. chaos modeling, and dynamical systems, with applications to Fintech, cryptography, number theory, and random number generators. While covering advanced topics, this article is accessible to professionals with limited knowledge in statistical or mathematical theory. It introduces new material not covered in my recent book (available here) on applied stochastic processes. You don't need to read my book to understand this article, but the book is a nice complement and introduction to the concepts discussed here.None of the material presented here is covered in standard textbooks on stochastic processes or dynamical systems. In particular, it has nothing to do with the classical logistic map or Brownian motions, though the systems investigated here exhibit very similar behaviors and are related to the classical models. This cross-disciplinary article is targeted to professionals with interests in statistics, probability, mathematics, machine learning, simulations, signal processing, operations research, computer science, pattern recognition, and physics. Because of its tutorial style, it should also appeal to beginners learning about Markov processes, time series, and data science techniques in general, offering fresh, off-the-beaten-path content not found anywhere else, contrasting with the material covered again and again in countless, identical books, websites, and classes catering to students and researchers alike. Some problems discussed here could be used by college professors in the classroom, or as original exam questions, while others are extremely challenging questions that could be the subject of a PhD thesis or even well beyond that level. This article constitutes (along with my book) a stepping stone in my endeavor to solve one of the biggest mysteries in the universe: are the digits of mathematical constants such as Pi, evenly distributed? To this day, no one knows if these digits even have a distribution to start with, let alone whether that distribution is uniform or not. Part of the discussion is about statistical properties of numeration systems in a non-integer base (such as the golden ratio base) and its applications. All systems investigated here, whether deterministic or not, are treated as stochastic processes, including the digits in question. They all exhibit strong chaos, albeit easily manageable due to their ergodicity. Interesting connections to the golden ratio, Fibonacci numbers, Pisano periods, special polynomials, Brownian motions, and other special mathematical constants, are discussed throughout the article. All the analyses were done in Excel. You can download my spreadsheets from this article; all the results are replicable. Also, numerous illustrations are provided. Read the full article here.Content of this article1. General framework, notations and terminologyFinding the equilibrium distributionAuto-correlation and spectral analysisErgodicity, convergence, and attractorsSpace state, time state, and Markov chain approximationsExamples2. Case studyFirst fundamental theoremSecond fundamental theoremConvergence to equilibrium: illustration3. ApplicationsPotential application domainsExample: the golden ratio processFinding other useful b-processes4. Additional research topicsPerfect stochastic processesCharacterization of equilibrium distributions (the attractors)Probabilistic calculus and number theory, special integrals5. AppendixComputing the auto-correlation at equilibriumProof of the first fundamental theoremHow to find the exact equilibrium distribution6. Additional ResourcesRead the full article here.See More

]]>

]]>

Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don't exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy.For instance, how many clusters do you see in the picture below? What is the optimum number of clusters? No one can tell with certainty, not AI, not a human being, not an algorithm. How many clusters here? (source: see here)In the above picture, the underlying data suggests that there are three main clusters. But an answer such as 6 or 7, seems equally valid. A number of empirical approaches have been used to determine the number of clusters in a data set. They usually fit into two categories:Model fitting techniques: an example is using a mixture model to fit with your data, and determine the optimum number of components; or use density estimation techniques, and test for the number of modes (see here.) Sometimes, the fit is compared with that of a model where observations are uniformly distributed on the entire support domain, thus with no cluster; you may have to estimate the support domain in question, and assume that it is not made of disjoint sub-domains; in many cases, the convex hull of your data set, as an estimate of the support domain, is good enough. Visual techniques: for instance, the silhouette or elbow rule (very popular.)In both cases, you need a criterion to determine the optimum number of clusters. In the case of the elbow rule, one typically uses the percentage of unexplained variance. This number is 100% with zero cluster, and it decreases (initially sharply, then more modestly) as you increase the number of clusters in your model. When each point constitutes a cluster, this number drops to 0. Somewhere in between, the curve that displays your criterion, exhibits an elbow (see picture below), and that elbow determines the number of clusters. For instance, in the chart below, the optimum number of clusters is 4.The elbow rule tells you that here, your data set has 4 clusters (elbow strength in red)Good references on the topic are available. Some R functions are available too, for instance fviz_nbclust. However, I could not find in the literature, how the elbow point is explicitly computed. Most references mention that it is mostly hand-picked by visual inspection, or based on some predetermined but arbitrary threshold. In the next section, we solve this problem.Read full article here. See More

Many times, complex models are not enough (or too heavy), or not necessary, to get great, robust, sustainable insights out of data. Deep analytical thinking may prove more useful, and can be done by people not necessarily trained in data science, even by people with limited coding experience. Here we explore what we mean by deep analytical thinking, using a case study, and how it works: combining craftsmanship, business acumen, the use and creation of tricks and rules of thumb, to provide sound answers to business problems. These skills are usually acquired by experience more than by training, and data science generalists (see here how to become one) usually possess them.This article is targeted to data science managers and decision makers, as well as to junior professionals who want to become one at some point in their career. Deep thinking, unlike deep learning, is also more difficult to automate, so it provides better job security. Those automating deep learning are actually the new data science wizards, who can think out-of-the box. Much of what is described in this article is also data science wizardry, and not taught in standard textbooks nor in the classroom. By reading this tutorial, you will learn and be able to use these data science secrets, and possibly change your perspective on data science. Data science is like an iceberg: everyone knows and can see the tip of the iceberg (regression models, neural nets, cross-validation, clustering, Python, and so on, as presented in textbooks.) Here I focus on the unseen bottom, using a statistical level almost accessible to the layman, avoiding jargon and complicated math formulas, yet discussing a few advanced concepts. Read full article here. Content1. Case Study: The Problem2. Deep Analytical ThinkingAnswering hidden questionsBusiness questionsData questionsMetrics questions3. Data Science WizardryGeneric algorithmIllustration with three different modelsResults4. A few data science hacksSee More

In this data science article, emphasis is placed on science, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attempt here to lay the foundations of a new statistical technology, hoping that it will plant the seeds for further research on a topic with a broad range of potential applications. It is based on mixture models. Mixtures have been studied and used in applications for a long time, and it is still a subject of active research. Yet you will find here plenty of new material.Introduction and ContextIn a previous article (see here) I attempted to approximate a random variable representing real data, by a weighted sum of simple kernels such as uniformly and independently, identically distributed random variables. The purpose was to build Taylor-like series approximations to more complex models (each term in the series being a random variable), toavoid over-fitting,approximate any empirical distribution (the inverse of the percentiles function) attached to real data,easily compute data-driven confidence intervals regardless of the underlying distribution,derive simple tests of hypothesis,perform model reduction, optimize data binning to facilitate feature selection, and to improve visualizations of histogramscreate perfect histograms,build simple density estimators,perform interpolations, extrapolations, or predictive analytics,perform clustering and detect the number of clusters,create deep learning Bayesian systems.Why I've found very interesting properties about stable distributions during this research project, I could not come up with a solution to solve all these problems. The fact is that these weighed sums would usually converge (in distribution) to a normal distribution if the weights did not decay too fast -- a consequence of the central limit theorem. And even if using uniform kernels (as opposed to Gaussian ones) with fast-decaying weights, it would converge to an almost symmetrical, Gaussian-like distribution. In short, very few real-life data sets could be approximated by this type of model.Now, in this article, I offer a full solution, using mixtures rather than sums. The possibilities are endless. Content of this article1. Introduction and Context2. Approximations Using Mixture ModelsThe error termKernels and model parametersAlgorithms to find the optimum parametersConvergence and uniqueness of solutionFind near-optimum with fast, black-box step-wise algorithm3. ExampleData and source codeResults4. ApplicationsOptimal binningPredictive analyticsTest of hypothesis and confidence intervalsDeep learning: Bayesian decision treesClustering5. Interesting problemsGaussian mixtures uniquely characterize a broad class of distributionsWeighted sums fail to achieve what mixture models doStable mixturesNested mixtures and Hierarchical Bayesian SystemsCorrelationsRead full article here. See More