Featured Blog Posts - AnalyticBridge2017-11-17T18:37:28Zhttps://www.analyticbridge.datasciencecentral.com/profiles/blog/feed?promoted=1&xn_auth=noHigh Precision Computing in Python or Rtag:www.analyticbridge.datasciencecentral.com,2017-11-14:2004291:BlogPost:3739902017-11-14T02:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Here we discuss an application of HPC (not high performance computing, instead high precision computing, which is a special case of HPC) applied to dynamical systems such as the logistic map in chaos theory. defined as X(k) = 4 X(k) (1 - X(k-1)). </p>
<p>For all these systems, the loss of precision propagates exponentially, to the point that after 50 iterations, all generated values are completely wrong. Tons of articles have been written on this subject - none of them acknowledging the…</p>
<p>Here we discuss an application of HPC (not high performance computing, instead high precision computing, which is a special case of HPC) applied to dynamical systems such as the logistic map in chaos theory. defined as X(k) = 4 X(k) (1 - X(k-1)). </p>
<p>For all these systems, the loss of precision propagates exponentially, to the point that after 50 iterations, all generated values are completely wrong. Tons of articles have been written on this subject - none of them acknowledging the faulty numbers being used, as round-off errors propagate as fast as chaos. This is an an active research area with applications in population dynamics, physics, and engineering. It does not invalidate the published results, as most of them are theoretical in nature, and do not impact the limiting distribution as the faulty sequences behave as instances of processes that are re-seeded every 40 iterations or so due to errors, behaving the same way regardless of the seed. </p>
<p>The core of the discussion here is about how to write code that produces far more accurate numbers, whether in R, Python or other languages, using super precision. In short, which libraries should you use to handle such problems?</p>
<p>You can check out the context, Perl code, Python code, and an Excel spreadsheet that illustrates the issue, in this discussion. </p>
<p><a href="https://www.datasciencecentral.com/forum/topics/question-how-precision-computing-in-python" target="_blank">Click here to read the full article</a>. </p>
<p></p>
<p><a href="http://api.ning.com:80/files/qUnr-syCGlU7aRNPCmJnSXXDfprErYzcP8ruveC1oDWDr*l8WweWVPB-Hwojnphq2Nn6lDp3LFOJ6jCYnwATNNA3-Wd7RkIg/fract.PNG" target="_self"><img src="http://api.ning.com:80/files/qUnr-syCGlU7aRNPCmJnSXXDfprErYzcP8ruveC1oDWDr*l8WweWVPB-Hwojnphq2Nn6lDp3LFOJ6jCYnwATNNA3-Wd7RkIg/fract.PNG" width="350" class="align-center"/></a></p>
<p style="text-align: center;"><em>This broccoli is an example of the self-replicating processes that could benefit from HPC</em></p>
<p style="text-align: left;"><b>DSC Resources</b></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>
<p style="text-align: left;">Popular Articles</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-articles-about-core-data-science">What is Data Science? 24 Fundamental Articles Answering This Question</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python">Hitchhiker's Guide to Data Science, Machine Learning, R, Python</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
</ul>Linear Models Don’t have to Fit Exactly for P-Values To Be Accurate, Right, and Usefultag:www.analyticbridge.datasciencecentral.com,2017-11-03:2004291:BlogPost:3740452017-11-03T05:30:00.000ZChirag Shivalkerhttps://www.analyticbridge.datasciencecentral.com/profile/ChiragShivalker
<p>There is no need to get confused with multiple linear regression, generalized linear model or general linear methods. The general linear model or multivariate regression model is a statistical linear model and is written as <strong>Y = XB + U</strong>.</p>
<p><br></br> <img src="http://api.ning.com:80/files/IQ0ncUB6iv46gmInhSMUmqksefRP62Vf8KPE3lkuGhZiemAfzakEd9AmEseSNzn-xnrbA0vdegJVmimvfisz2q3pZqCIboFh/linearmodelsdonthavetofitexactlyforpvaluestobeaccuraterightanduseful.jpg?width=750" width="750"></img></p>
<p><br></br> Usually, a linear model includes a number of different statistical models such as ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-test. The GLM is a generalization of multiple…</p>
<p>There is no need to get confused with multiple linear regression, generalized linear model or general linear methods. The general linear model or multivariate regression model is a statistical linear model and is written as <strong>Y = XB + U</strong>.</p>
<p><br/> <img src="http://api.ning.com:80/files/IQ0ncUB6iv46gmInhSMUmqksefRP62Vf8KPE3lkuGhZiemAfzakEd9AmEseSNzn-xnrbA0vdegJVmimvfisz2q3pZqCIboFh/linearmodelsdonthavetofitexactlyforpvaluestobeaccuraterightanduseful.jpg?width=750" width="750"/></p>
<p><br/> Usually, a linear model includes a number of different statistical models such as ANOVA, ANCOVA, MANOVA, MANCOVA, ordinary linear regression, t-test and F-test. The GLM is a generalization of multiple linear regression models to the case of more than one dependent variable. So if Y, B, and U represent column vectors, the matrix equation above will portray a multiple linear regression.<br/> <br/> <span class="font-size-4">Which are the key assumptions made in a multiple linear regression analysis?</span><br/> <br/> Independent variables and outcome variables should have a linear relationship among them, and to find out whether there is a linear or curvilinear relationship; scatterplots can be leveraged.<br/></p>
<ul>
<li><strong>Multivariate Normality:</strong> Residuals are normally distributed, as is assumed in multiple regressions.</li>
<li><strong>No Multicollinearity:</strong> Independent variables are not correlated among, as is assumed in multiple regressions. To test these assumptions, Variance Inflation Factor – VIF is used.</li>
<li><strong>Homoscedasticity:</strong> Error terms are similar across the values for independent variables in the assumptions made. predicted values Vs standardized residuals are used to showcase if points are successfully and equally distributed across independent variables.</li>
</ul>
<p><br/> <a href="http://www.hitechbpo.com/market-research-and-data-analytics.php" target="_blank">Best data analytic solutions</a> are derived to automatically include assumption tests and plots while conducting regression. From nominal, ordinal, or interval/ratio level variables; multiple regression requires at least two independent variables. A rule of thumb for the sample size is that regression analysis requires at least 20 cases per independent variable in the analysis.<br/> <br/> <span class="font-size-4">Assumptions in your regression or ANOVA model</span><br/> <br/> We know, you know; how important are they because if they’re not met adequately, all the p-values will become inaccurate, wrong, & useless. But linear models don’t have to fit precisely for p-values to be accurate, right and useful; they are robust enough to departures from these assumptions. Statistics classes and several other coaching places have been imparting such knowledge or contradictory statements as you may say; to drive analysts crazy.<br/> <br/> <em>It is a debatable topic as to whether statisticians cooked this stuff up to torture researchers, pun intended; or do they do it to satisfy their egos?</em><br/> <br/> Well, the reality is they don’t. Learning how to push robust assumptions is not that hard a task, of course when done with professional training and guidance, backed with some practice. <em>Enlisted are few of the mistakes researchers usually make because of one, or both of the above claims.</em><br/> <br/> <strong><span class="font-size-3">1. P-value is the feel-good factor</span></strong><br/> <br/> Avoiding over-testing of assumptions is one of the ways out. Statistical tests can help in determining if assumptions made are met adequately or not. Having a p-value is the feel-good factor, isn’t it? It helps you avoid further complications, and one can do it by leveraging the golden rule of p<.05.<br/> <br/> In no case, the tests should ignore the robustness. Assuming that every distribution is non-normal and heteroskedastic; would be a mistake. Tools may prove helpful, but are developed to treat every data set as if it is a nail. The right thing to do is use the hammer when really required, and not hammer everything.<br/> <br/> <strong><span class="font-size-3">2. GLM is robust but not for all the assumptions</span></strong><br/> <br/> Here, assumptions are made that everything is robust and tests are avoided. It is a normal practice which succeeds most of the times. But there are instances, when it does not work. Yes, the robust GLM, for deviations from some of the assumptions, but are not robust all the way and not for all the assumptions. So check all of them without fail.<br/> <br/> <strong><span class="font-size-3">3. Test wrong assumptions</span></strong><br/> <br/> Testing wrong assumptions also is one of the mistakes that researchers do. They look at any two regression books and it will give them different sets of assumptions.<br/> <br/> <span class="font-size-4">Testing the related, but wrong thing</span><br/> <br/> Insights here might seem partial as several of these “assumptions” should be checked, but they are not model assumptions; instead are data challenges. At times the assumptions are taken to their logical conclusions, which adds up to it looking to be partial. The reference guide is trying to make it more logical for you, but in that attempt, it leads you to test the related but the wrong thing. It might work out most of the times, but not always.</p>Information Retrieval Document Search Engine in Rtag:www.analyticbridge.datasciencecentral.com,2017-11-07:2004291:BlogPost:3739452017-11-07T13:30:00.000Zsuresh kumar Gorakalahttps://www.analyticbridge.datasciencecentral.com/profile/sureshkumarGorakala
<h3>Introduction:</h3>
<p><span>In this post, we learn about building a basic search engine or document retrieval system using Vector space model. This use case is widely used in information retrieval systems. Given a set of documents and search term(s)/query we need to retrieve relevant documents that are similar to the search query. </span></p>
<h3>Problem statement:</h3>
<p><span>The problem statement explained above is represented as in below image. …</span></p>
<h3>Introduction:</h3>
<p><span>In this post, we learn about building a basic search engine or document retrieval system using Vector space model. This use case is widely used in information retrieval systems. Given a set of documents and search term(s)/query we need to retrieve relevant documents that are similar to the search query. </span></p>
<h3>Problem statement:</h3>
<p><span>The problem statement explained above is represented as in below image. </span></p>
<div style="text-align: center;"><a href="http://3.bp.blogspot.com/-w96aVPMy198/WfgQTqkmI7I/AAAAAAAAGAQ/zd3iZfVh0_E5Yy2d_1WSs5N8RB8rPJRbwCK4BGAYYCw/s1600/information%2Bretrieval_1.PNG" target="_blank"><img src="https://3.bp.blogspot.com/-w96aVPMy198/WfgQTqkmI7I/AAAAAAAAGAQ/zd3iZfVh0_E5Yy2d_1WSs5N8RB8rPJRbwCK4BGAYYCw/s320/information%2Bretrieval_1.PNG?width=320" width="320" class="align-center"/></a><em>Document retrieval system</em></div>
<p></p>
<p>High level design of document search system is shown below :</p>
<div class="slate-resizable-image-embed slate-image-embed__resize-middle"><a href="https://media.licdn.com/mpr/mpr/AAIA_wDGAAAAAQAAAAAAAAt5AAAAJDViMmE0Y2NhLTg4ZmYtNDNiOC1hYWQxLTM2ZTkwOWM0NGU0Mg.png" target="_blank"><img src="https://media.licdn.com/mpr/mpr/AAIA_wDGAAAAAQAAAAAAAAt5AAAAJDViMmE0Y2NhLTg4ZmYtNDNiOC1hYWQxLTM2ZTkwOWM0NGU0Mg.png" class="align-center"/></a></div>
<p></p>
<p>The content of the post is as follows:</p>
<ul>
<li>Explaining various techniques used in Information retrieval such as vector space models, term document matrix, similarity score calculation</li>
<li>Data description </li>
<li>High level design of the document search system</li>
<li>Code implementation in R</li>
</ul>
<p>Please go thorough the complete blog at below location:</p>
<p><a href="http://www.dataperspective.info/2017/11/information-retrieval-document-search-using-vector-space-model-in-r.html">http://www.dataperspective.info/2017/11/information-retrieval-document-search-using-vector-space-model-in-r.html</a></p>Fascinating Time Series with Cool Applicationstag:www.analyticbridge.datasciencecentral.com,2017-11-07:2004291:BlogPost:3741632017-11-07T03:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Here we describe well-known chaotic sequences, including new generalizations, with application to random number generation, highly non-linear auto-regressive models for times series, simulation, random permutations, and the use of big numbers (libraries available in programming languages to work with numbers with hundreds of decimals) as standard computer precision almost always produces completely erroneous results after a few iterations -- a fact rarely if ever mentioned in the scientific…</p>
<p>Here we describe well-known chaotic sequences, including new generalizations, with application to random number generation, highly non-linear auto-regressive models for times series, simulation, random permutations, and the use of big numbers (libraries available in programming languages to work with numbers with hundreds of decimals) as standard computer precision almost always produces completely erroneous results after a few iterations -- a fact rarely if ever mentioned in the scientific literature, but illustrated here, together with a solution. It is possible that all scientists who published on chaotic processes, used faulty numbers because of this issue.</p>
<p>This article is accessible to non-experts, even though we solve a special stochastic equation for the first time, providing an unexpected exact solution, for a new chaotic process that generalizes the logistic map. We also describe a general framework for continuous random number generators, and investigate the interesting auto-correlation structure associated with some of these sequences. References are provided, as well as fast source code to process big numbers accurately, and even an elegant mathematical proof in the last section.</p>
<p><a href="https://api.ning.com/files/q03Iqwj-1ZoSi1l9qr9Uib38qslBZxgjww5CHJGfTiguEkEspF0ev5*PHK8F6jozOMo4*a0xybbVPIq6sfb5y2ywOHOAjRsP/Capture.PNG" target="_self"><img src="https://api.ning.com/files/q03Iqwj-1ZoSi1l9qr9Uib38qslBZxgjww5CHJGfTiguEkEspF0ev5*PHK8F6jozOMo4*a0xybbVPIq6sfb5y2ywOHOAjRsP/Capture.PNG" class="align-center"/></a>This article is also a useful read for participants in our <a href="https://www.analyticbridge.datasciencecentral.com/profiles/blogs/interesting-probability-problem-for-serious-geeks" target="_blank">upcoming competition</a> (to be announced soon) as it addresses a similar stochastic integral equation problem, also with exact solution, in the related context of self-correcting random walks - another kind of memory-less process. </p>
<p>The approach used here starts with traditional data science and simulations for exploratory analysis, with empirical results confirmed later by mathematical arguments in the last section. </p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/amazing-random-sequences-with-cool-applications" target="_blank">Read the full article here</a>. </p>Interesting Problem: Self-correcting Random Walkstag:www.analyticbridge.datasciencecentral.com,2017-10-04:2004291:BlogPost:3721092017-10-04T20:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><em>Section 3 was added on October 11. Section 4 was added on October 19. A $2,000 award is offered to solve any of the open questions, <a href="https://www.datasciencecentral.com/profiles/blogs/dsc-competition-for-data-scientists-and-quantitative-experts-quan" target="_blank">click here for details</a>. </em></p>
<p>This is another off-the-beaten-path problem, one that you won't find in textbooks. You can solve it using data science methods (my approach) but the mathematician with some…</p>
<p><em>Section 3 was added on October 11. Section 4 was added on October 19. A $2,000 award is offered to solve any of the open questions, <a href="https://www.datasciencecentral.com/profiles/blogs/dsc-competition-for-data-scientists-and-quantitative-experts-quan" target="_blank">click here for details</a>. </em></p>
<p>This is another off-the-beaten-path problem, one that you won't find in textbooks. You can solve it using data science methods (my approach) but the mathematician with some spare time could find an elegant solution. Share it with your colleagues to see how math-savvy they are, or with your students. I was able to make substantial progress in 1-2 hours of work using Excel alone, thought I haven't found a final solution yet (maybe you will.) My Excel spreadsheet with all computations is accessible from this article. You don't need a deep statistical background to quickly discover some fun and interesting results playing with this stuff. Computer scientists, software engineers, quants, BI and analytic professionals from beginners to veterans, will also be able to enjoy it!</p>
<p><a href="https://i.imgur.com/nvHjav6.png" target="_blank"><img src="https://i.imgur.com/nvHjav6.png?width=391" width="391" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source for picture: <a href="http://demonstrations.wolfram.com/ConstrainedRandomWalk/" target="_blank">2-D constrained random walk</a> (snapshot - video available <a href="https://www.youtube.com/watch?v=W9jktqV3_Mc" target="_blank">here</a>)</em></p>
<p><span class="font-size-4"><strong>1, The problem</strong></span></p>
<p>We are dealing with a stochastic process barely more complicated than a random walk. Random walks are also called <em>drunken walks</em>, as they represent the path of a drunken guy moving left and right seemingly randomly, and getting lost over time. Here the process is called <em>self-correcting random walk</em> or also <em>reflective random walk</em>, and is related to <em><a href="https://arxiv.org/abs/1303.3655" target="_blank">controlled random walks</a></em>, and <em><a href="http://demonstrations.wolfram.com/ConstrainedRandomWalk/" target="_blank">constrained random walks</a></em> (see also <a href="https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2648134" target="_blank">here</a>)<em> </em>in the sense that the walker, less drunk than in a random walk, is able to correct any departure from a straight path, more and more over time, by either slightly over- or under-correcting at each step. One of the two model parameters (the positive parameter <em>a</em>) represents how drunk the walker is, with <em>a</em> = 0 being the worst. Unless <em>a</em> = 0, the amplitude of the corrections decreases over time to the point that eventually (after many steps) the walker walks almost straight and arrives at his destination. This model represents many physical processes, for instance the behavior of a stock market somewhat controlled by a government to avoid bubbles and implosions, or when it hits a symbolic threshold and has a hard time breaking through. It is defined as follows:</p>
<p>Let's start with <em>X</em>(1) = 0, and define <em>X</em>(<em>k</em>) recursively as follows, for <em>k</em> > 1:</p>
<p><a href="https://i.imgur.com/ozpssO4.png" target="_blank"><img src="https://i.imgur.com/ozpssO4.png?width=327" width="327" class="align-center"/></a></p>
<p><a href="https://i.imgur.com/0UkIUnK.png" target="_blank"><img src="https://i.imgur.com/0UkIUnK.png?width=324" width="324" class="align-center"/></a></p>
<p>and let's define <em>U</em>(<em>k</em>), <em>Z</em>(<em>k</em>), and <em>Z</em> as follows:</p>
<p><a href="https://i.imgur.com/ZtV2AJl.png" target="_blank"><img src="https://i.imgur.com/ZtV2AJl.png?width=120" width="120" class="align-center"/></a></p>
<p><a href="https://i.imgur.com/H1iOwc3.png" target="_blank"><img src="https://i.imgur.com/H1iOwc3.png?width=116" width="116" class="align-center"/></a></p>
<p><a href="https://i.imgur.com/jT3Y5xF.png" target="_blank"><img src="https://i.imgur.com/jT3Y5xF.png?width=113" width="113" class="align-center"/></a></p>
<p>where the <em>V</em>(<em>k</em>)'s are deviates from <em>independent</em> uniform variables on [0, 1], obtained for instance using the function RAND in Excel. So there are two <em>positive</em> parameters in this problem, <em>a</em> and <em>b</em>, and <em>U</em>(<em>k</em>) is always between 0 and 1. When <em>b</em> = 1, the <em>U</em>(<em>k</em>)'s are just standard uniform deviates, and if <em>b</em> = 0, then <em>U</em>(<em>k</em>) = 1. The case <em>a</em> = <em>b</em> = 0 is degenerate and should be ignored. The case <em>a</em> > 0 and <em>b</em> = 0 is of special interest, and it is a number theory problem in itself, <a href="http://www.datasciencecentral.com/profiles/blogs/new-representation-of-numbers-with-very-fast-converging-fractions" target="_blank">related to this problem</a> when <em>a</em> = 1. Also, just like in random walks or Markov chains, the <em>X</em>(<em>k</em>)'s are not independent; they are indeed highly auto-correlated.</p>
<p>Prove that if <em>a</em> < 1, then <em>X</em>(<em>k</em>) converges to 0 as <em>k</em> increases. Under the same condition, prove that the limiting distribution <em>Z</em></p>
<ul>
<li>always exists, (Note: if <em>a</em> > 1, <em>X</em>(<em>k</em>) may not converge to zero, causing a drift and asymmetry)</li>
<li>always takes values between -1 and +1, with min(<em>Z</em>) = -1 and max(<em>Z</em>) = +1,</li>
<li>is symmetric, with mean and median equal to 0</li>
<li>and does not depend on <em>a,</em> but only on <em>b.</em></li>
</ul>
<p>For instance, for <em>b</em> =1, even <em>a</em> = 0 yields the same triangular distribution for <em>Z</em>, as any <em>a</em> > 0.</p>
<p>If <em>a</em> < 1 and <em>b</em> = 0, (the non-stochastic case) prove that </p>
<ul>
<li><em>Z</em> can only take 3 values: -1 with probability 0.25, +1 with probability 0.25, and 0 with probability 0.50</li>
<li>If <em>U</em>(<em>k</em>) and <em>U</em>(<em>k</em>+1) have the same sign,then <em>U</em>(<em>k</em>+2) is of opposite sign </li>
</ul>
<p>And here is a more challenging question: In general, what is the limiting distribution of <em>Z</em>? Also, what happens if you replace the <em>U</em>(<em>k</em>)'s with (say) Gaussian deviates? Or with <em>U</em>(<em>k</em>) = | sin (<em>k</em>*<em>k</em>) | which has a somewhat random behavior?</p>
<p><span class="font-size-4"><strong>2. Hints to solve this problem</strong></span></p>
<p>It is necessary to use a decent random number generator to perform simulations. Even with Excel, plotting the empirical distribution of <em>Z</em>(<em>k</em>) for large values of <em>k</em>, and matching the kurtosis, variance and empirical percentiles with those of known statistical distributions, one quickly notices that when <em>b</em> = 1 (and even if <em>a</em> = 0) the limiting distribution <em>Z</em> is well approximated by a symmetric triangular distribution on [-1, 1], and thus centered on 0, with a kurtosis of -3/5 and and a variance of 1/6. In short, this is the distribution of the difference of two uniform random variables on [0, 1]. In other words, it is the distribution of <em>U</em>(3) - <em>U</em>(2). Of course, this needs to be proved rigorously. Note that the limiting distribution <em>Z</em> can be estimated by computing the values <em>Z</em>(<em>n</em>+1), ..., <em>Z</em>(<em>n</em>+<em>m</em>) for large values of <em>n</em> and <em>m</em>, using just one instance of this simulated stochastic process.<em> </em></p>
<p>Does it generalize to other values of <em>b</em>? That is, does <em>Z</em> always have the distribution of <em>U</em>(3) - <em>U</em>(2)? Obviously not for the case <em>b</em> = 0. But it could be a function, combination and/or mixture of <em>U</em>(3), -<em>U</em>(2), and <em>U</em>(3) - <em>U</em>(2). This works both for <em>b</em> = 0 and <em>b</em> = 1.</p>
<p><a href="https://i.imgur.com/EXQSt5r.png" target="_blank"><img src="https://i.imgur.com/EXQSt5r.png?width=516" width="516" class="align-center"/></a></p>
<p style="text-align: center;"><em><strong>Figure 1</strong>: Mixture-like distribution of Z (estimated) when b = 0.01 and a = 0.8</em></p>
<p>Interestingly, for small values of <em>b</em>, the limiting distribution <em>Z</em> looks like a mixture of (barely overlapping) simple distributions. So it could be used as a statistical model in clustering problems, each component of the mixture representing a cluster. See Figure 1.</p>
<p><a href="https://i.imgur.com/5IgW1Ue.png" target="_blank"><img src="https://i.imgur.com/5IgW1Ue.png?width=513" width="513" class="align-center"/></a></p>
<p style="text-align: center;"><em>Figure 2: Triangular distribution of Z (estimated) when b = 1 and a = 0.8</em></p>
<p>The spreadsheet with all computations and model fitting <a href="http://api.ning.com:80/files/Dwi1glrkjL4qs53cOt86PQxvltEtxn1XmbA2FwwACaxvwM1H9YnK2J7MVSKaLvRpmYxd9JAWsHFqdckrys8YvoYCf0tJLS-c/ControlledRandomWalk.xlsx" target="_self">can be downloaded here</a>.. </p>
<p><span class="font-size-4"><strong>3. Deeper dive</strong></span></p>
<p>So far, my approach has been data science oriented: it looks more like a guesswork. Here I switch to mathematics, to try to derive the distribution of <em>Z</em>. Since it does not depend on the parameter <em>a</em>, let us assume here that <em>a</em> = 0. <span>Note that when </span><em>a</em><span> = 0, </span><em>X</em><span>(</span><em>k</em><span>) does not converge to zero; instead </span><em>X</em><span>(</span><em>k</em><span>) = Z(</span><em>k</em><span>) and both converge in distribution to </span><em>Z</em><span>. </span>It is obvious that <em>X</em>(<em>k</em>) is a mixture of distributions, namely <em>X</em>(<em>k</em>-1) + <em>U</em>(<em>k</em>) and <em>X</em>(<em>k</em>-1) - <em>U</em>(<em>k</em>). Since <em>X</em>(<em>k</em>-1) is in turn a mixture, <em>X</em>(<em>k</em>) is actually a mixture of mixtures, and so on, In short, it has the distribution of some nested mixtures.</p>
<p>As a starting point, it would be interesting to study the variance of <em>Z</em> (the expectation of Z is equal to 0.) The following formula is incredibly accurate for any value of <em>b</em> between 0 and 1, and even beyond. It is probably an exact formula, not an approximation. It was derived using the tentative density function obtained at the bottom of this section, for Z:</p>
<p><a href="https://i.imgur.com/LMSP0Py.png" target="_blank"><img src="https://i.imgur.com/LMSP0Py.png?width=174" width="174" class="align-center"/></a></p>
<p>It is possible to obtain a functional equation for the distribution <em>P</em>(<em>Z</em> < <em>z</em>)<em>,</em> using the equations that define <em>X</em>(<em>k</em>) in section 1, with <em>a</em> = 0, and letting <em>k</em> tends to infinity. It starts with </p>
<p><a href="https://i.imgur.com/W9rNj3W.png" target="_blank"><img src="https://i.imgur.com/W9rNj3W.png?width=460" width="460" class="align-center"/></a></p>
<p>Let's introduce <em>U</em> as a random variable with the same distribution as <em>U</em>(<em>k</em>) or <em>U</em>(2). As <em>k</em> tends to infinity, and separating the two cases <em>x</em> negative and <em>x</em> positive, we get</p>
<p><a href="https://i.imgur.com/JfMsTMQ.png" target="_blank"><img src="https://i.imgur.com/JfMsTMQ.png?width=519" width="519" class="align-center"/></a></p>
<p>Taking advantages of symmetries, this can be further simplified to </p>
<p><a href="https://i.imgur.com/fKjVw3Z.png" target="_blank"><img src="https://i.imgur.com/fKjVw3Z.png?width=394" width="394" class="align-center"/></a></p>
<p>where <em>F</em> represents the distribution function, <em>f</em> represents the density function, and <em>U</em> has the same distribution as <em>U</em>(2), that is</p>
<p><a href="https://i.imgur.com/0VQLhmv.png" target="_blank"><img src="https://i.imgur.com/0VQLhmv.png?width=216" width="216" class="align-center"/></a></p>
<p>Taking the derivative with respect to <em>z</em>, the functional equation becomes the following <a href="https://en.wikipedia.org/wiki/Fredholm_integral_equation" target="_blank">Fredholm integral equation</a>, the unknown being <em>Z</em>'s density function:</p>
<p><a href="https://i.imgur.com/Js26YkK.png" target="_blank"><img src="https://i.imgur.com/Js26YkK.png?width=352" width="352" class="align-center"/></a></p>
<p>We have the following particular cases:</p>
<ul>
<li>When <em>b</em> tends to zero, the distribution of <em>Z</em> converges to a uniform law on [-1, 1] thus with a variance equal to 1/3. </li>
<li>When <em>b</em> = 1/2, <em>Z</em> has a parabolic distribution on [-1, +1], defined by P(<em>Z</em> < <em>z</em>) = (2 + 3<em>z</em> - <em>z</em>^3)/4. This needs to be proved, for instance by plugging this parabolic distribution in the functional equation, and checking that the functional equation is verified if <em>b</em> = 1/2. However, a constructive proof would be far more interesting.</li>
<li>When <em>b</em> = 1, <em>Z</em> has the triangular distribution discussed earlier. The density function for <em>Z</em>, defined as the derivative of P(<em>Z</em> < <em>z</em>) with respect to <em>z</em>, is equal to 1 - |<em>z</em>| when <em>b</em> = 1, and 3 (1 - <em>z</em>^2) / 4 when <em>b</em> = 1/2.</li>
</ul>
<p>So for <em>b</em> = 1, <em>b</em> = 1/2, or the limiting case <em>b</em> = 0, we have the following density for <em>Z</em>, defined on [-1, 1]:</p>
<p><a href="https://i.imgur.com/dTxllJB.png" target="_blank"><img src="https://i.imgur.com/dTxllJB.png?width=211" width="211" class="align-center"/></a></p>
<p>Is this formula valid for any <em>b</em> between 0 and 1? This is still an open question. The functional equation applies regardless of <em>U</em>'s distribution though, even if exponential or Gaussian. The complexity in the cases discussed here, arises from the fact that <em>U</em>'s density is not smooth enough, due to its bounded support domain [0, 1] (outside the support domain, the density is equal to 0.) A potential more generic version of the previous formula would be:</p>
<p><a href="https://i.imgur.com/8AoRsHl.png" target="_blank"><img src="https://i.imgur.com/8AoRsHl.png?width=345" width="345" class="align-center"/></a></p>
<p>where <em>E</em> denotes the expectation. However, I haven't checked whether and under which conditions this formula is correct or not, except for the particular cases of <em>U</em> discussed here. One of the requirements is that the support domain for <em>U</em> is [0, 1]. If this formula is not exact in general, it might still be a good approximation in some cases. </p>
<p><span class="font-size-4"><strong>4. Potential Areas of Research</strong></span></p>
<p>Here are a few interesting topics for research:</p>
<ul>
<li>Develop a 2-D or 3-D version of this process, investigate potential applications in thermodynamics or statistical mechanics, for instance modeling movements of gas molecules in a cube as the temperature goes down (<em>a</em> >0) or is constant (<em>a</em> = 0), and comparison with other stochastic processes used in similar contexts..</li>
<li>Continuous version of the discrete reflective random walk investigated here, with <em>a</em> = 0, and increments <em>X</em>(<em>k</em>) - <em>X</em>(<em>k</em>-1) being infinitesimally small, following a Gaussian rather than uniform distribution. The limiting un-constrained case is known as a <em><a href="https://en.wikipedia.org/wiki/Wiener_process" target="_blank">Wiener process</a></em> or <em>Brownian motion.</em> What happens if this process is also constrained to lie between -1 and +1 on the Y-axis? This would define a reflected Wiener process, <a href="https://en.wikipedia.org/wiki/Reflected_Brownian_motion" target="_blank">see also here</a> for a similar process, and also <a href="https://en.wikipedia.org/wiki/Reflection_principle_(Wiener_process)" target="_blank">here</a>.</li>
<li>Another direction is to consider the one-dimensional process as time series (which economists do) and to study the multivariate case, with multiple cross-correlated time series.</li>
<li>For the data scientist, it would be worth checking whether and when, based on cross-validation, my process provides better model fitting, leading to more accurate predictions and thus better stock trading ROI (than say a random walk, after removing any trend or drift) when applied to real stock market data publicly available.</li>
</ul>
<p>This is the kind of mathematics used by Wall Street quants and in operations research. Hopefully my presentation here is much less arcane than the traditional literature on the subject, and accessible to a much broader audience, even though it features the complex equations characterizing such a process (and even hinting to a mathematical proof that is not as difficult as it might seem at first glance, and supported by simulations). Note that my reflective random walk is not a true random walk in the classical sense of the term: A better term might be more appropriate. </p>
<p><span class="font-size-4"><strong>5. Solution for the (non-stochastic) case <em>b</em> = 0</strong></span></p>
<p><a href="https://www.linkedin.com/in/andrei-chtcheprov-5750731/" target="_blank">Andrei Chtcheprov</a> submitted the following statements, with proof :</p>
<ul>
<li>If <em>a</em> < = 1, then the sequence {<em>X</em>(<em>k</em>)} converges to zero.</li>
<li>If <em>a</em> = 3, {<em>X</em>(k)} converges to <a href="https://en.wikipedia.org/wiki/Ap%C3%A9ry%27s_constant" target="_blank">Zeta(3)</a> - 5/4 =~ -0.048.</li>
<li>If <em>a</em> = 4, {<em>X</em>(<em>k</em>)} converges to (Pi^4 / 90) - 9/8 =~ -0.043.</li>
</ul>
<p>You can read his proof <a href="http://api.ning.com:80/files/PfhqkPs7YPvl4V1MivGJKU3tNLp0vl95LdWaYCuNLSaKMo1u518lZfDwbb*rweKP2Ls7MSm353kpMU5KHJbnvqD9fsPfBKZl/proof.txt" target="_self">here</a>. Much more can also be explored regarding the case <em>b</em> = 0. For instance, when <em>a</em> = 1 and <em>b</em> = 0, the problem is similar <a href="http://www.datasciencecentral.com/profiles/blogs/new-representation-of-numbers-with-very-fast-converging-fractions" target="_blank">to this one</a>, where we try to approximate the number 2 by converging sums of elementary positive fractions without ever crossing the boundary Y= 2, staying below at all times. Here, by contrast, we try to approximate 0, also by converging sums of the same elementary fractions, but allowing each term to be either positive or negative, thus crossing the boundary Y = 0 very regularly. The alternance of the signs for <em>X</em>(<em>k</em>), is a problem of interest: It shows strong patterns. </p>
<p><em><em>To include mathematical formulas in this article, I used <a href="http://www.hostmath.com/" target="_blank">this app</a>. Those interested in winning the award by offering a theoretical solution should read <a href="https://www.datasciencecentral.com/profiles/blogs/amazing-random-sequences-with-cool-applications" target="_blank">this article</a>, where I solved another stochastic integral equation of similar complexity (with mathematical proof), in a related context (chaotic systems.) </em></em></p>
<p><em>For related articles from the same author, <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank">click here</a> or visit <a href="http://www.vincentgranville.com/" target="_blank">www.VincentGranville.com</a>. Follow me on Twitter at <a href="https://twitter.com/granvilleDSC" target="_blank">@GranvilleDSC</a> or <a href="https://www.linkedin.com/in/vincentg/" target="_blank">on LinkedIn</a>.</em></p>
<p><span class="font-size-4"><b>DSC Resources</b></span></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>
<p><b>Popular Articles</b></p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-articles-about-core-data-science">What is Data Science? 24 Fundamental Articles Answering This Question</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python">Hitchhiker's Guide to Data Science, Machine Learning, R, Python</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
</ul>Book on Computer Programmingtag:www.analyticbridge.datasciencecentral.com,2017-10-20:2004291:BlogPost:3720152017-10-20T02:00:00.000ZMark McIlroyhttps://www.analyticbridge.datasciencecentral.com/profile/MarkMcIlroy
<p>Data scientists use a range of tools in their work and some of these eventually require programming. This book, titled The Art and Craft of Computer Programming, is a guide to computer programming. It does not focus on a specific programming language, but instead contains the essential material from a first year Computer Science course. The book is available from Amazon.com.…</p>
<p></p>
<p>Data scientists use a range of tools in their work and some of these eventually require programming. This book, titled The Art and Craft of Computer Programming, is a guide to computer programming. It does not focus on a specific programming language, but instead contains the essential material from a first year Computer Science course. The book is available from Amazon.com.</p>
<p><a href="http://api.ning.com:80/files/tUZQu0G2R7cPuxOeM6pkCe1U1PLDfdvyPkbejs6cdUvfHuiVD5XnQWaSI8Wuz5DQxBNTES4yowNSFzVEbtwCLUNAngaCeTB7/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/tUZQu0G2R7cPuxOeM6pkCe1U1PLDfdvyPkbejs6cdUvfHuiVD5XnQWaSI8Wuz5DQxBNTES4yowNSFzVEbtwCLUNAngaCeTB7/Capture.PNG" width="554" class="align-center"/></a></p>
<p><strong>Contents</strong></p>
<p>1. Prelude 5</p>
<p>2. Program Structure 6</p>
<p>2.1. Procedural Languages 6<br/> 2.2. Declarative Languages 17<br/> 2.3. Other Languages 20</p>
<p>3. Topics from Computer Science 21</p>
<p>3.1. Execution Platforms 21<br/> 3.2. Code Execution Models 26<br/> 3.3. Data structures 29<br/> 3.4. Algorithms 44<br/> 3.5. Techniques 66<br/> 3.6. Code Models 85<br/> 3.7. Data Storage 100<br/> 3.8. Numeric Calculations 116<br/> 3.9. System Security 132<br/> 3.10. Speed & Efficiency 136</p>
<p>4. The Craft of Programming 157</p>
<p>4.1. Programming Languages 157<br/> 4.2. Development Environments 167<br/> 4.3. System Design 170<br/> 4.4. Software Component Models 180<br/> 4.5. System Interfaces 184<br/> 4.6. System Development 191 <br/> 4.3. System Design 170<br/> 4.4. Software Component Models 180<br/> 4.5. System Interfaces 184<br/> 4.6. System Development 191<br/> 4.7. System evolution 210<br/> 4.8. Code Design 217<br/> 4.9. Coding 233<br/> 4.10. Testing 262<br/> 4.11. Debugging 275<br/> 4.12. Documentation 285</p>
<p>5. Appendix A - Summary of operators 286</p>What Kind of OLAP Do We Really Need?tag:www.analyticbridge.datasciencecentral.com,2017-10-09:2004291:BlogPost:3724382017-10-09T10:00:00.000ZJIANG Buxinghttps://www.analyticbridge.datasciencecentral.com/profile/HANLijun
<p><b>The narrow-sensed OLAP</b></p>
<p>OLAP is part and parcel of a BI application. As the name suggests, the word is an acronym for online analytical processing. Users, frontline employees, to be precise, are responsible for performing various types of data processing online. </p>
<p><b>But, the concept of OLAP tends to be used in a very narrow sense</b>. It has almost become an equivalence of multidimensional analysis. Based on a prebuilt data cubic, the analysis performs summarization…</p>
<p><b>The narrow-sensed OLAP</b></p>
<p>OLAP is part and parcel of a BI application. As the name suggests, the word is an acronym for online analytical processing. Users, frontline employees, to be precise, are responsible for performing various types of data processing online. </p>
<p><b>But, the concept of OLAP tends to be used in a very narrow sense</b>. It has almost become an equivalence of multidimensional analysis. Based on a prebuilt data cubic, the analysis performs summarization according to specified dimensions/levels and presents the aggregate values as a table or a diagram. It adopts drilldown, aggregation, rotation, and slicing to change the dimensions/levels and summarization range. The idea behind multidimensional analysis is this: extensive ground-based aggregate results are too broad to get a good insight into an issue; instead, data needs to be sliced into smaller parts and drilled down to more detailed and deeper levels for achieving a more valuable analytical purpose.</p>
<p><b>The broad-sensed OLAP</b></p>
<p>Is online analytical processing all about the multidimensional analysis?</p>
<p>There are some data analysis scenarios where a person who has a lot of experience in a field makes some predictions about their businesses. For example:</p>
<ul>
<li>An equity analyst predicts that stocks meeting certain conditions are most likely to rise;</li>
<li>A sales manager knows which types of sales representatives are better at dealing with difficult customers;</li>
<li>A tutor knows how the results of students who have very strong subjects and very weak subjects are like;</li>
</ul>
<p>These guesses provide basis for predictions. After operating for a certain time, a business system will generate a huge amount of data, which could verify these guesses. Verified guesses can be used as principles to guide future decisions. If the guesses are proved wrong, re-guesses will be made.</p>
<p>It is the guess verification that the OLAP should focus. The guess-and-verify work aims to find principles or facts that support a conclusion based on historical data. An OLAP tool helps to verify guesses via data manipulation.</p>
<p>Of course <b>guesses are made by experienced people in a certain field</b>, instead of the software. The online analysis is necessary because, most of the time, guesses are made on the spot based on some intermediate results. It is impossible and unnecessary to pre-design a complete end-to-end path, which means the pre-modelling is unfeasible. The provisionality of the action also makes the IT resources unavailable when trying to verify it.</p>
<p>To counter the issue technologically, frontline workers must be equipped with the capability of querying and computing data in a flexible and interactive way. In the previously mentioned scenarios, the possible computations are as follows:</p>
<ul>
<li>For a stock that has been rising for 3 days in a month, find the probability of continuous rising on the 4<sup>th</sup> day;</li>
<li>Find the customers whose last orders were half a year ago but who placed an order after their sales representatives were changed;</li>
<li>Get the rankings of the English scores of the students whose scores of both Chinese and Math are in top 10;</li>
<li>…</li>
</ul>
<p><b>Limitations of multidimensional analysis</b></p>
<p>Obviously these computations can be handled based on historical data. But is a multidimensional analysis method helpful?</p>
<p>I’m afraid not!</p>
<p>The multidimensional analysis has two drawbacks: one is that the data cubic should be pre-created, giving users no opportunity of remolding it provisionally and requiring a re-creation for each new analysis; the other one is that the analytic operations over a data cubic are limited, including only drilldown, aggregation, slicing and rotation, thus it is difficult to cope with complex multi-step computations. Though the popular agile BI products in recent years that are capable of performing multidimensional analysis have much better operation fluency and far more attractive interface than the early OLAP products have, their essential functionalities remain unchanged and no improvement is made about the inabilities.</p>
<p>Yet multidimensional analysis has values, like locating the exact source of the high cost. But it can’t get a principle that is crucial for predicting and guiding a future move based on data. In this sense, online analytical processing should be more than multidimensional analysis.</p>
<p><b>What kind of OLAP do we need?</b></p>
<p>What functionalities the OLAP software for verifying a speculation should have?</p>
<p>As mentioned previously, verifying a speculation is a process of data query and computation. <b>It is vital that the query and computation can be defined by frontline workers without the help of IT specialists</b>. In the current application context, an OLAP platform needs to have the following two functionalities:</p>
<p>1. Associated query</p>
<p>The first thing for performing an analysis is acquiring data. Many organizations have their own data warehouses for non-IT employees to access and perform queries. An important issue is that most of the OLAP software doesn’t provide convenient associated query functionality for the frontline employees. Instead, IT specialists need to first create a model to solve the associated query (which is similar to creating a data cubic for performing multidimensional analysis). Usually not all real-life demands can be handled with this single model, and IT rescue is still needed. This makes online analytical processing not online any more.</p>
<p>2. Interactive computation</p>
<p>After data is collected, computation begins. The distinguishing characteristic of the speculation-verifying computation is that, instead of a ready-made program, the next move is determined based on the result of the previous move. The process is highly interactive, which is similar to the computation with a calculator. Furthermore, it is the structured data in batches, instead of numbers, that needs to be processed. The OLAP tool thus becomes a <b>data calculator</b>. Excel is interactive to some degree, making it the most popular desktop BI tool. But Excel doesn’t give sufficient support for dealing with multi-level data and regular operations, thus unable to handle the speculation-verifying computation mentioned in the previous scenarios.</p>
<p>In later articles, we’ll analyze the current popular computing techniques to locate problems of handling the two types of computation, and suggest solutions to them. </p>
<p><a href="https://www.linkedin.com/pulse/mr-jiangs-datatalk-room-what-kind-olap-do-we-really-han-lijun/" target="_blank">https://www.linkedin.com/pulse/mr-jiangs-datatalk-room-what-kind-olap-do-we-really-han-lijun/</a></p>9 Off-the-beaten-path Statistical Science Topics with Interesting Applicationstag:www.analyticbridge.datasciencecentral.com,2017-10-02:2004291:BlogPost:3719192017-10-02T19:43:45.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><span>You will find here nine interesting topics that you won't learn in college classes. Most have interesting applications in business and elsewhere. They are not especially difficult, and I explain them in simple English. Yet they are not part of the traditional statistical curriculum, and even many data scientists with a PhD degree have not heard about some of these concepts.…</span></p>
<p></p>
<p><span>You will find here nine interesting topics that you won't learn in college classes. Most have interesting applications in business and elsewhere. They are not especially difficult, and I explain them in simple English. Yet they are not part of the traditional statistical curriculum, and even many data scientists with a PhD degree have not heard about some of these concepts.</span></p>
<p><span><a href="http://api.ning.com/files/hIzuEPIKkM6cOZ73lNYnehvJ6jmXmMAGlRiI8MXMIo9pF*QRTE5wAxpO559lTW-Ynz-xxulWwguEJmCyBrwPZJosP2n8gF4N/Capture.PNG" target="_self"><img src="http://api.ning.com/files/hIzuEPIKkM6cOZ73lNYnehvJ6jmXmMAGlRiI8MXMIo9pF*QRTE5wAxpO559lTW-Ynz-xxulWwguEJmCyBrwPZJosP2n8gF4N/Capture.PNG" class="align-center"/></a></span></p>
<p><span>The topics discussed in this article include:</span></p>
<ul>
<li>Random walks in one, two and three dimensions - With Video</li>
<li>Estimation of the convex hull of a set of points - Application to clustering and oil industry</li>
<li>Constrained linear regression on unusual domains - Application to food industry</li>
<li>Robust and scale-invariant variances</li>
<li>Distribution of arrival times of extreme events - Application to flood predictions</li>
<li>The Tweedie distributions - Numerous applications</li>
<li>The arithmetic-geometric mean - Fast computations of decimals of Pi</li>
<li>Weighted version of the K-NN clustering algorithm</li>
<li>Multivariate exponential distribution and storm modeling</li>
</ul>
<p><a href="http://www.datasciencecentral.com/profiles/blogs/9-off-th-beaten-path-statistical-science-topics" target="_blank">Click here to read the article</a>.</p>
<p></p>Audience is The Future Business Model, Data Analytics Can Improves Ittag:www.analyticbridge.datasciencecentral.com,2017-09-26:2004291:BlogPost:3710142017-09-26T18:30:00.000ZChirag Shivalkerhttps://www.analyticbridge.datasciencecentral.com/profile/ChiragShivalker
<p><img src="http://api.ning.com:80/files/Zk032S8hxsW5EhV6AH5bPFP6D1ncYHrrK2-OrQXYHYOzX*zwYnMcn5Zf7NMGDoe51DEu2ivPqJ16NW0QJGaNTJ4*efe6a9ch/audienceisthefuturebusinessmodeldataanalyticsimprovesit3.jpg" width="750"></img></p>
<p></p>
<p>Companies and enterprises are facing a daily grind, while they are also required to see to it that their customers are happy & satisfied, operations is efficient and employees are satisfied; and all this makes running the business – a real challenge. “<em>Audience is the new business model</em>”, and if any organization is struggling miserably to communicate with customers or their audience, there certainly is a negative impact of it across the business plan,…</p>
<p><img src="http://api.ning.com:80/files/Zk032S8hxsW5EhV6AH5bPFP6D1ncYHrrK2-OrQXYHYOzX*zwYnMcn5Zf7NMGDoe51DEu2ivPqJ16NW0QJGaNTJ4*efe6a9ch/audienceisthefuturebusinessmodeldataanalyticsimprovesit3.jpg" width="750"/></p>
<p></p>
<p>Companies and enterprises are facing a daily grind, while they are also required to see to it that their customers are happy & satisfied, operations is efficient and employees are satisfied; and all this makes running the business – a real challenge. “<em>Audience is the new business model</em>”, and if any organization is struggling miserably to communicate with customers or their audience, there certainly is a negative impact of it across the business plan, starting from finances to product development.<br/></p>
<p>Majority of the organizations, in some or other manner, have turned to data analytics to identify and understand their audience, focusing on optimizing customer communication, revitalizing their business models, and make profitable growth. Even Fortune 500 and Fortune 1000 companies have realized that data analytics draws a clear line from marketing campaigns to profits. Here are three ways, tried and tested, on capitalizing data analytics to improve overall business plan.<br/></p>
<p><span class="font-size-5">1. Understanding target markets before expansion</span></p>
<p><br/> Some of them have grown smart enough & started viewing their businesses models as a constant work in progress. Exploratory data analytics has helped them stay relevant. Leveraging data analytics offerings, equips them with real-time information in and around consumer activity. This also helps them with potential target markets which might have stayed unnoticed in past years. <em>Factors that they consider for reaching out new demographics:</em><br/></p>
<ul>
<li>Identify what aspects of products/services are best received to focus on reaching out to potential customers.</li>
<li>Assess and strategies how potential customers can be served better as compared to direct competition.</li>
<li>Visualizing insights derived through data analytics & conclude with pattern of age, gender, location etc., to fine tune offers in and around these factors.</li>
</ul>
<p><br/> Understanding target markets before implementing expansion plans, is imperative. Using data to glean insights and information to identify strengths and weaknesses of the current business model backed with user activity; ultimately enhances capabilities to engage potential customers in newer landscapes.</p>
<p><br/> <span class="font-size-5">2. Understanding consumer behavior</span></p>
<p><br/> Businesses were operating with limited abilities when it came to reaching out to customers and a lot of decisions were made on guesswork. This included the rudimentary marketing and advertising tactics of attracting customers. And then predictive data analytics walked in to the picture.<br/></p>
<p>Understanding audience or customer behavior, the critical need and element of preparing a profitable business plan, is addressed adequately. Online activity of any consumer, which gives out data that can be analyzed on real time basis, is used like an opportunity to directly target them. <em>This data also helps businesses in organizing targeted advertisements to encourage more sales and better manage finances in the following ways:</em></p>
<p></p>
<ul>
<li>Tracking down when a customer had made purchases, with help of sales forecasting, helps in telling most probably when they will make the next purchase again. Based on this, businesses can organize targeted advertisements to enhance sales.</li>
<li>Consumer purchasing patterns, as highlighted by predictive analytics, will help in strategically allocating funds according to times of prosperity or stagnation.</li>
<li>Easy identification of trends and fluctuations in the market and across the industry, facilitated by historical data analytics, helps in minimizing risks and ascertain lucrative investment opportunities.</li>
</ul>
<p></p>
<p>The fine blend of sales forecasting and predictive analytics creates a sustainable financial model. Understanding consumer behavior delivers valuable insights about how purchases were made and what all can be done to improve overall sales activity.<br/></p>
<p><span class="font-size-5">3. Understanding the power of word of mouth</span></p>
<p><br/> Tremendous rise of social networks has forced businesses to recognize and understand the power of word of mouth. Customer feedbacks and forums have proved that even the biggest of corporations across the globe are subject to customer reviews. Businesses today are modeling their business plans based on customer expectations by leveraging consumer data analytics.</p>
<p></p>
<ul>
<li>To build trust and reach out to more loyal a customer base, businesses can target consumers with personalized messages to prove that they are attentive.</li>
<li>To engage consumers in more meaningful communication, they can leverage data to get in touch with consumers based on seasonal events, holidays and significant events.</li>
<li>Being open to criticism is one of the most effective ways businesses have opted to improve products/services and even the business model, and make the customer feel that they area attentive; encouraging their customers to become returning customers.</li>
</ul>
<p><br/> With help of data analytics, it becomes much easier to track customer needs and adjust the business model accordingly. Ecommerce / retail, real estate, consulting & professional services, transportation & logistics, BFSI or healthcare; data analytics has empowered professionals across industries to glean insights about consumer needs, useful to reinforce confidence in their products/services, impacting everything from consumer awareness to sales & profit.</p>Why Analytics Projects Fail – And It’s Not The Analytics!tag:www.analyticbridge.datasciencecentral.com,2017-09-26:2004291:BlogPost:3712992017-09-26T21:00:00.000ZEd Crowleyhttps://www.analyticbridge.datasciencecentral.com/profile/EdCrowley
<p>Being in a highly technical, complex field it is easy to sometimes lose the ‘human aspect’ of the solutions we are developing. We focus on apply edge computing concepts, or whether a seasonality model works better for our predictive accuracy than some other approach. Don't get me wrong, these are all important activities. However, in working with many firms in developing, deploying and supporting advanced analytics solutions, particularly in the domain of the Industrial IoT space, it’s often…</p>
<p>Being in a highly technical, complex field it is easy to sometimes lose the ‘human aspect’ of the solutions we are developing. We focus on apply edge computing concepts, or whether a seasonality model works better for our predictive accuracy than some other approach. Don't get me wrong, these are all important activities. However, in working with many firms in developing, deploying and supporting advanced analytics solutions, particularly in the domain of the Industrial IoT space, it’s often the people side that fails – not the technology.</p>
<p>How many times have you developed an amazing predictive analytics solution that your team is excited about, but it fails to get corporate funding? Or even worse, you develop a proof of concept, but the business unit responsible for deploying the solution never fully adopts or accepts the solution. What gives? The solution has been proven to work. It shows significant results. Why the hesitancy.</p>
<p>The reality is that most advanced analytics solutions have some impact on existing business processes. Whether it is replacing staff with an automated process, or gaining the trust in management that the decision making algorithms work – the bottom line is that it’s all about making sure the organization, the people, not only accept, but embrace the solution.</p>
<p>Here are some suggestions for helping to make this happen:</p>
<ol>
<li><b>KISS-Keep It Simple Stupid.</b> Yeah, it’s cool to be the ‘wizard in the room’, but the reality is, most people are pretty intimidated by the statistics and buzz words associated with advanced analytics. Keep it simple, they are already going to be impressed by your understanding of the subject area. Use as many examples as possible and try to avoid buzzwords or highly technical jargon.</li>
<li><b>Make It Relevant.</b> Clearly articulate the reason why this has value. For the executive team, a clear use case with financial detail explain the value of the solution is critical. For the management team implementing this, before you ever pitch ‘the solution’, make sure you have spent time to really understand how this would impact them, what their needs are, and explaining how this will help them in their job or how the improvement in company performance will benefit them.</li>
<li><b>Get Buy In Early.</b> Give everyone involved in implementing, using, or making decisions about the solution a chance to buy in by having a say on how it is designed, and how it will operate. If you wait until you have ‘completed’ the design, then you have to overcome resistance to change and inertia. No matter how good the solution is, folks typically don’t want to change. However, if you let them help you design the solution, they will buy-in to it as part of the process.</li>
</ol>
<p>Clearly, there are a lot of things that can be done to help drive acceptance of your solution. So my strong advice is – don’t think of just designing the solution and developing as the ‘project’. Half of the project is getting buy-in to the solution from your organization. Be prepared for this, plan for it, and embrace it!</p>
<p>Ed Crowley is CEO of Virtulytix – the Industrial IoT solutions development firm that makes the Industrial IoT smart!</p>Can you solve these mathematical / statistical problems?tag:www.analyticbridge.datasciencecentral.com,2017-09-22:2004291:BlogPost:3712872017-09-22T03:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>I recently posted an article featuring a non traditional approach to find large prime numbers. The research section of this article offers interesting challenges, both for data scientists interested in mathematics, and for mathematicians interested in data science and big data. My approach is data, pattern recognition, and machine learning heavy. Here is the introduction:</p>
<p>Large prime numbers have been a topic of considerable research, for its own mathematical beauty, as well as to…</p>
<p>I recently posted an article featuring a non traditional approach to find large prime numbers. The research section of this article offers interesting challenges, both for data scientists interested in mathematics, and for mathematicians interested in data science and big data. My approach is data, pattern recognition, and machine learning heavy. Here is the introduction:</p>
<p>Large prime numbers have been a topic of considerable research, for its own mathematical beauty, as well as to develop more powerful cryptographic applications and random number generators. In this article, we show how big data, statistical science (more specifically, pattern recognition) and the use of new efficient, distributed algorithms, could lead to an original research path to discover large primes. Here we also discuss new mathematical conjectures related to our methodology.</p>
<p>Much of the focus so far has been on discovering raw large primes: Any time a new one, bigger than all predecessors, is found, it gets a lot of attention even beyond the mathematical community. Here we explore a different path: finding numbers (usually not primes) that have a very large prime factor. In short, we are looking for special integer-valued functions f(n) such that f(n) has a prime factor bigger than n, hopefully much bigger than n, for most values of n.</p>
<p><a href="http://api.ning.com:80/files/2ZTD6jtt0ki2RHB4NCP5vnQap1rzTO*xnaeSraW2vMSqpizMibvH7WNSzUe*dECAUDydUKL*TkB9yqMq-E3kaCTP-FAzYX0q/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/2ZTD6jtt0ki2RHB4NCP5vnQap1rzTO*xnaeSraW2vMSqpizMibvH7WNSzUe*dECAUDydUKL*TkB9yqMq-E3kaCTP-FAzYX0q/Capture.PNG" width="595" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source for picture: <a href="http://infosthetics.com/archives/2012/07/on_the_pattern_of_primes.html" target="_blank">click here</a></em></p>
<p>The distribution of the largest prime factor has been studied extensively. If we choose a function that grows fast enough, one would expect that the largest prime factor of f(n) will always be larger than n. However, this would lead to intractable factoring issues to find the large primes in question. So in practice, we are interested in functions f(n) that do not grow too fast. The problem is that many, if not most very large integers, are friable : their largest prime factor is a relatively small prime. I like to call them porous numbers. So the challenge is to find a function f(n) that is not growing too fast, and that somehow produces very few friable numbers as n becomes extremely large. <a href="http://www.datasciencecentral.com/profiles/blogs/data-science-method-to-discover-large-prime-numbers" target="_blank">Read the full article here</a>.</p>
<p><em>For another interesting challenge, read the section "Potential Areas of Research" in my article <a href="http://www.analyticbridge.datasciencecentral.com/profiles/blogs/mysterious-sequences-that-look-random-with-surprising-properties" target="_blank">How to detect if numbers are random or not</a>. For other articles featuring difficult mathematical problems, <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank">click here</a>. For a statistical problem with several potential applications (clustering, data reduction) <a href="http://www.datasciencecentral.com/profiles/blogs/nice-generalization-of-the-k-nn-clustering-algorithm" target="_blank">click here</a> and read the last section. More challenges can be found <a href="http://www.datasciencecentral.com/group/resources/forum/topics/best-kept-secret-about-data-science-competitions" target="_blank">here</a>.</em> . </p>Building Convolutional Neural Networks with Tensorflowtag:www.analyticbridge.datasciencecentral.com,2017-09-07:2004291:BlogPost:3694982017-09-07T13:30:00.000Zahmet taspinarhttps://www.analyticbridge.datasciencecentral.com/profile/ahmettaspinar
<p>In the past year I have also worked with Deep Learning techniques, and I would like to share with you how to make and train a Convolutional Neural Network from scratch, using tensorflow. Later on we can use this knowledge as a building block to make interesting Deep Learning applications.</p>
<p>The pictures here are from the full article. Source code is also provided.…</p>
<p></p>
<p>In the past year I have also worked with Deep Learning techniques, and I would like to share with you how to make and train a Convolutional Neural Network from scratch, using tensorflow. Later on we can use this knowledge as a building block to make interesting Deep Learning applications.</p>
<p>The pictures here are from the full article. Source code is also provided.</p>
<p><a href="http://api.ning.com:80/files/OtMcIKbpmwRWBDj9GQF-YGP9oRvaOCVcm7V69c-mMoJgT7BYOfP10tAyA-7CVNEeeT65kmN0MZazjB08c3rhMHL6-tv-mM5i/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/OtMcIKbpmwRWBDj9GQF-YGP9oRvaOCVcm7V69c-mMoJgT7BYOfP10tAyA-7CVNEeeT65kmN0MZazjB08c3rhMHL6-tv-mM5i/Capture.PNG" width="654" class="align-center"/></a></p>
<p>Before you continue, make sure you understand how a convolutional neural network works. For example,</p>
<ul>
<li>What is a convolutional layer, and what is the filter of this convolutional layer?</li>
<li>What is an activation layer (ReLu layer (most widely used), sigmoid activation or tanh)?</li>
<li>What is a pooling layer (max pooling / average pooling), dropout?</li>
<li>How does Stochastic Gradient Descent work?</li>
</ul>
<p><strong>The contents of this blog-post is as follows</strong>:</p>
<p>1. Tensorflow basics:</p>
<ul>
<li>Constants and Variables</li>
<li>Tensorflow Graphs and Sessions</li>
<li>Placeholders and feed_dicts</li>
</ul>
<p>2. Neural Networks in Tensorflow</p>
<ul>
<li>Introduction</li>
<li>Loading in the data</li>
<li>Creating a (simple) 1-layer Neural Network:</li>
<li>The many faces of Tensorflow</li>
<li>Creating the LeNet5 CNN</li>
<li>How the parameters affect the outputsize of an layer</li>
<li>Adjusting the LeNet5 architecture</li>
<li>Impact of Learning Rate and Optimizer</li>
</ul>
<p>3. Deep Neural Networks in Tensorflow</p>
<ul>
<li>AlexNet</li>
<li>VGG Net-16</li>
<li>AlexNet Performance</li>
</ul>
<p>4. Final words</p>
<p><a href="http://api.ning.com:80/files/OtMcIKbpmwQGE4JomGtJxEJJGt6KnYUy00r0-QEkm5ijh4CUew9kI4K1GfAaWEg373iECKBQR7Q91MCgpVvvCfEN7lohLKe*/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/OtMcIKbpmwQGE4JomGtJxEJJGt6KnYUy00r0-QEkm5ijh4CUew9kI4K1GfAaWEg373iECKBQR7Q91MCgpVvvCfEN7lohLKe*/Capture.PNG" width="506" class="align-center"/></a></p>
<p><em>To read this blog, <a href="http://ataspinar.com/2017/08/15/building-convolutional-neural-networks-with-tensorflow/" target="_blank">click here</a>. <span>The code is also available in my </span><a href="https://github.com/taspinar/sidl" target="_blank">GitHub repository</a><span>, so feel free to use it on your own dataset(s).</span></em></p>A Day in the life of an Analysttag:www.analyticbridge.datasciencecentral.com,2017-09-03:2004291:BlogPost:3706622017-09-03T20:00:00.000ZIvy Pro Schoolhttps://www.analyticbridge.datasciencecentral.com/profile/IvyProSchool
<p><a href="http://ivyproschool.com/blog/2016/11/29/a-day-in-the-life-of-an-analyst/" target="_blank"><img class="align-full" src="http://api.ning.com:80/files/n6oHIXSTHuV5RPPEztZ2VArorCdmCs1luq*GOApIsTbZs*-znhBpU-iDgKScC-BCZrwTFYKloOAcMgZea06SjHwLj3erdtbT/Infographic_dayinthelifeofanAnalyst.png?width=750" width="750"></img></a></p>
<p><strong><u>A typical day in the life of an Analyst</u></strong></p>
<p>An Analyst works on varied projects with multiple deliverables and varied duties depending on the business objectives.</p>
<p>However there are some tasks that can be easily classified as “common everyday duties” in a “typical work day of a business analyst”</p>
<p><strong>Clarification and…</strong></p>
<p><a href="http://ivyproschool.com/blog/2016/11/29/a-day-in-the-life-of-an-analyst/" target="_blank"><img src="http://api.ning.com:80/files/n6oHIXSTHuV5RPPEztZ2VArorCdmCs1luq*GOApIsTbZs*-znhBpU-iDgKScC-BCZrwTFYKloOAcMgZea06SjHwLj3erdtbT/Infographic_dayinthelifeofanAnalyst.png?width=750" width="750" class="align-full"/></a></p>
<p><strong><u>A typical day in the life of an Analyst</u></strong></p>
<p>An Analyst works on varied projects with multiple deliverables and varied duties depending on the business objectives.</p>
<p>However there are some tasks that can be easily classified as “common everyday duties” in a “typical work day of a business analyst”</p>
<p><strong>Clarification and investigation of business goals and problems</strong></p>
<p>Analysts like to ask questions. This helps them identify the right question against which they would need to conduct analysis. The process of investigating involves conducting interviews, reading and observing people at work.</p>
<p><strong>Information analysis</strong></p>
<p>The analysis phase is the phase during which the Business Analyst explains the elements in detail, affirming clearly and unambiguously what the business needs to do in order solve its issue. During this stage the BA will also interact with the development team and, if appropriate, an architect, to design the layout and define accurately what the solution should look like.</p>
<p> </p>
<p><strong>Meeting stakeholders</strong></p>
<p>Good Business Analysts contribute countless hours actively communicating. More than only speaking, this means hearing and recognising verbal and non-verbal information, building an open conversation, verifying you’ve understood what you heard, and communicating what you learn to those who will create the actual solution</p>
<p> </p>
<p><strong>Document research</strong></p>
<p>Post Investigation, all the facts that have been collected by Analysts need to be documented for their future use. They use Data visualization tools like Excel graphs, spreadsheets etc.</p>
<p><strong>Evaluate options</strong></p>
<p>Before drawing solutions for the given problems, Analysts like to work on various possible solutions contemplating the best one according to business requirements. In order to draw conclusions Analysts use analytical tools like SAS, R and Python for analyzing and predicting the best outcome.</p>
<p><strong>Take action</strong></p>
<p>Once the optimum solution is derived, an Analyst has to again communicate with the client to understand any missed objectives and to guide them on how to implement the recommended processes.</p>
<p> </p>
<p>A business Analyst shows the light towards viable solutions to complex business problems. It’s no surprise that Business Analysts are in such great demand.</p>
<p><a rel="nofollow" href="http://ivyproschool.com/" target="_blank"> </a></p>Overpromising and Underperforming: Understanding and Evaluating Self-service BI Toolstag:www.analyticbridge.datasciencecentral.com,2017-08-31:2004291:BlogPost:3705892017-08-31T06:00:00.000ZJIANG Buxinghttps://www.analyticbridge.datasciencecentral.com/profile/HANLijun
<p>From the OLAP concept in earlier years to the agile BI over the last few years, BI vendors never stop advertising the self-service capability, claiming that business users will be able to perform analytics by themselves. Since there are strong self-service needs among users, the two really hit it off and it is very likely that a quick deal is made. The question is - does a BI product’s self-service functionality enable a truly flexible data analytics by business users?</p>
<p>There isn’t a…</p>
<p>From the OLAP concept in earlier years to the agile BI over the last few years, BI vendors never stop advertising the self-service capability, claiming that business users will be able to perform analytics by themselves. Since there are strong self-service needs among users, the two really hit it off and it is very likely that a quick deal is made. The question is - does a BI product’s self-service functionality enable a truly flexible data analytics by business users?</p>
<p>There isn’t a standard definition of “data analytics” in the industry. So no one can say for sure whether the claim is objective or exaggerated. But for users who have little BI experience, the fact is that most of their self-service needs can’t be met with the so-called self-service technology. According to industry experiences, the best record is about 30% solved problems. Most BI products lag far behind the number, lingering around 10%.</p>
<p>We’ll look at the phenomenon from three aspects.</p>
<p><b> </b></p>
<p><b>Multidimensional analysis</b></p>
<p>Multidimensional analysis performs interactive operations over a pre-created data set (or a data cube). Today most BI products provide this type of analytic capability. Though a new generation of BI products has improved much on interface design and operational smoothness, their ability of implementing computations hasn’t essentially improved.</p>
<p>The key aspect of multidimensional analysis is model creation, which is the pre-preparation of data sets. If the data to be analyzed is all held in a single data set, and if the operations to be performed are within those provided by a BI product (including rotation, drilldown, slicing, and so on), the analysis is well within the product’s capability. But for most real-life scenarios, analytic needs are beyond these pre-installed functionalities, like adding a provisional data item or perform a join with another data set, leading to a re-creation of the model. The problem is that the model creation requires technical professionals, sending the tool non-self-service. </p>
<p>Multidimensional analysis can meet only 10% of the self-service needs, which reflects the average self-service ability of today’s BI products.</p>
<p><b> </b></p>
<p><b>Associative query</b></p>
<p>Some BI products provide associative query capability to make up for the limitations of multidimensional analysis. The strategy is to create a new data set by joining multiple data sets before performing the multidimensional analysis, or to implement certain joins between multiple data cubes during the multidimensional analysis. This means business users are to some degree allowed to create models.</p>
<p>It isn’t easy to implement an associative query well. Relational databases give a too simple definition of the JOIN operation, making the association between data sets too complicated to understand for many business users. The issue can be partly addressed through a well-designed product interface, and a good BI product enables business users to appropriately handle non-circular joins. But to solve the issue, we need to change the data organization scheme on the database level. The reality is that nearly no BI products re-define the database model, thus the improvement of associative query ability is limited. We’ll discuss the related technologies in later articles.</p>
<p>Here’s a typical example for testing the associative query ability of a BI product: finding the male employees under the female managers. The simple query involves multiple self-joins, but most BI products are incapable of handling it (without first creating a model).</p>
<p>BI products’ associative query capability can meet 20%-30% of the self-service needs, though the specific number depends on the different capabilities provided by different products.</p>
<p> </p>
<p><b>Procedure computing</b></p>
<p>About 70% or more self-service demands involve multi-step procedural computations, which is completely beyond the design target of a BI product and even can be considered beyond data analytics, but is a hot user problem. Users hope that frontline employees can get data as flexible as possible within their authority.</p>
<p>A simple solution is exporting data with the BI product, and then handling it by the frontline workers with desktop tools like Excel. Unfortunately, Excel is not good at handling multilevel joins (the issue will be discussed later), as well as dealing with a large amount of data, making it unsuitable in many computing scenarios.</p>
<p>Before more advanced interactive computing technology appears, technical specialists are responsible for tackling those problems. In this context, instead of pursuing the self-service procedure computing, BI products should focus on facilitating business users’ access to technical resources and the development process for developers.</p>
<p>There are two things that we can do. One is establishing an algorithm library where the algorithms of handled scenarios are stored. Business users would call up an algorithm and change parameters for use in a same type of computing scenario. They can also find an algorithm in the library for the technical specialists’ reference in handling a new scenario, reducing the chance of difference between the business users and the development team in understanding a computing scenario, which is a major factor for transaction delay. The other is providing efficient and manageable programming technology that facilitates coding and modification and that supports storing an algorithm in the library and its reuse. Corresponding technologies are rare in the industry. SQL has good manageability, but SQL code in handling procedural computations is too tedious. The stored procedure needs recompilation, which is inconvenient for reuse. Java code needs recompilation, too, and is nearly unmanageable. Other scripting languages are integration-unfriendly and thus difficult to store and manage in a database for reuse.</p>
<p> </p>
<p>At present, BI products are barely able to meet the most common self-service needs. Usually the BI vendors are talking about multidimensional analysis while the users are thinking of problems that need to be done with procedure computing. The misunderstanding invites high expectations as well as big disappointments. In view of this, it’s critical that users have a good understanding of their self-service needs: Is multidimensional analysis sufficient for dealing with the problems? How many associative queries will be needed? Will the frontline employees have a lot of problems that require procedure computing? Having these questions answered is necessary for setting a reasonable expectation of a BI product and for knowing what the BI product can do, thus avoiding being misled by the flowery interface and smooth operation and making a wrong purchase decision.</p>
<p></p>
<p><a href="http://www.linkedin.com/pulse/overpromising-underperforming-understanding-evaluating-han-lijun" target="_blank">http://www.linkedin.com/pulse/overpromising-underperforming-understanding-evaluating-han-lijun</a></p>8 Great Articles, Tutorials, and Infographicstag:www.analyticbridge.datasciencecentral.com,2017-08-22:2004291:BlogPost:3698382017-08-22T18:39:35.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Posted on DSC today and yesterday</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/new-book-data-science-mindset-methodologies-and-misconceptions">New Book: Data Science: Mindset, Methodologies, and Misconceptions</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/robust-attacks-on-machine-learning-models">Robust Attacks on Machine Learning Models</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/about-quick-r">Quick Guide to R…</a></li>
</ul>
<p>Posted on DSC today and yesterday</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/new-book-data-science-mindset-methodologies-and-misconceptions">New Book: Data Science: Mindset, Methodologies, and Misconceptions</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/robust-attacks-on-machine-learning-models">Robust Attacks on Machine Learning Models</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/about-quick-r">Quick Guide to R and Statistical Programming</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/book-r-for-data-science">Book: R for Data Science</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/predicting-the-next-eclipse">Predicting the next Eclipse</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/under-the-hood-with-reinforcement-learning-understanding-basic-rl">Understanding Basic Reinforcement Learning Models</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/evolution-of-machine-learning-infographics">Evolution of Machine Learning </a>- Infographics</li>
<li><a href="http://www.bigdatanews.datasciencecentral.com/profiles/blogs/how-marketers-use-data-science-to-increase-reach">How Marketers Use Data Science to Increase Reach </a>- Infographic</li>
</ul>
<p><span>Enjoy the reading!</span></p>Curious Mathematical Object: Hyperlogarithmstag:www.analyticbridge.datasciencecentral.com,2017-08-16:2004291:BlogPost:3696132017-08-16T18:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Logarithms turn a product of numbers into a sum of numbers: log(xy) = log(x) + log(y). Hyperlogarithms generalize the concept as follows: Hlog(XY) = Hlog(X) + Hlog(y), where X and Y are any kind of objects, and the product and sum are replaced by operators in some arbitrary space. …</p>
<p><a href="http://api.ning.com:80/files/*K0CPt1rhrQPLLGS8*ARoNUsaedloCkcp2WiyS-5HI7Xxf3xmSL5DVRW*noyClNPAuQJCPhh8EMhqW27*tNv*4Lv8nrmI9yA/Capture.PNG" target="_self"><img class="align-center" src="http://api.ning.com:80/files/*K0CPt1rhrQPLLGS8*ARoNUsaedloCkcp2WiyS-5HI7Xxf3xmSL5DVRW*noyClNPAuQJCPhh8EMhqW27*tNv*4Lv8nrmI9yA/Capture.PNG" width="252"></img></a></p>
<p>Logarithms turn a product of numbers into a sum of numbers: log(xy) = log(x) + log(y). Hyperlogarithms generalize the concept as follows: Hlog(XY) = Hlog(X) + Hlog(y), where X and Y are any kind of objects, and the product and sum are replaced by operators in some arbitrary space. </p>
<p><a href="http://api.ning.com:80/files/*K0CPt1rhrQPLLGS8*ARoNUsaedloCkcp2WiyS-5HI7Xxf3xmSL5DVRW*noyClNPAuQJCPhh8EMhqW27*tNv*4Lv8nrmI9yA/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/*K0CPt1rhrQPLLGS8*ARoNUsaedloCkcp2WiyS-5HI7Xxf3xmSL5DVRW*noyClNPAuQJCPhh8EMhqW27*tNv*4Lv8nrmI9yA/Capture.PNG" width="252" class="align-center"/></a></p>
<p style="text-align: center;"><em>Source for picture::<a href="https://en.wikipedia.org/wiki/Super-logarithm" target="_blank">click here</a></em></p>
<p>Here we focus exclusively on operations on sets: XY becomes the intersection of the sets X and Y, and X + Y the union of X and Y. The question is: which functions satisfy Hlog(XY) = Hlog(X) + Hlog(y). We assume here that the argument for Hlog is a set X, and the returned value Hlog(X) = Y is another set Y from the same set of sets. Let E = {X, Y, ... } be the sets of all potential arguments for Hlog. E must satisfy the following conditions</p>
<ul>
<li>The intersection of two sets of E is also a set of E</li>
<li>The union of two sets of E is also a set of E</li>
</ul>
<p>The following additional condition can be added:</p>
<ul>
<li>E does not contain the empty set (thus any intersection of a finite number of sets in E, is non empty)</li>
</ul>
<p>Let's denote as U the union of all sets of E. It is easy to prove that Hlog(U) is the empty set, denoted as O. Also Hlog(O) does not exist, just like log(0) does not exist. It is also easy to prove that Hlog(XYZ) = Hlog(X) + Hlog(Y) + Hlog(Z), and this generalizes to any (finite) number of sets in E.</p>
<p>Two functions Hlog satisfy Hlog(XY) = Hlog(X) + Hlog(Y) :</p>
<ul>
<li>Hlog(X) is equal to a constant set if X is different from U, and Hlog(U) = O is the empty set.</li>
<li>Hlog(X) = U - X is the complimentary set of X (that is, U - X consists of all the elements that are in U but not in X)</li>
</ul>
<p>The question is: Besides these two functions (the first one is a degenerate solution), are there any other functions Hlog satisfying the same property? How do you proceed to solve such as weird functional equation in an unusual space? One could try to investigate this problem by first analyzing the case where E contains only 3 or 4 sets: In this case, look at all potential functions defined on E (there is only a finite number of them), and see which ones satisfy the equation Hlog(XY) = Hlog(X) + Hlog(Y). </p>
<p>Also, using object orientated programming, how would you implement a generic function Hlog that works both for real numbers, for sets, or in any other context?</p>
<p>To read more of my math-related articles, <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank">click here</a>. </p>
<p><span class="font-size-4"><b>DSC Resources</b></span></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>
<p>Popular Articles</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-articles-about-core-data-science">What is Data Science? 24 Fundamental Articles Answering This Question</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python">Hitchhiker's Guide to Data Science, Machine Learning, R, Python</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
</ul>Fighting eCommerce fraud with graph technologytag:www.analyticbridge.datasciencecentral.com,2017-08-09:2004291:BlogPost:3693712017-08-09T15:30:00.000ZElise Devauxhttps://www.analyticbridge.datasciencecentral.com/profile/EliseDevaux
<p><span><br></br> ECommerce fraud is growing quickly, creating new challenges in terms of prevention and detection. As merchants gather more and more information about customers and their behaviors, the key element in the fight against fraud is now to draw on the connections within the data collected to uncover fraudulent behaviors. In this post we explain why and how graph technologies are crucial in the detection of eCommerce fraud.…</span></p>
<p></p>
<h2 style="text-align: center;"></h2>
<p><span><br/> ECommerce fraud is growing quickly, creating new challenges in terms of prevention and detection. As merchants gather more and more information about customers and their behaviors, the key element in the fight against fraud is now to draw on the connections within the data collected to uncover fraudulent behaviors. In this post we explain why and how graph technologies are crucial in the detection of eCommerce fraud.</span></p>
<p></p>
<h2 style="text-align: center;"><span class="font-size-5">eCommerce: a fertile ground for fraud</span></h2>
<p></p>
<p><span>In the past years the eCommerce market has continually expanded, reaching up to 1.9$ trillions in terms of transaction value in 2016. Ecommerce sales are still growing rapidly and are </span><a href="https://www.emarketer.com/Article/Worldwide-Retail-Ecommerce-Sales-Will-Reach-1915-Trillion-This-Year/1014369">forecast to reach $4 trillion</a><span> by the end of 2020, notably with retailers pushing into new international markets.<br/> <br/></span></p>
<p><span>In the meantime, eCommerce fraud has become a multi-billion dollar industry. A </span><a href="http://www.experian.com/assets/decision-analytics/white-papers/juniper-research-online-payment-fraud-wp-2016.pdf">study conducted by Juniper</a><span> estimated the average cost for merchants between 0.3% and 3% of revenues, depending on the vertical and the region. B</span><span>elow are some examples of today’s most common fraud schemes.</span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/ecommerce_fraud_types_linkurious.png" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/ecommerce_fraud_types_linkurious.png?width=1230" width="1230" class="align-center"/></a></p>
<p><span>The adoption of new technologies, payment methods and data processing systems has benefited to fraudsters, opening new doors to bypass existing security measures and cover their tracks. Professional fraudsters are organizing </span><span><a href="https://www.rte.ie/news/technology/2017/0621/884459-online-fraud/">themselves into networks</a>. They</span><span> exchange knowledge and technology across the globe and devise new schemes to stay one step ahead of the latest anti-fraud technology.<br/> <br/></span></p>
<h2 style="text-align: center;"><span><span class="font-size-5">A layered approach to fight fraud raising technical challenges</span><br/> <br/></span></h2>
<p>Faced with increasing flows of money and evolving fraud schemes, but also with new technology disrupting traditional security measures, anti-fraud teams in eCommerce companies need to adapt.<br/></p>
<p><span>The traditional “silver bullet” approach of relying on one or two anti-fraud strategies is no longer enough. Best in class organizations combine multiple complementary approaches to maximize the accuracy of fraud detection and avoid false positives that negatively impact reputations. With numerous fraud prevention solutions available – from device authentication to proxy piercing or address verification service – the layered approach shows better results in detecting fraud attempts.<br/> <br/></span></p>
<p><span>In their </span><a href="https://www.gartner.com/doc/3472117/market-guide-online-fraud-detection">Market Guide for Online Fraud Detection</a><span>, Gartner’s fraud analysts outlined five critical layers to tackle today’s threats: end-point, navigation, channel and cross-channel centric layers and an additional entity link analysis layer.</span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/Gartner_layers_fraud.png" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/Gartner_layers_fraud.png?width=1858" width="1858" class="align-center"/></a></p>
<p style="text-align: center;"><i>Gartner’s conceptual model of a layered approach for fraud detection<br/> <br/></i></p>
<p><span>In order to carry out connected analysis, eCommerce vendors need technologies able to work with cross-channel data and perform relationship analysis at scale. However, most of today’s anti-fraud solutions (whether it’s homemade or provided by a vendor) still relies on relational databases, designed to store data in a tabular format. Detecting connections between entities typically requires to join tables using foreign keys which becomes computationally intractable after a few hops.<br/> <br/></span></p>
<p><span>As a result, eCommerce anti-fraud teams are still limited by product and channel silos that provide little or no cross-channel view of a subject’s behavior. Existing structures are too rigid to allow the easy adoption of new rules or data, making it hard to keep pace with new products and schemes.<br/> <br/></span></p>
<h2 style="text-align: center;"><span>Graph technology: the answer for connected analysis<br/> <br/></span></h2>
<p><span>To be able to perform connected analysis and reinforce their fraud detection system, many eCommerce merchants are choosing to leverage graph technology. This approach relies on a graph data model where all the data is stored as a graph. The entities are stored as nodes, connected to each other by edges. Popular graph databases vendors include <a href="https://www.datastax.com/">DataStax</a>, <a href="https://neo4j.com/">Neo4j</a>, <a href="http://titan.thinkaurelius.com/">Titan</a>.<br/> <br/></span></p>
<p><span>Graph technology allows to gather and connect customer, transaction, behavior or third party data into a unique data model. This is essential to discover fraud attempts that are often hidden beyond layers of deceit. For instance, instead of examining credit card transactions over a lapse of time, analysts can query the graph data to investigate how it’s connected to other entities such as IP addresses, customers, devices.<br/> <br/></span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/linkurious_fraud_1.png" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/linkurious_fraud_1.png?width=2135" width="2135" class="align-center"/></a></p>
<p style="text-align: center;"><i>Visualization in Linkurious of cross-channel data stored as a graph in Neo4j<br/> <br/></i></p>
<p><span>The graph approach makes it easier and faster to query connections within the data. Anti-fraud teams can run queries traversing datasets of millions of records to unveil suspicious connections. This is critical to detect networks and suspicious patterns in real-time. But in order to speed up the analysis process, and therefore the response time, eCommerce merchants need intuitive accesses to this graph data.</span></p>
<p></p>
<h2 style="text-align: center;"><span>Detecting eCommerce fraudulent activity<br/> <br/></span></h2>
<p><span>To illustrate what fraud detection analysis with Linkurious Enterprise looks like, we created a small dataset with dummy eCommerce data and loaded it into a graph database. In the following sections, we explain how anti-fraud teams can leverage Linkurious Enterprise to detect and investigate fraud attempts.<br/> <br/></span></p>
<h3 style="text-align: left;"><span><span class="font-size-4">Setting up alerts to monitor fraud attempts</span><br/> <br/></span></h3>
<p><span>As new fraud schemes emerge, the ability to create detection rules on the fly is critical. Graph traversal languages, such as <a href="https://neo4j.com/developer/cypher-query-language/">Cypher </a>or <a href="https://github.com/tinkerpop/gremlin/wiki">Gremlin </a>are simple yet complete enough to let analysts imagine new queries that will flag fraudulent behaviors. Linkurious Enterprise offers an alert dashboard to generate and monitor different alerts and assess the flagged cases.<br/> <br/></span></p>
<p><span>For instance, we want to set up an alert query that returns any transactions that have a connection with at-risk-countries. The list of these countries is integrated in the graph model and appears as a property on our country nodes. The data is processed in near-real time and you get an immediate response of whether or not there is such patterns in your data.<br/> <br/></span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/alerts_linkurious.png" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/alerts_linkurious.png?width=400" width="400" class="align-center"/></a></p>
<p style="text-align: center;"><span class="font-size-2"><i>Setting up alerts in Linkurious using Cypher query language<br/> <br/></i></span></p>
<p><span>The above alert will flag transactions where the IP or delivery addresses is located in one of the countries on my “at-risk-countries” list. For every case reported, a team can visually investigate the data.<br/> <br/></span></p>
<h3 style="text-align: left;"><span><span class="font-size-4">Visually investigating graph data</span><br/> <br/></span></h3>
<p><span>Whether it’s to investigate specific alert cases or learn more about a particular entity, analysts can easily search and visualize their data in real time in the Linkurious Enterprise interface. For instance, we can generate a visualization compiling a subset of transactions and their related connections with a simple query. Below is the visualization of our subset of transactions (red nodes), the associated credit cards (blue nodes) and customer accounts (green nodes). We see that nodes are connected together, depicting the ownership relationships between credit cards and customers.<br/> <br/></span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/linkurious_visualization_2.jpg" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/linkurious_visualization_2.jpg?width=3194" width="3194" class="align-center"/></a></p>
<p style="text-align: center;"><i>Example of visualization of graph data nodes (customer, credit card and transactions) and their connections<br/> <br/></i></p>
<p><span>By using different graph layouts, we can easily reveal structural patterns in the data to identify differences in a glimpse. In the example below, we switched from a force-directed layout to a hierarchical layout in order to better understand the different cases we have.<br/> <br/></span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/Linkurious_visualization_3.jpg" rel="prettyPhoto[gallery]" title="" class="fancybox image"><img class="aligncenter wp-image-5737 size-full" src="https://linkurio.us/wp-content/uploads/2017/08/Linkurious_visualization_3.jpg" alt="Transactions visualization in Linkurious" width="3184" height="1504"/></a></p>
<p style="text-align: center;"><i>Visualization with a hierarchical layout a of subset of transactions and their related nodes<br/> <br/></i></p>
<p><span>We immediately notice a suspicious pattern in the graph. Customers (green nodes) are typically connected to a single credit card (blue node) which is connected to one or several transactions (red nodes). But there’s one case where two green nodes (two customers), are connected to the same credit card which is suspicious.<br/> <br/></span></p>
<p><span>It is easy to drill down on a suspicious case with the Linkurious Enterprise interface. Analysts simply expand the nodes around suspicious patterns to reveal other connections within the data and assess the situation. In our example, we expanded the nodes around our two customers and the transactions to unveil a sub-graph with additional information (customer IP addresses, contact information, addresses, goods bought and shipping addresses).<br/> <br/></span></p>
<p><a href="https://linkurio.us/wp-content/uploads/2017/08/Linkurious_visualization_4.png" target="_blank"><img src="https://linkurio.us/wp-content/uploads/2017/08/Linkurious_visualization_4.png?width=1916" width="1916" class="align-center"/></a></p>
<p style="text-align: center;"><i>Investigation of the neighboring nodes of our suspicious customers<br/> <br/></i></p>
<p><span>With this visualization, we understand that actually three users are involved. Two of them, from different countries, ordered goods and shipped them to a third client, which could indicate a reshipping fraud.<br/> <br/></span></p>
<p><span>Graph technology offers an additional layer of protection for eCommerce companies. It enhances discrete analysis methods by providing connected analysis capabilities over a single source of truth of customer data. Anti-fraud teams have an intuitive tool to detect and investigate fraud attempts and fraud rings that would otherwise stay undetected.</span></p>
<p></p>
<p><a href="https://linkurio.us/" target="_blank">Learn more</a></p>Type I and Type II Errors in One Picturetag:www.analyticbridge.datasciencecentral.com,2017-08-10:2004291:BlogPost:3695862017-08-10T23:17:32.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>This picture speaks more than words. It explains the concept or false positive and false negative, that is, what is referred to by statisticians as Type I and Type II errors.</p>
<p></p>
<p><a href="http://api.ning.com:80/files/4aD3G6lWhVnWOLeavIi6rI00scQEtKRCp8cBo5rnAIbckXiSk7jOXyuhyG4JyI-C76Hp*lymuaqTdW5D1RKdy1laH7x3S2Ph/fp.PNG" target="_self"><img class="align-center" src="http://api.ning.com:80/files/4aD3G6lWhVnWOLeavIi6rI00scQEtKRCp8cBo5rnAIbckXiSk7jOXyuhyG4JyI-C76Hp*lymuaqTdW5D1RKdy1laH7x3S2Ph/fp.PNG" width="472"></img></a></p>
<p>Other great pictures summarizing data science and statistical concepts, can be found…</p>
<p>This picture speaks more than words. It explains the concept or false positive and false negative, that is, what is referred to by statisticians as Type I and Type II errors.</p>
<p></p>
<p><a href="http://api.ning.com:80/files/4aD3G6lWhVnWOLeavIi6rI00scQEtKRCp8cBo5rnAIbckXiSk7jOXyuhyG4JyI-C76Hp*lymuaqTdW5D1RKdy1laH7x3S2Ph/fp.PNG" target="_self"><img src="http://api.ning.com:80/files/4aD3G6lWhVnWOLeavIi6rI00scQEtKRCp8cBo5rnAIbckXiSk7jOXyuhyG4JyI-C76Hp*lymuaqTdW5D1RKdy1laH7x3S2Ph/fp.PNG" width="472" class="align-center"/></a></p>
<p>Other great pictures summarizing data science and statistical concepts, can be found <a href="http://www.datasciencecentral.com/profiles/blogs/four-great-pictures-illustrating-machine-learning-concepts" target="_blank">here</a> and also <a href="http://www.datasciencecentral.com/profiles/blogs/17-amazing-infographics-and-other-visual-tutorials" target="_blank">here</a>. </p>
<p><b>DSC Resources</b></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>
<p>Popular Articles</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-articles-about-core-data-science">What is Data Science? 24 Fundamental Articles Answering This Question</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python">Hitchhiker's Guide to Data Science, Machine Learning, R, Python</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
</ul>Capturing Low-Probability, High-Impact Events 'Black Swans' in Economic and Financial Modelstag:www.analyticbridge.datasciencecentral.com,2017-07-31:2004291:BlogPost:3691962017-07-31T14:30:00.000ZJamilu Auwalu Adamuhttps://www.analyticbridge.datasciencecentral.com/profile/JamiluAuwaluAdamu
<p style="text-align: center;"><b>Capturing Low-Probability, High-Impact Events 'Black Swans' in Economic and Financial Models</b> <br></br> <b>Jamilu Auwalu Adamu , <i>Lecturer, Nigeria</i></b></p>
<p align="center"><b>Incorporation of Fat - Tailed Effects of the Underlying Assets Probability Distribution using Advanced Stressed Methods.</b></p>
<p><br></br> Capturing the effects of Low-Probability, High-Impact "Black Swans" in the existing stochastic and deterministic models is tremendously…</p>
<p style="text-align: center;"><b>Capturing Low-Probability, High-Impact Events 'Black Swans' in Economic and Financial Models</b> <br/> <b>Jamilu Auwalu Adamu , <i>Lecturer, Nigeria</i></b></p>
<p align="center"><b>Incorporation of Fat - Tailed Effects of the Underlying Assets Probability Distribution using Advanced Stressed Methods.</b></p>
<p><br/> Capturing the effects of Low-Probability, High-Impact "Black Swans" in the existing stochastic and deterministic models is tremendously Important. On this page, I would like to share with the members an open access, peer reviewed published research findings of my PhD thesis on how to capture the effects of Low-Probability, High-Impact Events in our existing economic and financial models.<br/> I shall begin with the incorporation of fat-tailed effects of the underlying assets probability distribution in the popular LOGIT and PROBIT MODELS.<br/> <br/> INTRODUCTION<br/> <br/> The Global financial markets have experienced series of financial and economic crises right from the inception and from generation to generation. Banks, Companies and the world economy experienced catastrophic deterioration and serious corporate failures by systemic risk effect.<br/> Big Banks and Companies like Continental Illinois, City Federal Savings and Loan, Bank of New England Boston, Lehman Brothers, General Motors, and Worldcom have all failed and declared bankrupt in the history of global financial markets. That is why, many scholars in the past and recent past have attempted to comes up with the models that can precisely calculate the Probability of Default of a Bank or Company over a given time period. Probability of Default for a given Company or Bank captures the probability that the Company or Bank will default within a certain period.<br/> The most popular models used by financial institutions to calculate probabilities of default are LOGIT (1980) and PROBIT (1981). Despite the fact that Logit and Probit gives good approximations but seems not to capture chaotic markets behaviour to some extend. Accurately determination of probability of default play very important role in the entire world economy. Probability of Default is the major component when determining (i) Capital Requirements under Basel II (Now Basel III) (ii) Expected Loss (iii) Risk Weighted Asset.<br/> Also, the probability of default (PD) is a crucial parameter in risk management, and can be used for the requests of loans, rating estimation, pricing of credit derivatives and many others key financial fields. The false estimation of PD leads to unreasonable rating; incorrect pricing of financial instruments and thereby it was one of the causes of the recent global financial crisis. <br/> The aim of this research work is to come up with new advanced stressed probability models that can capture chaotic markets behaviour or in the other way round to work under financial markets distress to some extend.<br/> <br/> LITERATURE REVIEW<br/> <br/> Initially, corporate distresses were assessed based on some qualitative information, which were very subjective. In particular, four references were mostly used, namely: (i) the capacity of the manager in charge of the project or company, (ii) the fact that the manager had an important financial involvement in the company as a financial guarantee, (iii) the project and the industry in itself, and (iv) the fact that firm possessed assets or collateral to back-up in case of a bad situation. <br/> <br/> John M. Moody (1909) was the first to published credit rating grades for publicly traded bonds. In 1941, David Durand applied discriminant analysis proposed by Fisher (1936) to classify prospective borrowers. Attempts have been made in 1950s to merge automated credit decision making with statistical techniques so as to enhance credit decision making. Lack of sophisticated computing tools, the models possessed limitations. Myers and Forgy (1963), compared discrimination analysis with regression in credit scoring application.<br/> <br/> Altman (1960), introduced variables in a multivariate discriminant analysis and obtained a function depending on some financial ratios. Beaver in 1966 introduced an univariate approach of discriminant analysis to assess the individual relationships between predictive variables, and subsequent failure events. In 1968, Altman expanded the work of Beaver (1966) to allow one to assess the relationship between failure and a set of financial characteristics. Martin (1977), presented a logistic regression model to predict probabilities of failure of banks using data obtained from Federal Reserve System. Ohlson (1980), used Logit to predict bankruptcy. Zmijewski (1984) used probit to estimate probability of default and predict bankruptcy.<br/> <br/> In 1985, West used factor analysis and logit estimation to assign a probability of a bank being a problematic. In 2001, Shumway introduced dynamic logit or hazard model to predict bankruptcy. Chava & Jarrow (2004), Hillegeist, Keating, Cram, & Lundstedt (2004), and Beaver, McNichols & Rhie (2005) uses Shumway’s approach. In 2004, Jones & Hensher introduced a mixed logit model for financial distress prediction and argued that it offers significant improvements compare to binary logit and multinomial logit models. Campbell, Hilscher, & Szilagyi (2008), introduced a dynamic logit model to predict corporate bankruptcies and failures at short and long horizons using accounting and market variables.<br/> <br/> In 2011, Altman, Fargler, & Kalotay used accounting – based measures, firm characteristics and industry level expectations of distress conditions to estimate the likehood of default inferred from equity prices. Li, Lee, Zhou, & Sun (2011) introduced a combined random subspace approach (RSB) with binary logit models to generate a so called RSB-L model that takes into account different decision agent’s opinions as a matter to enhance results.<br/> Sun & Li (2011) tested the feasibility and effectiveness of dynamic modelling for financial distress prediction (FDP) based on the Fisher discriminant analysis model.<br/> <br/> Stefan Van der Ploeg (2011) stated that since the seminal work of Martin (1977), the Logit and Probit Models has become one of the most commonly applied parametric failure prediction models in the academic literature as well as the banking regulation and supervision. Jamilu (2015) introduced new methods entitled “Jameel’s Advanced Stressed Methods uses Jameel’s Criterion” to Stress Economic and Financial Stochastic Models, initially using Logit and Probit Models. <br/> <br/> JAMEEL'S ADVANCED STRESSED METHODS<br/> <br/> Before incorporating Fat - Tailed Effects in our Existing LOGIT and PROBIT Models. We have to discuss about how to obtain BEST FITTED FAT -TAILED PROBABILITY DISTRIBUTIONS OF THE UNDERLYING STOCKS RETURNS OF THE COMPANIES UNDER CONSIDERATION. <br/> <br/> However, the major aim of this Post of the Research Findings is to consider Eleven (11) out of fifty (50) WORLD’S BIGGEST PUBLIC Companies by FORBES as of 2015 Ranking regardless of the platform in which they are listed and run the test of Goodness of fit on them vis – a – vis time series from 2014 – 2009. These include:<br/> <br/> (1) China Construction Bank Corporation (CICHY) from 2014 – 2009<br/> (2) Bank of China Limited (3988.K) from 2014 – 2009 <br/> (3) Berkshire Hathaway Inc. (BRK – A) from 2014 – 2009<br/> (4) Toyota Motor Corporation (TM) from 2014 – 2009<br/> (5) Volkswagen Group AG (VLKAY) from 2014 – 2009<br/> (6) Bank of America Corporation (BAC) from 2014 – 2009<br/> (7) Nestle India Limited (NESTLEIND.NS) from 2014 – 2009<br/> (8) International Business Machines Corporation (IBM) from 2014 – 2009<br/> (9) Goldman Sachs Group Securities (GJS) from 2014 – 2009<br/> (10) Google Inc. (GOOG) from 7/2/2015 to 5/18/2012<br/> (11) Facebook Inc. (FB) from 7/2/2015 to 5/18/2012<br/> <br/> This is regardless of the questions of WHAT would happen to the Stock Returns probability distributions IF we considers:<br/> (a) Daily, Weekly, Quarterly or Annually Stock Returns<br/> (b) Companies that are more than Five (5)<br/> (c) Companies that are listed on different stock exchanges not only ONE STOCK EXCHANGE platform<br/> (d) Simultaneously Long – term and Short – term time series of the Stock Returns<br/> (e) Recently past public companies like FACEBOOK with initial public offering on the 18th May, 2012 and began selling stock to the public and trading on the NASDAQ the same date and the GOOGLE on the March 9 2006.<br/> <br/> To achieve these, the author developed what is called JAMEEL'S CRITERION using ESSAYFITS SOFTWARE as follows:<br/> <br/> JAMEEL'S CRITERION:<br/> <br/> (i) We accept if the Average of the ranks of Kolmogorov Smirnor, Anderson Darling and Chi-squared is less than or equal to Three (3)<br/> (ii) We must choose the Probability Distribution follows by the data ITSELF regardless of its Rankings<br/> (iii) If there is tie, we include both Probability Distributions in the selection<br/> (iv) At least Two (2) probability distributions must be included in the selection <br/> (v) We select the most occur probability distribution as the qualify candidate in each case of test of goodness of fit of the stock returns as follows<br/> <br/> In the next post, we shall continue from Jameel's Criterion.<br/> <br/> Thank you.v</p>Introducing User Behavioral Analysis in the Risk Processtag:www.analyticbridge.datasciencecentral.com,2017-07-31:2004291:BlogPost:3690282017-07-31T17:30:00.000ZAndrew Maranehttps://www.analyticbridge.datasciencecentral.com/profile/AndrewMarane
<div>Many years ago when I was entering the intelligence community, I attended a class in Virginia where the instructor opened the session with a test that I will never forget and that I have applied to almost every analytic task in my career. At the beginning of the class we were shown a ten-minute video of grand central station at rush hour with tens of thousands of people and were asked if we could find a single pickpocket in the crowd by the end of the video. At the end of ten minutes no…</div>
<div>Many years ago when I was entering the intelligence community, I attended a class in Virginia where the instructor opened the session with a test that I will never forget and that I have applied to almost every analytic task in my career. At the beginning of the class we were shown a ten-minute video of grand central station at rush hour with tens of thousands of people and were asked if we could find a single pickpocket in the crowd by the end of the video. At the end of ten minutes no one in the class was able to identify the individual.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>The purpose of the class was to stress the importance of looking for behavior at as indication that something is wrong, looking for a person or thing which is not doing or behaving the same way everything else is. By understanding what is normal in any given place or situation, identifying threats becomes much easier because outliers stand out. If you want to find a pickpocket in a crowd of people you don’t attempt to look at every person, you understand what the crowd is doing and look for the person who is not doing what the crowd is.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>By the end of the class, the video was shown again and every person was able to find the pickpocket within the first three minutes. Everyone understood how this principal could be applied to everything from counter-terrorism to personal protection and as I realized after entering the field, fraud. The person who is committing fraud does not behave the same way legitimate people do online. Even their attempts to look “normal” make them stand out, because their goal is completely different from everyone else on your site.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>Recently I conducted an experiment to see what percentage of users who had committed organized fraud transactions could have been identified by their behavior before they ever made a transaction. By looking at these individual’s interactions with the platform from account inception through the transaction process, I examined if this group did anything that a legitimate user would not and if that action was exclusive of individuals committing fraud. I used a year’s worth of fraud events, hundreds of thousands of fraud transactions and began dissecting their activity down to the click. At the end, 93% of individuals who had engaged in organized fraud could have been detected before they ever made a transaction based on their behavior with the bulk identifiable at account creation. Of that 93%, over half could have been identified by the first eight things they did on the site based on how the entered, their settings, their network and flow of interaction from entry to attempted activation. Even worse, and it was something that I hadn’t planned on, 12% of rejected transaction for organized fraud were false positives and misclassified. If behavioral analysis had been utilized, it would have detected that these were legitimate users and applied a different decision.</div>
<div>.</div>
<div>The impulse of every company engaged in online commerce is develop rules and models aimed at identifying fraud in the transaction flow. We throw hundreds of rules that look at thousands of data points when a person is buying something, writing a post or review or conducting a financial transaction. Over the years, these rules continue to grow and weave their way through the commerce or submission flow in our operation and they funnel every single person and every single transaction through them. We continue to scale as transaction volume grows, adding resources at bandwidth to the rules platform as the people and the rules themselves continue to multiply.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>Ultimate our fraud platforms struggle with latency and conflict because we are aiming every big gun we have at every individual who walks through our door and because we keep adding guns they eventually start pointing at themselves creating conflicts within the fraud infrastructure and friction to the user, regardless of who that user is because none of these rules simply look at the risk of the individual. In the end, more transactions are pushed to manual review which adds cost or worse, falsely rejected due to a lack of intelligence about the user themselves.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>Fraud rules, fraud filters and fraud models are essential and play an important part of any fraud prevention platform but they a reactionary not strategic. They do not know the difference between an individual who entered your site three months ago shopping for a TV, returned to your site 20 times in the following weeks looking at the price, reviews, specs and pictures of TV’s you are selling and then finally makes a purchasing decision on the best and most expensive TV you have and the guy who appeared out of the blue, set up an account, went directly to the most expensive TV you have, threw it in the cart and hit the purchasing flow. The rules are going to treat both of these transactions exactly the same because they are “new” users buying a high-risk item.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>By starting the fraud identification process by analyzing the behavior of users, particularly new users who enter your site or establish an account we become strategic an solve any number of issues in our fraud scoring process. Much in the same way we classify transactions in risk, our behavior model is looking for attributes and actions that run contrary to what legitimate users do on the site and has to begin with an understanding of what that is (I recommend you cozy up with your user experience team, they have piles of information on that very thing). Once there is a good understanding of normal user activity and behavior, we start training models that look for the opposite, and since you know what good behavior is, finding the abnormal behavior becomes much easier (see paragraph one). This can be visualized but scatter plotting behavioral attributes across millions of activities and users.</div>
<div class="separator"><span style="color: #ffffff;">.</span></div>
<div>Our behavioral model is going to look at attributes and actions and begin scoring them so we can tell the difference between a risky new user and a good new user (new users on first transactions are the hardest to classify and the highest risk in the transaction flow). It will look for things such as the users network, did they enter on a hosted service, proxy, botnet or in some way are they trying to disguise where they are and who they are. Do they have java, flash or cookies disabled? Do they enter the site and go directly to sign-up flow or account creation? What is the delta between the amount of time it takes them to complete the sign-up flow, is it 20 seconds when ever legitimate user takes 2 minutes? What type of operating system, browser and resolution (user agent attributes) are they on, is it unique? What language localization are they using and does it match where the other attributes and user entered data say they are from? What are the click patterns, are they too efficient in getting from entry to transaction for a normal user?</div>
<div><span style="color: #ffffff;">.</span></div>
<div>This could go on for some time and there are literally thousands of things to be gleaned from user and page logs that have enormous value in fraud detection that are hardly ever used anywhere. The behavioral model should be looking at everything you can feed into it from entry stopping at transaction, this model doesn’t care about transactions its job is to make a decision on what to do with this user before they make it there.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>A good example of strategic intervention over tactical or reactive intervention is the comparison of airline security in the United States to Israel. In the US, when you go to the airport that starts the screening process and everyone is lumped together and fed through the same security process to identify risk. Everyone goes through the same security line, the same x-ray, the same scanner and even random selection does not take into account anything about you, its just a random number. A five year old is basically looked at the same way a military age individual is when it comes to the overall process, which is why you always see the funny and yet disturbing videos of five year olds getting pat down, because their number was up.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>In Israel, the airlines security knows as much as they can about you before you even show up at the airport. By the time you arrive they know about you and your network, why your there, where you’re going, why you’re going and anything else they can possibly dig up about you. They have already risked scored you from the minute you bought the ticket and decided what kind of screening you are going to have at the airport and if you are still high risk, which security agent is going to be sitting in the seat behind you ready to blow your head off if you jump to quick. Strategic vs tactical risk management or in our case proactive vs reactive fraud detection.</div>
<div><span style="color: #ffffff;">.</span></div>
<div>Once the behavior model is built and beginning to score and surface risk, it can begin solving fraud and platform issues. The first process would to begin cohorting users into different risk groups to determine what fraud rules and models apply based on the behavior. A new user who exhibited normal or predicted user behavior before entering the transaction flow would not be subject to the same filters that a user who showed highly suspicious behavior.</div>
<div><div class="separator"><span style="color: #ffffff;">.</span></div>
By cohorting users, we can begin more accurately “vetting” users in the transaction flow and also redirect traffic to our fraud platform to improve scalability and friction. An established user with a good behavior profile and predictable buying pattern could bypass all but exclusion rules freeing up bandwidth for users who exhibit high risk behavior that would run the gauntlet of our fraud platform.<br/>
<div class="separator"><span style="color: #ffffff;">.</span></div>
</div>
<div>Likewise, a user who exhibits known fraudulent behavior doesn’t need to be routed through the fraud platform at time of transaction, they can bypass the models to rejection. We can tune our cohort groups to optimize manual review of those users in specific risk groups who have a higher likelihood of transaction completion. By implementing behavior modeling, deep diving and analyzing the data from the users interactions prior to the transaction flow, we get a better understanding of the user’s intent on our site, we gain efficiency and bandwidth in our transactional fraud process and a greater accuracy when making risk based decisions.</div>How to Detect if Numbers are Random or Nottag:www.analyticbridge.datasciencecentral.com,2017-07-10:2004291:BlogPost:3665472017-07-10T06:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>In this article, you will learn some modern techniques to detect whether a sequence appears as random or not, whether it satisfies the central limit theorem (CLT) or not -- and what the limiting distribution is if CLT does not apply -- as well as some tricks to detect abnormalities. Detecting lack of randomness is also referred to as signal versus noise detection, or pattern recognition.</p>
<p>It leads to the exploration of time series with massive, large-scale (long term) auto-correlation…</p>
<p>In this article, you will learn some modern techniques to detect whether a sequence appears as random or not, whether it satisfies the central limit theorem (CLT) or not -- and what the limiting distribution is if CLT does not apply -- as well as some tricks to detect abnormalities. Detecting lack of randomness is also referred to as signal versus noise detection, or pattern recognition.</p>
<p>It leads to the exploration of time series with massive, large-scale (long term) auto-correlation structure, as well as model-free, data-driven statistical testing. No statistical knowledge is required: we will discuss deep results that can be expressed in simple English. Most of the testing involved here uses big data (more than a billion computations) and data science, to the point that we reached the accuracy limits of our machines. So there is even a tiny piece of numerical analysis in this article.</p>
<p>Potential applications include testing randomness, Monte Carlo simulations for statistical testing, encryption, blurring, and <a href="http://www.datasciencecentral.com/profiles/blogs/interesting-data-science-application-steganography" target="_blank">steganography</a> (encoding secret messages into images) using pseudo-random numbers. A number of open questions are discussed here, offering the professional post-graduate statistician new research topics both in theoretical statistics and advanced number theory. The level here is state-of-the-art, but we avoid jargon and some technicalities to allow newbies and non-statisticians to understand and enjoy most of the content. An Excel spreadsheet, attached to this document, summarizes our computations and will help you further understand the methodology used here.</p>
<p>Interestingly, I started to research this topic by trying to apply the notorious <a href="http://www.datasciencecentral.com/profiles/blogs/the-fundamental-statistics-theorem-revisited" target="_blank">central limit theorem</a> (CLT) to non-random (static) variables -- that is, to fixed sequences of numbers that look chaotic enough to simulate randomness. Ironically, it turned out to be far more complicated than using CLT for regular random variables. So I start here by describing what the initial CLT problem was, before moving into other directions such as testing randomness, and the distribution of the largest gap in seemingly random sequences. As we will see, these problems are connected. </p>
<p><span class="font-size-3"><strong>1. Central Limit Theorem for Non-Random Variables</strong></span></p>
<p>Here we are interested in sequences generated by a periodic function f(<em>x</em>) that has an irrational period <em>T</em>, that is f(x+<em>T</em>) = f(x). Examples include f(<em>x</em>) = sin <em>x</em> with <em>T</em> = 2<span style="font-family: 'symbol', geneva;">p</span>, or f(x) = {<span style="font-family: 'symbol', geneva;">a</span>x} where <span style="font-family: 'symbol', geneva;">a</span> > 0 is an irrational number, { } represents the <a href="https://en.wikipedia.org/wiki/Fractional_part" target="_blank">fractional part</a> and <em>T</em> = 1/<span style="font-family: 'symbol', geneva;">a</span>. The <em>k</em>-th element in the infinite sequence (starting with <em>k</em> = 1) is f(<em>k)</em>. The central limit theorem can be stated as follows:</p>
<p>Under certain conditions to be investigated -- mostly the fact that the sequence seems to represent or simulate numbers generated by a well-behaved stochastic process -- we would have:</p>
<p><a href="http://api.ning.com:80/files/lFg-2*UqyS7c3VI1rJZ-56pQ-GyU5xXXkHlDEUSKvuwlpdSIRwDUfjr3xOgRJrcmHLIZsg5POMw3AEUtxkYMs0yg5m2eTCSz/Capture1.PNG" target="_self"><img src="http://api.ning.com:80/files/lFg-2*UqyS7c3VI1rJZ-56pQ-GyU5xXXkHlDEUSKvuwlpdSIRwDUfjr3xOgRJrcmHLIZsg5POMw3AEUtxkYMs0yg5m2eTCSz/Capture1.PNG?width=703" width="703" class="align-center"/></a></p>
<p>In short, <em>U</em>(<em>n</em>) tends to a normal distribution of mean 0 and variance 1 as <em>n</em> tends to infinity, which means that as both <em>n</em> and <em>m</em> tends to infinity, the values <em>U</em>(<em>n</em>+1), <em>U</em>(<em>n</em>+2) ... <em>U</em>(<em>n</em>+<em>m</em>) have a distribution that converges to the standard bell curve.</p>
<p>From now on, we are dealing exclusively with sequences that are <a href="https://en.wikipedia.org/wiki/Equidistributed_sequence" target="_blank">equidistributed</a> over [0, 1], thus <span style="font-family: 'symbol', geneva;">m</span> = 1/2 and <span style="font-family: 'symbol', geneva;">s</span> = SQRT(1/12). In particular, we investigate f(<em>x</em>) = {<span style="font-family: 'symbol', geneva;">a</span><em>x</em>} where <span style="font-family: 'symbol', geneva;">a</span> > 0 is an irrational number and { } the fractional part. While this function produces a sequence of numbers that seems fairly random, there are major differences with truly random numbers, to the point that CLT is no longer valid. The main difference is the fact that these numbers, while somewhat random and chaotic, are much more evenly spread than random numbers.True random numbers tend to create some clustering as well as empty spaces. Another difference is that these sequences produce highly auto-correlated numbers.</p>
<p>As a result, we propose a more general version of CLT, redefining <em>U</em>(<em>n</em>) by adding two parameters <em>a</em> and <em>b</em>: </p>
<p><a href="http://api.ning.com:80/files/lFg-2*UqyS56ZvvpU4DRshu2WGmCT-Qy7KhoDkR33qA-NoS0R3KKs8pyNdgUK7tmwvNckoY0upDSVU2kpnieorvFjugeKavG/Capture2b.PNG" target="_self"><img src="http://api.ning.com:80/files/lFg-2*UqyS56ZvvpU4DRshu2WGmCT-Qy7KhoDkR33qA-NoS0R3KKs8pyNdgUK7tmwvNckoY0upDSVU2kpnieorvFjugeKavG/Capture2b.PNG" width="298" class="align-center"/></a></p>
<p>This more general version of CLT can handle cases like our sequences. Note that the classic CLT corresponds to <em>a</em> = 1/2 and <em>b</em> =0. In our case, we suspect that <em>a</em> = 1 and <em>b</em> is between 0 and -1. This is discussed in the next section. </p>
<p>Note that if instead of f(<em>k</em>), the <em>k</em>-th element of the sequence is replaced by f(<em>k^2</em>) then the numbers generated behave more like random numbers: they are less evenly distributed and less auto-correlated, and thus the CLT might apply. We haven't tested it yet. </p>
<p>You will also find an application of CLT to non-random variables, as well as to correlated variables, i<a href="http://www.datasciencecentral.com/profiles/blogs/the-fundamental-statistics-theorem-revisited" target="_blank">n my previous article on this topic</a>.</p>
<p><span class="font-size-4"><strong>2. Testing Randomness: Max Gap, Auto-Correlations and More</strong></span></p>
<p>Let's call the sequence f(1), f(2), f(3) ... f(<em>n</em>) generated by our function f(<em>x</em>) an <span style="font-family: 'symbol', geneva;">a</span>-sequence. Here we compare properties of <span style="font-family: 'symbol', geneva;">a</span>-sequences with those of random numbers on [0, 1] and we highlight the striking differences. Both sequences, when <em>n</em> tends to infinity, have a mean value converging to 1/2, a variance converging to 1/12 (just like any uniform distribution on [0, 1]), and they both look quite random at first glance. But the similarities almost stop here. </p>
<p><strong>Maximum gap</strong></p>
<p>The maximum gap among <em>n</em> points scattered between 0 and 1 is another way to test for randomness. If the points were truly randomly distributed, the expected value for the length of the maximum gap (also called longest segment) is known and is equal to</p>
<p><a href="http://api.ning.com:80/files/KtY2yaJFKB6Voi*GralKMEEv*VZ4x3sYQ1rrR46fXestPxL*UBN-2KCB0r7MTCxVzmrWO3LBNkUB8DmmuV8TDBm370qFI7nv/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/KtY2yaJFKB6Voi*GralKMEEv*VZ4x3sYQ1rrR46fXestPxL*UBN-2KCB0r7MTCxVzmrWO3LBNkUB8DmmuV8TDBm370qFI7nv/Capture.PNG" width="150" class="align-center"/></a></p>
<p>See <a href="https://math.stackexchange.com/questions/13959/if-a-1-meter-rope-is-cut-at-two-uniformly-randomly-chosen-points-what-is-the-av" target="_blank">this article</a> for details, or the book <a href="https://www.amazon.com/Order-Statistics-Herbert-David/dp/0471389269" target="_blank">Order Statistics</a> published by Wiley, page 135. The max gap values have been computed in the spreadsheet (see section below to download the spreadsheet) both for random numbers and for <span style="font-family: 'symbol', geneva;">a</span>-sequences. It is pretty clear from the Excel spreadsheet computations that the average maximum gaps have the following expected values, as <em>n</em> becomes very large:</p>
<ul>
<li>Maximum gap for random numbers: log(<em>n</em>) / <em>n</em> as expected from the above theoretical formula</li>
<li>Maximum gap for <span style="font-family: 'symbol', geneva;">a</span>-sequences: <em>c</em> / <em>n</em> (<em>c</em> is a constant close to 1.5; the result needs to be formally proved)</li>
</ul>
<p>So <span style="font-family: 'symbol', geneva;">a</span>-sequences have points that are far more evenly distributed than random numbers, by an order of magnitude, not just by a constant factor! This is true for the 8 <span style="font-family: 'symbol', geneva;">a</span>-sequences (8 different values of <span style="font-family: 'symbol', geneva;">a</span>) investigated in the spreadsheet, corresponding to 8 "nice" irrational numbers (more on this in the research section below, about what a "nice" irrational number might be in this context.) </p>
<p><strong>Auto-correlations</strong></p>
<p>Unlike random numbers, values of f(<em>k</em>) exhibit strong, large-scale auto-correlations: f(<em>k</em>) is strongly correlated with f(<em>k</em>+<em>p</em>) for some values of <em>p</em> as large as 100. The successive lag-<em>p</em> auto-correlations do not seem to decay with increasing values of <em>p.</em> To the contrary, it seems that the maximum lag-<em>p</em> auto-correlation (in absolute value) seems to be increasing with <em>p</em>, and possibly reaching very close to 1 eventually. This is in stark contrast with random numbers: random numbers do not show auto-correlations significantly different from zero, and this is confirmed in the spreadsheet. Also, the vast majority of time series have auto-correlations that quickly decay to 0. This surprising lack of decay could be the subject of some interesting number theoretic research. These auto-correlations are computed and illustrated in the Excel spreadsheet (see section below) and are worth checking out. </p>
<p><strong>Convergence of <em>U</em>(<em>n</em>) to a non-degenerate distribution</strong></p>
<p>Figures 2 and 3 in the next section (extracts from our spreadsheet) illustrate why the classic central limit theorem (that is, <em>a</em> = 1/2, <em>b</em> =0 for the <em>U</em>(<em>n</em>) formula) does not apply to <span style="font-family: 'symbol', geneva;">a</span>-sequences, and why <em>a</em> = 1 and <em>b</em> = <em>0</em> might be the correct parameters to use instead. However, with the data gathered so far, we can't tell whether <em>a</em> = 1 and <em>b</em> = 0 is correct, or whether <em>a</em> = 1 and <em>b</em> = -1 is correct: both exhibit similar asymptotic behavior, and the data collected is not accurate enough to make a final decision on this. The answer could come from theoretical considerations rather than from big data analysis. Note that the correct parameters should produce a somewhat horizontal band for <em>U</em>(<em>n</em>) in figure 2, with values mostly concentrated between -2 and +2 due to normalization of <em>U</em>(<em>n</em>) by design. And <em>a</em> = 1, <em>b</em> = 0, as well as <em>a</em> = 1, <em>b</em> = -1, both do just that, while it is clear that <em>a</em> = 1/2 and <em>b</em> = 0 (classic CTL) fails as illustrated in figure 3. You can play with parameters <em>a</em> and <em>b</em> in the spreadsheet, and see how it changes figure 2 or 3, interactively. </p>
<p>One issue is that we computed <em>U</em>(<em>n</em>) for <em>n</em> up to 100,000,000 using a formula that is ill-conditioned: multiplying a large quantity <em>n</em> by a value close to zero (for large <em>n</em>) to compute <em>U</em>(<em>n</em>), when the precision available is probably less than 12 digits. This might explain the large, unexpected oscillations found in figure 2. Note that oscillations are expected (after all, <em>U</em>(<em>n</em>) is supposed to converge to a statistical distribution, possibly the bell curve, even though we are dealing with non-random sequences) but such large-scale, smooth oscillations, are suspicious. </p>
<p><strong><span class="font-size-4">3. Excel Spreadsheet with Computations</span></strong></p>
<p><a href="http://datashaping.com/tctl.xlsx" target="_blank">Click here</a> to download the spreadsheet. The spreadsheet has 3 tabs: One for <span style="font-family: 'symbol', geneva;">a</span>-sequences, one for random numbers -- each providing auto-correlation, max gap, and some computations related to estimating <em>a</em> and <em>b</em> for <em>U</em>(<em>n</em>) -- and a tab summarizing <em>n</em> = 100,000,000 values of <em>U</em>(<em>n</em>) for <span style="font-family: 'symbol', geneva;">a</span>-sequences, as shown in figures 2 and 3. That tab, based on data computed using a Perl script, also features moving maxima and moving minima, a concept similar to moving averages, to better identify the correct parameters <em>a</em> and <em>b</em> to use in <em>U</em>(<em>n</em>). </p>
<p>Confidence intervals (CI) can be empirically derived to test a number of assumptions, as illustrated in figure 1: in this example, based on 8 measurements, it is clear that maximum gap CI's for <span style="font-family: 'symbol', geneva;">a</span>-sequences are very different from those for random numbers, meaning that <span style="font-family: 'symbol', geneva;">a</span>-sequences do not behave like random numbers.</p>
<p><a href="http://api.ning.com:80/files/KtY2yaJFKB6KsHYx84NGKv2T24dHWq4PXxXtg7d**3gwNuzqLMLQUR2jrHDa3mMAiyU80buVj7F1Z39Of4ro7UpF1I2UN83c/Capture.PNG" target="_self"><img src="http://api.ning.com:80/files/KtY2yaJFKB6KsHYx84NGKv2T24dHWq4PXxXtg7d**3gwNuzqLMLQUR2jrHDa3mMAiyU80buVj7F1Z39Of4ro7UpF1I2UN83c/Capture.PNG" width="515" class="align-center"/></a></p>
<p style="text-align: center;"><em><strong>Figure 1</strong>: max gap times n (n = 10,000), for 8 <span style="font-family: 'symbol', geneva;">a</span>-sequences (top) and 8 sequences of random numbers (bottom)</em></p>
<p><a href="http://api.ning.com:80/files/KtY2yaJFKB6nBIDr4NLvNg3f*p-xVEt9yMsro5RPphWM*x34ZiC5dcJ1JViH6ZyVQjrTqqO*g2hn80uBmyYTEcLykRWkI3NQ/CaptureA.PNG" target="_self"><img src="http://api.ning.com:80/files/KtY2yaJFKB6nBIDr4NLvNg3f*p-xVEt9yMsro5RPphWM*x34ZiC5dcJ1JViH6ZyVQjrTqqO*g2hn80uBmyYTEcLykRWkI3NQ/CaptureA.PNG?width=503" width="503" class="align-center"/></a></p>
<p style="text-align: center;"><em><strong>Figure 2</strong>: U(n) with a = 1, b = 0 (top) and U(n) moving max / min (bottom) for <span style="font-family: 'symbol', geneva;">a</span>-sequences</em></p>
<p><a href="http://api.ning.com:80/files/KtY2yaJFKB5il*-FhIyyJS7ibpjMnMP-64KqHhPHEfWrgQlJKkEHnkRG5CJeeLCt88F-b3AAOpwl1lDF0J3wvTkYjcKc3764/Captureb.PNG" target="_self"><img src="http://api.ning.com:80/files/KtY2yaJFKB5il*-FhIyyJS7ibpjMnMP-64KqHhPHEfWrgQlJKkEHnkRG5CJeeLCt88F-b3AAOpwl1lDF0J3wvTkYjcKc3764/Captureb.PNG?width=504" width="504" class="align-center"/></a></p>
<p style="text-align: center;"><em><strong>Figure 3</strong>: U(n) with a = 0.5, b = 0 (top) and U(n) moving max / min (bottom) for <span style="font-family: 'symbol', geneva;">a-</span>sequences</em></p>
<p><span class="font-size-4"><strong>4. Potential Research Areas</strong></span></p>
<p>Here we mention some interesting areas for future research. By sequence, we mean <span style="font-family: 'symbol', geneva;">a</span>-sequence as defined in section 2, unless otherwise specified. </p>
<ul>
<li>Using f(<em>k^c</em>) as the <em>k</em>-th element of the sequence, instead of f(<em>k</em>). Which values of <em>c</em> > 0 lead to equidistribution over [0, 1], as well as yielding the classic version of CLT with <em>a</em> = 1/2 and <em>b</em> = 0 for <em>U</em>(<em>n</em>)? Also what happens if f(<em>k</em>) = {<span style="font-family: 'symbol', geneva;">a </span>p(<em>k</em>)} where p(<em>k</em>) is the <em>k</em>-th prime number and { } represents the fractional part? This sequence was proved to be equidistributed on [0, 1] (this by itself.is a famous result of analytic number theory, published by Vinogradov in 1948) and has a behavior much more similar to random numbers, so maybe the classic CLT applies to this sequence? Nobody knows. </li>
</ul>
<ul>
<li>What is the asymptotic distribution of the moments and distribution of the maximum gap among the <em>n</em> first terms of the sequence, both for random numbers on [0, 1] and for the sequences investigated in this article? Does it depend on the parameter <span style="font-family: 'symbol', geneva;">a</span>? Same question for minimum gap and other metrics used to test randomness, such as point concentration, defined for instance in the article <a href="http://www.sciencedirect.com/science/article/pii/S0022314X99924204" target="_blank">On Uniformly Distributed Dilates of Finite Integer Sequences</a>?</li>
</ul>
<ul>
<li>Does <em>U</em>(<em>n</em>) depend on <span style="font-family: 'symbol', geneva;">a</span>? What are the best choices for <span style="font-family: 'symbol', geneva;">a</span>, to get as much randomness as possible? In a similar context, sqrt(2)-1 and (sqrt(5)-1)/2 are found to be good candidates: see <a href="https://en.wikipedia.org/wiki/Low-discrepancy_sequence" target="_blank">this Wikipedia article</a> (read the section on additive recurrence.) Also, what are the values of the coefficients <em>a</em> and <em>b</em> in <em>U</em>(<em>n</em>), for <span style="font-family: 'symbol', geneva;">a</span>-sequences? It seems that <em>a</em> must be equal to 1 to guarantee convergence to a non-degenerate distribution. Is the limiting distribution for <em>U</em>(<em>n</em>) also normal for <span style="font-family: 'symbol', geneva;">a</span>-sequences, when using the correct <em>a</em> and <em>b</em>?</li>
</ul>
<ul>
<li>What happens if <span style="font-family: 'symbol', geneva;">a</span> is very close to a simple rational number, for instance if the first 500 digits of <span style="font-family: 'symbol', geneva;">a</span> are identical to those of 3/2?</li>
</ul>
<p><strong>Generalization to higher dimensions</strong></p>
<p>So far we worked in dimension 1, the support domain being the interval [0, 1]. In dimension 2, f(<em>x</em>) = {<span style="font-family: 'symbol', geneva;">a</span><em>x</em>} becomes f(<em>x</em>, <em>y</em>) = ({<span style="font-family: 'symbol', geneva;">a</span><em>x</em>}, {<span style="font-family: 'symbol', geneva;">b</span><em>y</em>}) with <span style="font-family: 'symbol', geneva;">a</span>, <span style="font-family: 'symbol', geneva;">b</span>, and <span style="font-family: 'symbol', geneva;">a</span>/<span style="font-family: 'symbol', geneva;">b</span> irrational; f(<em>k</em>) becomes f(<em>k</em>,<em>k</em>). Just like the interval [0, 1] can be replaced by a circle to avoid boundary effects when deriving theoretical results, the square [0, 1] x [0, 1] can be replaced by the surface of the torus. The maximum gap becomes the maximum circle (on the torus) with no point inside it. The range statistic (maximum minus minimum) becomes the area of the convex hull of the <em>n</em> points. For a famous result regarding the asymptotic behavior of the area of the convex hull of a set of <em>n</em> points, <a href="http://www.datasciencecentral.com/profiles/blogs/the-fundamental-statistics-theorem-revisited" target="_blank">read this article</a> and check out the sub-section entitled "Other interesting stuff related to the Central Limit Theorem." Note that as the dimension increases, boundary effects become more important. </p>
<p><a href="http://api.ning.com:80/files/5yCEGsYwHQr1dk1G9i2cH1fchLFoDRoH4ONMOJ-6vFsDU1vxyjxKyeEior6HtwcVBGmeNsvaMaUnCpUO6Ogz2CBD72UOYKTy/biv.PNG" target="_self"><img src="http://api.ning.com:80/files/5yCEGsYwHQr1dk1G9i2cH1fchLFoDRoH4ONMOJ-6vFsDU1vxyjxKyeEior6HtwcVBGmeNsvaMaUnCpUO6Ogz2CBD72UOYKTy/biv.PNG" width="467" class="align-center"/></a></p>
<p style="text-align: center;"><em><strong>Figure 4</strong>: bi-variate example with c = 1/2, <span style="font-family: 'symbol', geneva;">a</span> = SQRT(31), <span style="font-family: 'symbol', geneva;">b</span> = SQRT(17) and n = 1000 points</em></p>
<p>Figure 4 shows an unsual example in two dimensions, with strong departure from randomness, at least when looking at the first 1,000 points. Usually, the point pattern looks much more random, albeit not perfectly random, as in Figure 5</p>
<p style="text-align: center;"><a href="http://api.ning.com:80/files/QzmRjEe-2paSdrJZvK-BHRQxwI*tUYGdGk28ST-vzHjO4HlXin1L4YcJzzRuwYVkmXWnmXnR003FmDBZ2QUYnKl3NryeeMnF/vvv.PNG" target="_self"><img src="http://api.ning.com:80/files/QzmRjEe-2paSdrJZvK-BHRQxwI*tUYGdGk28ST-vzHjO4HlXin1L4YcJzzRuwYVkmXWnmXnR003FmDBZ2QUYnKl3NryeeMnF/vvv.PNG" width="467" class="align-center"/></a>. <em><strong>Figure 5</strong>: bi-variate example with c = 1/2, <span style="font-family: 'symbol', geneva;">a</span> = SQRT(13), <span style="font-family: 'symbol', geneva;">b</span> = SQRT(26) and n = 1000 points</em></p>
<p style="text-align: left;">Computations are found <a href="http://api.ning.com:80/files/QzmRjEe-2pZ3iaGKOuGtjwCfeTA7dpgyYjRCGBMY8WSco0FnNqXlGyGv67job4C6PfHwvKQsMeE7Znnok-7y-OddV0W5Vww7/bivariate.xlsx" target="_self">in this spreadsheet</a>. Note that we've mostly discussed the case <em>c</em> = 1 in our article. The case <em>c</em> = 1/2 creates interesting patterns, and the case <em>c</em> = 2 produces more random patterns. The case <em>c</em> = 1 creates very regular patterns (points evenly spread, just like in one dimension.)</p>
<p><strong>Related articles</strong></p>
<ul>
<li><a href="http://www.analyticbridge.datasciencecentral.com/profiles/blogs/10-interesting-reads-for-math-geeks" target="_blank">12 Interesting Reads for Math Geeks</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank">My Best Articles</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/the-fundamental-statistics-theorem-revisited" target="_blank">The Fundamental Statistics Theorem Revisited</a></li>
</ul>
<div><p><span class="font-size-4"><b>DSC Resources</b></span></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>
<p>Popular Articles</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between Machine Learning, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-articles-about-core-data-science">What is Data Science? 24 Fundamental Articles Answering This Question</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/hitchhiker-s-guide-to-data-science-machine-learning-r-python">Hitchhiker's Guide to Data Science, Machine Learning, R, Python</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
</ul>
</div>Open sourcing 'spot the difference'tag:www.analyticbridge.datasciencecentral.com,2017-07-21:2004291:BlogPost:3681462017-07-21T07:30:00.000ZDan Kelletthttps://www.analyticbridge.datasciencecentral.com/profile/DanKellett
<p>Capital One UK’s Data Science team has been focused on move from proprietary (paid-for) software to open source for some time now.</p>
<p>There are several key benefits to making this change. Open source software is prevalent in academia which makes it much easier for our new starters to hit the ground running, building models and analysing data on day one with the company (the switch has also been a terrific development opportunity for my team to learn new skills). Our team now has greater…</p>
<p>Capital One UK’s Data Science team has been focused on move from proprietary (paid-for) software to open source for some time now.</p>
<p>There are several key benefits to making this change. Open source software is prevalent in academia which makes it much easier for our new starters to hit the ground running, building models and analysing data on day one with the company (the switch has also been a terrific development opportunity for my team to learn new skills). Our team now has greater and quicker access to cutting-edge techniques and approaches; as soon as a package is available, our team can install and get moving rather than having to wait for software updates and upgrade projects. Finally, there is a real cost benefit in using open source software.</p>
<p>Our switch to open source has been a journey requiring the team to learn a heap of new skills. Migrating large codebases from our legacy systems has involved a lot of work, and after two years, everything we do uses open source. Having come so far along this transformation, the time was right to start giving back to the open source community. Today, I’m excited to announce our first analytic package in R: <b>dataCompareR</b>.</p>
<p> <a href="http://api.ning.com:80/files/NNjuEVHX27ysq1lmiTUVxXqpd7qcjn3Rlu7bXwWI-B4-oMwreWsGVc3EBse41eZhgwnmyihqHPnuxxDTpgZ*jKw2wGkbwKEm/dataCompareR.png" target="_self"><img src="http://api.ning.com:80/files/NNjuEVHX27ysq1lmiTUVxXqpd7qcjn3Rlu7bXwWI-B4-oMwreWsGVc3EBse41eZhgwnmyihqHPnuxxDTpgZ*jKw2wGkbwKEm/dataCompareR.png" width="519" class="align-full"/></a></p>
<p>In the Data Science team we often have to move code across environments or re-code from one language into another. The key to making sure this has been done correctly is the ability to compare two datasets (before and after) to make sure they are the same – if they’re not, you want to find out where they are different and why. Historically our software had a handy procedure to do this, though we didn’t feel like we had this in R. So, we thought: “Let’s build it!”</p>
<p>With <b>dataCompareR</b> you are able to point the package at two datasets. It will compare the two and highlight any differences. Simple. The process to build <b>dataCompareR</b> was actually a load of fun. After some initial planning the entire team, comprising 18 people, spent two full days in hackathon mode building the functionality, and testing harnesses and outputs for the package. That’s a lot of coffee and pizza! It was a great couple of days and the team learnt a lot – both about R and how to work together in an agile manner.</p>
<p>We feel pretty proud of the <b>dataCompareR</b> package and would love people to start using it and give us some feedback.</p>
<p>So get yourself to the <a href="https://cran.r-project.org/web/packages/dataCompareR/index.html" target="_blank">CRAN</a> and enjoy!</p>Text Clustering : Get quick insights from Unstructured Datatag:www.analyticbridge.datasciencecentral.com,2017-07-06:2004291:BlogPost:3666342017-07-06T03:30:00.000ZVivek Kalyanaranganhttps://www.analyticbridge.datasciencecentral.com/profile/VivekKalyanarangan
<p>In this two-part series, we will explore text clustering and how to get insights from unstructured data. It will be quite powerful and industrial strength. The first part will focus on the motivation. The second part will be about implementation.</p>
<p>This post is the first part of the two-part series on how to get insights from unstructured data using text clustering. We will build this in a very modular way so that it can be applied to any dataset. Moreover, we will also focus…</p>
<p>In this two-part series, we will explore text clustering and how to get insights from unstructured data. It will be quite powerful and industrial strength. The first part will focus on the motivation. The second part will be about implementation.</p>
<p>This post is the first part of the two-part series on how to get insights from unstructured data using text clustering. We will build this in a very modular way so that it can be applied to any dataset. Moreover, we will also focus on <span>exposing the functionalities as an API</span> so that it can serve as a plug and play model without any disruptions to the existing systems.<span> </span></p>
<ul>
<li>Text Clustering: How to get quick insights from Unstructured Data – Part 1: The Motivation</li>
<li>Text Clustering: How to get quick insights from Unstructured Data – Part 2: The Implementation</li>
</ul>
<p>In case you are in a hurry you can find the full code for the project at my <a href="https://github.com/vivekkalyanarangan30/Text-Clustering-API/">Github Page</a></p>
<p>Just a sneak peek into how the final output is going to look like –</p>
<p><a href="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-4-300x204.png" target="_blank"><img src="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-4-300x204.png?width=300" width="300" class="align-full"/></a></p>
<p>It is established beyond reasonable doubt that data is the new oil. Organizations across the globe are aggressively building in-house analytics capabilities to harness this untapped treasure cove. However sustainable business benefits arising from analytics initiatives remain elusive at large as organizations are yet to discover the secret recipe that makes it all work.</p>
<p>As per a recent study, the average ROI from analytics initiatives is still negative for most organizations. Most organizations are in one of the following stages of evolution towards becoming a data driven organization –</p>
<p><a href="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-300x49.png" target="_blank"><img src="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-300x49.png?width=300" width="300" class="align-full"/></a></p>
<h2>Dealing with Unstructured Data</h2>
<p>Organizations today are sitting on vast heaps of data and unfortunately, most of it is unstructured in nature. There is an abundance of data in the form of free flow text residing in our data repositories.</p>
<p>While there are many analytical techniques in place that help process and analyze structured (i.e. numeric) data, fewer techniques exist that are targeted towards analyzing natural language data.</p>
<h2>The Solution</h2>
<p>In order to overcome these problems, we will devise an <span>unsupervised</span> text clustering approach that enables business to programmatically bin this data. These bins themselves are programmatically generated based on the algorithm’s understanding of the data. This would help tone down the volume of the data and understanding the broader spectrum effortlessly. So instead of trying to understand millions of rows, it just makes sense to understand the top keywords in about 50 clusters.</p>
<p>Based on this, a world of opportunities open up –</p>
<ol>
<li>In a customer support module, these clusters help identify the show stopper issues and can become subjects of increased focus or automation.</li>
<li>Customer reviews on a particular product or brand can be summarized which will literally lay the road map for the organization</li>
<li>Surveys data can be easily segmented</li>
<li>Resumes and other unstructured data in the HR world can be effortlessly looked at…….</li>
</ol>
<p>This list is endless but the point of focus is a generic machine learning algorithm that can help derive insights in an amenable form from large parts of unstructured text.</p>
<h2>Text Clustering: Some Theory</h2>
<p>The algorithm first performs a series of transformations on the free flow text data (elaborated in subsequent sections) and then performs a k-means clustering on the vectorized form of the transformed data. Subsequently, the algorithm creates cluster-wise tags, also known as cluster-centers, that are representative of the data contained in these clusters.<br/> The solution boasts of end-to-end automation and is generic enough to operate on <span>any dataset</span>.</p>
<p>The text clustering algorithm works in five stages enumerated below:-</p>
<ul>
<li>Transformations on raw stream of free flow text</li>
<li>Creation of Term Document Matrix</li>
<li>TF-IDF (Term Frequency – Inverse Document Frequency) Normalization</li>
<li>K-Means Clustering using Euclidean Distances</li>
<li>Auto-Tagging based on Cluster Centers</li>
</ul>
<p>These are elaborated below along with illustrations:-</p>
<p>The free flow text data is first curated in the following stages:-</p>
<ul>
<li>Stage 1<ul>
<li>Removing punctuations</li>
<li>Transforming to lower case</li>
<li>Grammatically tagging sentences and removing pre-identified stop phrases (<a href="http://www.nltk.org/book/ch07.html">Chunking</a>)</li>
<li>Removing numbers from the document</li>
<li>Stripping any excess white spaces</li>
</ul>
</li>
<li>Stage 2<ul>
<li><span>Removing generic words</span> of the English language viz. determiners, articles, conjunctions and other parts of speech.</li>
</ul>
</li>
<li>Stage 3<ul>
<li><span>Document Stemming</span> which reduces each word to its root using <span>Porter’s stemming algorithm</span>.</li>
</ul>
</li>
</ul>
<p>These steps are best explained through the illustration below:-</p>
<p><a href="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-1-300x116.png" target="_blank"><img src="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-1-300x116.png?width=300" width="300" class="align-full"/></a></p>
<p>Once all the documents in the corpus are transformed as explained above, a term document matrix is created and the documents are transformed into this vector space model using the <span>1-gram vectorizer </span>(see below). Other more sophisticated implementations include n-gram (where n in a reasonably small integer)</p>
<h3>TF-IDF (Term Frequency – Inverse Document Frequency) Normalization</h3>
<p>This is an optional step and can be performed in case there is high variability in the document corpus and the number of documents in the corpus is extremely large (of the order of several million). This normalization increases the importance of terms that appear multiple times in the same document while decreasing the importance of terms that appear in many documents (which would mostly be generic terms). The term weightages are computed as follows:-</p>
<p><a href="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-1-300x116.png" target="_blank"><img src="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-1-300x116.png?width=300" width="300" class="align-full"/></a><a href="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-2-300x100.png" target="_blank"><img src="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-2-300x100.png?width=300" width="300" class="align-full"/></a></p>
<h3>K-Means Clustering using Euclidean Distances</h3>
<p>Post the TF-IDF transformation, the document vectors are put through a <span>K-Means clustering algorithm</span> which computes the <span>Euclidean Distances</span> amongst these documents and clusters nearby documents together.</p>
<h3>Auto-Tagging based on Cluster Centers</h3>
<p>The algorithm then generates cluster tags, known as cluster centers which represent the documents contained in these clusters. The clustering and auto-generated tags are best depicted in the illustration below (Principal components 1 and 2 are plotted along the x and y axes respectively):-</p>
<p><a href="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-3-300x229.png" target="_blank"><img src="https://machinelearningblogs.com/wp-content/uploads/2017/01/Capture-3-300x229.png?width=300" width="300" class="align-full"/></a></p>
<p>In order for more and more users to benefit from this solution and analyze their unstructured text data, I have created a <span>RESTful</span> <span>web service </span>that users can access in two ways:-</p>
<ul>
<li>A web interface for this service which is a <span>Swagger API Docs front end. </span>This is a very popular solution for RESTful web services. The user can navigate to the web interface URL, upload the data-set, specify the column containing the natural language data that needs to be analyzed and the desired number of clusters and within a few minutes the output will appear as a downloadable link containing the results of the analysis.</li>
<li>Since the web service works on the concept of <span>Application Programming Interface (API)</span>, the computation engine that performs the analysis is a separate component which is scalable, portable and can be accessed from any other application through <span>RESTful HTTP</span>.</li>
</ul>
<p>Since all computations are performed in-memory, the results are lightning fast.</p>
<h1>Conclusion</h1>
<p>A mathematical approach to understanding and analyzing natural language data could prove instrumental in unlocking the enormous value and insights currently trapped within it and vastly improve our understanding of our organization and its eco-system. The next post will contain the ground-level implementation details. Follow along with me if you are interested and this will work out great. My next post on the tech details will be up soon. The code is available at my <a href="https://github.com/vivekkalyanarangan30/Text-Clustering-API/">Github Page</a></p>For Companies, Data Analytics is a Pain; But Why?tag:www.analyticbridge.datasciencecentral.com,2017-06-22:2004291:BlogPost:3657902017-06-22T19:30:00.000ZChirag Shivalkerhttps://www.analyticbridge.datasciencecentral.com/profile/ChiragShivalker
<p>Businesses across the globe are facing the brunt, one of huge data influx and second of increasing data complexity and of course the market volatility. To address these challenges, companies and all their verticals are turning to data-driven analytics and insights as a means to better understand their organization’s customer bases and to grow their businesses; and manage the increasing uncertainty up to a certain extent. </p>
<p></p>
<p><img src="http://api.ning.com:80/files/pS32jW-GgixU4xtXcijVVGgBMG9ic9l50lyLQOo7YPixzkKh1lhG49CaLb2AHuyaDRBCX7tzAeRnYsu5So7RfU-GpvIQAB*D/dataanalyticsisapainbutwhy.jpg?width=750" width="750"></img></p>
<p></p>
<p>The shift from conventional…</p>
<p>Businesses across the globe are facing the brunt, one of huge data influx and second of increasing data complexity and of course the market volatility. To address these challenges, companies and all their verticals are turning to data-driven analytics and insights as a means to better understand their organization’s customer bases and to grow their businesses; and manage the increasing uncertainty up to a certain extent. </p>
<p></p>
<p><img src="http://api.ning.com:80/files/pS32jW-GgixU4xtXcijVVGgBMG9ic9l50lyLQOo7YPixzkKh1lhG49CaLb2AHuyaDRBCX7tzAeRnYsu5So7RfU-GpvIQAB*D/dataanalyticsisapainbutwhy.jpg?width=750" width="750"/></p>
<p></p>
<p>The shift from conventional to data driven analytics is steered by technology and automation across organizations. Growth in digital technologies is enhancing the abilities to analyze more and more data, ultimately increasing the appetite of enterprises for more data, better data, advanced analytics, implementation of best practices and what not. Data analytics is the primary enabler to derive insights and reach out to meaningful truth, resulting in business growth and increased revenue.</p>
<p></p>
<p>The world is going gaga over the promise of analytics and what enterprises can attain by harnessing it, compelling brands to make significant investments in analytics tools, or analytics service providers. However, somewhere down the line it feels as if analytics is a bubble, which is likely to burst anytime. And there are various <em>reasons and several related technology pain points which surface, as conveyed by brands trying to leverage analytics to reap benefits of data-driven improvement across the enterprise. In no particular order, enlisted are some of them:</em></p>
<p></p>
<p><span class="font-size-5">1. Analytics is not a vaccine, but a routine workout</span></p>
<p></p>
<p>Companies looking out for instant solutions, usually think of analytics as a vaccine shot and undertake it in an ad-hoc way, a kind of a one-time process to find value. It should not be the case at all. If enterprises are keen on improving their businesses continuously, they need analytics to be systematic, and repetitive like a routine work out in gym.</p>
<p></p>
<p><span class="font-size-5">2. Insights are just the initiations, and don’t add immediate value to your business</span></p>
<p></p>
<p>USP for some of the analytic players is the promise to convert data to "insights". Organizations should understand what is insight? Usually insight is a static or an interactive dashboard, comprising of graphs enabling the slicing and dicing of tons of data, the way one wants it.</p>
<p></p>
<p>But does it add any immediate value to the business? For these insights to be really of importance, human intervention is required to make sense of it. And also to figure out what actions should be taken. You as a business would not make investments in analytics, because you want insights – would you? Unless it gives out answers, such insights are of no use. Businesses need answers, specific and practical, to improvise the metrics with immediate values.</p>
<p></p>
<p><span class="font-size-5">3. Scalability</span></p>
<p></p>
<p>Enterprises and organizations, in that rat race have or start collecting high volumes of data from every machine and transactions available. However, the aspect that needs to be thought about is, are they equipped with the right kind of tools or data analytics team or have they even partnered with decision analysts who can help them keep pace with the volume and speed at which the data is generated. Usually, most of them are yet to get associated, and unfortunately few of them are yet to think about taking up the data analytics approach.</p>
<p></p>
<p><span class="font-size-5">4. Descriptive analytics is a post-mortem, does it really help</span></p>
<p></p>
<p>Data analytics offered by most of the online tools, or analytics service providers is a kind of post-mortem; a look back to the old data to assess what happened and why – just in order to make beneficial changes in the future. It really is helpful to know when male customers visited your eCommerce site for a particular product, and will get easily churned if no promotion is offered in the first three months. If you succeed in pitching a promotional offer to this customers meeting this kind of profile, you will succeed in reducing the churn as well – and it’s a valuable takeaway. </p>
<p></p>
<p>However; it is more than important to know when, customers with this profile, are talked to so as to make the offer. Along with this, you are also required to keep a tab on other profiles that can increase your churn ratio. And this is where predictive analytics walks into the picture. It empowers you to recognize what events, transactions and interactions are likely to lead to a particular outcome, churn – in this case. It also helps in identifying such cases while they are happening so as to enlighten you to take required action at the right time.</p>
<p></p>
<p><span class="font-size-5">5. Human intervention in analytics is a friend and a foe too</span></p>
<p></p>
<p>Usually data analytics needs humans to query the data, and the results of the analytics exemplify only the questions the analyst of data scientists thought fit to be asked or put on paper, ensuring the answers are biased and incomplete.</p>
<p></p>
<p>Decision making depends on instincts and intuitions and it is risky as human beings are not inherently impartial. Several cognitive biases and logical fallacies do exist, and have the potential to affect decision most of the times. The best way to reduce cognitive bias is to rely on data to make informed decisions and not pure human intuitions. Data scientists should be made more responsible and cautious to stay impartial when supervising machine learning and organizations should embrace data collection from all available avenues, considering the objective of data in the first place.</p>
<p></p>
<p><span class="font-size-5">6. Opportunities cost is huge; stale answers make dents </span></p>
<p></p>
<p>Conducting data analytics is process which needs experts and data scientists to spend months and years in data collection, cleansing, validation, modeling, visualizing; before reaching out to final conclusions, and deploying tactics. And let’s not forget that enterprises have been collecting terabytes of data – daily.</p>
<p></p>
<p>So by the time, answers in the form of analytics are produced and new tactics are developed and deployed, they get outdated; and obsolete at times because of the long cycle times. The reason to this is that competitors, customers and environmental pressures usually consistently keep on changing ground reality every minute – every day – depending on the nature of your business. That is why stale answers make dents that are irrevocable and opportunities cost really huge.</p>
<p></p>
<p><span class="font-size-5">7. Manually intensive</span></p>
<p></p>
<p>The integration of analytics thought process in an organizational set of beliefs is mostly manually intensive. There are a lot of whys and whats, and the curiosity is welcome; but overdoes of it at times is time consuming and proves really costly. And after all this, the actual analytics is usually time consuming in terms of human hours, days, weeks or months of querying, coding, modeling, experimentation and deployment of course.</p>
<p></p>
<p><span class="font-size-5">8. Numerical data is analyzed, but what about categorical values </span></p>
<p></p>
<p>To analyze numerical data for analytic solution providers or any of the online or licensed analytic tools is like a cake walk; but what about categorical values? Most of the organizations fail miserably at aligning relevant data across siloes to try and understand how information gathered by one of the departments or its systems, combined with variables of another department to drive performance and efficiencies up or down. They also fail at tapping into the unstructured data in free form text from sources like email, social media, and calls.</p>
<p></p>
<p>Brands have invested significant resources in wringing value from data, but many are only tapping a small percentage of data available to them, leaving enormous value on the table.</p>
<p></p>
<p><span class="font-size-5">9. Users without expertise</span></p>
<p></p>
<p>Most of the analytics tools, from coding-heavy data science toolkits to drag and drop studios, need users to be well equipped with significant expertise in data science, statistics, coding and software to choose and develop models, transform data etc. However, the irony is that people who need to exploit data are department managers with less or no expertise of it.</p>
<p></p>
<p><span class="font-size-5">10. Increased lead time to value</span></p>
<p></p>
<p>Installing software packages and analytics tools is a time consuming task. The kind of set-up required to get started increased the lead time to value. However; enterprises want to get started NOW, and still there are many who cannot afford to wait.</p>Descriptive, Predictive & Prescriptive Analytics will fail to Help You Understand Your Businesstag:www.analyticbridge.datasciencecentral.com,2017-05-12:2004291:BlogPost:3642612017-05-12T13:30:00.000ZChirag Shivalkerhttps://www.analyticbridge.datasciencecentral.com/profile/ChiragShivalker
<p><span style="font-family: arial, helvetica, sans-serif;"><a href="http://api.ning.com:80/files/VP3FukQGz*pz64BEOD0jhAEEVMEAfq1H5NS*6GPs65ormHxc8*6NoV9kyvBtfgMeWumbCfnDYPK107mRU-DBvHtjVtrLLv-d/44B2770B8FE8551BBB717B8DD99866E7C82117099A76A19F75pimgpsh_fullsize_distr.jpg" target="_self"><img class="align-center" src="http://api.ning.com:80/files/VP3FukQGz*pz64BEOD0jhAEEVMEAfq1H5NS*6GPs65ormHxc8*6NoV9kyvBtfgMeWumbCfnDYPK107mRU-DBvHtjVtrLLv-d/44B2770B8FE8551BBB717B8DD99866E7C82117099A76A19F75pimgpsh_fullsize_distr.jpg" width="700"></img></a></span></p>
<p></p>
<p><span style="font-family: arial, helvetica, sans-serif;">It has been a practice followed religiously by companies and organizations to analyze how they have performed over a…</span></p>
<p><span style="font-family: arial, helvetica, sans-serif;"><a href="http://api.ning.com:80/files/VP3FukQGz*pz64BEOD0jhAEEVMEAfq1H5NS*6GPs65ormHxc8*6NoV9kyvBtfgMeWumbCfnDYPK107mRU-DBvHtjVtrLLv-d/44B2770B8FE8551BBB717B8DD99866E7C82117099A76A19F75pimgpsh_fullsize_distr.jpg" target="_self"><img src="http://api.ning.com:80/files/VP3FukQGz*pz64BEOD0jhAEEVMEAfq1H5NS*6GPs65ormHxc8*6NoV9kyvBtfgMeWumbCfnDYPK107mRU-DBvHtjVtrLLv-d/44B2770B8FE8551BBB717B8DD99866E7C82117099A76A19F75pimgpsh_fullsize_distr.jpg" width="700" class="align-center"/></a></span></p>
<p></p>
<p><span style="font-family: arial, helvetica, sans-serif;">It has been a practice followed religiously by companies and organizations to analyze how they have performed over a period of time. It is mandatory for them to do so; just that some do it to survive and some do it to thrive in concurrent market dynamics. If we look at the history of big data, it has been a common practice for all; trying to understand how the world and various businesses, completely resting on the analysis of first hand data available. This is to an extent that people in the business define their careers progression as BBD and ABD; before big data and after big data.</span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">Few years back, it was like looking at the crystal ball the wizard would describe what all happened in the past with your business. Today it is called descriptive analysis. With advent in data technology and understanding about BIG data walking in, brought along answers to questions like what is going to happen to my business in future, called the predictive analysis.</span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">The times have changed and so have changed the market dynamics and world economies, which are more colorful than rainbows. This has compelled organizations across the globe to resort to the final stage of analytics, the prescriptive analytics which is all about “so what”, “why not”, etc. All other analytics were able to provide insights to businesses about what their customers buy, when and where. But with Prescriptive analytics is what makes them understand “why”.</span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">Today every business is subjected to new expectations, competitors, channels, threats and opportunities. Organizational leaders and C-suites across the globe, from largest companies; have clearly understood that in order to boost revenues, increase profitability and build customer loyalty, they are required to make decisions with an understanding of customer actions, attitudes and opinions. The required complete insight is something that descriptive, predictive & prescriptive analytics can help them with, if adapted and adopted while taking to important aspects into considerations.</span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">One of them is that all these three types of analytics should co-exist; as one is not better than and without the other. If you are aware of the basics of data analytics; you would agree that though all three of them are more consecutive, they contribute to the objective of improved decision making. The second but more important an aspect, which needs to be paid utmost attention to, is something that we will discuss towards the end of this article.</span></p>
<p><span class="font-size-4" style="font-family: arial, helvetica, sans-serif;"><b>Know the past of your business through descriptive analytics</b></span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">The term past here refers to timeline from 1 minute to a certain number of years back then. Such analytics mainly assist in comprehending the relationship between your products and your customers; where the main purpose of this assessment is to finalize the futuristic approach. It is kind of learning from past behavior to influence futuristic outcomes.</span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">To be candid, this type of analytic should not come to you as a surprise. You and others in your capacity have been using it more or less in form of reports that give you better insight into finances, operations, customer preferences and sales etc. <a href="http://www.hitechbpo.com/market-research-and-data-analytics.php" target="_blank">Descriptive analysis</a> has played a vital role in determining what to do next, by transforming data to information to conclude the future outcome of acts and events.</span></p>
<p><span class="font-size-4" style="font-family: arial, helvetica, sans-serif;"><b>Know the future of your business through predictive analytics</b></span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">This type of analytic is effective and useful in providing businesses with actionable insights based on data; yes remember this. Data mining, data modeling, game theory, machine learning etc. are put at work to obtain estimation regarding outcomes in the future. Simply put, predictive analysis is to identify potential risks and opportunities – if any. 3 components of predictive analytics are <b>predictive modeling, decision analysis</b> and <b>optimization, & transaction profiling</b>. Useful across a wide plethora of departments in your organization, they help in forecasting demand for operations or determining risk profiles for finance team, to predicting customer behavior in sales & marketing. Determining risk profiles needs a lot of data, both public and social.</span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">Predictive analytics are also useful when it comes to forecasting product or service demand for a particular geography or taking a segment approach for customer services; and adjusting manpower and production, accordingly. Data sets put at task for performing this analytics include data from weather, example, sales data, social media data etc. The usage of historical and transactional data needs a special mention here. They are used to identify patterns, whereas statistical models and algorithms are utilized to assess the relationship between several data sets. With the advent in Big Data, predictive analytics has taken a really big leap. The more data that you have on hand in an organized manner means more accurate predictions.</span></p>
<p><span class="font-size-4" style="font-family: arial, helvetica, sans-serif;"><b>Intelligent insights derived from descriptive & predictive analytics, is Prescriptive Analytics</b></span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">Though in existence for a few less years in a decade; it will not be wrong to say that prescriptive analytics is taking baby steps. May be this is the reason why <a href="http://www.gartner.com/newsroom/id/2575515?utm_source=datafloq&utm_medium=ref&utm_campaign=datafloq">Gartner</a> considers it to be a “Innovative Trigger” that will take 5-10 years to get fully functional and productive. All this said and done, the best part about prescriptive analytics is that it not only tells what will happen and when, but also why will it happen. Along with this, it also suggests how to get in action to reap appropriate benefits of predictions made.</span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">A fine blend of business rule algorithms, computational modelling techniques and machine learning is used for prescriptive analytics; along with a wide plethora of historical, transactional, public and social data sets. The beauty of it is in <a href="http://gizmodo.com/how-prescriptive-analytics-could-harness-big-data-to-se-512396683">how prescriptive analytics foresees</a> what would be the effect of a particular decision taken, and the suggestions that it has to make to adjust the decisions that are actually made; ultimately enhancing the decision making process and the bottom line of course. But as mentioned earlier, due to its recent existence; very few companies utilize this technique, and that too with humongous amount of errors. The best example to it is <a href="https://waymo.com/">self-driving google cars</a> that are required to decide based on predictions and future outcomes.</span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">Prescriptive analytics has the potential to leave a gigantic impact on your business, making them operationally efficient and effective as against competition, by optimizing scheduling, production, inventory management and supply chain design of your business.</span></p>
<p><span class="font-size-5" style="font-family: arial, helvetica, sans-serif;"><b>Final phase of analytics</b></span></p>
<p><span style="font-family: arial, helvetica, sans-serif;">So now, we come to that second & more important aspect, which needs to be paid utmost attention to; is data collection or data extraction followed with data cleansing and data processing. If these processes are not in place; though descriptive, predictive and prescriptive analytics are known for making people understand their businesses, you are set to fail miserably. Better informed decisions for future outcomes will become a fantasy, as you will be struggling with managing your day to day operations as well. Prescriptive analytics as per IBM is the “<a href="https://youtu.be/VtETirgVn9c">final phase of analytics</a>”; but in absence of accurate and well managed data sets – your business could reach the final phase of liquidity or bankruptcy. </span></p>Business Analytics and Intelligence Comparedtag:www.analyticbridge.datasciencecentral.com,2017-04-25:2004291:BlogPost:3628832017-04-25T14:00:00.000ZLaura Bucklerhttps://www.analyticbridge.datasciencecentral.com/profile/LauraBuckler
<p>Business analytics and business intelligence are two different notions, but only few people understand the difference. Interestingly, even people who have worked in the business industry struggle with this particular topic or have various different answers when someone asks the question 'What is the difference between business analytics and business intelligence?'</p>
<p>Some people define business analytics as an umbrella term and place intelligence as one of its parts, together with data…</p>
<p>Business analytics and business intelligence are two different notions, but only few people understand the difference. Interestingly, even people who have worked in the business industry struggle with this particular topic or have various different answers when someone asks the question 'What is the difference between business analytics and business intelligence?'</p>
<p>Some people define business analytics as an umbrella term and place intelligence as one of its parts, together with data warehousing, enterprise performance management, risk, compliance and analytic applications. Meanwhile, others use the term business analytics as a level of domain knowledge related to predictive or statistical analytics.</p>
<p>So, how can you differentiate between the two?</p>
<p>Business intelligence can be defined as the necessity to get the most out of a particular information. This need is harder to be delivered due to the increased complexity of today's economy, but generally speaking, it has not really changed in the past few decades.</p>
<p>Business analytics is the thing being used or done to help deliver or provide a particular business need.</p>
<p>The answer to the question is quite simple: intelligence is something you have, while analytics is what you do with it. This applies to business too and can help you differentiate between the two terms much easier. People often mix these terms and confuse others since, 'business intelligence' is now used to refer to both these notions, which cannot be more wrong.</p>
<p>Once you determine the role both play in your particular business, you can easily understand the difference between both notions. Business intelligence is related to everything about accessing big chunks of data and consists of the infrastructure and software you will use with the goal of funneling data to analysis.</p>
<p>On the other hand, business analytics is the thing you do with the data you have at your disposal. Once you have gathered a specific amount of data by using business intelligence, you can use it to optimize the performance of your business. Additionally, business analytics help businesspeople to determine the client satisfaction rate.</p>
<h1><span class="font-size-5">Functions of Business Intelligence and Business Analytics</span></h1>
<p>Business intelligence is used to look backward and provide you with an insight into the data that has already occurred. The function of business analytics, on the other hand, is to anticipate the needs and trends of the future.</p>
<p>In order to have a successful business, it is crucial to follow both. For example, you must look into business intelligence to see what needs to be changed and what worked well for your company. Then, you will use this information and continue with business analytics to anticipate the hypothetical changes that will be done with each of your actions. If you learn to do all this, both business intelligence and analytics will help you make the right changes in the right way.</p>
<p>Many business people do not consider the difference to be very relevant or extend beyond a single company. The fact is, once you establish a way to optimize your business, the difference will not really matter. Still, many analysts consider the difference crucial if one wants their business to succeed. Therefore, it is best to divide the two terms into different categories, understand the meaning and explain it to everyone who needs to know this. By doing this, you can achieve interior precision.</p>
<p>People need to know what you are talking about. Learn the difference between the terms business intelligence and analytics to explain to others if you are thinking of big data or predicting the future of the market and your business decisions. Establishing an understanding of both terms can help avoid any confusion.</p>12 Interesting Reads for Math Geekstag:www.analyticbridge.datasciencecentral.com,2017-05-03:2004291:BlogPost:3632492017-05-03T17:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><span>Many data scientists have a passion for mathematics, and many modern math problems can be explored using data science. Below is a selection of interesting articles, many about challenging, deep mathematical problems, by a data scientist who developed </span><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel" target="_blank">math-free algorithms</a><span>. Some of these articles cover statistical theory and thus belong to data science,…</span></p>
<p><span>Many data scientists have a passion for mathematics, and many modern math problems can be explored using data science. Below is a selection of interesting articles, many about challenging, deep mathematical problems, by a data scientist who developed </span><a href="http://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel" target="_blank">math-free algorithms</a><span>. Some of these articles cover statistical theory and thus belong to data science, some are just about mathematics and number theory for its own sake. Most of them can be understood by the layman. Some include R code to produce visualizations, and some include processing vast amounts of data -- trillions of data points: thus it provides an excellent sandbox to test distributed architecture implementations, and high performance computing.</span></p>
<p><a href="http://api.ning.com:80/files/NZ1ilwiKtKPIiq-jrsH0YJvhxQGnEcc6q6ZaJT-bs7L16Rr6n6QjIpqwYM0vG-QmXE5r9fm6zS0QZsK3L*Xy6z6qR9FLLpi*/airportsall.png" target="_self"><img src="http://api.ning.com:80/files/NZ1ilwiKtKPIiq-jrsH0YJvhxQGnEcc6q6ZaJT-bs7L16Rr6n6QjIpqwYM0vG-QmXE5r9fm6zS0QZsK3L*Xy6z6qR9FLLpi*/airportsall.png" width="226" class="align-center"/></a></p>
<p style="text-align: center;"><em>Math model: Tessellation</em></p>
<p><strong>12 Interesting Reads for Math Geeks</strong></p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/simple-proof-of-prime-number-theorem" target="_blank">Simple Proof of the Prime Number Theorem</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/prime-numbers-interesting-distribution-and-density-results" target="_blank">Fascinating Facts and Conjectures about Primes and Other Special Nu...</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/factoring-massive-numbers-a-new-machine-learning-approach" target="_blank">Factoring Massive Numbers: Machine Learning Approach</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/how-and-why-decorrelate-time-series" target="_blank">How and Why: Decorrelate Time Series</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/distribution-of-arrival-times-of-extreme-events" target="_blank">Distribution of Arrival Times of Extreme Events</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/the-fundamental-statistics-theorem-revisited" target="_blank">The Fundamental Statistics Theorem Revisited</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/88-per-cent-of-all-integers-have-a-factor-under-100" target="_blank">88 percent of all integers have a factor under 100</a></li>
<li><a href="http://www.analyticbridge.com/profiles/blogs/interesting-math-challenge-average-rotational-speed-of-earth" target="_blank">Math Challenge: Computing the Average Rotational Speed of Earth</a></li>
<li><a href="http://www.datasciencecentral.com/forum/topics/challenge-representation-of-numbers-as-infinite-products" target="_blank">Representation of Numbers as Infinite Products</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/a-beautiful-probability-theorem" target="_blank">A Beautiful Probability Theorem</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/mars-craters-an-interesting-stochastic-geometry-problem" target="_blank">Mars Craters: An Interesting Stochastic Geometry Problem</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/the-art-and-data-science-of-leveraging-economic-bubbles" target="_blank">The art and (data) science of leveraging economic bubbles</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/seasons-in-binary-star-planetary-systems" target="_blank">Seasons in Binary Star Planetary Systems</a></li>
</ul>
<p><b>DSC Resources</b></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>16 Data Science Repositoriestag:www.analyticbridge.datasciencecentral.com,2017-05-03:2004291:BlogPost:3632472017-05-03T16:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Each one is a repository in its own, and they cover topics such as time series, regression, outliers, clustering, correlation, Hadoop, deep learning, Python, IoT, data sets, cheat sheets, infographics, and more (AI coming soon.) </p>
<p>Each one features a number of popular articles and resources.</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/10-timeless-reference-books" target="_blank">14 Timeless Reference Books…</a></li>
</ul>
<p>Each one is a repository in its own, and they cover topics such as time series, regression, outliers, clustering, correlation, Hadoop, deep learning, Python, IoT, data sets, cheat sheets, infographics, and more (AI coming soon.) </p>
<p>Each one features a number of popular articles and resources.</p>
<ul>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/10-timeless-reference-books" target="_blank">14 Timeless Reference Books</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-big-data-repositories-you-should-check-out-1" target="_blank">20 Big Data Repositories You Should Check Out</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/a-plethora-of-data-set-repositories" target="_blank">A Plethora of Data Set Repositories</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/four-great-pictures-illustrating-machine-learning-concepts" target="_blank">Four Great Pictures Illustrating Machine Learning Concepts</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/15-deep-learning-tutorials" target="_blank">15 Deep Learning Tutorials</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/11-great-hadoop-spark-and-map-reduce-articles" target="_blank">11 Great Hadoop, Spark and Map-Reduce Articles</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/16-great-iot-articles-published-in-2016" target="_blank">16 Great IoT Articles Published in 2016</a><a href="http://www.datasciencecentral.com/profiles/blogs/15-deep-learning-tutorials"> </a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-articles-about-core-data-science">24 Core Articles About Data Science</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/14-great-articles-and-tutorials-on-clustering">14 Great Articles and Tutorials on Clustering</a> </li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/13-great-articles-and-tutorials-about-correlation" target="_blank">13 Great Articles and Tutorials about Correlation</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/26-great-articles-and-tutorials-about-regression-analysis" target="_blank">26 Great Articles and Tutorials about Regression Analysis</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/11-articles-and-tutorials-about-outliers" target="_blank">10 Articles and Tutorials about Outliers</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/21-great-articles-and-tutorials-on-time-series" target="_blank">21 Great Articles and Tutorials on Time Series</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/17-amazing-infographics-and-other-visual-tutorials" target="_blank">15 Amazing Infographics and Other Visual Tutorials</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/misuses-of-statistics-examples-and-solutions" target="_blank">Misuses of Statistics: Examples and Solutions</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blogs/20-cheat-sheets-python-ml-data-science" target="_blank">20 Cheat Sheets: Python, ML, Data Science, R, and More</a></li>
</ul>
<p><b>DSC Resources</b></p>
<ul>
<li>Services: <a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a> | <a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a> | <a href="http://classifieds.datasciencecentral.com/">Classifieds</a> | <a href="http://www.analytictalent.com/">Find a Job</a></li>
<li>Contributors: <a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a> | <a href="http://www.datasciencecentral.com/forum/topic/new">Ask a Question</a></li>
<li>Follow us: <a href="http://www.twitter.com/datasciencectrl">@DataScienceCtrl</a> | <a href="http://www.twitter.com/analyticbridge">@AnalyticBridge</a></li>
</ul>The Ultimate Guide for Choosing Algorithms for Predictive Modelingtag:www.analyticbridge.datasciencecentral.com,2017-04-03:2004291:BlogPost:3584602017-04-03T06:00:00.000ZSteven M. Mehlerhttps://www.analyticbridge.datasciencecentral.com/profile/StevenMMehler
<p><a href="http://api.ning.com:80/files/nlbHGOcp6COKC1PzlHV4cREpCw2LWJb1gCfpaSGnRLD5n99uPiO3lPhBj5lPpLifV5oJRbRBXIbYprwKYLKiJ4nTLlS1DuPA/bigstock124077614.jpg" target="_self"><img class="align-full" src="http://api.ning.com:80/files/nlbHGOcp6COKC1PzlHV4cREpCw2LWJb1gCfpaSGnRLD5n99uPiO3lPhBj5lPpLifV5oJRbRBXIbYprwKYLKiJ4nTLlS1DuPA/bigstock124077614.jpg?width=750" width="750"></img></a></p>
<p>There are three ways to look at data. The first is analytics. This is when you look at data from the (potentially very recent) past. Think analytics. It allows you to explore the questions what happened and why did it happen? The second is monitoring. This is looking at things as they happen. In many…</p>
<p><a href="http://api.ning.com:80/files/nlbHGOcp6COKC1PzlHV4cREpCw2LWJb1gCfpaSGnRLD5n99uPiO3lPhBj5lPpLifV5oJRbRBXIbYprwKYLKiJ4nTLlS1DuPA/bigstock124077614.jpg" target="_self"><img src="http://api.ning.com:80/files/nlbHGOcp6COKC1PzlHV4cREpCw2LWJb1gCfpaSGnRLD5n99uPiO3lPhBj5lPpLifV5oJRbRBXIbYprwKYLKiJ4nTLlS1DuPA/bigstock124077614.jpg?width=750" width="750" class="align-full"/></a></p>
<p>There are three ways to look at data. The first is analytics. This is when you look at data from the (potentially very recent) past. Think analytics. It allows you to explore the questions what happened and why did it happen? The second is monitoring. This is looking at things as they happen. In many cases, monitoring is used to find abnormalities. Finally, there is predictive analytics. This is looking at data in a way that helps make predictions about what might happen in the future.</p>
<p> </p>
<p>Basically, predictive analytics is what drives the actions that make the changes which will, in turn, be monitored by the analytical phase. As you build your predictive analysis model, you will have various algorithms that you can select in the categories of machine-learning, data-mining, and statistics. Once you know more about your data, and what you want to accomplish, making this decision will become a bit easier.</p>
<p> </p>
<p>The algorithms that are right for you depend on what you are trying to accomplish. For example:</p>
<p> </p>
<ul>
<li>Classification algorithms are great if customer retention is your focus or if you are trying to put together a recommendation system.</li>
<li>Clustering algorithms work well for segmentation or use with social data.</li>
<li>Regression algorithms are generally used as a way of predicting outcomes from events that are calendar driven.</li>
</ul>
<p> </p>
<p>It should be considered a best practice to use the maximum number of algorithms that you can as long as they are the types of algorithms that you need. The more information that you have to compare and analyze, the better off you will be. It can be quite enlightening to find surprises, or to reveal interesting bits of information. This can lead to your ability to solve problems. Perhaps even more importantly, this can reveal to you which information in your data can be used to predict future trends. Let’s begin by going over some of the most popular predictive algorithms and methodologies.</p>
<p> </p>
<p><b>The Ensemble Model</b></p>
<p> </p>
<p>Many people have found that using an ensemble model is the best method for successful predictive analytics. This is multiple models that all use the same data set. Basically, a mechanism is created to gather all of the output from the various models. This information is then used to provide a final analysis to the person running the test.</p>
<p> </p>
<p>The specifics of each model can vary. For example, decision trees, scenarios, queries, etc. are all models. To pick the correct models, you have to understand what works best for your data, and the problem you are trying to solve. Before proceeding, you will need to clearly define the questions you are trying to answer? For example:</p>
<p> </p>
<ul>
<li>Will a new target audience be receptive to our current email marketing efforts?</li>
<li>Should we create a microsite or a reviews page for a new product line?</li>
<li>Are customers with poor credit going to default if we offer in-house financing?</li>
<li>Will consumers buy clothing made with cheaper fabrics if prices are cut?</li>
</ul>
<p> </p>
<p><b>Unsupervised Clustering Algorithms</b></p>
<p> </p>
<p>These algorithms are very useful in helping you find relationships that may not have been clear to you at first glance. If you are interested in finding similarities between various user personas, clustering algorithms might be the way to go. You can also use these to discover product relationships as well. If you’ve ever wanted to bundle services or wondered how you could influence customers to respond to your upselling efforts, these might be some algorithms to consider.</p>
<p> </p>
<p><b>Regression Algorithms</b></p>
<p> </p>
<p>If you have data that you receive on a continuing basis, regression algorithms might help you to predict future trends based upon that data. For example, if you purchase raw materials for manufacturing processes, you could use the monthly price data that you gather to predict seasonal fluctuations in those prices.</p>
<p> </p>
<p><b>Not an Exact Science</b></p>
<p> </p>
<p>There is no exact formula for finding the ideal algorithms for predictive modeling. It takes a combination of understanding the types of algorithms available to you, understanding exactly what it is that you need to know, and understanding how to interpret the information that you receive.<br/> <br/> Those who are most successful at choosing the right algorithms for predictive modeling will have a strong understanding of data science, or they will work with people who do. Then, in addition to this, having a strong level of business area expertise and experience is key. This might be considered the ‘art’ of predictive modeling.</p>
<p> </p>
<p><b>Conclusion</b></p>
<p> </p>
<p>Ultimately, the work that goes into selecting algorithms to help to predict future trends and events is worthwhile. It can result in better customer service, improved sales, and better business practices. Each of these things can, of course, result in increased profits or lowered expenses. Both are desirable outcomes. The information above should act as a bit of a primer on the subject for those new to using analytics.</p>Monte Carlo Analysis and Simulationtag:www.analyticbridge.datasciencecentral.com,2017-04-11:2004291:BlogPost:3610282017-04-11T22:00:00.000ZArnaldo Gunzihttps://www.analyticbridge.datasciencecentral.com/profile/ArnaldoGunzi
<p class="graf graf--p graf-after--h3" id="fc33">The Monte Carlo method is an simple way to solve very difficult probabilistic problems. This text is a very simple, didactic introduction to this subject, a mixture of history, mathematics and mythology.<br></br> <br></br> The method has origins in the World War II, proposed by the Polish American mathematician Stanislaw Ulam and Hungary American mathematician John Von Neumann.…</p>
<p class="graf graf--p graf-after--h3"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/873/0*5D4L-CR2WwNqYZKZ."></img></p>
<p id="fc33" class="graf graf--p graf-after--h3">The Monte Carlo method is an simple way to solve very difficult probabilistic problems. This text is a very simple, didactic introduction to this subject, a mixture of history, mathematics and mythology.<br/> <br/> The method has origins in the World War II, proposed by the Polish American mathematician Stanislaw Ulam and Hungary American mathematician John Von Neumann.</p>
<p class="graf graf--p graf-after--h3"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/873/0*5D4L-CR2WwNqYZKZ."/></p>
<p class="graf graf--p graf-after--h3">It is not a coincidence these scientists were European that came to the United States. Several world class scientists did the same, escaping from the Nazis and their military domination.</p>
<p class="graf graf--p graf-after--h3"></p>
<p><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/873/0*0yn7hG570bYC4LwJ."/></p>
<p></p>
<p id="47a4" class="graf graf--p graf-after--figure">These and other bright scientists were involve in the secret project of the development of the Atomic Bomb. This secret project was the Manhattan Project.</p>
<h3 id="db47" class="graf graf--h3 graf-after--p">What does the Atomic Bomb has to do with Simulation?</h3>
<p id="952f" class="graf graf--p graf-after--h3">This is a very easy, simplified introduction, in order to give an idea of Monte Carlo method in Manhattan Project.<br/> <br/> Imagine we have an Atom of Plutonium.</p>
<p id="8f66" class="graf graf--p graf-after--p">An “Atom” is a Greek word for “indivisible”. “A” = not, “tom” = tomes, divisions. Scientists believed that these unique elements of nature were the indivisible building blocks of everything in the universe, like some kind of Lego. The idea of atom has the roots on philosopher Democritus, circa 500 a.C.</p>
<p></p>
<p></p>
<div class="aspectRatioPlaceholder is-locked"><div class="aspectRatioPlaceholder-fill"></div>
<div class="progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/800/0*eNdco3uJWUNWGZfi."/></div>
</div>
<p></p>
<p></p>
<p id="58e1" class="graf graf--p graf-after--figure">In the 1930s, scientists were researching the newly discovered nuclear fission, in which an atom is broken down, divided. An atom was not more indivisible. The atom divided in two other atoms, and released a astounding amount of energy. <span class="markup--strong markup--p-strong">The bomb is the energy of the atom in our hands</span>.<br/> <br/> The enriched Plutonium was an excellent atom for this, because it was highly unstable. And Uranium also.<br/> <br/> Imagine a ball in the top of a mountain. Any little push will move the ball and release its potential energy into kinetic energy. That’s the meaning of instability.</p>
<p></p>
<p></p>
<div class="aspectRatioPlaceholder is-locked"><div class="aspectRatioPlaceholder-fill"></div>
<div class="progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/800/0*ZjvSPxUlWLKFL541.GIF"/></div>
</div>
<p></p>
<p></p>
<p id="16a5" class="graf graf--p graf-after--figure">But there were a lot of problems. To process the natural mineral and enrich the Plutonium was a very very painful process. It’s like to carve down an entire mountain of material, spending enormous quantities of energy, just to get a milligram of enriched Plutonium. How much of this precious Plutonium was enough? How do they could use it?</p>
<h3 id="a36b" class="graf graf--h3 graf-after--p">Domino Effect</h3>
<p id="4624" class="graf graf--p graf-after--h3">It is inoffensive to release the power of just one atom. In order to create a bomb, there is the need to create a chain reaction: one atom releases its energy, and it breaks another atom, and them another: a domino effect. It is like fire. With few gasoline, the fire will extinguish and will not produce the chain reaction to burn a forest.</p>
<p></p>
<p></p>
<div class="aspectRatioPlaceholder is-locked"><div class="aspectRatioPlaceholder-fill"></div>
<div class="progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/800/0*FslivgBCpMZNeZb-."/></div>
</div>
<p></p>
<p></p>
<p id="3fe0" class="graf graf--p graf-after--figure">The sub critical reaction is when the bomb doesn’t explode: the chain reaction doesn’t happen. It is like a domino chain that is interrupted along this way.<br/> <br/> The super critical reaction is when the bomb explodes: there is an exponential amount of energy being released.<br/> <br/> The mission of the scientists was to find the conditions for the critical reaction, which is the line between exploding or not.</p>
<p></p>
<p></p>
<div class="aspectRatioPlaceholder is-locked"><div class="aspectRatioPlaceholder-fill"></div>
<div class="progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/800/0*mOgrf7Ohi4BxCSM_."/></div>
</div>
<p></p>
<p></p>
<h3 id="09ef" class="graf graf--h3 graf-after--figure">Do not explode in my hands</h3>
<p id="bf24" class="graf graf--p graf-after--h3">The knowledge of the critical condition had two main goals: they need to be assured that the bomb would explode, but also they needed to be sure that it didn’t explode in their hands.<br/> <br/> They had to divide the amount of Plutonium in little pieces, little enough to do not explode when they didn’t want, even if an accident happens.<br/> <br/> And they had to join the little pieces in one single great piece, with enough material to cause a chain reaction when they wanted to (in the moment of the explosion).</p>
<h3 id="a3c0" class="graf graf--h3 graf-after--p">How to calculate it?</h3>
<p id="1991" class="graf graf--p graf-after--h3">How to calculate if a bomb will explode or not?<br/> <br/> In a very simplified way, there is a model for calculation the behavior of a single atom: probability of explosion, amount of energy released for each fission of an atom, etc.</p>
<p></p>
<p></p>
<div class="aspectRatioPlaceholder is-locked"><div class="aspectRatioPlaceholder-fill"></div>
<div class="progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/800/0*yOGcejaO7yT3EcwL."/></div>
</div>
<p></p>
<p></p>
<p id="981e" class="graf graf--p graf-after--figure">If there is a model for a single atom, they needed to calculate the behavior of a group of atoms, given they are in a certain distance and density. If the first fission, will the second atom do the same? Will the third do the same?<br/> <br/> In other words, each atom is a random variable. The compound effect in two atoms is a sum of two random variables.<br/> <br/> To sum random variables is not an easy task. It means to solve an integral equation (not easy even for genius like Von Neumann). The sum of two random variables is manageable, but the sum of hundreds of thousands of variable, not.<br/> <br/> <br/> Ulam and Von Neumann proposed two solutions:<br/> <br/> <br/> 1. The Monte Carlo method, with human calculators. Imagine a way do divide the scenarios in several small cases. Each case is defined by a draw on the probability distribution for each atom. Each case is calculated by a woman (men aren’t fit to this task, because them make a lot of mistakes), using paper and pencil (can you imagine this?)… and then a mathematician group together all the several cases. It was named “Monte Carlo” after the casino name, because it remembers a throw of dice. There was a room, full of women doing calculations. Note: they didn’t even know what they were calculating, since the project was secret.</p>
<p></p>
<p></p>
<div class="aspectRatioPlaceholder is-locked"><div class="aspectRatioPlaceholder-fill"></div>
<div class="progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/800/1*l1l61h042FkXSJAdbnJFeA.jpeg"/></div>
</div>
<p>“Human computers”</p>
<p></p>
<p id="c494" class="graf graf--p graf-after--figure">2. The second solution was using Electronic Computers. The only problem: electronic computers didn’t exist. The solution of Von Neumann was to invent the electronic computer! Von Neumann did the conceptual architecture of the electronic computer (CPU, memory, input, output, etc), and the computer we use today still uses Von Neumann architecture. The only problem: it wasn’t completed before the war ended. Therefore, the atomic bomb is entirely due to the efforts of the human calculators.</p>
<p></p>
<p></p>
<div class="aspectRatioPlaceholder is-locked"><div class="aspectRatioPlaceholder-fill"></div>
<div class="progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/800/1*IWmBtqdixTUfqS-UGyg0UA.jpeg"/></div>
</div>
<p></p>
<p></p>
<p id="27eb" class="graf graf--p graf-after--figure">In a probabilistic world, we use random variables to represent stochastic phenomena. We choose the right random variable to represent what we want. If it is an event like the height of a group of people, we use a normal variable. If it is the arrival of clients in a queue, we usually model it by an exponential function. If we have no idea, but somehow we know the minimum and maximum, the uniform distribution is a good choice.</p>
<p></p>
<p></p>
<div class="aspectRatioPlaceholder is-locked"><div class="aspectRatioPlaceholder-fill"></div>
<div class="progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/800/0*HvhbOrMQ0yxWbiXu."/></div>
</div>
<p></p>
<p></p>
<p id="f939" class="graf graf--p graf-after--figure">A very common mistake is to use a very complicated function unnecessarily. Say, the analyst has one year of measurements, and he has learned in college a weibull distribution that fits well. Is it a good choice? No, it is not, unless he knows exactly what he is doing. He will be overfitting the model. Because we don’t want to model the past. We want to model the <span class="markup--strong markup--p-strong">future</span>. And the future will not necessarily fit in a complicated weibull. I prefer to be humble, and say “I don’t know exactly”. The way we say “I don’t know exactly” is to use a very simple random variable, a normal, an uniform.<br/> <br/> <br/> Once we know the random variables, we can use the Monte Carlo method. It consists in throwing a dice on this random variable. Depending on the value we get from the dice, we get the value of the random variable.</p>
<p></p>
<p></p>
<div class="aspectRatioPlaceholder is-locked"><div class="aspectRatioPlaceholder-fill"></div>
<div class="progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/800/0*uiiM1ZbZyGdJcW7N."/></div>
</div>
<p></p>
<p></p>
<p id="e5a9" class="graf graf--p graf-after--figure">In the first example, we get 4 from the first throw, and 0.2 from the second throw, resulting in 4.2.<br/> <br/> In the second example, we get 6.1 from the first throw, and 0.7 from the second throw, resulting in 6.7.<br/> <br/> If we do this a million times, we can estimate the distribution of probabilities of the final random variable. Each step is very very easy, easy enough to be done by a human calculator, or a electronic computer. This way, we can model a very complicated model in a simple way.<br/> <br/> The Monte Carlo method is a fine way to find the variations of the process. In other words, the risk of the process. The risk of a supply chain to be understock or overstock. The risk of the costs of the project to be above budget. The risk of the work to exceed timeline. And so on.<br/> <br/> <br/> Today, in a single laptop, we have the processing power equivalent to millions of humans doing calculations by hand. In a simple Excel spreadsheet, we can have very very complicated models. We have the power of computation in our hands.</p>
<p></p>
<p></p>
<div class="aspectRatioPlaceholder is-locked"><div class="aspectRatioPlaceholder-fill"></div>
<div class="progressiveMedia js-progressiveMedia graf-image is-canvasLoaded is-imageLoaded"><img class="progressiveMedia-image js-progressiveMedia-image" src="https://cdn-images-1.medium.com/max/800/0*QsY9_7P9D3FKvGrw."/></div>
</div>
<p></p>
<p></p>
<p id="92e4" class="graf graf--p graf-after--figure">Today, the bottleneck is not the computation, but the hypothesis of the model itself.<br/> — The data also must have quality, popularly, garbage in, garbage out.<br/> — The model must be simple enough, because of complexity — a complex model may have thousands of parameters, making it impossible to be analyzed and debugged. But it also needs to be complex enough to represent reality in a reasonable way.<br/> <br/> No shelf software can do a good simulation for every possible situation. It is the <span class="markup--strong markup--p-strong">art of modelling</span>.</p>
<h3 id="c2dd" class="graf graf--h3 graf-after--p">Plutonium and Uranium</h3>
<p id="67e7" class="graf graf--p graf-after--h3 graf--trailing">The scientists of Manhattan Project were researching two atoms: uranium and plutonium. In the end they made two bombs.<br/> <br/> — The Hiroshima bomb was the “Little boy”, an uranium bomb.<br/> — The Nagasaki bomb was the “Fat man”, a plutonium bomb.<br/> <br/> Uranus is the god of the sky, in the Greek mythology. Pluto is the god of the hell. Two bombs. Sky and Hell.</p>
<p></p>
<p></p>
<p class="graf graf--p graf-after--h3"></p>