Vincent Granville's Posts - AnalyticBridge2019-11-19T16:05:03ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranvillehttps://storage.ning.com/topology/rest/1.0/file/get/2191504775?profile=RESIZE_48X48&width=48&height=48&crop=1%3A1https://www.analyticbridge.datasciencecentral.com/profiles/blog/feed?user=vi0zmqyuk8ci&xn_auth=no10 Visualizations Every Data Scientist Should Knowtag:www.analyticbridge.datasciencecentral.com,2019-11-12:2004291:BlogPost:3954782019-11-12T17:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><em>This article is by Jorge Castañón, Ph.D., Senior Data Scientist at the IBM Machine Learning Hub.</em></p>
<p class="ni nj en ao nk b nl nm nn no np nq nr ns nt nu nv" id="5920">Data visualization plays two key roles:</p>
<p class="ni nj en ao nk b nl nm nn no np nq nr ns nt nu nv" id="085d">1.<span> </span><em class="op">Communicating results clearly to a general audience.</em></p>
<p class="ni nj en ao nk b nl nm nn no np nq nr ns nt nu nv" id="c440">2.<span> …</span></p>
<p><em>This article is by Jorge Castañón, Ph.D., Senior Data Scientist at the IBM Machine Learning Hub.</em></p>
<p id="5920" class="ni nj en ao nk b nl nm nn no np nq nr ns nt nu nv">Data visualization plays two key roles:</p>
<p id="085d" class="ni nj en ao nk b nl nm nn no np nq nr ns nt nu nv">1.<span> </span><em class="op">Communicating results clearly to a general audience.</em></p>
<p id="c440" class="ni nj en ao nk b nl nm nn no np nq nr ns nt nu nv">2.<span> </span><em class="op">Organizing a view of data that suggests a new hypothesis or a next step in a project.</em></p>
<p id="f14e" class="ni nj en ao nk b nl nm nn no np nq nr ns nt nu nv">It’s no surprise that most people prefer visuals to large tables of numbers. That’s why clearly labeled plots with meaningful interpretation always make it to the front of academic papers.</p>
<p class="ni nj en ao nk b nl nm nn no np nq nr ns nt nu nv"><a href="https://storage.ning.com/topology/rest/1.0/file/get/3709852824?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3709852824?profile=RESIZE_710x" class="align-center"/></a></p>
<p id="6028" class="ni nj en ao nk b nl nm nn no np nq nr ns nt nu nv">This post looks at the 10 visualizations you can bring to bear on your data — whether you want to convince the wider world of your theories or crack open your own project and take the next step:</p>
<ol class="">
<li id="53c6" class="ni nj en ao nk b nl nm nn no np nq nr ns nt nu nv oq or os">Histograms</li>
<li id="ddc7" class="ni nj en ao nk b nl ot nn ou np ov nr ow nt ox nv oq or os">Bar/Pie charts</li>
<li id="6fcc" class="ni nj en ao nk b nl ot nn ou np ov nr ow nt ox nv oq or os">Scatter/Line plots</li>
<li id="3613" class="ni nj en ao nk b nl ot nn ou np ov nr ow nt ox nv oq or os">Time series</li>
<li id="6263" class="ni nj en ao nk b nl ot nn ou np ov nr ow nt ox nv oq or os">Relationship maps</li>
<li id="c7df" class="ni nj en ao nk b nl ot nn ou np ov nr ow nt ox nv oq or os">Heat maps</li>
<li id="d07c" class="ni nj en ao nk b nl ot nn ou np ov nr ow nt ox nv oq or os">Geo Maps</li>
<li id="8f76" class="ni nj en ao nk b nl ot nn ou np ov nr ow nt ox nv oq or os">3-D Plots</li>
<li id="3965" class="ni nj en ao nk b nl ot nn ou np ov nr ow nt ox nv oq or os">Higher-Dimensional Plots</li>
<li id="ec17" class="ni nj en ao nk b nl ot nn ou np ov nr ow nt ox nv oq or os">Word clouds</li>
</ol>
<p>Read the full article, with descriptions and illustrations for these visualizations, <a href="https://www.datasciencecentral.com/profiles/blogs/10-visualizations-every-data-scientist-should-know" target="_blank" rel="noopener">here</a>.</p>More Weird Statistical Distributionstag:www.analyticbridge.datasciencecentral.com,2019-10-27:2004291:BlogPost:3951392019-10-27T00:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Some original and very interesting material is presented here, with possible applications in Fintech. No need for a PhD in math to understand this article: I tried to make the presentation as simple as possible, focusing on high-level results rather than technicalities. Yet, professional statisticians and mathematicians, even academic researchers, will find some deep and fascinating results worth further exploring.…</p>
<p></p>
<p>Some original and very interesting material is presented here, with possible applications in Fintech. No need for a PhD in math to understand this article: I tried to make the presentation as simple as possible, focusing on high-level results rather than technicalities. Yet, professional statisticians and mathematicians, even academic researchers, will find some deep and fascinating results worth further exploring.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3681849077?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3681849077?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><em>Can you identify patterns in this chart? (see section 2.2. in the article for an answer)</em></p>
<p>Let's start with </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3681308901?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3681308901?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Here the<span> </span><em>X</em>(<em>k</em>)'s are random variable identically and independently distributed, commonly referred to as <em>X</em>. We are trying to find the distribution of<span> </span><em>Z</em>.</p>
<p><strong>Contents</strong></p>
<p>1. Using a Simple Discrete Distribution for <em>X</em></p>
<p>2. Towards a Better Model</p>
<ul>
<li>Approximate Solution</li>
<li>The Fractal, Brownian-like Error Term</li>
</ul>
<p>3. Finding <em>X</em> and <em>Z</em> Using Characteristic Functions</p>
<ul>
<li>Test with Log-normal Distribution for <em>X</em></li>
<li>Playing with the Characteristic Functions</li>
<li>Generalization to Continued Fractions and Nested Cubic Roots</li>
</ul>
<p>4. Exercises</p>
<p><em>Read this article <a href="https://www.datasciencecentral.com/profiles/blogs/math-fun-infinite-nested-radicals-of-random-variables" target="_blank" rel="noopener">here</a>. </em></p>
<p></p>Complete Hands-Off Automated Machine Learningtag:www.analyticbridge.datasciencecentral.com,2019-10-22:2004291:BlogPost:3948882019-10-22T20:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>By Bill Vorhies. </p>
<p><strong><em>Summary:</em></strong><em> Here’ a proposal for real ‘zero touch’, ‘set-em-and-forget-em’ machine learning from the researchers at Amazon. If you have an environment as fast changing as e-retail and a huge number of models matching buyers and products you could achieve real cost savings and revenue increases by making the refresh cycle faster and more accurate with automation. This capability likely will be coming soon to your favorite AML…</em></p>
<p>By Bill Vorhies. </p>
<p><strong><em>Summary:</em></strong><em> Here’ a proposal for real ‘zero touch’, ‘set-em-and-forget-em’ machine learning from the researchers at Amazon. If you have an environment as fast changing as e-retail and a huge number of models matching buyers and products you could achieve real cost savings and revenue increases by making the refresh cycle faster and more accurate with automation. This capability likely will be coming soon to your favorite AML platform.</em></p>
<p><em><a href="https://storage.ning.com/topology/rest/1.0/file/get/3674974988?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3674974988?profile=RESIZE_710x" class="align-center"/></a></em></p>
<p>Is there a future in which we can really ‘set-em-and-forget-em’ machine learning? So far Automated Machine Learning (AML) is delivering on vastly simplifying the creation of models but the maintenance, refresh, and update still require manual intervention.</p>
<p>Not that we’re trying to talk ourselves out of a job. But after all, once the model is built and implemented it’s more fun to move on to the next opportunity. If the maintenance and refresh cycle could be truly automated that would be a good thing.</p>
<p>Much of the effort so far has been put into simplifying getting the model out of its AML environment and into its production environment. Facebook’s FBLearner is an example of this. A number of platforms claim to ease this process for the rest of us. At least once we manually refresh the model it’s easier to update it in production.</p>
<p><em>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/complete-hands-off-automated-machine-learning" target="_blank" rel="noopener">here</a>. </em></p>40+ Modern Tutorials Covering All Aspects of Machine Learningtag:www.analyticbridge.datasciencecentral.com,2019-10-13:2004291:BlogPost:3947202019-10-13T17:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><span>This list of lists contains books, notebooks, presentations, cheat sheets, and tutorials covering all aspects of data science, machine learning, deep learning, statistics, math, and more, with most documents featuring Python or R code and numerous illustrations or case studies. All this material is available for free, and consists of content mostly created in 2019 and 2018, by various top experts in their respective fields. A few of these documents are available on LinkedIn: see last…</span></p>
<p><span>This list of lists contains books, notebooks, presentations, cheat sheets, and tutorials covering all aspects of data science, machine learning, deep learning, statistics, math, and more, with most documents featuring Python or R code and numerous illustrations or case studies. All this material is available for free, and consists of content mostly created in 2019 and 2018, by various top experts in their respective fields. A few of these documents are available on LinkedIn: see last section on how to download them. </span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/3660371847?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3660371847?profile=RESIZE_710x" class="align-center"/></a></span></p>
<p><span>Below are the first two sections.</span></p>
<p><strong>General References</strong></p>
<ul>
<li>Free Deep Learning Book (639 pages) by Prof. Gilles Louppe</li>
<li>Python Crash Course (562 pages) by Eric Matthes</li>
<li>Free Book: Applied Data Science (141 pages) - Columbia University</li>
<li>Data Science in Practice</li>
<li>Machine Learning 101 - By Jason Mayes, Google</li>
<li>The Ultimate guide to AI, Data Science & Machine Learning</li>
<li>Free Handbooks for Data Science Professionals</li>
<li>Free Book: Natural Language Processing with Python</li>
<li>Data Visualization Resources</li>
<li>Textbook: Probability Course - Harvard University</li>
<li>Textbook: The Math of Machine Learning - Berkeley University</li>
<li>Comprehensive Guide to Machine Learning - Berkeley University</li>
<li>Free Book: Foundations of Data Science - by Microsoft Research</li>
<li>Comprehensive Guide on Machine Learning - by J.P. Morgan</li>
<li>Gentle Approach to Linear Algebra - by Vincent Granville</li>
</ul>
<p><strong>Data Science Central Books, Booklets and References</strong></p>
<ul>
<li>Statistics: New Foundations, Toolbox, and Machine Learning Recipes</li>
<li>Deep Learning and Computer Vision with CNNs</li>
<li>Getting Started with TensorFlow 2.0</li>
<li>Classification and Regression in a Weekend</li>
<li>Online Encyclopedia of Statistical Science</li>
<li>Azure Machine Learning in a Weekend</li>
<li>Enterprise AI - An Application Perspective</li>
<li>Applied Stochastic Processes</li>
<li>Comprehensive Repository of Data Science and ML Resources</li>
<li>Foundations of ML and Data Science for Developers</li>
<li>Elegant Representation of Forward/Back Propagation in Neural Networks</li>
<li>Learning the Math of Data Science</li>
</ul>
<p>To access all these documents and more, <a href="https://www.datasciencecentral.com/profiles/blogs/40-tutorials-covering-all-aspects-of-machine-learning" target="_blank" rel="noopener">follow this link</a>.</p>Surprising Uses of Synthetic Random Data Setstag:www.analyticbridge.datasciencecentral.com,2019-10-02:2004291:BlogPost:3947462019-10-02T23:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>I have used synthetic data sets many times for simulation purposes, most recently in my articles<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/six-degrees-of-separation-between-any-two-data-sets" rel="noopener" target="_blank">Six degrees of Separations between any two Datasets</a><span> </span>and<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/how-to-lie-with-p-values" rel="noopener" target="_blank">How to Lie with p-values</a>. Many…</p>
<p>I have used synthetic data sets many times for simulation purposes, most recently in my articles<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/six-degrees-of-separation-between-any-two-data-sets" target="_blank" rel="noopener">Six degrees of Separations between any two Datasets</a><span> </span>and<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/how-to-lie-with-p-values" target="_blank" rel="noopener">How to Lie with p-values</a>. Many applications (including the data sets themselves) can be found in my books<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">Applied Stochastic Processes</a><span> </span>and<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/free-book-statistics-new-foundations-toolbox-and-machine-learning" target="_blank" rel="noopener">New Foundations of Statistical Science</a>. For instance, these data sets can be used to benchmark some statistical tests of hypothesis (the null hypothesis known to be true or false in advance) and to assess the power of such tests or confidence intervals. In other cases, it is used to simulate clusters and test cluster detection / pattern detection algorithms, see<span> </span><a href="https://www.analyticbridge.datasciencecentral.com/profiles/blogs/how-to-detect-a-pattern-problem-and-solution" target="_blank" rel="noopener">here</a>. I also used such data sets to discover two new deep conjectures in number theory (see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/two-new-deep-conjectures-in-probabilistic-number-theory" target="_blank" rel="noopener">here</a>), to design new Fintech models such as<span> </span><em>bounded Brownian motions</em>, and find new families of statistical distributions (see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/a-strange-family-of-statistical-distributions" target="_blank" rel="noopener">here</a>).</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3641314354?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3641314354?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><em>Goldbach's comet </em></p>
<p>In this article, I focus on peculiar random data sets to prove -- heuristically -- two of the most famous math conjectures in number theory, related to prime numbers: the Twin Prime conjecture, and the Goldbach conjecture. The methodology is at the intersection of probability theory, experimental math, and probabilistic number theory. It involves working with infinite data sets, dwarfing any data set found in any business context.</p>
<p>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/surprising-uses-of-synthetic-random-data-sets?xg_source=activity" target="_blank" rel="noopener">here</a>. </p>Six Degrees of Separation Between Any Two Data Setstag:www.analyticbridge.datasciencecentral.com,2019-09-09:2004291:BlogPost:3943772019-09-09T16:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>This is an interesting data science conjecture, inspired by the well known<span> </span><a href="https://www.bigdatanews.datasciencecentral.com/profiles/blogs/graph-theory-six-degrees-of-separation-problem" rel="noopener" target="_blank">six degrees of separation problem</a>, stating that there is a link involving no more than 6 connections between any two people on Earth, say between you and anyone living (say) in North Korea. </p>
<p>Here the link is between any two univariate data sets…</p>
<p>This is an interesting data science conjecture, inspired by the well known<span> </span><a href="https://www.bigdatanews.datasciencecentral.com/profiles/blogs/graph-theory-six-degrees-of-separation-problem" target="_blank" rel="noopener">six degrees of separation problem</a>, stating that there is a link involving no more than 6 connections between any two people on Earth, say between you and anyone living (say) in North Korea. </p>
<p>Here the link is between any two univariate data sets of the same size, say Data A and Data B. The claim is that there is a chain involving no more than 6 intermediary data sets, each highly correlated to the previous one (with a correlation above 0.8), between Data A and Data B. The concept is illustrated in the example below, where only 4 intermediary data sets (labeled Degree 1, Degree 2, Degree 3, and Degree 4) are actually needed. </p>
<p><img src="https://storage.ning.com/topology/rest/1.0/file/get/3547469050?profile=RESIZE_710x" class="align-center"/></p>
<p style="text-align: center;"><em>Correlation table for the 6 data sets</em></p>
<p>The view the (random) data sets, understand how the chain of intermediary data sets was built, and access the spreadsheets to reproduce the results or test on different data, <a href="https://www.datasciencecentral.com/profiles/blogs/six-degrees-of-separation-between-any-two-data-sets" target="_blank" rel="noopener">follow this link</a>. I<span>t makes for an interesting theoretical data science research project, for people with too much free time on their hands. </span></p>Two New Deep Conjectures in Probabilistic Number Theorytag:www.analyticbridge.datasciencecentral.com,2019-09-08:2004291:BlogPost:3941282019-09-08T10:09:38.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>The material discussed here is also of interest to machine learning, AI, big data, and data science practitioners, as much of the work is based on heavy data processing, algorithms, efficient coding, testing, and experimentation. Also, it's not just two new conjectures, but paths and suggestions to solve these problems. The last section contains a few new, original exercises, some with solutions, and may be useful to students, researchers, and instructors offering math and statistics classes…</p>
<p>The material discussed here is also of interest to machine learning, AI, big data, and data science practitioners, as much of the work is based on heavy data processing, algorithms, efficient coding, testing, and experimentation. Also, it's not just two new conjectures, but paths and suggestions to solve these problems. The last section contains a few new, original exercises, some with solutions, and may be useful to students, researchers, and instructors offering math and statistics classes at the college level: they range from easy to very difficult. Some great probability theorems are also discussed, in layman's terms: see section 1.2. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3546311327?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3546311327?profile=RESIZE_710x" class="align-center"/></a></p>
<p>The two deep conjectures highlighted in this article (conjectures B and C) are related to the digit distribution of well known math constants such as Pi or log 2, with an emphasis on binary digits of SQRT(2). This is an old problem, one of the most famous ones in mathematics, still unsolved today.</p>
<p><strong>Content of this article</strong></p>
<p>A Strange Recursive Formula</p>
<ul>
<li>Conjecture A</li>
<li>A deeper result</li>
<li>Conjecture B</li>
<li>Connection to the Berry-Esseen theorem</li>
<li>Potential path to solving this problem</li>
</ul>
<p>Potential Solution Based on Special Rational Number Sequences</p>
<ul>
<li>Interesting statistical result</li>
<li>Conjecture C</li>
<li>Another curious statistical result</li>
</ul>
<p>Exercises</p>
<p><em>Read the full article <a href="https://www.datasciencecentral.com/profiles/blogs/two-new-deep-conjectures-in-probabilistic-number-theory" target="_blank" rel="noopener">here</a>. </em></p>10 Machine Learning Methods that Every Data Scientist Should Knowtag:www.analyticbridge.datasciencecentral.com,2019-08-30:2004291:BlogPost:3944382019-08-30T17:08:12.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p class="nj nk eo ao nl b nm nn no np nq nr ns nt nu nv nw" id="a572">Machine learning is a hot topic in research and industry, with new methodologies developed all the time. The speed and complexity of the field makes keeping up with new techniques difficult even for experts — and potentially overwhelming for beginners.</p>
<p class="nj nk eo ao nl b nm nn no np nq nr ns nt nu nv nw" id="0d4d">To demystify machine learning and to offer a learning path for those who are new to the core…</p>
<p id="a572" class="nj nk eo ao nl b nm nn no np nq nr ns nt nu nv nw">Machine learning is a hot topic in research and industry, with new methodologies developed all the time. The speed and complexity of the field makes keeping up with new techniques difficult even for experts — and potentially overwhelming for beginners.</p>
<p id="0d4d" class="nj nk eo ao nl b nm nn no np nq nr ns nt nu nv nw">To demystify machine learning and to offer a learning path for those who are new to the core concepts, let’s look at ten different methods, including simple descriptions, visualizations, and examples for each one.</p>
<p class="nj nk eo ao nl b nm nn no np nq nr ns nt nu nv nw"><a href="https://storage.ning.com/topology/rest/1.0/file/get/3487793979?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3487793979?profile=RESIZE_710x" class="align-center"/></a></p>
<p id="64a5" class="nj nk eo ao nl b nm nn no np nq nr ns nt nu nv nw">A machine learning algorithm, also called model, is a mathematical expression that represents data in the context of a problem, often a business problem. The aim is to go from data to insight. For example, if an online retailer wants to anticipate sales for the next quarter, they might use a machine learning algorithm that predicts those sales based on past sales and other relevant data. Similarly, a windmill manufacturer might visually monitor important equipment and feed the video data through algorithms trained to identify dangerous cracks.</p>
<p id="00c2" class="nj nk eo ao nl b nm nn no np nq nr ns nt nu nv nw">The ten methods described offer an overview — and a foundation you can build on as you hone your machine learning knowledge and skill:</p>
<ol class="">
<li id="b886" class="nj nk eo ao nl b nm nn no np nq nr ns nt nu nv nw nx ny nz">Regression</li>
<li id="2763" class="nj nk eo ao nl b nm ob no oc nq od ns oe nu of nw nx ny nz">Classification</li>
<li id="54dd" class="nj nk eo ao nl b nm ob no oc nq od ns oe nu of nw nx ny nz">Clustering</li>
<li id="c007" class="nj nk eo ao nl b nm ob no oc nq od ns oe nu of nw nx ny nz">Dimensionality Reduction</li>
<li id="1af1" class="nj nk eo ao nl b nm ob no oc nq od ns oe nu of nw nx ny nz">Ensemble Methods</li>
<li id="91ed" class="nj nk eo ao nl b nm ob no oc nq od ns oe nu of nw nx ny nz">Neural Nets and Deep Learning</li>
<li id="5128" class="nj nk eo ao nl b nm ob no oc nq od ns oe nu of nw nx ny nz">Transfer Learning</li>
<li id="2251" class="nj nk eo ao nl b nm ob no oc nq od ns oe nu of nw nx ny nz">Reinforcement Learning</li>
<li id="6975" class="nj nk eo ao nl b nm ob no oc nq od ns oe nu of nw nx ny nz">Natural Language Processing</li>
<li id="429f" class="nj nk eo ao nl b nm ob no oc nq od ns oe nu of nw nx ny nz">Word Embeddings</li>
</ol>
<p><em>Read the full article, with detailed description for each method, <a href="https://www.datasciencecentral.com/profiles/blogs/10-machine-learning-methods-that-every-data-scientist-should-know" target="_blank" rel="noopener">here</a>. </em></p>A Strange Family of Statistical Distributionstag:www.analyticbridge.datasciencecentral.com,2019-08-30:2004291:BlogPost:3943402019-08-30T16:11:16.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><span>I introduce here a family of very peculiar statistical distributions governed by two parameters: </span><em>p</em><span>, a real number in [0, 1], and </span><em>b</em><span>, an integer > 1. </span></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3487729021?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/3487729021?profile=RESIZE_710x"></img></a></p>
<p><span>Potential applications are found in cryptography, Fintech (stock market modeling), Bitcoin, number theory, random number…</span></p>
<p><span>I introduce here a family of very peculiar statistical distributions governed by two parameters: </span><em>p</em><span>, a real number in [0, 1], and </span><em>b</em><span>, an integer > 1. </span></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3487729021?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3487729021?profile=RESIZE_710x" class="align-center"/></a></p>
<p><span>Potential applications are found in cryptography, Fintech (stock market modeling), Bitcoin, number theory, random number generation, benchmarking statistical tests (see </span><a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">here</a><span>) and even gaming (see </span><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-foundations-for-a-new-stock-market" target="_blank" rel="noopener">here</a><span>.) However, the most interesting application is probably to gain insights about how non-normal numbers look like, especially their chaotic nature. It is a fundamental tool to help solve one of the most intriguing mathematical conjectures of all times (yet unsolved): are the digits of standard constants such as Pi or SQRT(2) uniformly distributed or not? For instance, when </span><em>b</em><span> = 2, any departure from </span><em>p</em><span> = 0.5 (a normal seed) results in a strong discontinuity for </span><em>f</em><span>(</span><em>x</em><span>) at </span><em>x</em><span> = 0.5. If you look at the above chart, </span><em>f(</em><span>0) = </span><em>f(</em><span>1/2) = </span><em>f</em><span>(1) regardless of </span><em>p</em><span>, but discontinuities are masking this fact. </span></p>
<p><span><a href="https://www.datasciencecentral.com/profiles/blogs/a-strange-family-of-statistical-distributions" target="_blank" rel="noopener">Read full article here</a>. </span></p>Extreme Events Modeling Using Continued Fractionstag:www.analyticbridge.datasciencecentral.com,2019-08-30:2004291:BlogPost:3943242019-08-30T15:42:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Continued fractions are usually considered as a beautiful, curious mathematical topic, but with applications mostly theoretical and limited to math and number theory. Here we show how it can be used in applied business and economics contexts, leveraging the mathematical theory developed for continued fraction, to model and explain natural phenomena. …</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3487696331?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/3487696331?profile=RESIZE_710x"></img></a></p>
<p>Continued fractions are usually considered as a beautiful, curious mathematical topic, but with applications mostly theoretical and limited to math and number theory. Here we show how it can be used in applied business and economics contexts, leveraging the mathematical theory developed for continued fraction, to model and explain natural phenomena. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3487696331?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3487696331?profile=RESIZE_710x" class="align-center"/></a></p>
<p>The interest in this project started when analyzing sequences such as<span> </span><em>x</em>(<em>n</em>) = {<span> </span><em>nq</em><span> </span>} =<span> </span><em>nq</em><span> </span>- INT(<em>nq</em>) where<span> </span><em>n</em>= 1, 2, and so on, and<span> </span><em>q</em><span> </span>is an irrational number in [0, 1] called the<span> </span><em>seed</em>. The brackets denote the fractional part function. The values<span> </span><em>x</em>(<em>n</em>) are also in [0, 1] and get arbitrarily close to 0 and 1 infinitely often, and indeed arbitrarily close to any number in [0, 1] infinitely often. I became interested to see what happens when it gets very close to 1, and more precisely, about the distribution of the arrival times<span> </span><em>t</em>(<em>n</em>) of successive records. I was curious to compare these arrival times with those from truly random numbers, or from real-life time series such as temperature, stock market or gaming/sports data. Such arrival times are known to have an infinite expectation under stable conditions, though their medians always exist: after all, any record could be the final one, never to be surpassed again in the future. This always happens at some point with the sequence<span> </span><em>x</em>(<em>n</em>), if<span> </span><em>q</em><span> </span>is a rational number -- thus our focus on irrational seeds: they yield successive records that keep growing over and over, without end, although the gaps between successive records eventually grow very large, in a chaotic, unpredictable way, just like records in traditional time series.</p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/extreme-events-modeling-using-continued-fractions" target="_blank" rel="noopener">Read the full article here</a>.</p>
<p><strong>Content</strong>:</p>
<ul>
<li>Theoretical background (simplified)</li>
<li>Generalization and potential applications to real life problems</li>
<li>Original applications in music and probabilistic number theory</li>
</ul>Comparing Model Evaluation Techniquestag:www.analyticbridge.datasciencecentral.com,2019-08-08:2004291:BlogPost:3936612019-08-08T16:37:43.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>In my previous posts, I compared model evaluation techniques using Statistical Tools & Tests and commonly used Classification and Clustering evaluation techniques</p>
<p>In this post, I'll take a look at how you can compare regression models. Comparing regression models is perhaps one of the trickiest tasks to complete in the "comparing models" arena; The reason is that there are literally dozens of statistics you can calculate to compare regression models, including:</p>
<p><strong>1.…</strong></p>
<p>In my previous posts, I compared model evaluation techniques using Statistical Tools & Tests and commonly used Classification and Clustering evaluation techniques</p>
<p>In this post, I'll take a look at how you can compare regression models. Comparing regression models is perhaps one of the trickiest tasks to complete in the "comparing models" arena; The reason is that there are literally dozens of statistics you can calculate to compare regression models, including:</p>
<p><strong>1. Error measures in the estimation period (in-sample testing) or validation period (out-of-sample testing):</strong></p>
<ul>
<li>Mean Absolute Error (MAE),</li>
<li>Mean Absolute Percentage Error (MAPE),</li>
<li>Mean Error,</li>
<li>Root Mean Squared Error (RMSE),</li>
</ul>
<p><br/><strong>2. Tests on Residuals and Goodness-of-Fit:</strong></p>
<ul>
<li>Plots: actual vs. predicted value; cross correlation; residual autocorrelation; residuals vs. time/predicted values,</li>
<li>Changes in mean or variance,</li>
<li>Tests: normally distributed errors; excessive runs (e.g. of positives or negatives); outliers/extreme values/ influential observations.</li>
</ul>
<p>This list isn't exhaustive--there are many other tools, tests and plots at your disposal. Rather than discuss the statistics in detail, I chose to focus this post on comparing a few of the most popular regression model evaluation techniques and discuss when you might want to use them (or when you might not want to). The techniques listed below tend to be on the "easier to use and understand" end of the spectrum, so if you're new to model comparison it's a good place to start.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3414342046?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3414342046?profile=RESIZE_710x" class="align-center"/></a></p>
<p></p>
<p>The above picture (comparing models) was originally posted <a href="https://www.datasciencecentral.com/profiles/blogs/model-evaluation-techniques-in-one-picture" target="_blank" rel="noopener">here</a>. </p>
<p><em>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/comparing-model-evaluation-techniques-part-3-regression-models" target="_blank" rel="noopener">here</a>. </em></p>Elegant Representation of Forward and Back Propagation in Neural Networkstag:www.analyticbridge.datasciencecentral.com,2019-08-08:2004291:BlogPost:3934122019-08-08T16:29:52.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Sometimes, you see a diagram and it gives you an ‘aha ha’ moment. Here is one representing forward propagation and back propagation in a neural network:<br></br><a href="https://storage.ning.com/topology/rest/1.0/file/get/3388408048?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/3388408048?profile=RESIZE_710x"></img></a></p>
<p>A brief explanation is:</p>
<ul>
<li>Using the input variables x and y, The forwardpass (left half of the figure) calculates output z as a function of x and y i.e. f(x,y)</li>
<li>The right side…</li>
</ul>
<p>Sometimes, you see a diagram and it gives you an ‘aha ha’ moment. Here is one representing forward propagation and back propagation in a neural network:<br/><a href="https://storage.ning.com/topology/rest/1.0/file/get/3388408048?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3388408048?profile=RESIZE_710x" class="align-center"/></a></p>
<p>A brief explanation is:</p>
<ul>
<li>Using the input variables x and y, The forwardpass (left half of the figure) calculates output z as a function of x and y i.e. f(x,y)</li>
<li>The right side of the figures shows the backwardpass.</li>
<li>Receiving dL/dz (the derivative of the total loss with respect to the output z) , we can calculate the individual gradients of x and y on the loss function by applying the chain rule, as shown in the figure.</li>
</ul>
<p>A more detailed explanation below from me.</p>
<p><em>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/an-elegant-way-to-represent-forward-propagation-and-back" target="_blank" rel="noopener">here</a>. </em></p>Decision Tree vs Random Forest vs Gradient Boosting Machinestag:www.analyticbridge.datasciencecentral.com,2019-08-08:2004291:BlogPost:3934102019-08-08T16:25:09.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Decision Trees, Random Forests and Boosting are among the top 16 data science and machine learning tools used by data scientists. The three methods are similar, with a significant amount of overlap. In a nutshell:</p>
<ul>
<li>A decision tree is a simple, decision making-diagram.</li>
<li>Random forests are a large number of trees, combined (using averages or "majority rules") at the end of the process.</li>
<li>Gradient boosting machines also combine decision trees, but start the combining…</li>
</ul>
<p>Decision Trees, Random Forests and Boosting are among the top 16 data science and machine learning tools used by data scientists. The three methods are similar, with a significant amount of overlap. In a nutshell:</p>
<ul>
<li>A decision tree is a simple, decision making-diagram.</li>
<li>Random forests are a large number of trees, combined (using averages or "majority rules") at the end of the process.</li>
<li>Gradient boosting machines also combine decision trees, but start the combining process at the beginning, instead of at the end.</li>
</ul>
<p><strong>Decision Trees and Their Problems</strong></p>
<p>Decision trees are a series of sequential steps designed to answer a question and provide probabilities, costs, or other consequence of making a particular decision.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3414325027?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3414325027?profile=RESIZE_710x" class="align-center"/></a></p>
<p>They are simple to understand, providing a clear visual to guide the decision making progress. However, this simplicity comes with a few serious disadvantages, including overfitting, error due to bias and error due to variance.</p>
<ul>
<li>Overfitting happens for many reasons, including presence of noise and lack of representative instances. It's possible for overfitting with one large (deep) tree.</li>
<li>Bias error happens when you place too many restrictions on target functions. For example, restricting your result with a restricting function (e.g. a linear equation) or by a simple binary algorithm (like the true/false choices in the above tree) will often result in bias.</li>
<li>Variance error refers to how much a result will change based on changes to the training set. Decision trees have high variance, which means that tiny changes in the training data have the potential to cause large changes in the final result.</li>
</ul>
<p><strong>Random Forest vs Decision Trees</strong></p>
<p>As noted above, decision trees are fraught with problems. A tree generated from 99 data points might differ significantly from a tree generated with just one different data point. If there was a way to generate a very large number of trees, averaging out their solutions, then you'll likely get an answer that is going to be very close to the true answer.</p>
<p><em>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/decision-tree-vs-random-forest-vs-boosted-trees-explained" target="_blank" rel="noopener">here</a>. </em></p>How the Mathematics of Fractals Can Help Predict Stock Markets Shiftstag:www.analyticbridge.datasciencecentral.com,2019-07-08:2004291:BlogPost:3930292019-07-08T16:25:57.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>In financial markets, two of the most common trading strategies used by investors are the momentum and mean reversion strategies. If a stock exhibits momentum (or trending behavior as shown in the figure below), its price on the current period is more likely to increase (decrease) if it has already increased (decreased) on the previous period.</p>
<p>When the return of a stock at time t depends in some way on the return at the previous time t-1, the returns are said to be autocorrelated. In…</p>
<p>In financial markets, two of the most common trading strategies used by investors are the momentum and mean reversion strategies. If a stock exhibits momentum (or trending behavior as shown in the figure below), its price on the current period is more likely to increase (decrease) if it has already increased (decreased) on the previous period.</p>
<p>When the return of a stock at time t depends in some way on the return at the previous time t-1, the returns are said to be autocorrelated. In the momentum regime, returns are positively correlated.</p>
<p>In contrast, the price of a mean-reverting stock fluctuates randomly around its historical mean and displays a tendency to revert to it. When there is mean reversion, if the price increased (decreased) in the current period, it is more likely to decrease (increase) in the next one.</p>
<p>A section of the time series of log returns of the Apple stock (adjusted closing price), shown below, is an example of mean-reverting behavior.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3211474393?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3211474393?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Note that, since the two regimes occur in different time frames (trending behavior usually occurs in larger timescales), they can, and often do, coexist.</p>
<p>In both regimes, the current price contains useful information about the future price. In fact, trading strategies can only generate profit if asset prices are either trending or mean-reverting since, otherwise, prices are following what is known as a random walk (see the animation below).</p>
<p>Read full (long) article <a href="https://www.datasciencecentral.com/profiles/blogs/how-the-mathematics-of-fractals-can-help-predict-stock-markets" target="_blank" rel="noopener">here</a>. <span>For free books about machine learning and data science, </span><a href="https://www.datasciencecentral.com/profiles/blogs/new-books-and-resources-for-dsc-members" target="_blank" rel="noopener">follow this link</a><span>. </span></p>Where’s the Love – Trends in Data Science Career Opportunitiestag:www.analyticbridge.datasciencecentral.com,2019-07-08:2004291:BlogPost:3933392019-07-08T16:18:23.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><span> </span><em> The annual Burtch Works salary survey tells us a lot about which industries are using the most data scientists and the difference between higher and lower skilled data scientists. Salary increases show us whether demand is increasing, and finally we take a shot at determining which skills are most in demand.</em></p>
<p> What a difference a few years can make. We used to say that everyone loves a data scientist – and wants to be one. …</p>
<p><strong><em>Summary:</em></strong><span> </span><em> The annual Burtch Works salary survey tells us a lot about which industries are using the most data scientists and the difference between higher and lower skilled data scientists. Salary increases show us whether demand is increasing, and finally we take a shot at determining which skills are most in demand.</em></p>
<p> What a difference a few years can make. We used to say that everyone loves a data scientist – and wants to be one. That’s still true. But as data science has increasingly been adopted by businesses at all levels, industries, and geographies the nature of the opportunities available to data science have also changed.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3211466047?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3211466047?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Yes it’s still one of the most interesting and rewarding career choices you can make. I wouldn’t trade it for anything. Where else can you create value out of previously unvalued data while basically predicting the future? Of course I’m talking about what customers will do, what prices or values will be, or whether something is abnormal. All the things we’re involved with on a day-to-day basis.</p>
<p>Read the full article <a href="https://www.datasciencecentral.com/profiles/blogs/where-s-the-love-trends-in-data-science-career-opportunities" target="_blank" rel="noopener">here</a>. For free books about machine learning and data science, <a href="https://www.datasciencecentral.com/profiles/blogs/new-books-and-resources-for-dsc-members" target="_blank" rel="noopener">follow this link</a>. </p>How to learn the maths of Data Science using your high school maths knowledgetag:www.analyticbridge.datasciencecentral.com,2019-06-27:2004291:BlogPost:3931012019-06-27T18:22:15.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>By Ajit Jaokar. This post is a part of my forthcoming book on Mathematical foundations of Data Science. In this post, we use the Perceptron algorithm to bridge the gap between high school maths and deep learning. </p>
<p><strong>Background</strong></p>
<p>As part of my role as course director of the Artificial Intelligence: Cloud and Edge Computing at the University of Oxford, I see more students who are familiar with programming than with mathematics.</p>
<p>They have last learnt maths…</p>
<p>By Ajit Jaokar. This post is a part of my forthcoming book on Mathematical foundations of Data Science. In this post, we use the Perceptron algorithm to bridge the gap between high school maths and deep learning. </p>
<p><strong>Background</strong></p>
<p>As part of my role as course director of the Artificial Intelligence: Cloud and Edge Computing at the University of Oxford, I see more students who are familiar with programming than with mathematics.</p>
<p>They have last learnt maths years ago at University. And then, suddenly they find that they encounter matrices, linear algebra etc when they start learning Data Science.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/3138240717?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/3138240717?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Ideas they thought they would not face again after college! Worse still, in many cases, they do not know where precisely these concepts apply to data science.</p>
<p>If you consider the maths foundations needed to learn data science, you could divide them into four key areas</p>
<ul>
<li>Linear Algebra</li>
<li>Probability Theory and Statistics</li>
<li>Multivariate Calculus</li>
<li>Optimization</li>
</ul>
<p>All of these are taught (at least partially) in high schools (14 to 17 years of age). In this book, we start with these ideas and co-relate them to data science and AI.</p>
<p>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/how-to-learn-the-maths-of-data-science-using-your-high-school" target="_blank" rel="noopener">here</a>. </p>Machine Learning and Data Science Cheat Sheettag:www.analyticbridge.datasciencecentral.com,2019-06-07:2004291:BlogPost:3931312019-06-07T02:27:48.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Originally published in 2014 and viewed more than 200,000 times, this is the oldest data science cheat sheet - the mother of all the numerous cheat sheets that are so popular nowadays. I decided to update it in June 2019. While the first half, dealing with installing components on your laptop and learning UNIX, regular expressions, and file management hasn't changed much, the second half, dealing with machine learning, was rewritten entirely from scratch. It is amazing how things changed in…</p>
<p>Originally published in 2014 and viewed more than 200,000 times, this is the oldest data science cheat sheet - the mother of all the numerous cheat sheets that are so popular nowadays. I decided to update it in June 2019. While the first half, dealing with installing components on your laptop and learning UNIX, regular expressions, and file management hasn't changed much, the second half, dealing with machine learning, was rewritten entirely from scratch. It is amazing how things changed in just five years!</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2802101885?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2802101885?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Written for people who have never seen a computer in their life, it starts with the very beginning: buying a laptop! You can skip the first half and jump to sections 5 and 6 if you are already familiar with UNIX. This new cheat sheet will be included in my upcoming book<span> </span><em>Machine Learning: Foundations, Toolbox, and Recipes</em><span> </span>to be published in September 2019, and available (for free) to Data Science Central members exclusively. This cheat sheet is 14 pages long.</p>
<p><strong>Content</strong></p>
<p>1. Hardware</p>
<p>2. Linux environment on Windows laptop</p>
<p>3. Basic UNIX commands</p>
<p>4. Scripting languages</p>
<p>5. Python, R, Hadoop, SQL, DataViz</p>
<p>6. Machine Learning</p>
<ul>
<li>Algorithms</li>
<li>Getting started</li>
<li>Applications</li>
<li>Data sets and sample projects</li>
</ul>
<p>This new cheat sheet is available <a href="https://www.datasciencecentral.com/profiles/blogs/data-science-cheat-sheet" target="_blank" rel="noopener">here</a>. </p>7 Simple Tricks to Handle Complex Machine Learning Issuestag:www.analyticbridge.datasciencecentral.com,2019-06-04:2004291:BlogPost:3925262019-06-04T18:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We propose simple solutions to important problems that all data scientists face almost every day. In short, a toolbox for the handyman, useful to busy professionals in any field.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=RESIZE_710x"></img></a></p>
<p><strong>1. Eliminating sample size effects</strong>. <span>Many statistics, such as correlations or R-squared, depend on the sample size, making it difficult to…</span></p>
<p>We propose simple solutions to important problems that all data scientists face almost every day. In short, a toolbox for the handyman, useful to busy professionals in any field.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=RESIZE_710x" class="align-center"/></a></p>
<p><strong>1. Eliminating sample size effects</strong>. <span>Many statistics, such as correlations or R-squared, depend on the sample size, making it difficult to compare values computed on two data sets of different sizes. Based on re-sampling techniques, use this easy trick, to compare apples with other apples, not with oranges. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/simple-trick-to-normalize-correlations-r-squared-and-so-on" target="_blank" rel="noopener">here</a>. </span></p>
<p><span><strong>2. Sample size determination, and simple, model-free confidence intervals</strong>. We propose a generic methodology, also based on re-sampling techniques, to compute any confidence interval and for testing hypotheses, without using any statistical theory. Also, it is easy to implement, even in Excel. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/modern-re-sampling-and-statistical-recipes" target="_blank" rel="noopener">here</a>. </span></p>
<p><span><strong>3. Determining the number of clusters in non-supervised clustering</strong>. This modern version of the elbow rule also tells you how strong the global optimum is, and can help you identify local optima too. It can also be automated. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/how-to-automatically-determine-the-number-of-clusters-in-your-dat" target="_blank" rel="noopener">here</a>. </span></p>
<p><span><strong>4. Fixing issues in regression models when the assumptions are violated</strong>. If your data has serial correlation, unequal variances and other similar problems, this simple trick will remove the issue and allows you to perform more meaningful regressions, or to detect flaws in your data set. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/simple-trick-to-remove-serial-correlation-in-regression-models" target="_blank" rel="noopener">here</a>. </span></p>
<p><strong>5. Performing joins on poor quality data</strong>. This 40 year old trick allows you to perform a join when your data is infested with typos, multiple names representing the same entity, and other similar issues. In short, it performs a fuzzy join. Read more <a href="https://www.datasciencecentral.com/forum/topics/40-year-old-trick-to-clean-data-efficiently" target="_blank" rel="noopener">here</a>. </p>
<p><strong>6. Scale invariant techniques</strong>. Sometimes, transforming your data, even changing the scale of one feature, say from meters to feet, have a dramatic impact on the results. Sometimes, you want your conclusions to be scale-independent. This trick solves this problem. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/scale-invariant-clustering-and-regression" target="_blank" rel="noopener">here</a>. </p>
<p><strong>7. Blending data sets with incompatible data, adding consistency to your metrics</strong>. We are all too familiar with metrics that change over time and result in inconsistencies when comparing the past to the present, or when comparing different segments with incompatible measurements. This trick will allow you to design systems where again, apples are compared to other apples, not to oranges. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/how-to-stabilize-data-to-avoid-decay-in-model-performance" target="_blank" rel="noopener">here</a>.</p>
<p><em>To not miss this type of content in the future,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter">subscribe</a><span> </span>to our newsletter. For related articles from the same author, <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">click here</a><span> </span>or visit<span> </span><a href="http://www.vincentgranville.com/" target="_blank" rel="noopener">www.VincentGranville.com</a>. Follow me on<span> </span><a href="https://www.linkedin.com/in/vincentg/" target="_blank" rel="noopener">on LinkedIn</a>, or visit my old web page<span> </span><a href="http://www.datashaping.com">here</a>.</em></p>
<p><span style="font-size: 12pt;"><strong>Resources from our sponsors</strong></span></p>
<ul>
<li dir="ltr"><a href="https://dsc.news/2WFHJ0q" target="_blank" rel="noopener">The State of Data Preparation in 2019</a> - June 25</li>
<li dir="ltr"><a href="https://dsc.news/2JWn6XR" target="_blank" rel="noopener">AI in Action: Real-time Anomaly Detection</a> - June 18</li>
<li dir="ltr"><a href="https://dsc.news/2GZmBtn" target="_blank" rel="noopener">Balancing AI Endeavors with Analytic Talent</a> - DSC Podcast</li>
</ul>
<p></p>Gentle Approach to Linear Algebra, with Machine Learning Applicationstag:www.analyticbridge.datasciencecentral.com,2019-05-29:2004291:BlogPost:3925052019-05-29T03:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><span>This simple introduction to matrix theory offers a refreshing perspective on the subject. Using a basic concept that leads to a simple formula for the power of a matrix, we see how it can solve time series, Markov chains, linear regression, data reduction, principal components analysis (PCA) and other machine learning problems. These problems are usually solved with more advanced matrix calculus, including eigenvalues, diagonalization, generalized inverse matrices, and other types of…</span></p>
<p><span>This simple introduction to matrix theory offers a refreshing perspective on the subject. Using a basic concept that leads to a simple formula for the power of a matrix, we see how it can solve time series, Markov chains, linear regression, data reduction, principal components analysis (PCA) and other machine learning problems. These problems are usually solved with more advanced matrix calculus, including eigenvalues, diagonalization, generalized inverse matrices, and other types of matrix normalization. Our approach is more intuitive and thus appealing to professionals who do not have a strong mathematical background, or who have forgotten what they learned in math textbooks. It will also appeal to physicists and engineers. Finally, it leads to simple algorithms, for instance for matrix inversion. The classical statistician or data scientist will find our approach somewhat intriguing. </span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/2716936013?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2716936013?profile=RESIZE_710x" class="align-center"/></a></span></p>
<p><strong>Content</strong></p>
<p>1. Power of a matrix</p>
<p>2. Examples, Generalization, and Matrix Inversion</p>
<ul>
<li>Example with a non-invertible matrix</li>
<li>Fast computations</li>
</ul>
<p>3. Application to Machine Learning Problems</p>
<ul>
<li>Markov chains</li>
<li>Time series</li>
<li>Linear regression</li>
</ul>
<p><span><a href="https://www.datasciencecentral.com/profiles/blogs/new-approach-to-linear-algebra-in-machine-learning" target="_blank" rel="noopener">Read the full article</a>. </span></p>New Book: Classification and Regression In a Weekend (in Python)tag:www.analyticbridge.datasciencecentral.com,2019-05-17:2004291:BlogPost:3927002019-05-17T00:24:08.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We have added a new free book in our selection exclusively for DSC members. See the first entry below, to get started with machine learning with Python.</p>
<p><strong>1. Book: Classification and Regression In a Weekend</strong></p>
<p>This tutorial began as a series of weekend workshops created by Ajit Jaokar and Dan Howarth. The idea was to work with a specific (longish) program such that we explore as much of it as possible in one weekend. This book is an attempt to take this idea online.…</p>
<p>We have added a new free book in our selection exclusively for DSC members. See the first entry below, to get started with machine learning with Python.</p>
<p><strong>1. Book: Classification and Regression In a Weekend</strong></p>
<p>This tutorial began as a series of weekend workshops created by Ajit Jaokar and Dan Howarth. The idea was to work with a specific (longish) program such that we explore as much of it as possible in one weekend. This book is an attempt to take this idea online. The best way to use this book is to work with the Python code as much as you can. The code has comments. But you can extend the comments by the concepts explained here.</p>
<p>The table of contents is available<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/free-book-classification-and-regression-in-a-weekend" target="_blank" rel="noopener">here</a>. The book can be accessed<span> </span><a href="https://www.datasciencecentral.com/page/free-books-1" target="_blank" rel="noopener">here</a><span> </span>(members only.)</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2626374029?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2626374029?profile=RESIZE_710x" class="align-center"/></a></p>
<p><strong>2. Book: Enterprise AI - An Application Perspective</strong> </p>
<p>Enterprise AI: An applications perspective takes a use case driven approach to understand the deployment of AI in the Enterprise. Designed for strategists and developers, the book provides a practical and straightforward roadmap based on application use cases for AI in Enterprises. The authors (Ajit Jaokar and Cheuk Ting Ho) are data scientists and AI researchers who have deployed AI applications for Enterprise domains. The book is used as a reference for Ajit and Cheuk's new course on Implementing Enterprise AI.</p>
<p>The table of content is available<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/free-ebook-enterprise-ai-an-applications-perspective" target="_blank" rel="noopener">here</a>. The book can be accessed<span> </span><a href="https://www.datasciencecentral.com/page/free-books-1" target="_blank" rel="noopener">here</a><span> </span>(members only.)</p>
<p><strong>3. Book: Applied Stochastic Processes</strong></p>
<p>Full title:<span> </span><em>Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems</em>. Published June 2, 2018. Author: Vincent Granville, PhD. (104 pages, 16 chapters.)</p>
<p>This book is intended to professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject. It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.</p>
<p>New ideas, advanced topics, and state-of-the-art research are discussed in simple English, without using jargon or arcane theory. It unifies topics that are usually part of different fields (data science, operations research, dynamical systems, computer science, number theory, probability) broadening the knowledge and interest of the reader in ways that are not found in any other book. This short book contains a large amount of condensed material that would typically be covered in 500 pages in traditional publications. Thanks to cross-references and redundancy, the chapters can be read independently, in random order.</p>
<p>The table of content is available<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">here</a>. The book (PDF) can be accessed<span> </span><a href="https://www.datasciencecentral.com/page/free-books-1" target="_blank" rel="noopener">here</a><span> </span>(members only.) </p>Confidence Intervals Without Pain, with Exceltag:www.analyticbridge.datasciencecentral.com,2019-05-09:2004291:BlogPost:3924682019-05-09T17:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology works with any estimator (mean, median, variance, quantile, correlation and so on) even when the data set violates the classical requirements necessary to make traditional statistical techniques…</p>
<p>We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology works with any estimator (mean, median, variance, quantile, correlation and so on) even when the data set violates the classical requirements necessary to make traditional statistical techniques work. In particular, our method also applies to observations that are auto-correlated, non identically distributed, non-normal, and even non-stationary. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2383098025?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2383098025?profile=RESIZE_710x" class="align-center"/></a></p>
<p>No statistical knowledge is required to understand, implement, and test our algorithm, nor to interpret the results. Its robustness makes it suitable for black-box, automated machine learning technology. It will appeal to anyone dealing with data on a regular basis, such as data scientists, statisticians, software engineers, economists, quants, physicists, biologists, psychologists, system and business analysts, and industrial engineers. </p>
<p>In particular, we provide a confidence interval (CI) for the width of confidence intervals without using Bayesian statistics. The width is modeled as<span> </span><em>L</em><span> </span>=<span> </span><em>A</em><span> </span>/<span> </span><em>n^B</em> and we compute, using Excel alone, a 95% CI for<span> </span><em>B</em><span> </span>in the classic case where<span> </span><em>B</em><span> </span>= 1/2. We also exhibit an artificial data set where<span> </span><em>L</em><span> </span>= 1 / (log<span> </span><em>n</em>)^Pi. Here<span> </span><em>n</em><span> </span>is the sample size.</p>
<p><span>Despite the apparent simplicity of our approach, we are dealing here with martingales. But you don't need to know what a martingale is to understand the concepts and use our methodology. </span></p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/confidence-intervals-without-pain" target="_blank" rel="noopener">Read the full article here</a>.</p>Re-sampling: Amazing Results and Applicationstag:www.analyticbridge.datasciencecentral.com,2019-05-04:2004291:BlogPost:3925562019-05-04T18:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>This crash course features a new fundamental statistics theorem -- even more important than the central limit theorem -- and a new set of statistical rules and recipes. We discuss concepts related to determining the optimum sample size, the optimum<span> </span><em>k</em><span> </span>in<span> </span><em>k</em>-fold cross-validation, bootstrapping, new re-sampling techniques, simulations, tests of hypotheses, confidence intervals, and statistical inference using a unified, robust, simple…</p>
<p>This crash course features a new fundamental statistics theorem -- even more important than the central limit theorem -- and a new set of statistical rules and recipes. We discuss concepts related to determining the optimum sample size, the optimum<span> </span><em>k</em><span> </span>in<span> </span><em>k</em>-fold cross-validation, bootstrapping, new re-sampling techniques, simulations, tests of hypotheses, confidence intervals, and statistical inference using a unified, robust, simple approach with easy formulas, efficient algorithms and illustration on complex data.</p>
<p>Little statistical knowledge is required to understand and apply the methodology described here, yet it is more advanced, more general, and more applied than standard literature on the subject. The intended audience is beginners as well as professionals in any field faced with data challenges on a daily basis. This article presents statistical science in a different light, hopefully in a style more accessible, intuitive, and exciting than standard textbooks, and in a compact format yet covering a large chunk of the traditional statistical curriculum and beyond.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2301106250?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2301106250?profile=RESIZE_710x" class="align-center"/></a></p>
<p>In particular, the concept of<span> </span><em>p</em>-value is not explicitly included in this tutorial. Instead, following the new trend after the recent <em>p</em>-value debacle (addressed<span> </span>by the president of the American Statistical Association), it is replaced with a range of values computed on multiple sub-samples. </p>
<p>Our algorithms are suitable for inclusion in black-box systems, batch processing, and automated data science. Our technology is data-driven and model-free. Finally, our approach to this problem shows the contrast between the data science unified, bottom-up, and computationally-driven perspective, and the traditional top-down statistical analysis consisting of a collection of disparate results that emphasizes the theory. </p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/modern-re-sampling-and-statistical-recipes" target="_blank" rel="noopener">Read the full article here</a>.</p>
<p><span><strong>Contents</strong></span></p>
<p><span>1. Re-sampling and Statistical Inference</span></p>
<ul>
<li><span>Main Result</span></li>
<li><span>Sampling with or without Replacement</span></li>
<li><span>Illustration</span></li>
<li><span>Optimum Sample Size </span></li>
<li><span>Optimum <em>K</em> in <em>K</em>-fold Cross-Validation</span></li>
<li><span>Confidence Intervals, Tests of Hypotheses</span></li>
</ul>
<p><span>2. Generic, All-purposes Algorithm</span></p>
<ul>
<li><span>Re-sampling Algorithm with Source Code</span></li>
<li><span>Alternative Algorithm</span></li>
<li><span>Using a Good Random Number Generator</span></li>
</ul>
<p><span>3. Applications</span></p>
<ul>
<li><span>A Challenging Data Set</span></li>
<li><span>Results and Excel Spreadsheet</span></li>
<li><span>A New Fundamental Statistics Theorem</span></li>
<li><span>Some Statistical Magic</span></li>
<li><span>How does this work?</span></li>
<li><span>Does this contradict entropy principles?</span></li>
</ul>
<p><span>4. Conclusions</span></p>Some Fun with Gentle Chaos, the Golden Ratio, and Stochastic Number Theorytag:www.analyticbridge.datasciencecentral.com,2019-04-25:2004291:BlogPost:3923832019-04-25T13:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>So many fascinating and deep results have been written about the number (1 + SQRT(5)) / 2 and its related sequence - the Fibonacci numbers - that it would take years to read all of them. This number has been studied both for its applications (population growth, architecture) and its mathematical properties, for over 2,000 years. It is still a topic of active research.…</p>
<p></p>
<p>So many fascinating and deep results have been written about the number (1 + SQRT(5)) / 2 and its related sequence - the Fibonacci numbers - that it would take years to read all of them. This number has been studied both for its applications (population growth, architecture) and its mathematical properties, for over 2,000 years. It is still a topic of active research.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2197458362?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2197458362?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><em>Lag-1 auto-correlation in digit distribution of good seeds, for b-processes</em></p>
<p>I show here how I used the golden ratio for a new number guessing game (to generate chaos and randomness in ergodic time series) as well as new intriguing results, in particular:</p>
<ul>
<li>Proof that the<span> </span><a href="http://mathworld.wolfram.com/RabbitConstant.html" target="_blank" rel="noopener">rabbit constant</a><span> </span>it is not normal in any base; this might be the first instance of a non-artificial mathematical constant for which the normalcy status is formally established.</li>
<li>Beatty sequences, pseudo-periodicity, and infinite-range auto-correlations for the digits of irrational numbers in the numeration system derived from perfect stochastic processes</li>
<li>Properties of multivariate<span> </span><em>b</em>-processes, including integer or non-integer bases.</li>
<li>Weird behavior of auto-correlations for the digits of normal numbers (good seeds) in the numeration system derived from stochastic<span> </span><em>b</em>-processes</li>
<li>A strange recursion that generates all the digits of the rabbit constant</li>
</ul>
<p><strong>Content of this article</strong></p>
<p>1. Some Definitions</p>
<p>2. Digits Distribution in b-processes</p>
<p>3. Strange Facts and Conjectures about the Rabbit Constant</p>
<p>4. Gaming Application</p>
<ul>
<li>De-correlating Using Mapping and Thinning Techniques</li>
<li>Dissolving the Auto-correlation Structure Using Multivariate b-processes</li>
</ul>
<p>5. Related Articles</p>
<p><em>Read full articles, <a href="https://www.datasciencecentral.com/profiles/blogs/some-fun-with-the-golden-ratio-time-series-and-number-theory" target="_blank" rel="noopener">here</a>. </em></p>Causality – The Next Most Important Thing in AI/MLtag:www.analyticbridge.datasciencecentral.com,2019-04-25:2004291:BlogPost:3923012019-04-25T01:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><em> Finally there are tools that let us transcend ‘correlation is not causation’ and<span> </span><strong>identify true causal factors</strong><span> </span>and their relative strengths in our models. This is what prescriptive analytics was meant to be.</em></p>
<p> <a href="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=RESIZE_710x" width="400"></img></a></p>
<p>Just when I thought we’d figured it all out,…</p>
<p><strong><em>Summary:</em></strong><em> Finally there are tools that let us transcend ‘correlation is not causation’ and<span> </span><strong>identify true causal factors</strong><span> </span>and their relative strengths in our models. This is what prescriptive analytics was meant to be.</em></p>
<p> <a href="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>Just when I thought we’d figured it all out, something comes along to make me realize I was wrong. And that something in AI/ML is as simple as realizing that everything we’ve done so far is just curve-fitting. Whether it’s a scoring model or a CNN to recognize cats, it’s all about association; reducing the error between the distribution of two data sets. </p>
<p>What we should have had our eye on is CAUSATION. How many times have you repeated ‘correlation is not causation’. Well it seems we didn’t stop to ask how AI/ML can actually determine causality. And now it turns out it can.</p>
<p>But to achieve an understanding of causality requires us to cast loose of many of the common tools and techniques we’ve been trained to apply and to understand the data from a wholly new perspective. Fortunately the constant advance of research and ever increasing compute capability now makes it possible for us to use new relatively friendly tools to measure causality. </p>
<p>However, make no mistake, you’ll need to master the concepts of causal data analysis or you will most likely misunderstand what these tools can do.</p>
<p><em>Read the full article by Bill Vorhies, <a href="https://www.datasciencecentral.com/profiles/blogs/causality-the-next-most-important-thing-in-ai-ml" target="_blank" rel="noopener">here</a>. </em></p>New Stock Trading and Lottery Game Rooted in Deep Mathtag:www.analyticbridge.datasciencecentral.com,2019-04-15:2004291:BlogPost:3923672019-04-15T16:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>I describe here the ultimate number guessing game, played with real money. It is a new trading and gaming system, b<span>ased on state-of-the-art mathematical engineering, robust architecture, and patent-pending technology. It offers an alternative to the stock market and traditional gaming. This system is also far more transparent than the stock market, and can not be manipulated, as formulas to win the biggest returns (with real money) are made public. Also, it simulates a neutral,…</span></p>
<p>I describe here the ultimate number guessing game, played with real money. It is a new trading and gaming system, b<span>ased on state-of-the-art mathematical engineering, robust architecture, and patent-pending technology. It offers an alternative to the stock market and traditional gaming. This system is also far more transparent than the stock market, and can not be manipulated, as formulas to win the biggest returns (with real money) are made public. Also, it simulates a neutral, efficient stock market. In short, there is nothing random, everything is deterministic and fixed in advance, and known to all users. Yet it behaves in a way that looks perfectly random, and public algorithms offered to win the biggest gains require so much computing power, that for all purposes, they are useless -- except to comply with gaming laws and to establish trustworthiness.</span></p>
<p><span>We use private algorithms to determine the winning numbers, and while they produce the exact same results as the public algorithms (we tested this extensively), they are incredibly more efficient, by many orders of magnitude. Also, it can be mathematically proved that the public and private algorithms are equivalent, and we actually proved it. We go through this verification process for any new algorithm introduced in our system. </span></p>
<p><span>In the last section, we offer a competition: can you use the public algorithm to identify the winning numbers computed with the private (secret) algorithm? If yes, the system is breakable, and a more sophisticated approach is needed, to make it work. I don't think anyone can find the winning numbers (you are welcome to prove me wrong), so the award will be offered to the contestant providing the best insights on how to improve the robustness of this system. And if by chance you manage to identify those winning numbers, great, you'll get a bonus! But it is not a requirement to win the award.</span></p>
<p></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/2006368707?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2006368707?profile=RESIZE_710x" class="align-center"/></a></span></p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-foundations-for-a-new-stock-market" target="_blank" rel="noopener">Read the full article</a></p>
<p><strong>Content</strong></p>
<p>1. Description, Main Features and Advantages</p>
<p>2. How it Works: the Secret Sauce</p>
<ul>
<li>Public Algorithm</li>
<li>The Winning Numbers</li>
<li>Using Seeds to Find the Winning Numbers</li>
<li>ROI Tables</li>
</ul>
<p>3. Business Model and Applications</p>
<ul>
<li>Managing the Money Flow</li>
</ul>
<p>4. Challenge and Statistical Results</p>
<ul>
<li>Data Science / Math Competition</li>
<li>Controlling the Variance of the Portfolio Value</li>
<li>Probability of Cracking the System</li>
</ul>
<p>5. Designing 16-bit and 32-bit Systems</p>
<ul>
<li>Layered ROI Tables</li>
<li>Smooth ROI Tables</li>
<li>Systems with Winning Numbers in [0, 1]</li>
</ul>A Radical AI Strategy - Platformicationtag:www.analyticbridge.datasciencecentral.com,2019-04-09:2004291:BlogPost:3920582019-04-09T05:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><em> A new business model strategy based around intermediary platforms powered by AI/ML is promising the most direct path to fastest growth, profitability, and competitive success. Adopting this new approach requires a deep change in mindset and is quite different from just adopting AI/ML to optimize your current operations.…</em></p>
<p></p>
<p><strong><em>Summary:</em></strong><em> A new business model strategy based around intermediary platforms powered by AI/ML is promising the most direct path to fastest growth, profitability, and competitive success. Adopting this new approach requires a deep change in mindset and is quite different from just adopting AI/ML to optimize your current operations.</em></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1741416922?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1741416922?profile=RESIZE_710x" width="350" class="align-full"/></a>As a data scientist you may be wondering why you need to be concerned about strategy and business models. It’s simple. Different types of AI/ML are most appropriate for different business objectives. So whether you’re a data scientist being asked to plan and present the most appropriate portfolio of projects, or a CXO looking to support your new digital business model, you need to understand the relationship between data science and strategy.</p>
<p>In<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/now-that-we-ve-got-ai-what-do-we-do-with-it"><em><u>our last articl</u></em>e</a><span> </span>we laid out the four major AI/ML powered business models. We set up a structure to help you think about “AI Inside”, essentially pasted on and used to optimize an existing old-style business model versus “AI-First”, business models that can lead to real digital transformation.</p>
<p>AI-First models are typically associated with startups so not necessarily the first place a mature existing business would look for a strategy in its digital journey. But hidden in plain sight within AI-First is a business model strategy so bold that mature companies that have embraced it have outpaced their competitors by a wide margin. That’s adopting a “Platform Strategy”.</p>
<p><em>Read the full article, by Bill Vorhies, <a href="https://www.datasciencecentral.com/profiles/blogs/a-radical-ai-strategy-platformication" target="_blank" rel="noopener">here</a>. For more articles by the same author, <a href="https://www.datasciencecentral.com/profiles/blog/list?user=0h5qapp2gbuf8" target="_blank" rel="noopener">follow this link</a>. For more about AI applications, <a href="https://www.datasciencecentral.com/page/search?q=ai" target="_blank" rel="noopener">click here</a>. </em></p>Long-range Correlations in Time Series: Modeling, Testing, Case Studytag:www.analyticbridge.datasciencecentral.com,2019-04-01:2004291:BlogPost:3922472019-04-01T19:00:06.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We investigate a large class of auto-correlated, stationary time series, proposing a new statistical test to measure departure from the base model, known as Brownian motion. We also discuss a methodology to deconstruct these time series, in order to identify the root mechanism that generates the observations. The time series studied here can be discrete or continuous in time, they can have various degrees of smoothness (typically measured using the Hurst exponent) as well as long-range or…</p>
<p>We investigate a large class of auto-correlated, stationary time series, proposing a new statistical test to measure departure from the base model, known as Brownian motion. We also discuss a methodology to deconstruct these time series, in order to identify the root mechanism that generates the observations. The time series studied here can be discrete or continuous in time, they can have various degrees of smoothness (typically measured using the Hurst exponent) as well as long-range or short-range correlations between successive values. Applications are numerous, and we focus here on a case study arising from some interesting number theory problem. In particular, we show that one of the times series investigated in my article on randomness theory [<a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">see here</a>, read section 4.1.(c)] is not Brownian despite the appearance. It has important implications regarding the problem in question. Applied to finance or economics, it makes the difference between an efficient market, and one that can be gamed.</p>
<p>This article it accessible to a large audience, thanks to its tutorial style, illustrations, and easily replicable simulations. Nevertheless, we discuss modern, advanced, and state-of-the-art concepts. This is an area of active research. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1741616599?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1741616599?profile=RESIZE_710x" class="align-center"/></a> <strong>Content</strong></p>
<p>1. Introduction and time series deconstruction</p>
<ul>
<li>Example</li>
<li>Deconstructing time series</li>
<li>Correlations, Fractional Brownian motions</li>
</ul>
<p>2. Smoothness, Hurst exponent, and Brownian test</p>
<ul>
<li>Our Brownian tests of hypothesis</li>
<li>Data</li>
</ul>
<p>3. Results and conclusions</p>
<ul>
<li>Charts and interpretation</li>
<li>Conclusions</li>
</ul>
<p><strong>Read the full article, <a href="https://www.datasciencecentral.com/profiles/blogs/long-range-correlation-in-time-series-tutorial-and-case-study" target="_blank" rel="noopener">here</a>. </strong></p>Fascinating Developments in the Theory of Randomnesstag:www.analyticbridge.datasciencecentral.com,2019-03-21:2004291:BlogPost:3916312019-03-21T13:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>I present here some innovative results from my most recent research on stochastic processes. chaos modeling, and dynamical systems, with applications to Fintech, cryptography, number theory, and random number generators. While covering advanced topics, this article is accessible to professionals with limited knowledge in statistical or mathematical theory. It introduces new material not covered in my recent book (available …</p>
<p>I present here some innovative results from my most recent research on stochastic processes. chaos modeling, and dynamical systems, with applications to Fintech, cryptography, number theory, and random number generators. While covering advanced topics, this article is accessible to professionals with limited knowledge in statistical or mathematical theory. It introduces new material not covered in my recent book (available <a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">here</a>) on applied stochastic processes. You don't need to read my book to understand this article, but the book is a nice complement and introduction to the concepts discussed here.</p>
<p>None of the material presented here is covered in standard textbooks on stochastic processes or dynamical systems. In particular, it has nothing to do with the classical logistic map or Brownian motions, though the systems investigated here exhibit very similar behaviors and are related to the classical models. This cross-disciplinary article is targeted to professionals with interests in statistics, probability, mathematics, machine learning, simulations, signal processing, operations research, computer science, pattern recognition, and physics. Because of its tutorial style, it should also appeal to beginners learning about Markov processes, time series, and data science techniques in general, offering fresh, off-the-beaten-path content not found anywhere else, contrasting with the material covered again and again in countless, identical books, websites, and classes catering to students and researchers alike. </p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1529825331?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1529825331?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Some problems discussed here could be used by college professors in the classroom, or as original exam questions, while others are extremely challenging questions that could be the subject of a PhD thesis or even well beyond that level. This article constitutes (along with my book) a stepping stone in my endeavor to solve one of the biggest mysteries in the universe: are the digits of mathematical constants such as Pi, evenly distributed? To this day, no one knows if these digits even have a distribution to start with, let alone whether that distribution is uniform or not. Part of the discussion is about statistical properties of numeration systems in a non-integer base (such as the golden ratio base) and its applications. All systems investigated here, whether deterministic or not, are treated as stochastic processes, including the digits in question. They all exhibit strong chaos, albeit easily manageable due to their ergodicity.<span> </span></p>
<p>Interesting connections to the golden ratio, Fibonacci numbers, Pisano periods, special polynomials, Brownian motions, and other special mathematical constants, are discussed throughout the article. All the analyses were done in Excel. You can download my spreadsheets from this article; all the results are replicable. Also, numerous illustrations are provided. </p>
<p></p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">Read the full article here</a>.</p>
<p><strong>Content of this article</strong></p>
<p>1. General framework, notations and terminology</p>
<ul>
<li>Finding the equilibrium distribution</li>
<li>Auto-correlation and spectral analysis</li>
<li>Ergodicity, convergence, and attractors</li>
<li>Space state, time state, and Markov chain approximations</li>
<li>Examples</li>
</ul>
<p>2. Case study</p>
<ul>
<li>First fundamental theorem</li>
<li>Second fundamental theorem</li>
<li>Convergence to equilibrium: illustration</li>
</ul>
<p>3. Applications</p>
<ul>
<li>Potential application domains</li>
<li>Example: the golden ratio process</li>
<li>Finding other useful b-processes</li>
</ul>
<p>4. Additional research topics</p>
<ul>
<li>Perfect stochastic processes</li>
<li>Characterization of equilibrium distributions (the attractors)</li>
<li>Probabilistic calculus and number theory, special integrals</li>
</ul>
<p>5. Appendix</p>
<ul>
<li>Computing the auto-correlation at equilibrium</li>
<li>Proof of the first fundamental theorem</li>
<li>How to find the exact equilibrium distribution</li>
</ul>
<p>6. Additional Resources</p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">Read the full article here</a>.</p>How to Automatically Determine the Number of Clusters in your Data - and moretag:www.analyticbridge.datasciencecentral.com,2019-03-14:2004291:BlogPost:3912662019-03-14T00:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don't exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy.</p>
<p>For instance, how many clusters do you see in the picture below? What is the optimum number…</p>
<p>Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don't exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy.</p>
<p>For instance, how many clusters do you see in the picture below? What is the optimum number of clusters? No one can tell with certainty, not AI, not a human being, not an algorithm. </p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1405294997?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1405294997?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><em>How many clusters here? (source: see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-wizardry" target="_blank" rel="noopener">here</a>)</em></p>
<p>In the above picture, the underlying data suggests that there are three main clusters. But an answer such as 6 or 7, seems equally valid. </p>
<p>A number of empirical approaches have been used to determine the number of clusters in a data set. They usually fit into two categories:</p>
<ul>
<li>Model fitting techniques: an example is using a<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/decomposition-of-statistical-distributions-using-mixture-models-a" target="_blank" rel="noopener">mixture model</a> to fit with your data, and determine the optimum number of components; or use density estimation techniques, and test for the number of modes (see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-original-underused-statistical-tests" target="_blank" rel="noopener">here</a>.) Sometimes, the fit is compared with that of a model where observations are uniformly distributed on the entire support domain, thus with no cluster; you may have to estimate the support domain in question, and assume that it is not made of disjoint sub-domains; in many cases, the convex hull of your data set, as an estimate of the support domain, is good enough. </li>
<li>Visual techniques: for instance, the silhouette or elbow rule (very popular.)</li>
</ul>
<p>In both cases, you need a criterion to determine the optimum number of clusters. In the case of the elbow rule, one typically uses the percentage of unexplained variance. This number is 100% with zero cluster, and it decreases (initially sharply, then more modestly) as you increase the number of clusters in your model. When each point constitutes a cluster, this number drops to 0. Somewhere in between, the curve that displays your criterion, exhibits an elbow (see picture below), and that elbow determines the number of clusters. For instance, in the chart below, the optimum number of clusters is 4.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1405610723?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1405610723?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><em>The elbow rule tells you that here, your data set has 4 clusters (elbow strength in red)</em></p>
<p>Good references on the topic are available. Some R functions are available too, for instance fviz_nbclust. However, I could not find in the literature, how the elbow point is explicitly computed. Most references mention that it is mostly hand-picked by visual inspection, or based on some predetermined but arbitrary threshold. In the next section, we solve this problem.</p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/how-to-automatically-determine-the-number-of-clusters-in-your-dat" target="_blank" rel="noopener">Read full article here</a>. </p>Deep Analytical Thinking and Data Science Wizardrytag:www.analyticbridge.datasciencecentral.com,2019-03-07:2004291:BlogPost:3913552019-03-07T20:46:51.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Many times, complex models are not enough (or too heavy), or not necessary, to get great, robust, sustainable insights out of data. Deep analytical thinking may prove more useful, and can be done by people not necessarily trained in data science, even by people with limited coding experience. Here we explore what we mean by deep analytical thinking, using a case study, and how it works: combining craftsmanship, business acumen, the use and creation of tricks and rules of thumb, to provide…</p>
<p>Many times, complex models are not enough (or too heavy), or not necessary, to get great, robust, sustainable insights out of data. Deep analytical thinking may prove more useful, and can be done by people not necessarily trained in data science, even by people with limited coding experience. Here we explore what we mean by deep analytical thinking, using a case study, and how it works: combining craftsmanship, business acumen, the use and creation of tricks and rules of thumb, to provide sound answers to business problems. These skills are usually acquired by experience more than by training, and data science generalists (see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/why-you-should-be-a-data-science-generalist" target="_blank" rel="noopener">here</a><span> </span>how to become one) usually possess them.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1308372685?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1308372685?profile=RESIZE_710x" class="align-center"/></a></p>
<p>This article is targeted to data science managers and decision makers, as well as to junior professionals who want to become one at some point in their career. Deep thinking, unlike deep learning, is also more difficult to automate, so it provides better job security. Those automating deep learning are actually the new data science wizards, who can think out-of-the box. Much of what is described in this article is also data science wizardry, and not taught in standard textbooks nor in the classroom. By reading this tutorial, you will learn and be able to use these data science secrets, and possibly change your perspective on data science. Data science is like an iceberg: everyone knows and can see the tip of the iceberg (regression models, neural nets, cross-validation, clustering, Python, and so on, as presented in textbooks.) Here I focus on the unseen bottom, using a statistical level almost accessible to the layman, avoiding jargon and complicated math formulas, yet discussing a few advanced concepts. <span> </span></p>
<p><span><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-wizardry" target="_blank" rel="noopener">Read full article here</a>. </span></p>
<p><strong>Content</strong></p>
<p>1. Case Study: The Problem</p>
<p>2. Deep Analytical Thinking</p>
<ul>
<li>Answering hidden questions</li>
<li>Business questions</li>
<li>Data questions</li>
<li>Metrics questions</li>
</ul>
<p>3. Data Science Wizardry</p>
<ul>
<li>Generic algorithm</li>
<li>Illustration with three different models</li>
<li>Results</li>
</ul>
<p>4. A few data science hacks</p>