Vincent Granville's Posts - AnalyticBridge2019-06-20T15:43:19ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranvillehttps://storage.ning.com/topology/rest/1.0/file/get/2191504775?profile=RESIZE_180x180&width=48&height=48&crop=1%3A1https://www.analyticbridge.datasciencecentral.com/profiles/blog/feed?user=vi0zmqyuk8ci&%3Bxn_auth=noMachine Learning and Data Science Cheat Sheettag:www.analyticbridge.datasciencecentral.com,2019-06-07:2004291:BlogPost:3931312019-06-07T02:27:48.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Originally published in 2014 and viewed more than 200,000 times, this is the oldest data science cheat sheet - the mother of all the numerous cheat sheets that are so popular nowadays. I decided to update it in June 2019. While the first half, dealing with installing components on your laptop and learning UNIX, regular expressions, and file management hasn't changed much, the second half, dealing with machine learning, was rewritten entirely from scratch. It is amazing how things changed in…</p>
<p>Originally published in 2014 and viewed more than 200,000 times, this is the oldest data science cheat sheet - the mother of all the numerous cheat sheets that are so popular nowadays. I decided to update it in June 2019. While the first half, dealing with installing components on your laptop and learning UNIX, regular expressions, and file management hasn't changed much, the second half, dealing with machine learning, was rewritten entirely from scratch. It is amazing how things changed in just five years!</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2802101885?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2802101885?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Written for people who have never seen a computer in their life, it starts with the very beginning: buying a laptop! You can skip the first half and jump to sections 5 and 6 if you are already familiar with UNIX. This new cheat sheet will be included in my upcoming book<span> </span><em>Machine Learning: Foundations, Toolbox, and Recipes</em><span> </span>to be published in September 2019, and available (for free) to Data Science Central members exclusively. This cheat sheet is 14 pages long.</p>
<p><strong>Content</strong></p>
<p>1. Hardware</p>
<p>2. Linux environment on Windows laptop</p>
<p>3. Basic UNIX commands</p>
<p>4. Scripting languages</p>
<p>5. Python, R, Hadoop, SQL, DataViz</p>
<p>6. Machine Learning</p>
<ul>
<li>Algorithms</li>
<li>Getting started</li>
<li>Applications</li>
<li>Data sets and sample projects</li>
</ul>
<p>This new cheat sheet is available <a href="https://www.datasciencecentral.com/profiles/blogs/data-science-cheat-sheet" target="_blank" rel="noopener">here</a>. </p>7 Simple Tricks to Handle Complex Machine Learning Issuestag:www.analyticbridge.datasciencecentral.com,2019-06-04:2004291:BlogPost:3925262019-06-04T18:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We propose simple solutions to important problems that all data scientists face almost every day. In short, a toolbox for the handyman, useful to busy professionals in any field.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=RESIZE_710x"></img></a></p>
<p><strong>1. Eliminating sample size effects</strong>. <span>Many statistics, such as correlations or R-squared, depend on the sample size, making it difficult to…</span></p>
<p>We propose simple solutions to important problems that all data scientists face almost every day. In short, a toolbox for the handyman, useful to busy professionals in any field.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2760849159?profile=RESIZE_710x" class="align-center"/></a></p>
<p><strong>1. Eliminating sample size effects</strong>. <span>Many statistics, such as correlations or R-squared, depend on the sample size, making it difficult to compare values computed on two data sets of different sizes. Based on re-sampling techniques, use this easy trick, to compare apples with other apples, not with oranges. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/simple-trick-to-normalize-correlations-r-squared-and-so-on" target="_blank" rel="noopener">here</a>. </span></p>
<p><span><strong>2. Sample size determination, and simple, model-free confidence intervals</strong>. We propose a generic methodology, also based on re-sampling techniques, to compute any confidence interval and for testing hypotheses, without using any statistical theory. Also, it is easy to implement, even in Excel. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/modern-re-sampling-and-statistical-recipes" target="_blank" rel="noopener">here</a>. </span></p>
<p><span><strong>3. Determining the number of clusters in non-supervised clustering</strong>. This modern version of the elbow rule also tells you how strong the global optimum is, and can help you identify local optima too. It can also be automated. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/how-to-automatically-determine-the-number-of-clusters-in-your-dat" target="_blank" rel="noopener">here</a>. </span></p>
<p><span><strong>4. Fixing issues in regression models when the assumptions are violated</strong>. If your data has serial correlation, unequal variances and other similar problems, this simple trick will remove the issue and allows you to perform more meaningful regressions, or to detect flaws in your data set. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/simple-trick-to-remove-serial-correlation-in-regression-models" target="_blank" rel="noopener">here</a>. </span></p>
<p><strong>5. Performing joins on poor quality data</strong>. This 40 year old trick allows you to perform a join when your data is infested with typos, multiple names representing the same entity, and other similar issues. In short, it performs a fuzzy join. Read more <a href="https://www.datasciencecentral.com/forum/topics/40-year-old-trick-to-clean-data-efficiently" target="_blank" rel="noopener">here</a>. </p>
<p><strong>6. Scale invariant techniques</strong>. Sometimes, transforming your data, even changing the scale of one feature, say from meters to feet, have a dramatic impact on the results. Sometimes, you want your conclusions to be scale-independent. This trick solves this problem. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/scale-invariant-clustering-and-regression" target="_blank" rel="noopener">here</a>. </p>
<p><strong>7. Blending data sets with incompatible data, adding consistency to your metrics</strong>. We are all too familiar with metrics that change over time and result in inconsistencies when comparing the past to the present, or when comparing different segments with incompatible measurements. This trick will allow you to design systems where again, apples are compared to other apples, not to oranges. Read more <a href="https://www.datasciencecentral.com/profiles/blogs/how-to-stabilize-data-to-avoid-decay-in-model-performance" target="_blank" rel="noopener">here</a>.</p>
<p><em>To not miss this type of content in the future,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter">subscribe</a><span> </span>to our newsletter. For related articles from the same author, <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">click here</a><span> </span>or visit<span> </span><a href="http://www.vincentgranville.com/" target="_blank" rel="noopener">www.VincentGranville.com</a>. Follow me on<span> </span><a href="https://www.linkedin.com/in/vincentg/" target="_blank" rel="noopener">on LinkedIn</a>, or visit my old web page<span> </span><a href="http://www.datashaping.com">here</a>.</em></p>
<p><span style="font-size: 12pt;"><strong>Resources from our sponsors</strong></span></p>
<ul>
<li dir="ltr"><a href="https://dsc.news/2WFHJ0q" target="_blank" rel="noopener">The State of Data Preparation in 2019</a> - June 25</li>
<li dir="ltr"><a href="https://dsc.news/2JWn6XR" target="_blank" rel="noopener">AI in Action: Real-time Anomaly Detection</a> - June 18</li>
<li dir="ltr"><a href="https://dsc.news/2GZmBtn" target="_blank" rel="noopener">Balancing AI Endeavors with Analytic Talent</a> - DSC Podcast</li>
</ul>
<p></p>Gentle Approach to Linear Algebra, with Machine Learning Applicationstag:www.analyticbridge.datasciencecentral.com,2019-05-29:2004291:BlogPost:3925052019-05-29T03:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><span>This simple introduction to matrix theory offers a refreshing perspective on the subject. Using a basic concept that leads to a simple formula for the power of a matrix, we see how it can solve time series, Markov chains, linear regression, data reduction, principal components analysis (PCA) and other machine learning problems. These problems are usually solved with more advanced matrix calculus, including eigenvalues, diagonalization, generalized inverse matrices, and other types of…</span></p>
<p><span>This simple introduction to matrix theory offers a refreshing perspective on the subject. Using a basic concept that leads to a simple formula for the power of a matrix, we see how it can solve time series, Markov chains, linear regression, data reduction, principal components analysis (PCA) and other machine learning problems. These problems are usually solved with more advanced matrix calculus, including eigenvalues, diagonalization, generalized inverse matrices, and other types of matrix normalization. Our approach is more intuitive and thus appealing to professionals who do not have a strong mathematical background, or who have forgotten what they learned in math textbooks. It will also appeal to physicists and engineers. Finally, it leads to simple algorithms, for instance for matrix inversion. The classical statistician or data scientist will find our approach somewhat intriguing. </span></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/2716936013?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2716936013?profile=RESIZE_710x" class="align-center"/></a></span></p>
<p><strong>Content</strong></p>
<p>1. Power of a matrix</p>
<p>2. Examples, Generalization, and Matrix Inversion</p>
<ul>
<li>Example with a non-invertible matrix</li>
<li>Fast computations</li>
</ul>
<p>3. Application to Machine Learning Problems</p>
<ul>
<li>Markov chains</li>
<li>Time series</li>
<li>Linear regression</li>
</ul>
<p><span><a href="https://www.datasciencecentral.com/profiles/blogs/new-approach-to-linear-algebra-in-machine-learning" target="_blank" rel="noopener">Read the full article</a>. </span></p>New Book: Classification and Regression In a Weekend (in Python)tag:www.analyticbridge.datasciencecentral.com,2019-05-17:2004291:BlogPost:3927002019-05-17T00:24:08.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We have added a new free book in our selection exclusively for DSC members. See the first entry below, to get started with machine learning with Python.</p>
<p><strong>1. Book: Classification and Regression In a Weekend</strong></p>
<p>This tutorial began as a series of weekend workshops created by Ajit Jaokar and Dan Howarth. The idea was to work with a specific (longish) program such that we explore as much of it as possible in one weekend. This book is an attempt to take this idea online.…</p>
<p>We have added a new free book in our selection exclusively for DSC members. See the first entry below, to get started with machine learning with Python.</p>
<p><strong>1. Book: Classification and Regression In a Weekend</strong></p>
<p>This tutorial began as a series of weekend workshops created by Ajit Jaokar and Dan Howarth. The idea was to work with a specific (longish) program such that we explore as much of it as possible in one weekend. This book is an attempt to take this idea online. The best way to use this book is to work with the Python code as much as you can. The code has comments. But you can extend the comments by the concepts explained here.</p>
<p>The table of contents is available<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/free-book-classification-and-regression-in-a-weekend" target="_blank" rel="noopener">here</a>. The book can be accessed<span> </span><a href="https://www.datasciencecentral.com/page/free-books-1" target="_blank" rel="noopener">here</a><span> </span>(members only.)</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2626374029?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2626374029?profile=RESIZE_710x" class="align-center"/></a></p>
<p><strong>2. Book: Enterprise AI - An Application Perspective</strong> </p>
<p>Enterprise AI: An applications perspective takes a use case driven approach to understand the deployment of AI in the Enterprise. Designed for strategists and developers, the book provides a practical and straightforward roadmap based on application use cases for AI in Enterprises. The authors (Ajit Jaokar and Cheuk Ting Ho) are data scientists and AI researchers who have deployed AI applications for Enterprise domains. The book is used as a reference for Ajit and Cheuk's new course on Implementing Enterprise AI.</p>
<p>The table of content is available<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/free-ebook-enterprise-ai-an-applications-perspective" target="_blank" rel="noopener">here</a>. The book can be accessed<span> </span><a href="https://www.datasciencecentral.com/page/free-books-1" target="_blank" rel="noopener">here</a><span> </span>(members only.)</p>
<p><strong>3. Book: Applied Stochastic Processes</strong></p>
<p>Full title:<span> </span><em>Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems</em>. Published June 2, 2018. Author: Vincent Granville, PhD. (104 pages, 16 chapters.)</p>
<p>This book is intended to professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject. It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.</p>
<p>New ideas, advanced topics, and state-of-the-art research are discussed in simple English, without using jargon or arcane theory. It unifies topics that are usually part of different fields (data science, operations research, dynamical systems, computer science, number theory, probability) broadening the knowledge and interest of the reader in ways that are not found in any other book. This short book contains a large amount of condensed material that would typically be covered in 500 pages in traditional publications. Thanks to cross-references and redundancy, the chapters can be read independently, in random order.</p>
<p>The table of content is available<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">here</a>. The book (PDF) can be accessed<span> </span><a href="https://www.datasciencecentral.com/page/free-books-1" target="_blank" rel="noopener">here</a><span> </span>(members only.) </p>Confidence Intervals Without Pain, with Exceltag:www.analyticbridge.datasciencecentral.com,2019-05-09:2004291:BlogPost:3924682019-05-09T17:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology works with any estimator (mean, median, variance, quantile, correlation and so on) even when the data set violates the classical requirements necessary to make traditional statistical techniques…</p>
<p>We propose a simple model-free solution to compute any confidence interval and to extrapolate these intervals beyond the observations available in your data set. In addition we propose a mechanism to sharpen the confidence intervals, to reduce their width by an order of magnitude. The methodology works with any estimator (mean, median, variance, quantile, correlation and so on) even when the data set violates the classical requirements necessary to make traditional statistical techniques work. In particular, our method also applies to observations that are auto-correlated, non identically distributed, non-normal, and even non-stationary. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2383098025?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2383098025?profile=RESIZE_710x" class="align-center"/></a></p>
<p>No statistical knowledge is required to understand, implement, and test our algorithm, nor to interpret the results. Its robustness makes it suitable for black-box, automated machine learning technology. It will appeal to anyone dealing with data on a regular basis, such as data scientists, statisticians, software engineers, economists, quants, physicists, biologists, psychologists, system and business analysts, and industrial engineers. </p>
<p>In particular, we provide a confidence interval (CI) for the width of confidence intervals without using Bayesian statistics. The width is modeled as<span> </span><em>L</em><span> </span>=<span> </span><em>A</em><span> </span>/<span> </span><em>n^B</em> and we compute, using Excel alone, a 95% CI for<span> </span><em>B</em><span> </span>in the classic case where<span> </span><em>B</em><span> </span>= 1/2. We also exhibit an artificial data set where<span> </span><em>L</em><span> </span>= 1 / (log<span> </span><em>n</em>)^Pi. Here<span> </span><em>n</em><span> </span>is the sample size.</p>
<p><span>Despite the apparent simplicity of our approach, we are dealing here with martingales. But you don't need to know what a martingale is to understand the concepts and use our methodology. </span></p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/confidence-intervals-without-pain" target="_blank" rel="noopener">Read the full article here</a>.</p>Re-sampling: Amazing Results and Applicationstag:www.analyticbridge.datasciencecentral.com,2019-05-04:2004291:BlogPost:3925562019-05-04T18:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>This crash course features a new fundamental statistics theorem -- even more important than the central limit theorem -- and a new set of statistical rules and recipes. We discuss concepts related to determining the optimum sample size, the optimum<span> </span><em>k</em><span> </span>in<span> </span><em>k</em>-fold cross-validation, bootstrapping, new re-sampling techniques, simulations, tests of hypotheses, confidence intervals, and statistical inference using a unified, robust, simple…</p>
<p>This crash course features a new fundamental statistics theorem -- even more important than the central limit theorem -- and a new set of statistical rules and recipes. We discuss concepts related to determining the optimum sample size, the optimum<span> </span><em>k</em><span> </span>in<span> </span><em>k</em>-fold cross-validation, bootstrapping, new re-sampling techniques, simulations, tests of hypotheses, confidence intervals, and statistical inference using a unified, robust, simple approach with easy formulas, efficient algorithms and illustration on complex data.</p>
<p>Little statistical knowledge is required to understand and apply the methodology described here, yet it is more advanced, more general, and more applied than standard literature on the subject. The intended audience is beginners as well as professionals in any field faced with data challenges on a daily basis. This article presents statistical science in a different light, hopefully in a style more accessible, intuitive, and exciting than standard textbooks, and in a compact format yet covering a large chunk of the traditional statistical curriculum and beyond.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2301106250?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2301106250?profile=RESIZE_710x" class="align-center"/></a></p>
<p>In particular, the concept of<span> </span><em>p</em>-value is not explicitly included in this tutorial. Instead, following the new trend after the recent <em>p</em>-value debacle (addressed<span> </span>by the president of the American Statistical Association), it is replaced with a range of values computed on multiple sub-samples. </p>
<p>Our algorithms are suitable for inclusion in black-box systems, batch processing, and automated data science. Our technology is data-driven and model-free. Finally, our approach to this problem shows the contrast between the data science unified, bottom-up, and computationally-driven perspective, and the traditional top-down statistical analysis consisting of a collection of disparate results that emphasizes the theory. </p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/modern-re-sampling-and-statistical-recipes" target="_blank" rel="noopener">Read the full article here</a>.</p>
<p><span><strong>Contents</strong></span></p>
<p><span>1. Re-sampling and Statistical Inference</span></p>
<ul>
<li><span>Main Result</span></li>
<li><span>Sampling with or without Replacement</span></li>
<li><span>Illustration</span></li>
<li><span>Optimum Sample Size </span></li>
<li><span>Optimum <em>K</em> in <em>K</em>-fold Cross-Validation</span></li>
<li><span>Confidence Intervals, Tests of Hypotheses</span></li>
</ul>
<p><span>2. Generic, All-purposes Algorithm</span></p>
<ul>
<li><span>Re-sampling Algorithm with Source Code</span></li>
<li><span>Alternative Algorithm</span></li>
<li><span>Using a Good Random Number Generator</span></li>
</ul>
<p><span>3. Applications</span></p>
<ul>
<li><span>A Challenging Data Set</span></li>
<li><span>Results and Excel Spreadsheet</span></li>
<li><span>A New Fundamental Statistics Theorem</span></li>
<li><span>Some Statistical Magic</span></li>
<li><span>How does this work?</span></li>
<li><span>Does this contradict entropy principles?</span></li>
</ul>
<p><span>4. Conclusions</span></p>Some Fun with Gentle Chaos, the Golden Ratio, and Stochastic Number Theorytag:www.analyticbridge.datasciencecentral.com,2019-04-25:2004291:BlogPost:3923832019-04-25T13:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>So many fascinating and deep results have been written about the number (1 + SQRT(5)) / 2 and its related sequence - the Fibonacci numbers - that it would take years to read all of them. This number has been studied both for its applications (population growth, architecture) and its mathematical properties, for over 2,000 years. It is still a topic of active research.…</p>
<p></p>
<p>So many fascinating and deep results have been written about the number (1 + SQRT(5)) / 2 and its related sequence - the Fibonacci numbers - that it would take years to read all of them. This number has been studied both for its applications (population growth, architecture) and its mathematical properties, for over 2,000 years. It is still a topic of active research.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2197458362?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2197458362?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><em>Lag-1 auto-correlation in digit distribution of good seeds, for b-processes</em></p>
<p>I show here how I used the golden ratio for a new number guessing game (to generate chaos and randomness in ergodic time series) as well as new intriguing results, in particular:</p>
<ul>
<li>Proof that the<span> </span><a href="http://mathworld.wolfram.com/RabbitConstant.html" target="_blank" rel="noopener">rabbit constant</a><span> </span>it is not normal in any base; this might be the first instance of a non-artificial mathematical constant for which the normalcy status is formally established.</li>
<li>Beatty sequences, pseudo-periodicity, and infinite-range auto-correlations for the digits of irrational numbers in the numeration system derived from perfect stochastic processes</li>
<li>Properties of multivariate<span> </span><em>b</em>-processes, including integer or non-integer bases.</li>
<li>Weird behavior of auto-correlations for the digits of normal numbers (good seeds) in the numeration system derived from stochastic<span> </span><em>b</em>-processes</li>
<li>A strange recursion that generates all the digits of the rabbit constant</li>
</ul>
<p><strong>Content of this article</strong></p>
<p>1. Some Definitions</p>
<p>2. Digits Distribution in b-processes</p>
<p>3. Strange Facts and Conjectures about the Rabbit Constant</p>
<p>4. Gaming Application</p>
<ul>
<li>De-correlating Using Mapping and Thinning Techniques</li>
<li>Dissolving the Auto-correlation Structure Using Multivariate b-processes</li>
</ul>
<p>5. Related Articles</p>
<p><em>Read full articles, <a href="https://www.datasciencecentral.com/profiles/blogs/some-fun-with-the-golden-ratio-time-series-and-number-theory" target="_blank" rel="noopener">here</a>. </em></p>Causality – The Next Most Important Thing in AI/MLtag:www.analyticbridge.datasciencecentral.com,2019-04-25:2004291:BlogPost:3923012019-04-25T01:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><em> Finally there are tools that let us transcend ‘correlation is not causation’ and<span> </span><strong>identify true causal factors</strong><span> </span>and their relative strengths in our models. This is what prescriptive analytics was meant to be.</em></p>
<p> <a href="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=original" rel="noopener" target="_blank"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=RESIZE_710x" width="400"></img></a></p>
<p>Just when I thought we’d figured it all out,…</p>
<p><strong><em>Summary:</em></strong><em> Finally there are tools that let us transcend ‘correlation is not causation’ and<span> </span><strong>identify true causal factors</strong><span> </span>and their relative strengths in our models. This is what prescriptive analytics was meant to be.</em></p>
<p> <a href="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2132982369?profile=RESIZE_710x" width="400" class="align-center"/></a></p>
<p>Just when I thought we’d figured it all out, something comes along to make me realize I was wrong. And that something in AI/ML is as simple as realizing that everything we’ve done so far is just curve-fitting. Whether it’s a scoring model or a CNN to recognize cats, it’s all about association; reducing the error between the distribution of two data sets. </p>
<p>What we should have had our eye on is CAUSATION. How many times have you repeated ‘correlation is not causation’. Well it seems we didn’t stop to ask how AI/ML can actually determine causality. And now it turns out it can.</p>
<p>But to achieve an understanding of causality requires us to cast loose of many of the common tools and techniques we’ve been trained to apply and to understand the data from a wholly new perspective. Fortunately the constant advance of research and ever increasing compute capability now makes it possible for us to use new relatively friendly tools to measure causality. </p>
<p>However, make no mistake, you’ll need to master the concepts of causal data analysis or you will most likely misunderstand what these tools can do.</p>
<p><em>Read the full article by Bill Vorhies, <a href="https://www.datasciencecentral.com/profiles/blogs/causality-the-next-most-important-thing-in-ai-ml" target="_blank" rel="noopener">here</a>. </em></p>New Stock Trading and Lottery Game Rooted in Deep Mathtag:www.analyticbridge.datasciencecentral.com,2019-04-15:2004291:BlogPost:3923672019-04-15T16:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>I describe here the ultimate number guessing game, played with real money. It is a new trading and gaming system, b<span>ased on state-of-the-art mathematical engineering, robust architecture, and patent-pending technology. It offers an alternative to the stock market and traditional gaming. This system is also far more transparent than the stock market, and can not be manipulated, as formulas to win the biggest returns (with real money) are made public. Also, it simulates a neutral,…</span></p>
<p>I describe here the ultimate number guessing game, played with real money. It is a new trading and gaming system, b<span>ased on state-of-the-art mathematical engineering, robust architecture, and patent-pending technology. It offers an alternative to the stock market and traditional gaming. This system is also far more transparent than the stock market, and can not be manipulated, as formulas to win the biggest returns (with real money) are made public. Also, it simulates a neutral, efficient stock market. In short, there is nothing random, everything is deterministic and fixed in advance, and known to all users. Yet it behaves in a way that looks perfectly random, and public algorithms offered to win the biggest gains require so much computing power, that for all purposes, they are useless -- except to comply with gaming laws and to establish trustworthiness.</span></p>
<p><span>We use private algorithms to determine the winning numbers, and while they produce the exact same results as the public algorithms (we tested this extensively), they are incredibly more efficient, by many orders of magnitude. Also, it can be mathematically proved that the public and private algorithms are equivalent, and we actually proved it. We go through this verification process for any new algorithm introduced in our system. </span></p>
<p><span>In the last section, we offer a competition: can you use the public algorithm to identify the winning numbers computed with the private (secret) algorithm? If yes, the system is breakable, and a more sophisticated approach is needed, to make it work. I don't think anyone can find the winning numbers (you are welcome to prove me wrong), so the award will be offered to the contestant providing the best insights on how to improve the robustness of this system. And if by chance you manage to identify those winning numbers, great, you'll get a bonus! But it is not a requirement to win the award.</span></p>
<p></p>
<p><span><a href="https://storage.ning.com/topology/rest/1.0/file/get/2006368707?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2006368707?profile=RESIZE_710x" class="align-center"/></a></span></p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-foundations-for-a-new-stock-market" target="_blank" rel="noopener">Read the full article</a></p>
<p><strong>Content</strong></p>
<p>1. Description, Main Features and Advantages</p>
<p>2. How it Works: the Secret Sauce</p>
<ul>
<li>Public Algorithm</li>
<li>The Winning Numbers</li>
<li>Using Seeds to Find the Winning Numbers</li>
<li>ROI Tables</li>
</ul>
<p>3. Business Model and Applications</p>
<ul>
<li>Managing the Money Flow</li>
</ul>
<p>4. Challenge and Statistical Results</p>
<ul>
<li>Data Science / Math Competition</li>
<li>Controlling the Variance of the Portfolio Value</li>
<li>Probability of Cracking the System</li>
</ul>
<p>5. Designing 16-bit and 32-bit Systems</p>
<ul>
<li>Layered ROI Tables</li>
<li>Smooth ROI Tables</li>
<li>Systems with Winning Numbers in [0, 1]</li>
</ul>A Radical AI Strategy - Platformicationtag:www.analyticbridge.datasciencecentral.com,2019-04-09:2004291:BlogPost:3920582019-04-09T05:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><em> A new business model strategy based around intermediary platforms powered by AI/ML is promising the most direct path to fastest growth, profitability, and competitive success. Adopting this new approach requires a deep change in mindset and is quite different from just adopting AI/ML to optimize your current operations.…</em></p>
<p></p>
<p><strong><em>Summary:</em></strong><em> A new business model strategy based around intermediary platforms powered by AI/ML is promising the most direct path to fastest growth, profitability, and competitive success. Adopting this new approach requires a deep change in mindset and is quite different from just adopting AI/ML to optimize your current operations.</em></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1741416922?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1741416922?profile=RESIZE_710x" width="350" class="align-full"/></a>As a data scientist you may be wondering why you need to be concerned about strategy and business models. It’s simple. Different types of AI/ML are most appropriate for different business objectives. So whether you’re a data scientist being asked to plan and present the most appropriate portfolio of projects, or a CXO looking to support your new digital business model, you need to understand the relationship between data science and strategy.</p>
<p>In<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/now-that-we-ve-got-ai-what-do-we-do-with-it"><em><u>our last articl</u></em>e</a><span> </span>we laid out the four major AI/ML powered business models. We set up a structure to help you think about “AI Inside”, essentially pasted on and used to optimize an existing old-style business model versus “AI-First”, business models that can lead to real digital transformation.</p>
<p>AI-First models are typically associated with startups so not necessarily the first place a mature existing business would look for a strategy in its digital journey. But hidden in plain sight within AI-First is a business model strategy so bold that mature companies that have embraced it have outpaced their competitors by a wide margin. That’s adopting a “Platform Strategy”.</p>
<p><em>Read the full article, by Bill Vorhies, <a href="https://www.datasciencecentral.com/profiles/blogs/a-radical-ai-strategy-platformication" target="_blank" rel="noopener">here</a>. For more articles by the same author, <a href="https://www.datasciencecentral.com/profiles/blog/list?user=0h5qapp2gbuf8" target="_blank" rel="noopener">follow this link</a>. For more about AI applications, <a href="https://www.datasciencecentral.com/page/search?q=ai" target="_blank" rel="noopener">click here</a>. </em></p>Long-range Correlations in Time Series: Modeling, Testing, Case Studytag:www.analyticbridge.datasciencecentral.com,2019-04-01:2004291:BlogPost:3922472019-04-01T19:00:06.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We investigate a large class of auto-correlated, stationary time series, proposing a new statistical test to measure departure from the base model, known as Brownian motion. We also discuss a methodology to deconstruct these time series, in order to identify the root mechanism that generates the observations. The time series studied here can be discrete or continuous in time, they can have various degrees of smoothness (typically measured using the Hurst exponent) as well as long-range or…</p>
<p>We investigate a large class of auto-correlated, stationary time series, proposing a new statistical test to measure departure from the base model, known as Brownian motion. We also discuss a methodology to deconstruct these time series, in order to identify the root mechanism that generates the observations. The time series studied here can be discrete or continuous in time, they can have various degrees of smoothness (typically measured using the Hurst exponent) as well as long-range or short-range correlations between successive values. Applications are numerous, and we focus here on a case study arising from some interesting number theory problem. In particular, we show that one of the times series investigated in my article on randomness theory [<a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">see here</a>, read section 4.1.(c)] is not Brownian despite the appearance. It has important implications regarding the problem in question. Applied to finance or economics, it makes the difference between an efficient market, and one that can be gamed.</p>
<p>This article it accessible to a large audience, thanks to its tutorial style, illustrations, and easily replicable simulations. Nevertheless, we discuss modern, advanced, and state-of-the-art concepts. This is an area of active research. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1741616599?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1741616599?profile=RESIZE_710x" class="align-center"/></a> <strong>Content</strong></p>
<p>1. Introduction and time series deconstruction</p>
<ul>
<li>Example</li>
<li>Deconstructing time series</li>
<li>Correlations, Fractional Brownian motions</li>
</ul>
<p>2. Smoothness, Hurst exponent, and Brownian test</p>
<ul>
<li>Our Brownian tests of hypothesis</li>
<li>Data</li>
</ul>
<p>3. Results and conclusions</p>
<ul>
<li>Charts and interpretation</li>
<li>Conclusions</li>
</ul>
<p><strong>Read the full article, <a href="https://www.datasciencecentral.com/profiles/blogs/long-range-correlation-in-time-series-tutorial-and-case-study" target="_blank" rel="noopener">here</a>. </strong></p>Fascinating Developments in the Theory of Randomnesstag:www.analyticbridge.datasciencecentral.com,2019-03-21:2004291:BlogPost:3916312019-03-21T13:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>I present here some innovative results from my most recent research on stochastic processes. chaos modeling, and dynamical systems, with applications to Fintech, cryptography, number theory, and random number generators. While covering advanced topics, this article is accessible to professionals with limited knowledge in statistical or mathematical theory. It introduces new material not covered in my recent book (available …</p>
<p>I present here some innovative results from my most recent research on stochastic processes. chaos modeling, and dynamical systems, with applications to Fintech, cryptography, number theory, and random number generators. While covering advanced topics, this article is accessible to professionals with limited knowledge in statistical or mathematical theory. It introduces new material not covered in my recent book (available <a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">here</a>) on applied stochastic processes. You don't need to read my book to understand this article, but the book is a nice complement and introduction to the concepts discussed here.</p>
<p>None of the material presented here is covered in standard textbooks on stochastic processes or dynamical systems. In particular, it has nothing to do with the classical logistic map or Brownian motions, though the systems investigated here exhibit very similar behaviors and are related to the classical models. This cross-disciplinary article is targeted to professionals with interests in statistics, probability, mathematics, machine learning, simulations, signal processing, operations research, computer science, pattern recognition, and physics. Because of its tutorial style, it should also appeal to beginners learning about Markov processes, time series, and data science techniques in general, offering fresh, off-the-beaten-path content not found anywhere else, contrasting with the material covered again and again in countless, identical books, websites, and classes catering to students and researchers alike. </p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1529825331?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1529825331?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Some problems discussed here could be used by college professors in the classroom, or as original exam questions, while others are extremely challenging questions that could be the subject of a PhD thesis or even well beyond that level. This article constitutes (along with my book) a stepping stone in my endeavor to solve one of the biggest mysteries in the universe: are the digits of mathematical constants such as Pi, evenly distributed? To this day, no one knows if these digits even have a distribution to start with, let alone whether that distribution is uniform or not. Part of the discussion is about statistical properties of numeration systems in a non-integer base (such as the golden ratio base) and its applications. All systems investigated here, whether deterministic or not, are treated as stochastic processes, including the digits in question. They all exhibit strong chaos, albeit easily manageable due to their ergodicity.<span> </span></p>
<p>Interesting connections to the golden ratio, Fibonacci numbers, Pisano periods, special polynomials, Brownian motions, and other special mathematical constants, are discussed throughout the article. All the analyses were done in Excel. You can download my spreadsheets from this article; all the results are replicable. Also, numerous illustrations are provided. </p>
<p></p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">Read the full article here</a>.</p>
<p><strong>Content of this article</strong></p>
<p>1. General framework, notations and terminology</p>
<ul>
<li>Finding the equilibrium distribution</li>
<li>Auto-correlation and spectral analysis</li>
<li>Ergodicity, convergence, and attractors</li>
<li>Space state, time state, and Markov chain approximations</li>
<li>Examples</li>
</ul>
<p>2. Case study</p>
<ul>
<li>First fundamental theorem</li>
<li>Second fundamental theorem</li>
<li>Convergence to equilibrium: illustration</li>
</ul>
<p>3. Applications</p>
<ul>
<li>Potential application domains</li>
<li>Example: the golden ratio process</li>
<li>Finding other useful b-processes</li>
</ul>
<p>4. Additional research topics</p>
<ul>
<li>Perfect stochastic processes</li>
<li>Characterization of equilibrium distributions (the attractors)</li>
<li>Probabilistic calculus and number theory, special integrals</li>
</ul>
<p>5. Appendix</p>
<ul>
<li>Computing the auto-correlation at equilibrium</li>
<li>Proof of the first fundamental theorem</li>
<li>How to find the exact equilibrium distribution</li>
</ul>
<p>6. Additional Resources</p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/fascinating-new-results-in-the-theory-of-randomness" target="_blank" rel="noopener">Read the full article here</a>.</p>How to Automatically Determine the Number of Clusters in your Data - and moretag:www.analyticbridge.datasciencecentral.com,2019-03-14:2004291:BlogPost:3912662019-03-14T00:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don't exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy.</p>
<p>For instance, how many clusters do you see in the picture below? What is the optimum number…</p>
<p>Determining the number of clusters when performing unsupervised clustering is a tricky problem. Many data sets don't exhibit well separated clusters, and two human beings asked to visually tell the number of clusters by looking at a chart, are likely to provide two different answers. Sometimes clusters overlap with each other, and large clusters contain sub-clusters, making a decision not easy.</p>
<p>For instance, how many clusters do you see in the picture below? What is the optimum number of clusters? No one can tell with certainty, not AI, not a human being, not an algorithm. </p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1405294997?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1405294997?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><em>How many clusters here? (source: see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-wizardry" target="_blank" rel="noopener">here</a>)</em></p>
<p>In the above picture, the underlying data suggests that there are three main clusters. But an answer such as 6 or 7, seems equally valid. </p>
<p>A number of empirical approaches have been used to determine the number of clusters in a data set. They usually fit into two categories:</p>
<ul>
<li>Model fitting techniques: an example is using a<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/decomposition-of-statistical-distributions-using-mixture-models-a" target="_blank" rel="noopener">mixture model</a> to fit with your data, and determine the optimum number of components; or use density estimation techniques, and test for the number of modes (see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-original-underused-statistical-tests" target="_blank" rel="noopener">here</a>.) Sometimes, the fit is compared with that of a model where observations are uniformly distributed on the entire support domain, thus with no cluster; you may have to estimate the support domain in question, and assume that it is not made of disjoint sub-domains; in many cases, the convex hull of your data set, as an estimate of the support domain, is good enough. </li>
<li>Visual techniques: for instance, the silhouette or elbow rule (very popular.)</li>
</ul>
<p>In both cases, you need a criterion to determine the optimum number of clusters. In the case of the elbow rule, one typically uses the percentage of unexplained variance. This number is 100% with zero cluster, and it decreases (initially sharply, then more modestly) as you increase the number of clusters in your model. When each point constitutes a cluster, this number drops to 0. Somewhere in between, the curve that displays your criterion, exhibits an elbow (see picture below), and that elbow determines the number of clusters. For instance, in the chart below, the optimum number of clusters is 4.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1405610723?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1405610723?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><em>The elbow rule tells you that here, your data set has 4 clusters (elbow strength in red)</em></p>
<p>Good references on the topic are available. Some R functions are available too, for instance fviz_nbclust. However, I could not find in the literature, how the elbow point is explicitly computed. Most references mention that it is mostly hand-picked by visual inspection, or based on some predetermined but arbitrary threshold. In the next section, we solve this problem.</p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/how-to-automatically-determine-the-number-of-clusters-in-your-dat" target="_blank" rel="noopener">Read full article here</a>. </p>Deep Analytical Thinking and Data Science Wizardrytag:www.analyticbridge.datasciencecentral.com,2019-03-07:2004291:BlogPost:3913552019-03-07T20:46:51.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Many times, complex models are not enough (or too heavy), or not necessary, to get great, robust, sustainable insights out of data. Deep analytical thinking may prove more useful, and can be done by people not necessarily trained in data science, even by people with limited coding experience. Here we explore what we mean by deep analytical thinking, using a case study, and how it works: combining craftsmanship, business acumen, the use and creation of tricks and rules of thumb, to provide…</p>
<p>Many times, complex models are not enough (or too heavy), or not necessary, to get great, robust, sustainable insights out of data. Deep analytical thinking may prove more useful, and can be done by people not necessarily trained in data science, even by people with limited coding experience. Here we explore what we mean by deep analytical thinking, using a case study, and how it works: combining craftsmanship, business acumen, the use and creation of tricks and rules of thumb, to provide sound answers to business problems. These skills are usually acquired by experience more than by training, and data science generalists (see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/why-you-should-be-a-data-science-generalist" target="_blank" rel="noopener">here</a><span> </span>how to become one) usually possess them.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1308372685?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1308372685?profile=RESIZE_710x" class="align-center"/></a></p>
<p>This article is targeted to data science managers and decision makers, as well as to junior professionals who want to become one at some point in their career. Deep thinking, unlike deep learning, is also more difficult to automate, so it provides better job security. Those automating deep learning are actually the new data science wizards, who can think out-of-the box. Much of what is described in this article is also data science wizardry, and not taught in standard textbooks nor in the classroom. By reading this tutorial, you will learn and be able to use these data science secrets, and possibly change your perspective on data science. Data science is like an iceberg: everyone knows and can see the tip of the iceberg (regression models, neural nets, cross-validation, clustering, Python, and so on, as presented in textbooks.) Here I focus on the unseen bottom, using a statistical level almost accessible to the layman, avoiding jargon and complicated math formulas, yet discussing a few advanced concepts. <span> </span></p>
<p><span><a href="https://www.datasciencecentral.com/profiles/blogs/data-science-wizardry" target="_blank" rel="noopener">Read full article here</a>. </span></p>
<p><strong>Content</strong></p>
<p>1. Case Study: The Problem</p>
<p>2. Deep Analytical Thinking</p>
<ul>
<li>Answering hidden questions</li>
<li>Business questions</li>
<li>Data questions</li>
<li>Metrics questions</li>
</ul>
<p>3. Data Science Wizardry</p>
<ul>
<li>Generic algorithm</li>
<li>Illustration with three different models</li>
<li>Results</li>
</ul>
<p>4. A few data science hacks</p>New Perspectives on Statistical Distributions and Deep Learningtag:www.analyticbridge.datasciencecentral.com,2019-02-23:2004291:BlogPost:3909252019-02-23T18:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>In this data science article, emphasis is placed on<span> </span><em>science</em>, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attempt here to lay the foundations of a new statistical technology, hoping that it will plant the seeds for further research on a topic with a broad range of potential…</p>
<p>In this data science article, emphasis is placed on<span> </span><em>science</em>, not just on data. State-of-the art material is presented in simple English, from multiple perspectives: applications, theoretical research asking more questions than it answers, scientific computing, machine learning, and algorithms. I attempt here to lay the foundations of a new statistical technology, hoping that it will plant the seeds for further research on a topic with a broad range of potential applications. It is based on mixture models. Mixtures have been studied and used in applications for a long time, and it is still a subject of active research. Yet you will find here plenty of new material.</p>
<p><span><strong>Introduction and Context</strong></span></p>
<p>In a previous article (see<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/new-perspective-on-central-limit-theorem-and-related-stats-topics" target="_blank" rel="noopener">here</a>) I attempted to approximate a random variable representing real data, by a weighted sum of simple<span> </span><em>kernels</em><span> </span>such as uniformly and independently, identically distributed random variables. The purpose was to build Taylor-like series approximations to more complex models (each term in the series being a random variable), to</p>
<ul>
<li>avoid over-fitting,</li>
<li>approximate any empirical distribution (the inverse of the percentiles function) attached to real data,</li>
<li>easily compute data-driven confidence intervals regardless of the underlying distribution,</li>
<li>derive simple tests of hypothesis,</li>
<li>perform model reduction, </li>
<li>optimize data binning to facilitate feature selection, and to improve visualizations of histograms</li>
<li>create perfect histograms,</li>
<li>build simple density estimators,</li>
<li>perform interpolations, extrapolations, or predictive analytics,</li>
<li>perform clustering and detect the number of clusters,</li>
<li>create deep learning Bayesian systems.</li>
</ul>
<p>Why I've found very interesting properties about stable distributions during this research project, I could not come up with a solution to solve all these problems. The fact is that these weighed sums would usually converge (in distribution) to a normal distribution if the weights did not decay too fast -- a consequence of the central limit theorem. And even if using uniform kernels (as opposed to Gaussian ones) with fast-decaying weights, it would converge to an almost symmetrical, Gaussian-like distribution. In short, very few real-life data sets could be approximated by this type of model.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1187940877?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1187940877?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Now, in this article, I offer a full solution, using mixtures rather than sums. The possibilities are endless. </p>
<p><span style="font-size: 14pt;"><strong>Content of this article</strong></span></p>
<p><strong>1. Introduction and Context</strong></p>
<p><strong>2. Approximations Using Mixture Models</strong></p>
<ul>
<li>The error term</li>
<li>Kernels and model parameters</li>
<li>Algorithms to find the optimum parameters</li>
<li>Convergence and uniqueness of solution</li>
<li>Find near-optimum with fast, black-box step-wise algorithm</li>
</ul>
<p><strong>3. Example</strong></p>
<ul>
<li>Data and source code</li>
<li>Results</li>
</ul>
<p><strong>4. Applications</strong></p>
<ul>
<li>Optimal binning</li>
<li>Predictive analytics</li>
<li>Test of hypothesis and confidence intervals</li>
<li>Deep learning: Bayesian decision trees</li>
<li>Clustering</li>
</ul>
<p><strong>5. Interesting problems</strong></p>
<ul>
<li>Gaussian mixtures uniquely characterize a broad class of distributions</li>
<li>Weighted sums fail to achieve what mixture models do</li>
<li>Stable mixtures</li>
<li>Nested mixtures and Hierarchical Bayesian Systems</li>
<li>Correlations</li>
</ul>
<p>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/decomposition-of-statistical-distributions-using-mixture-models-a" target="_blank" rel="noopener">here</a>. </p>A Plethora of Original, Not Well-Known Statistical Teststag:www.analyticbridge.datasciencecentral.com,2019-02-14:2004291:BlogPost:3909012019-02-14T02:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Many of the following statistical tests are rarely discussed in textbooks or in college classes, much less in data camps. Yet they help answer a lot of different and interesting questions. I used most of them without even computing the underlying distribution under the null hypothesis, but instead, using simulations to check whether my assumptions were plausible or not. In short, my approach to statistical testing is is model-free, data-driven. Some are easy to implement even in Excel. Some…</p>
<p>Many of the following statistical tests are rarely discussed in textbooks or in college classes, much less in data camps. Yet they help answer a lot of different and interesting questions. I used most of them without even computing the underlying distribution under the null hypothesis, but instead, using simulations to check whether my assumptions were plausible or not. In short, my approach to statistical testing is is model-free, data-driven. Some are easy to implement even in Excel. Some of them are illustrated here, with examples that do not require statistical knowledge for understanding or implementation.</p>
<p>This material should appeal to managers, executives, industrial engineers, software engineers, operations research professionals, economists, and to anyone dealing with data, such as biometricians, analytical chemists, astronomers, epidemiologists, journalists, or physicians. Statisticians with a different perspective are invited to discuss my methodology and the tests described here, in the comment section at the bottom of this article. In my case, I used these tests mostly in the context of experimental mathematics, which is a branch of data science that few people talk about. In that context, the theoretical answer to a statistical test is sometimes known, making it a great benchmarking tool to assess the power of these tests, and determine the minimum sample size to make them valid.</p>
<p>I provide here a general overview, as well as my simple approach to statistical testing, accessible to professionals with little or no formal statistical training. Detailed applications of these tests are found<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">in my recent book</a> and in<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/new-perspective-on-central-limit-theorem-and-related-stats-topics" target="_blank" rel="noopener">this article</a>. Precise references to these documents are provided as needed, in this article.</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/1048765124?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/1048765124?profile=RESIZE_710x" class="align-center"/></a></p>
<p style="text-align: center;"><span><em>Examples of traditional tests</em></span></p>
<p><span><strong>1. General Methodology</strong></span></p>
<p>Despite my strong background in statistical science, over the years, I moved away from relying too much on traditional statistical tests and statistical inference. I am not the only one: these tests have been abused and misused, see for instance<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/statistical-significance-and-p-values-take-another-blow" target="_blank" rel="noopener">this article</a><span> </span>on<span> </span><em>p</em>-hacking. Instead, I favored a methodology of my own, mostly empirical, based on simulations, data- rather than model-driven. It is essentially a non-parametric approach. It has the advantage of being far easier to use, implement, understand, and interpret, especially to the non-initiated. It was initially designed to be integrated in black-box, automated decision systems. Here I share some of these tests, and many can be implemented easily in Excel. </p>
<p><em><a href="https://www.datasciencecentral.com/profiles/blogs/a-plethora-of-original-underused-statistical-tests" target="_blank" rel="noopener">Read the full article here</a>. </em></p>Machine Learning Glossarytag:www.analyticbridge.datasciencecentral.com,2019-02-12:2004291:BlogPost:3909852019-02-12T19:31:40.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><span>For background to this post, please see </span><a href="https://www.datasciencecentral.com/profiles/blogs/learn-machinelearning-coding-basics-in-a-weekend-a-new-approach" rel="nofollow noopener" target="_blank">Learn Machine Learning Coding Basics in a weekend</a><span>. Here,we present the glossary that we use for the coding and the mindmap attached to these classes and upcoming book. About 80 terms are included in the glossary, covering Ensembles, Regression, Classification,…</span></p>
<p><span>For background to this post, please see </span><a rel="nofollow noopener" href="https://www.datasciencecentral.com/profiles/blogs/learn-machinelearning-coding-basics-in-a-weekend-a-new-approach" target="_blank">Learn Machine Learning Coding Basics in a weekend</a><span>. Here,we present the glossary that we use for the coding and the mindmap attached to these classes and upcoming book. About 80 terms are included in the glossary, covering Ensembles, Regression, Classification, Algorithms, Training, Validation, Model Evaluation and more. For instance, the section about Classification contains the following entries:</span></p>
<ul>
<li>Class </li>
<li>Hyperplane </li>
<li>Decision Boundary </li>
<li>False Negative (FN) </li>
<li>False Positive (FP) </li>
<li>True Negative (TN) </li>
<li>True Positive (TP) </li>
<li>Precision </li>
<li>Recall </li>
<li>F1 Score </li>
<li>Few-Shot Learning </li>
<li>Hinge Loss </li>
<li>Log Loss </li>
</ul>
<p>To download the glossary, <a href="https://www.datasciencecentral.com/profiles/blogs/learn-machinelearning-coding-basics-in-a-weekend-glossary-and" target="_blank" rel="noopener">follow this link</a>. </p>
<p><span style="font-size: 12pt;"><strong>DSC Resources</strong></span></p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-books-and-resources-for-dsc-members">Free Books</a><br/><a href="https://www.datasciencecentral.com/forum"></a></li>
<li><a href="https://www.datasciencecentral.com/forum">Forum Discussions</a><br/><a href="https://www.datasciencecentral.com/page/search?q=cheat+sheets"></a></li>
<li><a href="https://www.datasciencecentral.com/page/search?q=cheat+sheets">Cheat Sheets</a><br/><a href="https://analytictalent.com/"></a></li>
<li><a href="https://www.analytictalent.datasciencecentral.com/">Jobs</a></li>
<li><a href="https://www.datasciencecentral.com/page/search?q=one+picture" target="_blank" rel="noopener">Search DSC</a></li>
<li><a href="https://twitter.com/DataScienceCtrl" target="_self">DSC on Twitter</a></li>
<li><a href="https://www.facebook.com/DataScienceCentralCommunity/" target="_blank" rel="noopener">DSC on Facebook</a></li>
</ul>Alternatives to Logistic Regressiontag:www.analyticbridge.datasciencecentral.com,2019-02-07:2004291:BlogPost:3909802019-02-07T22:23:19.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong>Logistic regression (LR)</strong><span> </span>models estimate the probability of a binary response, based on one or more predictor variables. Unlike linear regression models, the dependent variables are categorical. LR has become very popular, perhaps because of the wide availability of the procedure in software. Although LR is a good choice for many situations, it doesn't work well for<span> </span><em>all</em><span> </span>situations. For example:</p>
<ul>
<li>In propensity score…</li>
</ul>
<p><strong>Logistic regression (LR)</strong><span> </span>models estimate the probability of a binary response, based on one or more predictor variables. Unlike linear regression models, the dependent variables are categorical. LR has become very popular, perhaps because of the wide availability of the procedure in software. Although LR is a good choice for many situations, it doesn't work well for<span> </span><em>all</em><span> </span>situations. For example:</p>
<ul>
<li>In propensity score analysis where there are many covariates, LR performs poorly.</li>
<li>For classifications, LR usually requires more variables than to achieve the same (or better) misclassification rate than Support Vector Machines (SVM) for multivariate and mixture distributions.</li>
</ul>
<p><br/>In addition, LR is prone to issues like overfitting and multicollinearity.</p>
<p></p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/995605995?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/995605995?profile=RESIZE_710x" class="align-center"/></a></p>
<p></p>
<p>A<span> </span><strong>wide range of alternatives</strong><span> </span>are available, from statistics-based procedures (e.g. log binomial, ordinary or modified Poisson regression and Cox regression) to those rooted more deeply in data science such as machine learning and neural network theory. Which one you choose depends largely on what tools you have available to you, what theory (e.g. statistics vs. neural networks) you want to work with, and what you're trying to achieve with your data. For example, tree-based methods are a good alternative for assessing risk factors, while Neural Networks (NN) and Support Vector Machines (SVM) work well for propensity score estimation and Categorization/Classification.</p>
<p>There are literally hundreds of viable alternatives to logistic regression, so it isn't possible to discuss them all within the confines of a single blog post. What follows is an outline of some of the more popular choices.</p>
<p><em>Read the full article, <a href="https://www.datasciencecentral.com/profiles/blogs/alternatives-to-logistic-regression" target="_blank" rel="noopener">here</a>. </em></p>From Infinite Matrices to New Integration Formulatag:www.analyticbridge.datasciencecentral.com,2019-02-04:2004291:BlogPost:3910832019-02-04T00:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>This is another interesting problem, off-the-beaten-path. It ends up with a formula to compute the integral of a function, based on its derivatives solely. </p>
<p>For simplicity, I'll start with some notations used in the context of matrix theory, familiar to everyone: T(<em>f</em>) = <em>g</em>, where <em>f</em> and <em>g</em> are vectors, and T a square matrix. The notation T(<em>f</em>) represents the product between the matrix T, and the vector <em>f</em>. Now, imagine that the…</p>
<p>This is another interesting problem, off-the-beaten-path. It ends up with a formula to compute the integral of a function, based on its derivatives solely. </p>
<p>For simplicity, I'll start with some notations used in the context of matrix theory, familiar to everyone: T(<em>f</em>) = <em>g</em>, where <em>f</em> and <em>g</em> are vectors, and T a square matrix. The notation T(<em>f</em>) represents the product between the matrix T, and the vector <em>f</em>. Now, imagine that the dimensions are infinite, with <em>f</em> being a vector whose entries represent all the real numbers in some peculiar order. </p>
<p>In mathematical analysis, T is called an operator, mapping all real numbers (represented by the vector <em>f</em>) onto another infinite vector <em>g</em>. In other words, <em>f</em> and <em>g</em> can be viewed as real-valued functions, and T transforms the function <em>f</em> into a new function <em>g</em>. A simple case is when T is the derivative operator, transforming any function <em>f</em> into its derivative <em>g</em> = d<i>f/</i>d<i>x</i>. We define the powers of T as T^0 = I (the identity operator, with I(<em>f</em>) = <em>f</em>), T^2(<em>f</em>) = T(T(<em>f</em>)), T^3(<em>f</em>) = T(T^2(<em>f</em>)) and so on, just like the powers of a square matrix. Now let the fun begins.</p>
<p><strong>Exponential of the Derivative Operator</strong></p>
<p>We assume here that T is the derivative operator. Using the same notation as above, we have the same formula as if T was a matrix:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/954724656?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/954724656?profile=RESIZE_710x" class="align-center"/></a></p>
<p>Applied to a function <em>f</em>, we have:</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/957090133?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/957090133?profile=RESIZE_710x" class="align-center"/></a></p>
<p>This is a simple application of Taylor series. So the exponential of the derivative operator is a shift operator.</p>
<p><strong>Inverse of the Derivative Operator</strong></p>
<p>Likewise, as for matrices, we can define the inverse of T as</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/954798699?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/954798699?profile=RESIZE_710x" class="align-center"/></a></p>
<p>If T was a matrix, the condition for convergence is that <span>all of the eigenvalues of T - I have absolute value smaller than 1.</span> For the derivative operator T applied to a function <em>f</em>, and under some conditions that guarantee convergence, it is easy to show that</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/954846726?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/954846726?profile=RESIZE_710x" class="align-center"/></a></p>
<p>The coefficients (for instance 1, -4, 6, -4, 1 in the last term displayed above) are just the binomial coefficients, with alternating signs.</p>
<p>We call the inverse of the derivative operator, the <em>pseudo-integral</em> operator. It is easy to prove that the pseudo-integral operator (as defined above), applied to the exponential function, yields the exponential function itself. So the exponential function is a fixed point (the only continuous one) of the pseudo-integral operator. More interestingly, in this case, the pseudo-integral operator is just the standard integral operator: they are both the same. Is this always the case regardless of the function <em>f</em>? It turns out that this is true for any function <em>f</em> that can be written as </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/954958880?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/954958880?profile=RESIZE_710x" class="align-center"/></a></p>
<p>This covers a large class of functions, especially since the coefficients can also be complex numbers. These functions usually have a Taylor series expansion too. However, it does not apply to functions such as polynomials, due to lack of convergence of the formula, in that case.</p>
<p>In short, we have found a formula to compute the integral of a function, based solely on the function itself and its successive derivatives. The same technique can be used to invert more complicated linear operators, such as Laplace transforms.</p>
<p><strong>Exercise</strong></p>
<p>Apply the derivative operator to the pseudo-integral of a function <em>f</em>, using the above formula for the pseudo-integral. The result should be equal to <em>f</em>. This is the case if <em>f</em> belongs to the same family of functions as described above. Can you identify functions not belonging to that family of functions, for which the theory is still valid? Hint: try <em>f</em>(<em>x</em>) = exp(<i>b</i> <em>x</em>^2) or <em>f</em>(<em>x</em>) = <em>x</em> exp(<em>b</em> <em>x</em>), where <i>b</i> is a parameter.</p>
<p><em>To not miss this type of content in the future,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter">subscribe</a><span> </span>to our newsletter. For related articles from the same author, <a href="http://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles" target="_blank" rel="noopener">click here</a><span> </span>or visit<span> </span><a href="http://www.vincentgranville.com/" target="_blank" rel="noopener">www.VincentGranville.com</a>. Follow me on<span> </span><a href="https://www.linkedin.com/in/vincentg/" target="_blank" rel="noopener">on LinkedIn</a>, or visit my old web page<span> </span><a href="http://www.datashaping.com">here</a>.</em></p>
<p><span style="font-size: 14pt;"><b>DSC Resources</b></span></p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-books-and-resources-for-dsc-members">Book and Resources for DSC Members</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/comprehensive-repository-of-data-science-and-ml-resources">Comprehensive Repository of Data Science and ML Resources</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between ML, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles">Selected Business Analytics, Data Science and ML articles</a></li>
<li><a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a><span> </span>|<span> </span><a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a><span> </span>|<span> </span><a href="http://www.analytictalent.com">Find a Job</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a><span> </span>|<span> </span><a href="http://www.datasciencecentral.com/forum/topic/new">Forum Questions</a></li>
</ul>
<p><span>Follow us: </span><a href="https://twitter.com/DataScienceCtrl">Twitter</a><span> | </span><a href="https://www.facebook.com/DataScienceCentralCommunity/">Facebook</a></p>Top 10 Technology Trends of 2019tag:www.analyticbridge.datasciencecentral.com,2019-01-29:2004291:BlogPost:3911522019-01-29T18:43:29.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p class="justifyfull" dir="ltr"><span>First days after the celebration of the New Year is the time when looking back we can analyze our actions, promises and draw conclusions whether our predictions and expectations came true. As 2018 came to its end, it is perfect time to analyze it and to set trends for the next year. The amount of data generated every minute is enormous. Therefore new approaches, techniques, and solutions have been developed.…</span></p>
<p class="justifyfull" dir="ltr"><span>First days after the celebration of the New Year is the time when looking back we can analyze our actions, promises and draw conclusions whether our predictions and expectations came true. As 2018 came to its end, it is perfect time to analyze it and to set trends for the next year. The amount of data generated every minute is enormous. Therefore new approaches, techniques, and solutions have been developed.</span></p>
<p class="justifyfull" dir="ltr"><span>Looking back to our article</span><span> </span><a rel="nofollow noopener" href="https://www.datasciencecentral.com/profiles/blogs/blog/the-top-10-technology-trends-of-2018/" target="_blank"><span>Top 10 Technology Trends of 2018</span></a><span> </span><span>we can say that we were preparing you for the upcoming changes related to aspects of security, changes provoked by the AI in business operations, extensive application of blockchains, further development of the Internet of Things (IoT), growing of NLP, etc. Some of these statements have been implemented in 2018, yet some will remain topical in 2019 as well. Only one factor remains stable - development. There is no doubt, the technologies will continue to develop, improve and upgrade to fit their purposes better.</span></p>
<p class="justifyfull" dir="ltr"><span>Primarily smart data technologies were actively applied only by huge enterprises and corporations. Today, big data has become available to a wide range of small businesses and companies. Both big enterprises and small companies tend to rely on big data in the questions of the intelligent business insights in their decision-making.</span></p>
<p class="justifyfull" dir="ltr"><span>The ever-growing stream of data may also present a challenge to business people. The prediction of changes in the role of big data and technologies is even more difficult. Thus, our top technology trends of 2019 are to serve a comprehensible roadmap for you.</span></p>
<p class="justifyfull" dir="ltr"><strong>2019 Trends</strong></p>
<p>1. Data security will reinforce its positions</p>
<p>2. Internet of Things will deliver new opportunities</p>
<p>3. Automation continues to be game-changing</p>
<p>4. AR is expected to overcome VR</p>
<p>To read the 10 trends with detailed information for each trend, <a href="https://www.datasciencecentral.com/profiles/blogs/top-10-technology-trends-of-2019" target="_blank" rel="noopener">follow this link</a>. </p>
<p><span style="font-size: 14pt;"><b>DSC Resources</b></span></p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-books-and-resources-for-dsc-members">Book and Resources for DSC Members</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/comprehensive-repository-of-data-science-and-ml-resources">Comprehensive Repository of Data Science and ML Resources</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between ML, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles">Selected Business Analytics, Data Science and ML articles</a></li>
<li><a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a><span> </span>|<span> </span><a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a><span> </span>|<span> </span><a href="http://www.analytictalent.com">Find a Job</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a><span> </span>|<span> </span><a href="http://www.datasciencecentral.com/forum/topic/new">Forum Questions</a></li>
</ul>
<p><span>Follow us: </span><a href="https://twitter.com/DataScienceCtrl">Twitter</a><span> | </span><a href="https://www.facebook.com/DataScienceCentralCommunity/">Facebook</a></p>
<h2 class="justifyfull"></h2>
<p><span style="font-size: 12pt;"><strong><a href="https://www.facebook.com/DataScienceCentralCommunity/"></a></strong></span></p>Great Sunday Readingtag:www.analyticbridge.datasciencecentral.com,2019-01-27:2004291:BlogPost:3909422019-01-27T22:20:28.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Extract from the upcoming Monday newsletter published by Data Science Central. Previous editions can be found <a href="https://www.datasciencecentral.com/page/previous-digests" rel="noopener" target="_blank" title="This external link will open in a new window">here</a>. The contribution flagged with a + is our selection for the picture of the week. To subscribe,<span> …</span></p>
<p>Extract from the upcoming Monday newsletter published by Data Science Central. Previous editions can be found <a href="https://www.datasciencecentral.com/page/previous-digests" rel="noopener" target="_blank" title="This external link will open in a new window">here</a>. The contribution flagged with a + is our selection for the picture of the week. To subscribe,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" rel="noopener" target="_blank" title="This external link will open in a new window">follow this link</a>. To check the full digest and see the picture of the week, follow<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/weekly-digest-january-28" target="_blank" rel="noopener">this link</a>. </p>
<p><strong>Featured Resources and Technical Contributions</strong> </p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/23-statistical-concepts-explained-in-simple-english-part-7">23 Statistical Concepts Explained in Simple English - Part 7</a><span> </span></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-home-sales-projection-a-time-series-forecasting">New Home Sales Projection: Time Series Forecasting</a><span> </span></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/best-dynamically-typed-programming-languages-for-data-analysis">Best dynamically-typed programming languages for data analysis</a><span> </span></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/the-10-statistical-techniques-data-scientists-need-to-master-10">The 10 Statistical Techniques Data Scientists Need to Master</a><span> </span></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/sip-text-log-analysis-using-pandas">SIP text log analysis using Pandas</a><span> </span></li>
</ul>
<p><strong>Featured Articles and Forum Questions</strong></p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/how-to-flourish-in-industry-4-0-the-fourth-industrial-revolution">How to Flourish in Industry 4.0, the Fourth Industrial Revolution</a> +</li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/advice-to-a-fresh-graduate-for-getting-a-job-in-ai-data-science">Advice to a fresh graduate for getting a job in AI/ Data Science</a> </li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/top-10-technology-trends-of-2019-1" target="_blank" rel="noopener">Top 10 Technology Trends of 2019</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/data-driven-marketing-strategy-spatial-analytics-for-micro-2">Spatial Analytics for Micro-marketing</a> </li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/doctors-are-from-venus-data-scientists-from-mars-or-why-ai-ml-is-">Why AI/ML is Moving so Slowly in Healthcare</a> </li>
<li><a href="https://www.analyticbridge.datasciencecentral.com/profiles/blogs/turn-customer-reviews-into-business-growth">Mining Customer Reviews to drive Business Growth</a> </li>
<li><a href="https://www.analyticbridge.datasciencecentral.com/profiles/blogs/graph-analytics-to-reinforce-anti-fraud-programs">Graph Analytics to Reinforce Anti-fraud Programs</a> </li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/bill-schmarzo-s-retrospective-data-science-ml-big-data-analytics-">Bill Schmarzo's Retrospective: Data Science, ML, Big Data Analytics...</a> </li>
</ul>
<p>Follow us: <a href="https://twitter.com/DataScienceCtrl" target="_blank" title="This external link will open in a new window" rel="noopener">Twitter</a><span> | </span><a href="https://www.facebook.com/DataScienceCentralCommunity/" target="_blank" title="This external link will open in a new window" rel="noopener">Facebook</a>. </p>
<p></p>Great Sunday Readingtag:www.analyticbridge.datasciencecentral.com,2019-01-20:2004291:BlogPost:3908042019-01-20T19:15:49.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>Extract from the upcoming Monday newsletter published by Data Science Central. Previous editions can be found <a href="https://www.datasciencecentral.com/page/previous-digests" rel="noopener" target="_blank" title="This external link will open in a new window">here</a>. The contribution flagged with a + is our selection for the picture of the week. To subscribe,<span> …</span></p>
<p>Extract from the upcoming Monday newsletter published by Data Science Central. Previous editions can be found <a href="https://www.datasciencecentral.com/page/previous-digests" rel="noopener" target="_blank" title="This external link will open in a new window">here</a>. The contribution flagged with a + is our selection for the picture of the week. To subscribe,<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" rel="noopener" target="_blank" title="This external link will open in a new window">follow this link</a>. To check the full digest and see the picture of the week, follow<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/weekly-digest-january-21" rel="noopener" target="_blank" title="This external link will open in a new window">this link</a>. </p>
<p><strong>Featured Resources and Technical Contributions</strong> </p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/your-guide-to-natural-language-processing-nlp" target="_blank" title="This external link will open in a new window" rel="noopener">Your Guide to Natural Language Processing (NLP)</a><span> </span></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/stocks-significance-testing-amp-p-hacking-how-volatile-is" target="_blank" title="This external link will open in a new window" rel="noopener">Stocks, Significance Testing & p-Hacking:<span> </span></a>How volatile is volatile?</li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/the-mathematics-of-data-science-understanding-the-foundations-of" target="_blank" title="This external link will open in a new window" rel="noopener">The Mathematics of Data Science<span> </span></a>- Understanding the foundations of Deep Learning through Linear Regression</li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/tableau-in-10-minutes-step-by-step-guide" target="_blank" title="This external link will open in a new window" rel="noopener">Tableau in 10 Minutes: Step-by-Step Guide</a><span> </span></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/pancake-a-python-package-for-model-stacking" target="_blank" title="This external link will open in a new window" rel="noopener">Pancake: A Python package for model stacking</a><span> </span></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/900-most-popular-ds-ml-articles-in-2018" target="_blank" title="This external link will open in a new window" rel="noopener">900 Most Popular DS & ML Articles in 2018</a><span> </span></li>
</ul>
<p><strong>Featured Articles and Forum Questions</strong></p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/how-do-you-win-the-data-science-wars-you-cheat-by-doing-the" target="_blank" title="This external link will open in a new window" rel="noopener">How Do You Win the Data Science Wars? </a>You Cheat By Doing The Necessary Pre-work +</li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/supervised-vs-unsupervised-learning-whats-the-big-deal" target="_blank" title="This external link will open in a new window" rel="noopener">Supervised vs Unsupervised Learning...Whats the Big Deal?</a> </li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/exploit-the-economics-of-artificial-intelligence-with-design" target="_blank" title="This external link will open in a new window" rel="noopener">Exploit the Economics of AI with Design Thinking and Data Science</a> </li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/the-ai-ml-opportunity-landscape-in-healthcare-do-it-right-or-it-w" target="_blank" title="This external link will open in a new window" rel="noopener">The AI/ML Opportunity Landscape in Healthcare</a> </li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/19-controversial-articles-about-data-science" target="_blank" title="This external link will open in a new window" rel="noopener">19 Controversial Articles about Data Science</a> </li>
</ul>
<p></p>
<p>Follow us: <a href="https://twitter.com/DataScienceCtrl" target="_blank" title="This external link will open in a new window" rel="noopener">Twitter</a><span> | </span><a href="https://www.facebook.com/DataScienceCentralCommunity/" target="_blank" title="This external link will open in a new window" rel="noopener">Facebook</a>. </p>Understanding the foundations of Deep Learning through Linear Regressiontag:www.analyticbridge.datasciencecentral.com,2019-01-16:2004291:BlogPost:3904972019-01-16T16:48:52.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>This article was written by <a href="https://www.datasciencecentral.com/profile/ajitjaokar" rel="noopener" target="_blank">Ajit Joakar</a>. </p>
<p>In this longish post, I have tried to explain Deep Learning starting from familiar ideas like machine learning. This approach forms a part of my forthcoming book. I have used this approach in my teaching. It is based on ‘learning by exception,' i.e. understanding one concept and it’s limitations and then understanding how the subsequent concept…</p>
<p>This article was written by <a href="https://www.datasciencecentral.com/profile/ajitjaokar" target="_blank" rel="noopener">Ajit Joakar</a>. </p>
<p>In this longish post, I have tried to explain Deep Learning starting from familiar ideas like machine learning. This approach forms a part of my forthcoming book. I have used this approach in my teaching. It is based on ‘learning by exception,' i.e. understanding one concept and it’s limitations and then understanding how the subsequent concept overcomes that limitation.</p>
<p>The roadmap we follow is:</p>
<ul>
<li>Linear Regression</li>
<li>Multiple Linear Regression</li>
<li>Polynomial Regression</li>
<li>General Linear Model</li>
<li>Perceptron Learning</li>
<li>Multi-Layer Perceptron</li>
</ul>
<p>We thus develop a chain of thought that starts with linear regression and extends to multilayer perceptron (Deep Learning). Also, for simplification, I have excluded other forms of Deep Learning such as CNN and LSTM, i.e. we confine ourselves to the multilayer Perceptron when it comes to Deep Learning. Why start with Linear Regression? Because it is an idea familiar to many even at high school levels.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/779088792?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/779088792?profile=RESIZE_710x" class="align-center"/></a></p>
<p>To read the full article, <a href="https://www.datasciencecentral.com/profiles/blogs/the-mathematics-of-data-science-understanding-the-foundations-of" target="_blank" rel="noopener">follow this link</a>. For more about deep learning, <a href="https://www.datasciencecentral.com/page/search?q=deep+learning" target="_blank" rel="noopener">click here</a>. For more about regression, <a href="https://www.datasciencecentral.com/page/search?q=regression" target="_blank" rel="noopener">click here</a>. </p>
<p><span style="font-size: 14pt;"><b>DSC Resources</b></span></p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/new-books-and-resources-for-dsc-members">Book and Resources for DSC Members</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/comprehensive-repository-of-data-science-and-ml-resources">Comprehensive Repository of Data Science and ML Resources</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between ML, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles">Selected Business Analytics, Data Science and ML articles</a></li>
<li><a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a><span> </span>|<span> </span><a href="http://www.datasciencecentral.com/page/search?q=Python">Search DSC</a><span> </span>|<span> </span><a href="http://www.analytictalent.com">Find a Job</a></li>
<li><a href="http://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a><span> </span>|<span> </span><a href="http://www.datasciencecentral.com/forum/topic/new">Forum Questions</a></li>
</ul>
<p><span>Follow us: </span><a href="https://twitter.com/DataScienceCtrl">Twitter</a><span> | </span><a href="https://www.facebook.com/DataScienceCentralCommunity/">Facebook</a></p>5 Predictions about Data Science, Machine Learning, and AI for 2019tag:www.analyticbridge.datasciencecentral.com,2018-12-21:2004291:BlogPost:3900312018-12-21T01:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><em> Here are our 5 predictions for data science, machine learning, and AI for 2019. We also take a look back at last year’s predictions to see how we did.</em></p>
<p> </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/401132209?profile=original" rel="noopener" target="_blank"><img class="align-right" src="https://storage.ning.com/topology/rest/1.0/file/get/401132209?profile=original&width=250" width="250"></img></a> It’s that time of year again when we do a look back in order to offer a look forward. What trends will speed up, what things will actually happen,…</p>
<p><strong><em>Summary:</em></strong><em> Here are our 5 predictions for data science, machine learning, and AI for 2019. We also take a look back at last year’s predictions to see how we did.</em></p>
<p> </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/401132209?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/401132209?profile=original&width=250" width="250" class="align-right"/></a>It’s that time of year again when we do a look back in order to offer a look forward. What trends will speed up, what things will actually happen, and what things won’t in the coming year for data science, machine learning, and AI.</p>
<p>We’ve been watching and reporting on these trends all year and we scoured the web and some of our professional contacts to find out what others are thinking. </p>
<p> </p>
<p><span><strong>Here’s a Quick Look at Last Year’s Predictions and How We Did.</strong></span></p>
<ol>
<li><em>What we said: Both model production and data prep will become increasingly automated. Larger data science operations will converge on a single platform (of many available). Both of these trends are in response to the groundswell movement for efficiency and effectiveness. In a nutshell allowing fewer data scientists to do the work of many.</em> </li>
</ol>
<p>Clearly a win. <span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/practicing-no-code-data-science" target="_self">No code data science</a><span> </span>is on the rise as is end-to-end integration in advanced analytic platforms.</p>
<ol start="2">
<li><em>What we said: Data Science continues to develop specialties that mean the mythical ‘full stack’ data scientist will disappear.</em></li>
</ol>
<p>To read all 2018 predictions, and compare with the updated 2019 version, <a href="https://www.datasciencecentral.com/profiles/blogs/5-predictions-about-data-science-machine-learning-and-ai-for-2019" target="_blank" rel="noopener">click here</a>. </p>
<p><span style="font-size: 14pt;"><strong>Announcement</strong></span></p>
<ul>
<li><a href="https://dsc.news/2UZnoQ6">Leverage All Your Data With Cloud Analytics<span> </span></a>- On-demand Webinar<span> </span></li>
</ul>New Books in AI, Machine Learning, and Data Sciencetag:www.analyticbridge.datasciencecentral.com,2018-12-02:2004291:BlogPost:3896612018-12-02T01:26:14.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning. In the upcoming months, the following will be added:</p>
<ul>
<li>The Machine Learning Coding Book</li>
<li>Off-the-beaten-path Statistics and Machine Learning Techniques </li>
<li>Encyclopedia of Statistical Science</li>
<li>Original Math, Stat and Probability Problems - with…</li>
</ul>
<p>We are in the process of writing and adding new material (compact eBooks) exclusively available to our members, and written in simple English, by world leading experts in AI, data science, and machine learning. In the upcoming months, the following will be added:</p>
<ul>
<li>The Machine Learning Coding Book</li>
<li>Off-the-beaten-path Statistics and Machine Learning Techniques </li>
<li>Encyclopedia of Statistical Science</li>
<li>Original Math, Stat and Probability Problems - with Solutions</li>
<li>Computational Number Theory for Data Scientists</li>
<li>Randomness, Pattern Recognition, Simulations, Signal Processing - New developments</li>
</ul>
<p>We invite you to<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">sign up here</a><span> </span>to not miss these free books. Previous material (also for members only) can be found<span> </span><a href="https://www.datasciencecentral.com/page/member" target="_blank" rel="noopener">here</a>.</p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/135807237?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/135807237?profile=original" class="align-center"/></a></p>
<p></p>
<p>Currently, the following content is available:</p>
<p><strong>1. Book: Enterprise AI - An Application Perspective</strong> </p>
<p>Enterprise AI: An applications perspective takes a use case driven approach to understand the deployment of AI in the Enterprise. Designed for strategists and developers, the book provides a practical and straightforward roadmap based on application use cases for AI in Enterprises. The authors (Ajit Jaokar and Cheuk Ting Ho) are data scientists and AI researchers who have deployed AI applications for Enterprise domains. The book is used as a reference for Ajit and Cheuk's new course on Implementing Enterprise AI.</p>
<p>The table of content is available<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/free-ebook-enterprise-ai-an-applications-perspective" target="_blank" rel="noopener">here</a>. The book can be accessed<span> </span><a href="https://www.datasciencecentral.com/page/free-books-1" target="_blank" rel="noopener">here</a><span> </span>(members only.)</p>
<p><strong>2. Book: Applied Stochastic Processes</strong></p>
<p>Full title:<span> </span><em>Applied Stochastic Processes, Chaos Modeling, and Probabilistic Properties of Numeration Systems</em>. Published June 2, 2018. Author: Vincent Granville, PhD. (104 pages, 16 chapters.)</p>
<p>This book is intended to professionals in data science, computer science, operations research, statistics, machine learning, big data, and mathematics. In 100 pages, it covers many new topics, offering a fresh perspective on the subject. It is accessible to practitioners with a two-year college-level exposure to statistics and probability. The compact and tutorial style, featuring many applications (Blockchain, quantum algorithms, HPC, random number generation, cryptography, Fintech, web crawling, statistical testing) with numerous illustrations, is aimed at practitioners, researchers and executives in various quantitative fields.</p>
<p>New ideas, advanced topics, and state-of-the-art research are discussed in simple English, without using jargon or arcane theory. It unifies topics that are usually part of different fields (data science, operations research, dynamical systems, computer science, number theory, probability) broadening the knowledge and interest of the reader in ways that are not found in any other book. This short book contains a large amount of condensed material that would typically be covered in 500 pages in traditional publications. Thanks to cross-references and redundancy, the chapters can be read independently, in random order.</p>
<p>The table of content is available<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/fee-book-applied-stochastic-processes" target="_blank" rel="noopener">here</a>. The book can be accessed<span> </span><a href="https://www.datasciencecentral.com/page/free-books-1" target="_blank" rel="noopener">here</a><span> </span>(members only.)</p>
<p><span><b>DSC Resources</b></span></p>
<ul>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/comprehensive-repository-of-data-science-and-ml-resources">Comprehensive Repository of Data Science and ML Resources</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/advanced-machine-learning-with-basic-excel">Advanced Machine Learning with Basic Excel</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/difference-between-machine-learning-data-science-ai-deep-learning">Difference between ML, Data Science, AI, Deep Learning, and Statistics</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blogs/my-data-science-machine-learning-and-related-articles">Selected Business Analytics, Data Science and ML articles</a></li>
<li><a href="http://careers.analytictalent.com/jobs/products">Hire a Data Scientist</a><span> </span>|<span> </span><a href="https://www.datasciencecentral.com/page/search?q=Python">Search DSC</a><span> </span>|<span> </span><a href="http://www.analytictalent.com/">Find a Job</a></li>
<li><a href="https://www.datasciencecentral.com/profiles/blog/new">Post a Blog</a><span> </span>|<span> </span><a href="https://www.datasciencecentral.com/forum/topic/new">Forum Questions</a></li>
</ul>Things that Aren’t Working in Deep Learningtag:www.analyticbridge.datasciencecentral.com,2018-11-21:2004291:BlogPost:3894292018-11-21T17:00:42.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><span> </span><em> This may be the golden age of deep learning but a lot can be learned by looking at where deep neural nets aren’t working yet. This can be a guide to calming the hype. It can also be a roadmap to future opportunities once these barriers are behind us. The full article is accessible <a href="https://www.datasciencecentral.com/profiles/blogs/things-that-aren-t-working-in-deep-learning" rel="noopener" target="_blank">here</a>, below is a…</em></p>
<p><strong><em>Summary:</em></strong><span> </span><em> This may be the golden age of deep learning but a lot can be learned by looking at where deep neural nets aren’t working yet. This can be a guide to calming the hype. It can also be a roadmap to future opportunities once these barriers are behind us. The full article is accessible <a href="https://www.datasciencecentral.com/profiles/blogs/things-that-aren-t-working-in-deep-learning" target="_blank" rel="noopener">here</a>, below is a snapshot.. </em></p>
<p>We are living in the golden age of deep learning. This is quite literally the technology that launched 10,000 startups (to paraphrase Kevin Kelly’s prophetic prediction from 2014 “The business plans of the next 10,000 startups are easy to forecast:<span> </span><em>Take X and add AI</em>.”) Well that happened.</p>
<p>Kelly was speaking more broadly about AI, but over the last four years we’ve come to understand that it’s about CNNs and RNN/LSTMs that are actually commercially ready and driving this. </p>
<p>Although the last two years have been fairly quiet in terms of new technique and technology breakthroughs for data science, it hasn’t been totally quiet. Like the emergence of Temporal Convolutional Nets<span> </span><a href="https://www.datasciencecentral.com/profiles/blogs/temporal-convolutional-nets-tcns-take-over-from-rnns-for-nlp-pred"><em><u>(TCNs) to replace RNNs</u></em></a><span> </span>in language translation, research goes on to see how deep learning and specifically CNN architecture can be pushed into new applications.</p>
<p> </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/135609852?profile=original" target="_blank" rel="noopener"><img src="https://storage.ning.com/topology/rest/1.0/file/get/135609852?profile=original&width=225" width="225" class="align-center"/></a></p>
<p> </p>
<p><span><strong>Roadblocks to Deep Learning</strong></span></p>
<p>Which brings us to our current topic which is to understand what some of the major roadblocks in research are in trying to expand deep learning into new areas. </p>
<p>In calling our attention to ‘things that aren’t working in deep learning’, we aren’t suggesting that these things will never work, but rather that researchers are currently identifying major stumbling blocks to moving forward.</p>
<p>The value of this is two-fold. First it can help steer us away from projects that might on the surface look like deep learning will work, but in fact may take a year or years to work out. Second, we should keep our eye on these particular issues since once they are resolved they will represent opportunities that others may have decided weren’t possible.</p>
<p>Here are several that we spotted in the research.</p>
<p>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/things-that-aren-t-working-in-deep-learning" target="_blank" rel="noopener">here</a>. </p>
<p><em>To make sure you keep getting these emails, please add mail@newsletter.datasciencecentral.com to your address book or whitelist us. <span>To subscribe, </span><a href="https://www.datasciencecentral.com/profiles/blogs/check-out-our-dsc-newsletter" target="_blank" rel="noopener">follow this link</a><span>. </span></em></p>Lots of Open Source Datasets to Make Your AI Bettertag:www.analyticbridge.datasciencecentral.com,2018-10-03:2004291:BlogPost:3887132018-10-03T16:49:20.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong>Summary</strong>: There are several approaches to reducing the cost of training data for AI, one of which is to get it for free. Here are some excellent sources.</p>
<p>Recently we wrote that training data (not just data in general) is the new oil. It’s the difficulty and expense of acquiring labeled training data that causes many deep learning projects to be abandoned.</p>
<p>It also matters a great deal just how good you want your new deep learning app to be. A 2016 study by…</p>
<p><strong>Summary</strong>: There are several approaches to reducing the cost of training data for AI, one of which is to get it for free. Here are some excellent sources.</p>
<p>Recently we wrote that training data (not just data in general) is the new oil. It’s the difficulty and expense of acquiring labeled training data that causes many deep learning projects to be abandoned.</p>
<p>It also matters a great deal just how good you want your new deep learning app to be. A 2016 study by Goodfellow, Bengio and Courville concluded you could get ‘acceptable’ performance with about 5,000 labeled examples per category BUT it would take 10 Million labeled examples per category to “match or exceed human performance”.</p>
<p>There are a number of technologies coming up through research now that promise more accurate auto labeling to make creating training data less costly and time consuming. Snorkel from the Stanford Dawn Project is one we covered recently. This area is getting a lot of research attention.</p>
<p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2220289789?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2220289789?profile=original" width="415" class="align-center"/></a></p>
<p></p>
<p>Another approach is to build on someone else’s work using publicly available datasets. You can begin by building your model in the borrowed set, you can blend your data with the borrowed data, or you could use the transfer learning approach to repurpose the front end of an existing model to train on your more limited data.</p>
<p>Whatever your strategy, the ability to build on publicly available datasets is always something you’ll want to consider, so your ability to find them becomes key.</p>
<p>Here are some notes on where you might start your search. These won’t all be labeled image and text but a lot of them are. And for those of you looking to use ML and statistical techniques, there’s plenty here for you too.</p>
<p><a href="https://www.datasciencecentral.com/profiles/blogs/lots-of-free-open-source-datasets-to-make-your-ai-better" target="_blank" rel="noopener">Read full article here</a>. </p>
<p><em>To make sure you keep getting these emails, please add mail@newsletter.datasciencecentral.com </em><span><em>to your address book or whitelist us. </em> </span></p>Introduction to Deep Learningtag:www.analyticbridge.datasciencecentral.com,2018-09-21:2004291:BlogPost:3889382018-09-21T18:00:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><em>Guest blog post by Zied HY. Zied is <span>Senior Data Scientist at Capgemini Consulting. He is</span><span> specialized in building predictive models utilizing both traditional statistical methods (Generalized Linear Models, Mixed Effects Models, Ridge, Lasso, etc.) and modern machine learning techniques (XGBoost, Random Forests, Kernel Methods, neural networks, etc.). Zied</span><span> run some workshops for university students (ESSEC, HEC, Ecole polytechnique) interested in Data…</span></em></p>
<p><em>Guest blog post by Zied HY. Zied is <span>Senior Data Scientist at Capgemini Consulting. He is</span><span> specialized in building predictive models utilizing both traditional statistical methods (Generalized Linear Models, Mixed Effects Models, Ridge, Lasso, etc.) and modern machine learning techniques (XGBoost, Random Forests, Kernel Methods, neural networks, etc.). Zied</span><span> run some workshops for university students (ESSEC, HEC, Ecole polytechnique) interested in Data Science and its applications, and he is </span><span>the co-founder of Global International Trading (GIT), a central purchasing office based in Paris.</span></em></p>
<p>I have started reading about Deep Learning for over a year now through several articles and research papers that I came across mainly in LinkedIn, Medium and Arxiv.</p>
<p><a href="http://storage.ning.com/topology/rest/1.0/file/get/2220289590?profile=original" target="_self"><img src="http://storage.ning.com/topology/rest/1.0/file/get/2220289590?profile=original" width="666" class="align-center"/></a></p>
<p>When I virtually attended the MIT 6.S191 Deep Learning courses during the last few weeks, I decided to begin to put some structure in my understanding of Neural Networks through this series of articles.</p>
<p>I will go through the first four courses:</p>
<ol>
<li>Introduction to Deep Learning</li>
<li>Sequence Modeling with Neural Networks</li>
<li>Deep learning for computer vision - Convolutional Neural Networks</li>
<li>Deep generative modeling</li>
</ol>
<p>For each course, I will outline the main concepts and add more details and interpretations from my previous readings and my background in statistics and machine learning.</p>
<p>Starting from the second course, I will also add an application on an open-source dataset for each course.</p>
<p>That said, let’s go!</p>
<p>Read the first part, <a href="https://www.datasciencecentral.com/profiles/blogs/introduction-to-deep-learning" target="_blank" rel="noopener">here</a>. </p>Analytics Translator – The Most Important New Role in Analyticstag:www.analyticbridge.datasciencecentral.com,2018-09-12:2004291:BlogPost:3888422018-09-12T23:30:00.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p><strong><em>Summary:</em></strong><em> The role of Analytics Translator was recently identified by McKinsey as the most important new role in analytics, and a key factor in the failure of analytic programs when the role is absent.</em></p>
<p> <a href="https://storage.ning.com/topology/rest/1.0/file/get/2220290243?profile=original" target="_self"><img class="align-center" src="https://storage.ning.com/topology/rest/1.0/file/get/2220290243?profile=RESIZE_1024x1024" width="500"></img></a></p>
<p>The role of Analytics Translator was recently<span> identified by McKinsey </span>as the most important new role in…</p>
<p><strong><em>Summary:</em></strong><em> The role of Analytics Translator was recently identified by McKinsey as the most important new role in analytics, and a key factor in the failure of analytic programs when the role is absent.</em></p>
<p> <a href="https://storage.ning.com/topology/rest/1.0/file/get/2220290243?profile=original" target="_self"><img width="500" src="https://storage.ning.com/topology/rest/1.0/file/get/2220290243?profile=RESIZE_1024x1024" width="500" class="align-center"/></a></p>
<p>The role of Analytics Translator was recently<span> identified by McKinsey </span>as the most important new role in analytics, and a key factor in the failure of analytic programs when the role is absent.</p>
<p>As our profession of data science has evolved, any number of authors including myself has offered different taxonomies to describe the differences among the different ‘tribes’ of data scientists. We may disagree on the categories but we agree that we’re not all alike.</p>
<p>Ten years ago, around the time that Hadoop and Big Data went open source there was still a perception that data scientists should be capable of performing every task in the analytics lifecycle. </p>
<p>The obvious skills were model creation and deployment, and data blending and munging. Other important skills in this bucket would have included setting up data infrastructure (data lakes, streaming architectures, Big Data NoSQL DBs, etc.). And finally the skills that were just assumed to come with seniority, storytelling (explaining it to executive sponsors), and great project management skills.</p>
<p>Frankly, when I entered the profession, this was true and for the most part, in those early projects, I did indeed do it all.</p>
<p><span><strong>Data Science – A Profession of Specialties</strong></span></p>
<p>It’s fair to say that today nobody expects this. Ours is rapidly becoming a field of specialists, defined by data types (NLP, image, streaming, classic static data), role (data engineer, junior data scientist, senior data scientist), or by use cases (predictive maintenance, inventory forecasting, personalized marketing, fraud detection, chatbot UIs, etc.). These aren’t rigid boundaries and a good data scientist may bridge several of these, but not all.</p>
<p><em>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/analytics-translator-the-most-important-new-role-in-analytics" target="_blank" rel="noopener">here</a>. (By Bill Vorhies)</em></p>New Perspective on the Central Limit Theorem and Statistical Testingtag:www.analyticbridge.datasciencecentral.com,2018-09-11:2004291:BlogPost:3887582018-09-11T03:07:16.000ZVincent Granvillehttps://www.analyticbridge.datasciencecentral.com/profile/VincentGranville
<p>You won't learn this in textbooks, college classes, or data camps. Some of the material in this article is very advanced yet presented in simple English, with an Excel implementation for various statistical tests, and no arcane theory, jargon, or obscure theorems. It has a number of applications, in finance in particular. This article covers several topics under a unified approach, so it was not easy to find a title. In particular, we discuss:</p>
<ul>
<li>When the central limit theorem…</li>
</ul>
<p>You won't learn this in textbooks, college classes, or data camps. Some of the material in this article is very advanced yet presented in simple English, with an Excel implementation for various statistical tests, and no arcane theory, jargon, or obscure theorems. It has a number of applications, in finance in particular. This article covers several topics under a unified approach, so it was not easy to find a title. In particular, we discuss:</p>
<ul>
<li>When the central limit theorem fails: what to do, and case study</li>
<li>Various original statistical tests, some unpublished, for instance to test if an empirical statistical distribution (based on observations) is symmetric or not, or whether two distributions are identical</li>
<li>The power and mysteries of stable (also called divisible) statistical distributions</li>
<li>Dealing with weighted sums of random variables, especially with decaying weights</li>
<li>Fun number theory problems and algorithms associated with these statistical problems</li>
<li>Decomposing a (theoretical or empirical / observed) statistical distribution into elementary components, just like decomposing a complex molecule into atoms</li>
</ul>
<p>The focus is on principles, methodology, and techniques applicable to, and useful in many applications. For those willing to do a deeper dive on these topics, many references are provided. This article, written as a tutorial, is accessible to professionals with elementary statistical knowledge, like stats 101. It is also written in a compact style, so that you can grasp all the material in hours rather than days. This simple article covers topics that you could learn in MIT, Stanford, Berkeley, Princeton or Harvard classes aimed at PhD students. Some is state-of-the-art research results published here for the first time, and made accessible to the data science of data engineer novice. I think mathematicians (being one myself) will also enjoy it. Yet, emphasis is on applications rather than theory. </p>
<p><a href="https://storage.ning.com/topology/rest/1.0/file/get/2220289349?profile=original" target="_self"><img src="https://storage.ning.com/topology/rest/1.0/file/get/2220289349?profile=original" class="align-center"/></a></p>
<p>Finally, we focus here on sums of random variables. The next article will focus on mixtures rather than sums, providing more flexibility for modeling purposes, or to decompose a complex distribution in elementary components. In both cases, my approach is mostly non-parametric, and based on robust statistical techniques, capable of handling outliers without problems, and not subject to over-fitting.</p>
<p><strong>Content</strong></p>
<p>1. Central Limit Theorem: New Approach</p>
<p>2. Stable and Attractor Distributions</p>
<ul>
<li>Using decaying weights</li>
<li>More about stable distributions and their applications</li>
</ul>
<p>3. Non CLT-compliant Weighted Sums, and their Attractors</p>
<ul>
<li>Testing for normality</li>
<li>Testing for symmetry and dependence on kernel</li>
<li>Testing for semi-stability</li>
<li>Conclusions</li>
</ul>
<p>Read full article <a href="https://www.datasciencecentral.com/profiles/blogs/new-perspective-on-central-limit-theorem-and-related-stats-topics" target="_blank" rel="noopener">here</a>. </p>