]]>

]]>

This article is by Jorge Castañón, Ph.D., Senior Data Scientist at the IBM Machine Learning Hub.Data visualization plays two key roles:1. Communicating results clearly to a general audience.2. Organizing a view of data that suggests a new hypothesis or a next step in a project.It’s no surprise that most people prefer visuals to large tables of numbers. That’s why clearly labeled plots with meaningful interpretation always make it to the front of academic papers.This post looks at the 10 visualizations you can bring to bear on your data — whether you want to convince the wider world of your theories or crack open your own project and take the next step:HistogramsBar/Pie chartsScatter/Line plotsTime seriesRelationship mapsHeat mapsGeo Maps3-D PlotsHigher-Dimensional PlotsWord cloudsRead the full article, with descriptions and illustrations for these visualizations, here.See More

Some original and very interesting material is presented here, with possible applications in Fintech. No need for a PhD in math to understand this article: I tried to make the presentation as simple as possible, focusing on high-level results rather than technicalities. Yet, professional statisticians and mathematicians, even academic researchers, will find some deep and fascinating results worth further exploring.Can you identify patterns in this chart? (see section 2.2. in the article for an answer)Let's start with Here the X(k)'s are random variable identically and independently distributed, commonly referred to as X. We are trying to find the distribution of Z.Contents1. Using a Simple Discrete Distribution for X2. Towards a Better ModelApproximate SolutionThe Fractal, Brownian-like Error Term3. Finding X and Z Using Characteristic FunctionsTest with Log-normal Distribution for XPlaying with the Characteristic FunctionsGeneralization to Continued Fractions and Nested Cubic Roots4. ExercisesRead this article here. See More

By Bill Vorhies. Summary: Here’ a proposal for real ‘zero touch’, ‘set-em-and-forget-em’ machine learning from the researchers at Amazon. If you have an environment as fast changing as e-retail and a huge number of models matching buyers and products you could achieve real cost savings and revenue increases by making the refresh cycle faster and more accurate with automation. This capability likely will be coming soon to your favorite AML platform.Is there a future in which we can really ‘set-em-and-forget-em’ machine learning? So far Automated Machine Learning (AML) is delivering on vastly simplifying the creation of models but the maintenance, refresh, and update still require manual intervention.Not that we’re trying to talk ourselves out of a job. But after all, once the model is built and implemented it’s more fun to move on to the next opportunity. If the maintenance and refresh cycle could be truly automated that would be a good thing.Much of the effort so far has been put into simplifying getting the model out of its AML environment and into its production environment. Facebook’s FBLearner is an example of this. A number of platforms claim to ease this process for the rest of us. At least once we manually refresh the model it’s easier to update it in production.Read full article here. See More

This list of lists contains books, notebooks, presentations, cheat sheets, and tutorials covering all aspects of data science, machine learning, deep learning, statistics, math, and more, with most documents featuring Python or R code and numerous illustrations or case studies. All this material is available for free, and consists of content mostly created in 2019 and 2018, by various top experts in their respective fields. A few of these documents are available on LinkedIn: see last section on how to download them. Below are the first two sections.General ReferencesFree Deep Learning Book (639 pages) by Prof. Gilles LouppePython Crash Course (562 pages) by Eric MatthesFree Book: Applied Data Science (141 pages) - Columbia UniversityData Science in PracticeMachine Learning 101 - By Jason Mayes, GoogleThe Ultimate guide to AI, Data Science & Machine LearningFree Handbooks for Data Science ProfessionalsFree Book: Natural Language Processing with PythonData Visualization ResourcesTextbook: Probability Course - Harvard UniversityTextbook: The Math of Machine Learning - Berkeley UniversityComprehensive Guide to Machine Learning - Berkeley UniversityFree Book: Foundations of Data Science - by Microsoft ResearchComprehensive Guide on Machine Learning - by J.P. MorganGentle Approach to Linear Algebra - by Vincent GranvilleData Science Central Books, Booklets and ReferencesStatistics: New Foundations, Toolbox, and Machine Learning RecipesDeep Learning and Computer Vision with CNNsGetting Started with TensorFlow 2.0Classification and Regression in a WeekendOnline Encyclopedia of Statistical ScienceAzure Machine Learning in a WeekendEnterprise AI - An Application PerspectiveApplied Stochastic ProcessesComprehensive Repository of Data Science and ML ResourcesFoundations of ML and Data Science for DevelopersElegant Representation of Forward/Back Propagation in Neural NetworksLearning the Math of Data ScienceTo access all these documents and more, follow this link.See More

I have used synthetic data sets many times for simulation purposes, most recently in my articles Six degrees of Separations between any two Datasets and How to Lie with p-values. Many applications (including the data sets themselves) can be found in my books Applied Stochastic Processes and New Foundations of Statistical Science. For instance, these data sets can be used to benchmark some statistical tests of hypothesis (the null hypothesis known to be true or false in advance) and to assess the power of such tests or confidence intervals. In other cases, it is used to simulate clusters and test cluster detection / pattern detection algorithms, see here. I also used such data sets to discover two new deep conjectures in number theory (see here), to design new Fintech models such as bounded Brownian motions, and find new families of statistical distributions (see here).Goldbach's comet In this article, I focus on peculiar random data sets to prove -- heuristically -- two of the most famous math conjectures in number theory, related to prime numbers: the Twin Prime conjecture, and the Goldbach conjecture. The methodology is at the intersection of probability theory, experimental math, and probabilistic number theory. It involves working with infinite data sets, dwarfing any data set found in any business context.Read full article here. See More

]]>

This is an interesting data science conjecture, inspired by the well known six degrees of separation problem, stating that there is a link involving no more than 6 connections between any two people on Earth, say between you and anyone living (say) in North Korea. Here the link is between any two univariate data sets of the same size, say Data A and Data B. The claim is that there is a chain involving no more than 6 intermediary data sets, each highly correlated to the previous one (with a correlation above 0.8), between Data A and Data B. The concept is illustrated in the example below, where only 4 intermediary data sets (labeled Degree 1, Degree 2, Degree 3, and Degree 4) are actually needed. Correlation table for the 6 data setsThe view the (random) data sets, understand how the chain of intermediary data sets was built, and access the spreadsheets to reproduce the results or test on different data, follow this link. It makes for an interesting theoretical data science research project, for people with too much free time on their hands. See More

The material discussed here is also of interest to machine learning, AI, big data, and data science practitioners, as much of the work is based on heavy data processing, algorithms, efficient coding, testing, and experimentation. Also, it's not just two new conjectures, but paths and suggestions to solve these problems. The last section contains a few new, original exercises, some with solutions, and may be useful to students, researchers, and instructors offering math and statistics classes at the college level: they range from easy to very difficult. Some great probability theorems are also discussed, in layman's terms: see section 1.2. The two deep conjectures highlighted in this article (conjectures B and C) are related to the digit distribution of well known math constants such as Pi or log 2, with an emphasis on binary digits of SQRT(2). This is an old problem, one of the most famous ones in mathematics, still unsolved today.Content of this articleA Strange Recursive FormulaConjecture AA deeper resultConjecture BConnection to the Berry-Esseen theoremPotential path to solving this problemPotential Solution Based on Special Rational Number SequencesInteresting statistical resultConjecture CAnother curious statistical resultExercisesRead the full article here. See More