The methodology described here has broad applications, leading to new statistical tests, new type of ANOVA (analysis of variance), improved design of experiments, interesting fractional factorial designs, a better understanding of irrational numbers leading to cryptography, gaming and Fintech applications, and high quality random numbers generators (and when you really need them). It also features exact arithmetic / high performance computing and distributed algorithms to compute millions of binary digits for an infinite family of real numbers, including detection of auto- and cross-correlations (or lack of) in the digit distributions.The data processed in my experiment, consisting of raw irrational numbers (described by a new class of elementary recurrences) led to the discovery of unexpected apparent patterns in their digit distribution: in particular, the fact that a few of these numbers, contrarily to popular belief, do not have 50% of their binary digits equal to 1. It turned out that perfectly random digits simulated in large numbers, with a good enough pseudo-random generator, also exhibit the same strange behavior, pointing to the fact that pure randomness may not be as random as we imagine it is. Ironically, failure to exhibit these patterns would be an indicator that there really is a departure from pure randomness in the digits in question.In addition to new statistical / mathematical methods and discoveries and interesting applications, you will learn in my article how to avoid this type of statistical traps that lead to erroneous conclusions, when performing a large number of statistical tests, and how to not be misled by false appearances. I call them statistical hallucinations and false outliers.This article has two main sections: section 1, with deep research in number theory, and section 2, with deep research in statistics, with applications. You may skip one of the two sections depending on your interests and how much time you have. Both sections, despite state-of-the-art in their respective fields, are written in simple English. It is my wish that with this article, I can get data scientists to be interested in math, and the other way around: the topics in both cases have been chosen to be exciting and modern. I also hope that this article will give you new powerful tools to add to your arsenal of tricks and techniques. Both topics are related, the statistical analysis being based on the numbers discussed in the math section. One of the interesting new topics discussed here for the first time is the cross-correlation between the digits of two irrational numbers. These digit sequences are treated as multivariate time series. I believe this is the first time ever that this subject is not only investigated in detail, but in addition comes with a deep, spectacular probabilistic number theory result about the distributions in question, with important implications in security and cryptography systems. Another related topic discussed here is a generalized version of the Collatz conjecture, with some insights on how to potentially solve it.Read the full article here. Content1. On the Digits Distribution of Quadractic Irrational NumbersProperties of the recursionReverse recursionProperties of the reverse recursionConnection to Collatz conjectureSource codeNew deep probabilistic number theory resultsSpectacular new result about cross-correlationsApplications2. New Statistical Techniques Used in Our AnalysisData, features, and preliminary analysisDoing it the right wayAre the patterns found a statistical illusion, or caused by errors, or real?Pattern #1: Non-Gaussian behaviorPattern #2: Illusionary outliersPattern #3: Weird distribution for block countsRelated articles and booksAppendixSee More

Summary: The Gartner Magic Quadrant for Data Science and Machine Learning Platforms is just out the big news is how much more capable all the platforms have become. Of course there are also some interesting winner and loser stories.The Gartner Magic Quadrant for Data Science and Machine Learning Platforms is just out for 2020. The really big news is how many excellent choices are now available. In a remarkable move, the whole field of competitors has moved strongly up and to the right offering more and more Leaders or near-leader Visionaries than ever before.It’s a mark of maturity in our industry that so many platforms offer fully capable model development, operationalizing, and management features. That list of requirements as defined by Gartner grows longer every year and earning a better rating requires increasing capability and increasing customer satisfaction.What Are the Major Changes?As in previous years we’ve charted the major changes in position using green arrows for improvement and red arrows to indicate a reduced rating. The blue dots are current ratings and the gray dots are from a year ago.Read the full article here with the 2020 version of the above chart, with comments.See More

In this notebook, we try to predict the positive (label 1) or negative (label 0) sentiment of the sentence. We use the UCI Sentiment Labelled Sentences Data Set.Sentiment analysis is very useful in many areas. For example, it can be used for internet conversations moderation. Also, it is possible to predict ratings that users can assign to a certain product (food, household appliances, hotels, films, etc) based on the reviews.In this notebook we are using two families of machine learning algorithms: Naive Bayes (NB) and long short term memory (LSTM) neural networks.AYLIENDeeplearning4jUnderstanding LSTM NetworksEmpirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling The Unreasonable Effectiveness of Recurrent Neural NetworksWe will use pandas, numpy for data manipulation, nltk for natural language processing, matplotlib, seaborn and plotly for data visualization, sklearn and keras for learning the models.Read the full article with source code and illustrations, here. See More

Probably the worst error is thinking there is a correlation when that correlation is purely artificial. Take a data set with 100,000 variables, say with 10 observations. Compute all the (99,999 * 100,000) / 2 cross-correlations. You are almost guaranteed to find one above 0.999. This is best illustrated in may article How to Lie with P-values (also discussing how to handle and fix it.)This is being done on such a large scale, I think it is probably the main cause of fake news, and the impact is disastrous on people who take for granted what they read in the news or what they hear from the government. Some people are sent to jail based on evidence tainted with major statistical flaws. Government money is spent, propaganda is generated, wars are started, and laws are created based on false evidence. Sometimes the data scientist has no choice but to knowingly cook the numbers to keep her job. Usually, these “bad stats” end up being featured in beautiful but faulty visualizations: axes are truncated, charts are distorted, observations and variables are carefully chosen just to make a (wrong) point.Read the full article here. Related articlesHow to Lie with P-valuesFour Types of Data ScientistDebunking Forbes Article about the Death of the Data ScientistWhy You Should be a Data Science Generalist - and How to Become OneBecoming a Billionaire Data Scientist vs Struggling to Get a $100k Job Is a PhD helpful for a data science career?If data science is in demand, why is it so hard to get a job?Why do people with no experience want to become data scientists?Why is Becoming a Data Scientist so Difficult?Full Stack Data Scientist: The Elusive Unicorn and Data HackerStatistical Significance and p-Values Take Another BlowAre data science or stats curricula in US too specialized?How do you identify an actual data scientist?Is it still possible today to become a self-taught data scientist?Will the job outlook for data scientists severely decline after 2020?Why Logistic Regression should be the last thing you learnSource for picture: here See More

]]>

Fermat's last conjecture has puzzled mathematicians for 300 years, and was eventually proved only recently. In this note, I propose a generalization, that could actually lead to a much simpler proof and a more powerful result with broader applications, including to solve numerous similar equations. As usual, my research involves a significant amount of computations and experimental math, as an exploratory step before stating new conjectures, and eventually trying to prove them. The methodology is very similar to that used in data science, involving the following steps:Identify and process the data. Here the data set consists of all real numbers; it is infinite, which brings its own challenges. On the plus side, the data is public and accessible to everyone, though very powerful computation techniques are required, usually involving a distributed architecture. Data cleaning: in this case, inaccuracies are caused by no using enough precision; the solution consists of finding better / faster algorithms for your computations, and sometimes having to work with exact arithmetic, using Bignum libraries.Sample data and perform exploratory analysis to identify patterns. Formulate hypotheses. Perform statistical tests to validate (or not) these hypotheses. Then formulate conjectures based on this analysis. Build models (about how your numbers seem to behave) and focus on models offering the best fit. Perform simulations based on your model, see if your numbers agree with your simulations, by testing on a much larger set of numbers. Discard conjectures that do not pass these tests.Formally prove or disprove retained conjectures, when possible. Then write a conclusion if possible: in this case, a new, major mathematical theorem, showing potential applications. This last step is similar to data scientists presenting the main insights of their analysis, to a layman audience.See full article for explanations about this table (representing the number of solutions)The motivation in this article is two-fold:Presenting a new path that can lead to new interesting results and theoretical research in mathematics (yet my writing style and content is accessible to the layman).Offering data scientists and machine learning / AI practitioners (including newbies) an interesting framework to test their programming, discovery and analysis skills, using a huge (infinite) data set that has been available to everyone since the beginning of times, and applied to a fascinating problem. Read full article here. For more math-oriented articles, visit this page (check the math section), or download my books, available here.See More

Hundreds of programming languages dominate the data science and statistics market: Python, R, SAS and SQL are standouts. If you're looking to branch out and add a new programming language to your skill set, which one should you learn? This one picture breaks down the differences between the four languages.View the full picture (with pluses and minuses) as well as related articles, here. Below are more resources for specific languages, including comparisons between languages, and same algorithms illustrated in different languages.PythonPython vs RRSQLSASJuliaScalaJavaCMatlabTo quickly learn these languages or refresh your skills, check out our cheat sheets.See More

]]>

While many of the programming libraries encapsulate the inner working details of graph and other algorithms, as a data scientist it helps a lot having a reasonably good familiarity of such details. A solid understanding of the intuition behind such algorithms not only helps in appreciating the logic behind them but also helps in making conscious decisions about their applicability in real life cases. There are several graph based algorithms and most notable are the shortest path algorithms. Algorithms such as Dijkstra’s, Bellman Ford, A*, Floyd-Warshall and Johnson’s algorithms are commonly encountered. While these algorithms are discussed in many text books and informative resources online, I felt that not many provided visual examples that would otherwise illustrate the processing steps to sufficient granularity enabling easy understanding of the working details. As such, I had to use simple enough graphs to visualize the algorithmic flow for my own understanding and I wanted to share my examples along with the explanations through this article. Since there are many algorithms to illustrate, I decided to divide the article into several parts. In part 1, I have illustrated Dijkstra’s and Bellman-Ford algorithms. Before diving into algorithms, I also wanted to highlight salient points on the graph data structure.Content of this article:Quick Primer On Graph Data StructureDijkstra’s AlgorithmBellman-Ford AlgorithmMore Algorithms To CoverRead the full article here. Written by Murali Kashaboina, Tech. Executive, PhD Researcher AI/ML/DS, Data Scientist, Industry Speaker, Entrepreneur.See More

In 2019, Google announced TensorFlow 2.0, it is a major leap from the existing TensorFlow 1.0. The key differences are as follows:Ease of use: Many old libraries (example tf.contrib) were removed, and some consolidated. For example, in TensorFlow1.x the model could be made using Contrib, layers, Keras or estimators, so many options for the same task confused many new users. TensorFlow 2.0 promotes TensorFlow Keras for model experimentation and Estimators for scaled serving, and the two APIs are very convenient to use.Eager Execution: In TensorFlow 1.x. The writing of code was divided into two parts: building the computational graph and later creating a session to execute it. this was quite cumbersome, especially if in the big model that you have designed, a small error existed somewhere in the beginning. TensorFlow2.0 Eager Execution is implemented by default, i.e. you no longer need to create a session to run the computational graph, you can see the result of your code directly without the need of creating Session.Model Building and deploying made easy: With TensorFlow2.0 providing high level TensorFlow Keras API, the user has a greater flexibility in creating the model. One can define model using Keras functional or sequential API. The TensorFlow Estimator API allows one to run model on a local host or on a distributed multi-server environment without changing your model. Computational graphs are powerful in terms of performance, in TensorFlow 2.0 you can use the decorator tf.function so that the following function block is run as a single graph. This is done via the powerful Autograph feature of TensorFlow 2.0. This allows users to optimize the function and increase portability. And the best part you can write the function using natural Python syntax.Read the full article here. To access the author's books covering machine learning, Azure, Tensorflow, deep learning and related topics (free for DSC members), follow this link. See More