A Data Science Central Community
Given n observations x1, ..., xn, the generalized mean (also called power mean) is defined as
The case p = 1 corresponds to the traditional arithmetic mean,…Continue
Added by Vincent Granville on August 30, 2020 at 9:30am — No Comments
Bernouilli lattice processes may be one of the simplest examples of point processes, and can be used as an introduction to learn about more complex spatial processes that rely on advanced measure theory for their definition. In this article, we show the differences and analogies between Bernouilli lattice processes on the standard rectangular or hexagonal grid, and the Poisson process, including convergence of discrete lattice processes to continuous Poisson process, mainly in two…Continue
Added by Vincent Granville on June 5, 2020 at 1:11pm — No Comments
Summary: Explaining data science to a non-data scientist isn’t as easy as it sounds. You may know a lot about math, tools, techniques, data, and computer architecture but the question is how do you explain this briefly without getting buried in the detail. You might try this approach.
Added by Vincent Granville on June 4, 2020 at 5:05pm — No Comments
Product of two large primes are at the core of many encryption algorithms, as factoring the product is very hard for numbers with a few hundred digits. The two prime factors are associated with the encryption keys (public and private keys). Here we describe a new approach to factoring a big number that is the product of two primes of roughly the same size. It is designed especially to handle this problem and identify flaws in encryption algorithms. …Continue
Added by Vincent Granville on May 27, 2020 at 12:20pm — No Comments
We discuss a simple trick to significantly accelerate the convergence of an algorithm when the error term decreases in absolute value over successive iterations, with the error term oscillating (not necessarily periodically) between positive and negative values.
We first illustrate the technique on a well known and simple case: the computation of log 2 using its well know, slow-converging series. We then discuss a very interesting and more complex case, before finally focusing on a…Continue
Added by Vincent Granville on May 5, 2020 at 5:37pm — No Comments
The methodology described here has broad applications, leading to new statistical tests, new type of ANOVA (analysis of variance), improved design of experiments, interesting fractional factorial designs, a better understanding of irrational numbers leading to cryptography, gaming and Fintech applications, and high quality random numbers generators (and when you really need them). It also features exact arithmetic / high performance computing and distributed algorithms to compute millions of…Continue
Added by Vincent Granville on February 29, 2020 at 11:00pm — No Comments
Summary: The Gartner Magic Quadrant for Data Science and Machine Learning Platforms is just out the big news is how much more capable all the platforms have become. Of course there are also some interesting winner and loser stories.
The Gartner Magic Quadrant for Data Science and Machine Learning Platforms is just out for 2020. The really big news is how many excellent choices are now available. In a remarkable move, the whole field…Continue
Added by Vincent Granville on February 21, 2020 at 9:25am — No Comments
In this notebook, we try to predict the positive (label 1) or negative (label 0) sentiment of the sentence. We use the UCI Sentiment Labelled Sentences Data Set.
Sentiment analysis is very useful in many areas. For example, it can be used for internet conversations moderation. Also, it is possible to predict ratings that users can assign to a certain product (food, household appliances, hotels,…Continue
Added by Vincent Granville on February 19, 2020 at 8:42pm — No Comments
Probably the worst error is thinking there is a correlation when that correlation is purely artificial. Take a data set with 100,000 variables, say with 10 observations. Compute all the (99,999 * 100,000) / 2 cross-correlations. You are almost guaranteed to find one above 0.999. This is best illustrated in may article How to Lie with P-values (also discussing…Continue
Added by Vincent Granville on February 7, 2020 at 9:48am — No Comments
Fermat's last conjecture has puzzled mathematicians for 300 years, and was eventually proved only recently. In this note, I propose a generalization, that could actually lead to a much simpler proof and a more powerful result with broader applications, including to solve numerous similar equations. As usual, my research involves a significant amount of computations and experimental math, as an exploratory step before stating new conjectures, and eventually trying to prove them. The…Continue
Added by Vincent Granville on January 30, 2020 at 1:09am — No Comments
Hundreds of programming languages dominate the data science and statistics market: Python, R, SAS and SQL are standouts. If you're looking to branch out and add a new programming language to your skill set, which one should you learn? This one picture breaks down the differences between the four languages.…Continue
Added by Vincent Granville on January 28, 2020 at 8:41pm — No Comments
While many of the programming libraries encapsulate the inner working details of graph and other algorithms, as a data scientist it helps a lot having a reasonably good familiarity of such details. A solid understanding of the intuition behind such algorithms not only helps in appreciating the logic behind them but also helps in making conscious decisions about their applicability in real life cases. There are several graph based algorithms and most notable are the shortest path…Continue
Added by Vincent Granville on January 21, 2020 at 10:12am — No Comments
In 2019, Google announced TensorFlow 2.0, it is a major leap from the existing TensorFlow 1.0. The key differences are as follows:
Ease of use: Many old libraries (example tf.contrib) were removed, and some consolidated. For example, in TensorFlow1.x the model could be made using Contrib, layers, Keras or estimators, so many options for the same task confused many new users. TensorFlow 2.0 promotes TensorFlow Keras for model experimentation and Estimators…Continue
Added by Vincent Granville on January 9, 2020 at 9:49am — No Comments
Summary: AI/ML itself is the next big thing for many fields if you’re on the outside looking in. But if you’re a data scientist it’s possible to see those advancements that will propel AI/ML to its next phase of utility.
Added by Vincent Granville on January 7, 2020 at 7:41am — No Comments
Another good article by Ajit Joakar.
Co-relation does not equal causation – is a mantra drilled into a Data Scientist from an early age
That’s fine. But very few talk of the follow-on question ..
How exactly do you determine causation?
This problem is further compounded because most books and examples are based on standard datasets (ex: Boston, Iris etc) . These examples do not discuss…Continue
Added by Vincent Granville on December 17, 2019 at 2:30pm — No Comments
Written by Ajit Jaokar.
Firstly, there are three broad categories of algorithms:
Added by Vincent Granville on December 17, 2019 at 9:00am — No Comments
There's no doubt about it, probability and statistics is an enormous field, encompassing topics from the familiar (like the average) to the complex (regression analysis, correlation coefficients and hypothesis testing to name but a few). If you want to be a great data scientist, you have to know some basic statistics. The following picture shows which statistics topics you must know if you're going to excel in data science.…Continue
Added by Vincent Granville on December 12, 2019 at 6:30pm — No Comments
At the time of writing, I'm a 52 year-old working in the fields of mathematics and data science. In mathematics, that makes me well-seasoned (and probably well-tenured, if I had chosen to continue in academia). In data science, some would consider me a dinosaur. In fact, many older people considering a career in data science might be put off by the thought that data science is tough to break into at a later age. But is that statement true? Should the over 50 crowd put down their textbooks…Continue
Added by Vincent Granville on December 10, 2019 at 11:51am — No Comments
We study the properties of a typical chaotic system to derive general insights that apply to a large class of unusual statistical distributions. The purpose is to create a unified theory of these systems. These systems can be deterministic or random, yet due to their gentle chaotic nature, they exhibit the same behavior in both cases. They lead to new models with numerous applications in Fintech, cryptography, simulation and benchmarking tests of statistical hypotheses. They are also…Continue
Added by Vincent Granville on November 29, 2019 at 2:30am — No Comments
Summary: 99% of our application of NLP has to do with chatbots or translation. This is a very interesting story about expanding the bounds of NLP and feature creation to predict bestselling novels. The authors created over 20,000 NLP features, about 2,700 of which proved to be predictive with a 90% accuracy rate in predicting NYT bestsellers.…Continue
Added by Vincent Granville on November 28, 2019 at 10:00pm — No Comments