]]>

Below is my personal list of statistical and machine learning methods that every data scientist should know in 2016.Statistical Hypothesis Testing (t-test, chi-squared test & ANOVA)Multiple Regression (Linear Models)General Linear Models (GLM: Logistic Regression, Poisson Regression)Random ForestXgboost (eXtreme Gradient Boosted Trees)Deep LearningBayesian Modeling with MCMCword2vecK-means ClusteringGraph Theory & Network Analysis(A1) Latent Dirichlet Allocation & Topic Modeling(A2) Factorization (SVD, NMF)From my experience in the data science industry for 4 years, I think that currently these 12 methods are the most popular, useful and suitable for various problems requiring data science.As far as I've known, there have been not a few lists of "representative methods in data science" ever. However, I feel some of them are already out-of-date because they appear to neglect the latest advance of data science in the industry. Thus I made this list as the one by business person, who knows practical matters and solutions with data science, including statistics and machine learning in the industry.In addition to the list itself, I showed R or Python scripts of an experiment on sample datasets for each method, in order to enable readers to try it easily.The original post is here, including R or Python scripts and experiments on sample datasets.See More

Actually I've known about MXnet for weeks as one of the most popular library / packages in Kaggler, but just recently I heard bug fix has been almost done and some friends say the latest version looks stable, so at last I installed it.MXnet: https://github.com/dmlc/mxnetI think that the most important feature of MXnet is its implementation of not only Deep Neural Network (DNN) but also Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) in R, because as far as I've known there has been no R packages implementing CNN (and/or RNN).In the original post of my blog, I tried a CNN {mxnet} R package with a short version of MNIST handwritten digit datasets whose maximum accuracy may be less than 0.98 for its small sample size.As a result, CNN of {mxnet} performed accuracy 0.976: this is better than Random Forest (0.951), Xgboost (0.953) or DNN by {h2o} (0.962). In addition, CNN by {mxnet} ran very fast (270 sec), indeed faster than DNN by {h2o} (tens of mins).MXnet is a framework distributed by DMLC, the team also known as a distributor of Xgboost. Now its documentation looks to be completed and even pre-trained models for ImageNet are distributed. I think this should be a good news for R-users loving machine learning... so let's go.CNN is a variant of Deep Learning and it has been well known for its excellent performance of image recognition. In particular, after CNN won ILSVRC 2012, CNN has gotten more and more popular in image recognition. The most recent success of CNN would be AlphaGo, I believe.Original (long) post with source code, illustration on a classification problem, and detailed explanations is here.See More

I wrote a blog post inspired by Jamie Goode's book "Wine Science: The Application of Science in Winemaking".In this book, Goode argued that reductionistic approach cannot explain relationship between chemical ingredients and taste of wine. Indeed, we know not all high (alcohol) wines are excellent, although in general high wines are believed to be good. Usually taste of wine is affected by a complicated balance of many components such as sweetness, acid, tannin, density or others that are given by corresponding chemical entities.However, I think (and probably many other data science experts agree) that it is not a limitation of reductionistic approach, but a limitation of univariate modeling. To illustrate it, I performed a series of multivariate modeling with random forest or other models on "Wine Quality" dataset of UCI Machine Learning repository.As a result, a random forest classifier predicted tasting score of wine better than intuitive univariate modeling. At the same time, it also showed some hidden and complicated dynamics between chemical ingredients and taste of wine. I believe that modern multivariate modeling such as machine learning can reveal more complicated relationship between chemical ingredients and taste of wine.See my blog post below for more details.http://tjo-en.hatenablog.com/entry/2015/11/27/002241See More

I wrote a series of blog posts on Bayesian modeling with R and Stan.Bayesian modeling with R and Stan (1): OverviewBayesian modeling with R and Stan (2): Installation and an easy exampleBayesian modeling with R and Stan (3): Simple hierarchical Bayesian modelBayesian modeling with R and Stan (4): Time series with a nonlinear trendBayesian modeling with R and Stan (5): Time series with seasonalityStan is a growing platform for MC(MC) computing implemented with C++. Compared to WinBUGS or OpenBUGS, it is very fast and programmable intuitively.This series of the posts show how to install Stan on R, how to run it, and how to apply it to actual datasets. I hope you'll find it to practice Bayesian modeling easier than ever.See More

A/B testing is widely used for online marketing, management of Internet ads or any other usual analytics. In general, people use it in order to look for "golden features (metrics)" that are vital points for growth hacking. To validate A/B testing, statistical hypothesis tests such as t-test are used and people are trying to find any metric with a significant effect across conditions. If you successfully find a metric with a significant difference between design A and B of a click button, you'll get happy. Such a metric can provide a rule-based predictor for KGI / KPI: for example, a landing page with a button A increases conversion rate by 2%.But unfortunately you may encounter a very bad situation: there are entirely no metrics with any significant difference between conditions. In this case, do you give up seeking golden features and its rule-based predictor?Even if so, you don't have to give up. Multivariate modeling, such as (generalized) linear models or machine learning classifiers, can build a good model to predict KGI / KPI, without any "golden features". In the latest post of my blog, I argued about such a case in which there are entirely no golden features with any significant differences but multivariate modeling works.This is a long article. Click here for details (data sets, R source code, statistical tests and models such as L1-Penalized logistic regression using the glmnet library, and Welch 2-sample t-test, in R).ConclusionsThe result told us that univariate stats and rule-based predictors given by usual hypothesis testing on them sometimes fail, while multivariate modelings work well given by (generalized) linear models or machine learning classifiers.In general, multi-dimensional and multivariate features usually represent more complex information and internal structure of datasets than univariate features. But in many situations in marketing, not a few people neglect an importance of multivariate information and even persist in running univariate A/B tests and looking for "golden features or metrics".Even when multiple features have "partial" correlations, such univariate A/B testing can be wrong because partial correlation easily affects outcome of usual univariate correlation (and also univariate testing).If you have multivariate datasets, please try multivariate modelings and don't persist in univariate A/B testing.DSC ResourcesCareer: Training | Books | Cheat Sheet | Apprenticeship | Certification | Salary Surveys | JobsKnowledge: Research | Competitions | Webinars | Our Book | Members Only | Search DSCBuzz: Business News | Announcements | Events | RSS FeedsMisc: Top Links | Code Snippets | External Resources | Best Blogs | Subscribe | For BloggersAdditional ReadingData Scientist Reveals his Growth Hacking Techniques10 Modern Statistical Concepts Discovered by Data ScientistsTop data science keywords on DSC4 easy steps to becoming a data scientist13 New Trends in Big Data and Data Science22 tips for better data scienceData Science Compared to 16 Analytic DisciplinesHow to detect spurious correlations, and how to find the real ones17 short tutorials all data scientists should read (and practice)10 types of data scientists66 job interview questions for data scientistsHigh versus low-level data scienceFollow us on Twitter: @DataScienceCtrl | @AnalyticBridgeSee More

In my own blog I wrote a series of articles about how major machine learning classifiers work, with some visualization of their decision boundaries on various datasets.Machine learning for package users with R (0): PrologueMachine learning for package users with R (1): Decision TreeMachine learning for package users with R (2): Logistic RegressionMachine learning for package users with R (3): Support Vector MachineMachine learning for package users with R (4): Neural NetworkMachine learning for package users with R (5): Random ForestMachine learning for package users with R (6): Xgboost (eXtreme Gradient Boosting)What kind of decision boundaries does Deep Learning (Deep Belief Net) draw? Practice with R and {h2o} packageThis series of the articles has a simple goal; to help many people understand an algorithm and theoretical features of each ML classifier easily. I believe visualization is the most important in particular.See More

In my own blog I wrote a series of articles about how major machine learning classifiers work, with some visualization of their decision boundaries on various datasets.Machine learning for package users with R (0): PrologueMachine learning for package users with R (1): Decision TreeMachine learning for package users with R (2): Logistic RegressionMachine learning for package users with R (3): Support Vector MachineMachine learning for package users with R (4): Neural NetworkMachine learning for package users with R (5): Random ForestMachine learning for package users with R (6): Xgboost (eXtreme Gradient Boosting)What kind of decision boundaries does Deep Learning (Deep Belief Net) draw? Practice with R and {h2o} packageThis series of the articles has a simple goal; to help many people understand an algorithm and theoretical features of each ML classifier easily. I believe visualization is the most important in particular.See More

As a part of a series of posts discussing how a machine learning classifier works, I ran decision tree to classify a XY-plane, trained with XOR patterns or linearly separable patterns.1. Simple (non-overlapped) XOR patternIt worked well. Its decision boundary was drawn almost perfectly parallel to the assumed true boundary, i.e. XY axes.2. Complex (overlapped) XOR pattern without pruningAwful result, it appears to never follow the true boundary.3. Complex XOR pattern with pruningJust a little improved, but it still appears to be overfitted.4. Two-classes linearly separable patternNever parallel to the true boundary.5. Three-classes linearly separable patternEven worse... it appears to get more overfitted than the case of 2-classes.Throughout these experiments, I found decision tree alone is easy to get overfitted; it obviously requires any further additional methods to get generalized, e.g. ensemble learning.click here to read full article.Join Data Science Central to comment on this post.See More

As a part of a series of posts discussing how a machine learning classifier works, I ran decision tree to classify a XY-plane, trained with XOR patterns or linearly separable patterns.1. Simple (non-overlapped) XOR patternIt worked well. Its decision boundary was drawn almost perfectly parallel to the assumed true boundary, i.e. XY axes.2. Complex (overlapped) XOR pattern without pruningAwful result, it appears to never follow the true boundary.3. Complex XOR pattern with pruningJust a little improved, but it still appears to be overfitted.4. Two-classes linearly separable patternNever parallel to the true boundary.5. Three-classes linearly separable patternEven worse... it appears to get more overfitted than the case of 2-classes.Throughout these experiments, I found decision tree alone is easy to get overfitted; it obviously requires any further additional methods to get generalized, e.g. ensemble learning.click here to read full article.Join Data Science Central to comment on this post.See More