A Data Science Central Community

**Is data mining more about fitting data well? - Exercise Results**

Today, I am going to share results of an exercise that I carried out recently for a start-up. Intention of the study was to extract those major attributes that are generally driving less/in experienced (or) re-skilled data miners towards the given objective and to understand where they are failing back. Herein, twist is majority of them have given same conclusion or explanation for the given objective. Results highlight or comment on, those important aspects of the practice where most of them failed to cognize for the sake of quick answer/solution.

Sample Observed:

All members of the sample had experience both with R and data mining solutions; either through course projects (free/paid/part-of-curriculum) or through industry experience, however, industry experienced sample have been limited between minimum of 1 year to maximum of 3 years from whatever domain. Details of sample are as below:

a) 17 - Fresher’s from various engineering background (both Graduates and Post-Graduates)

b) 12 - Fresher’s from various quantitative background (Maths, Stats, MBAs, Econometrics, etc.)

c) 18 - Experienced from different industry background (data management related, programming, consulting, etc.)

d) All members of the sample belong to two major cities of India.

About Test Data:

Bank data of customers belonging to a particular city branch having around 17000 observations for a period of one month, which as information about customer’s age, few demographics, no of transactions they did in that month, whether they visited branch in that month, etc., total of 12 variables.

Infrastructure Provided:

Computing machine with a pre-installed latest R (3.1.1) & RStudio that has 8GB RAM and Intel Core i7 Processor.

Objective:

“Comment about the variables ‘visiting branch’ and ‘age’ relationship”.

Time Limit:

A time limit of 20 minutes was given, which was almost two and half times more than average time of experienced people, took to give their comments.

Highlights from the Exercise:

- As mentioned earlier, almost all except few has given same inference that ‘numbers of visits to branch’ have positive relationship with ‘age’ of the customers. In other words, as age is increasing, customers are preferring to visit the branch. Not to forget to mention, interestingly most of them are comfortable with R programming except few typo errors, kudos to all developers making it more user friendly.
- Astonishingly, only 21% of the sample, has done some data understanding after reading the data, i.e. looking into descriptive stats either through summary functions or plots before moving to the modeling part. In these 21%, not even a single sample member is from engineering background (by saying this I am not generalizing it, nor against engineering background, but commenting from sample perspective). Also, perceptibly, another 15% came back to data understanding after fitting at least one or two models.
- One more astonishment is, type of techniques employed by participants went onto deep learning methods. Average number of models applied by all participants was near to 3, herein, there are few participants, who didn’t even fitted a single technique/model.
- Only 15% of the sample, had clearly mentioned that result may be spurious or declined to comment on relationship due to noise in the data; however, only half of them came out with explanations for the same.
- Notable fact from our exercise is that, many of them directly applied the techniques they are aware (few among them directly fitted neural networks, and then came back to machine learning classification techniques as they need to comment on relationship). And, more than half of the sample first directly test with a variant of Generalized Linear Model and then went to applications of other techniques as they found explanatory power of the model was low and they were behind all data mining techniques till time limit ends.

What was wrong in the data?

When this data was originally received, I observed that due to a machine/man-made mistake, column ‘age of the customer’ in the data was having representation of an additive nature, for instance, if customer has visited the branch twice in the month and his original age is 25, it appeared as 50. Hence, positive relationship as age increased, however, it was not the case after the noise removal.

Summary:

Data Mining is a process of many stages as depicted in CRISP-DM^{1} and data understanding is key of them, I always suggest process your data incrementally, if you want efficient analytical solution, ignoring it, and employing which fits the data well practice, may not work in all situations.

^{1} http://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Da...

Author thank management of start-up for allowing to publish exercise highlights. He undertook several programs towards analytical talent development, views expressed here are from his industry experience. He can be reached at [email protected] for more details.

© 2020 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Statistics -- New Foundations, Toolbox, and Machine Learning Recipes
- Book: Classification and Regression In a Weekend - With Python
- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge