A Data Science Central Community
Got into an interesting discussion today about the basic requirements for someone to learn and perform data mining within a company.
While I am sure everyone agrees some stats understanding is imperative- the discussion was how much and in what areas.
One person (me) suggested a good second class in statistics (which covers regression) is all the classic statistics a person needs to get into learning data mining and other standard predictive models.
My colleague suggested a person needed much more formal statistics training to properly use, build and deploy these types of models.
My comment was partially based on many of the Data Science and Analytics degrees out there. They seem to require one or two stats classes to get started. My colleague stated if they don't understand the underlying statistics and probability better than that, they will misuse the models.
The conversation started over an MBA with a couple of stats classes, and would they be prepared to learn data mining with R (like get a data mining certificate).
We did not define "data mining" per se, but we were really thinking of the types of standard data mining a company might do.
So, I thought I would see the comments since almost everyone on this board is a data miner of some sort? I am sure there is no right answer. We just were interested in some other opinions.
This might provide some food for thought, from the point of view of your colleague's reservations around model misuse:
Sounds like you and your friend are talking about two different things. Any bright person who has survived statistics 102 (with regression) can learn data mining and mechanically construct mining models. What your colleague is talking about is really what separates the art from the science. If you don't understand how the statistics are operating under hood, you really never have the ability to question the results you obtain, or have the ability to tie it to the real world problem.
Yes, statistics will ALWAYS give you an answer. Sometimes the answer will look very good.
But whether or not it has anything to do with the problem is another question, and that is why your collegue is right.
I concur with Ralph. I think it is important to understand what is going on "underneath the hood." I also think it is VERY important for people to think about the underlying theory of statistics and how it relates to a problem. Anyone can search the web for some R code or use SPSS or the like to run an analysis, but people should think about the problem. Nate Silver, in his book "The Signal and the Noise," describes why models fail in different domains. To me, his central thesis was that analysts need to use a practical approach and think about how variables relate to the problem being analyzed and consider different approaches. He also advocates for the Bayesian approach over the Frequentist, but that is another topic.
Nice thought provoking post. Through the comments of Ralph and Mike, I found that I am not alone when it comes to thinking of the importance of knowing the details and assumptions behind the statistics used. I have never really believed in the black-box with-no-modification-to-parameters approach to get an answer to the data mining analytics.
While math is a large part of data mining and it provides a mean to articulate the findings, it is also important to understand the data itself. Understanding the data is crucial whether you work at your own $50k company or $500 million company. The data from a $50k company, its own data based on its own efforts, is a little more difficult come to a conclusion even with an understanding. More often than not, external data must be used.
As for the $500 million company, the data can be extensive and very useful. The analytics must be done so as to NOT come to the wrong conclusion. Striking a balance between noise and signal is important in the data. You must be able to understand the data that you are looking at before doing any kind of analytics or use any kind of statistics.
The art of analytics is being able to match the data with the appropriate statistics/math. The science of analytics is being able to articulate with the statistics/math on a more long term basis for the repeated incoming data.
For example: In business, businesses influence customers actions through a variety of means. In understanding the data and performing data mining, you would want to separate out the customers who were influenced by the business specific actions versus when the business did not influence the customers decision. In short, does the data illustrate the dynamics of reality(with minimal noise) or does the data need to be augmented with other data or knowledge. This seems simplistic but I have come across where this is missed quite often and the wrong conclusions were drawn from the data.
There are a lot of types of analytics: sales strategy, pricing of products and the changes in pricing, advertising effects, policy effects, business strategy effects, and there is the whole economics side of the problem.
This is an interesting topic
There appears to be no universal BoK ( Body of Knowledge ) for this domain . While it will be good to have one , it might be difficult to define.
There are all kinds people calling themselves Data professionals /experts. Some are domain experts that have got basic training in statistics while others are certified statisticians trying to learn the domain . There is also a third category i.e the hardcore software professionals that are trying to learn statistics as well as the domain .
The big challenge for each individual is to 'skill up" adequately in each of the these areas - Are we asking for too much?
I feel Data professionals should have sound knowledge of classical statistics . Just my opinion.
I'm pretty new to this stuff, but I agree with Myles.
In my work I've needed to do some unorthodox predictive modelling. I usually try to bounce ideas off of people who know more than I do, so I approached several MSc and PhD statisticians - who were completely useless. I then went to some machine learning experts and found them much more able to think outside-the-box. The machine learning guys generally had "stats 102" or less as far as classical statistics background, but seemed to have a better understanding of how to truly explore, analyse, and predict data.
I agree that it's poor form to simply hit the "analyse" button on some software platform without understanding what's going on in the background. But is extensive stats training giving us the right kind of understanding?
Thanks for the responses thus far. Let me throw on some more gas to the fire.
For those that propose more stats is better- help me with this one. How much stats and in what?
Many of us come to this field from different areas. We are strong in one area and tend to learn the rest.
Now I have a soft spot for stats and certainly value all what I learned. But as others have pointed out- a successful data miner or data scientists need a good foundation in stats, programming/CS, a keen business sense (to understand the problem), have an investigative nature to them and be a good communicator. All of these are learned skills.
Lets say you have a MBA with a strong background in econ or finance or even MIS- and they currently work as a business analyst (therefore they certainly had at least two stats classes (stats 102), and a little working with linear algebra).
They want to upgrade themselves to be a data miner and be a DS. What else would you recommend they take to properly understand the black box?
Do they need to get a masters in stats and then learn the ds field (all the other pieces I mentioned)? Are there specific stats classes beyond 102 do you need to be a better practitioner?
How would you advise this person? (This is really the gist of my entire question- as I get it all the time).
PS- Now my background was littered with graduate level stats and econometrics. I try to think which of those classes helped me better understand the black box. I do have a hard time with which one(s) gave me the insight.
Thanks for this fun conversation.
Good point to focus in the conversation. Let me preface my thoughts with stating that I do NOT claim to be a data miner nor a statistician. I am an Industrial Engineer that had several stats courses. That being said, I have analyzed numerous experiments as well as explored raw data for information. Back to the question at hand, I think everyone should have a base knowledge in stats (e.g., stats 101). After that, I suggest people to learn data reduction techniques (e.g., multiple regression, cluster, PCA, & FA). Based on my experiences, I have had to use a lot of nonparametric techniques so I think they come in handy. I think it is important to keep in mind that stats is a growing field. To me, being able to research different techniques and choose the one that best fits the need of a specific project is more important than taking all the stats classes you can. I highly suggest Nate Silver's book, "The Signal and the Noise." It is a quick easy read for the novice. It shows how simple stats can be used for a myriad of problems as well as show how complex stats can muddy the waters.
This is a very important topic and i have seen most organizations struggling to understand the right mix of qualities. My experience suggests that one needs to wear 3 hats simultaneously:
1. Hat of a theory expert - at least a Statistician/ Mathematician - on top of that, depending on the need, skills of an 'Econometrician'/ NLP etc may help: Theoretical understanding of Stats AND Mathematics - if one has only superficial knowledge, I have often seen the "analysts" using Logistic Regression "because the needed predictor was binary (Yes/ No)", without even caring to check if data was suitable for the technique used;
2. Hat of a Statistical Tool Expert: Given that one handles 'tons' of data, one needs to know at least 2-3 tools by himself. Every tool has its limitations and benefits - without knowing what's possible where, it's difficult to handle data. E.g., K Means++ may not be available in a particular tool - so one needs to know all this;
3. Hat of a BUSINESS Consultant: I have often seen expert Statisticians getting swayed by theory that they forget the business issue/ context/ need! Extreme Analytical insights that have little business value are of no use - similarly, an over simplistic view when the stakes are very high may be damaging. we must understand that no analysis is often a much better scenario than incorrect analysis!
That is exactly where the issue is. One may find good Statisticians - but they are very often unable to wear the hat of a Consultant! Even when they do, the expertise is often limited, that too, to a single tool...More often, the IT "experts" or the "tool operators" (with little understanding of business context/ theory) complicate the situation.
Hello V Shekhar!
Your phrase, "if the data was suitable for the technique" is refreshing. I find that some, perhaps more, business analysts just "whip out the regression" technique because they learned it in MBA class without understanding how the data is distributed or whether enough of a sample was collected. Many populations are distributed random normal with a mean of zero and standard deviation of 1 or chi squared. It's been several years since I practiced econometrics or built credit risk and interest rate risk models for scoring and pricing loans; however, I seem to recall basic stats and econometrics having some fundamental theories around clasical assumptions. I seam to remember having to test whether multicollnearity or heteroscedasticity exists in the regression model and determine work arounds.