A Data Science Central Community
As an Applied Statistician with Bayesian background ["Bayesian Inference in Life-Testing," Ph.D. 1968, Dept. of Math., Indian Institute of Technology, Kharagpur , India] I would like to add some views in this topic of Data and Data Science. Living and working in USA for last few decades as given me some experience and exposures to different phases of data development. Earlier there was not much available Data and we had to go after it like a precious gem. Now there are so much Data, and of all kinds, that there are shortages (at least it appears so), of trained people to handle it.
Let me take this moment to add that although my Ph.D. was awarded from IIT, In India, my thesis on 'Bayesian Inference in Life Testing' was examined in the USA and accepted as a new piece of work at that time. Having that load of humiliation off my chest that I carried for years, for not having another Ph.D. from the USA, I can assure you that I had always updated my knowledge in the state of the art. None of my USA employers, ever thought that I required another degree, as long as I did my work and satisfied other requirements of the job. Which I did. So, now that I am retired, I feel I have earned the right to express my professional opinion in this area. So, please bear with me until the end of this write up and then please do comment if you want. Here are my observations in the role of Statistical methods in Data Science in general.
1. Traditional or Classical Approach : Putting it in a simplest words as possible and in a very short summary
In an Estimation problem, looking at a data to derive any inference about a 'characteristic' of a Population, this approach mainly uses a sample taken at 'random' from a collection of these similar items. An 'estimate' of that characteristic (also known as a parameter) of the collection (or Universe, Population ), is computed from that sample. This estimate is then tested to find out how close it might be to the original parameter, which is usually unknown. Graphical methods such EDA (Exploratory Data Analysis) are also used to study and guess the nature of the characteristic in the population, based on the data from the sample. Sampling is repeated or replicated several times, to reduce the error in the estimate. Most of the time, certain assumptions are made in a classical approach. In Estimation process followed by a Testing of Hypotheses, one assumption is that the parameter in the universe, is a fixed quantity and has no variation. Other assumptions are also made.
The field of Statistics with classical approach is a vastly developed field of excellent work with applications in many different areas of science and technology. However, sometimes assumptions made may or may not be entirely realistic. Which brings me to my next comment.
2. Bayesian Approach: A short summary
Thomas Bayes a clergyman interested in Probability published an essay in the Transactions of The Royal Society, in 1763, suggesting a logical reasoning approach using conditional and marginal probabilities in resolving a probability issue. Thus, the idea of the Prior and Posterior probabilities was born. However, this approach was ignored by the Statistical community for more than two centuries. Not until the twentieth century, the Bayesian method and the Bayes Rule started taking roots in statistical thinking.
The approach suggests that, in case there is some prior knowledge about parameter, such as it is not fixed but varies, varies within a range of variation and a probability can be assigned to that variation. With this knowledge, the traditional methods cannot resolve certain problems with such knowledge. These give rise to mostly stochastic situations, where the population itself is shifting with all it parameters. Few examples, Climate Data (Carbon emission, Temperature rise etc), Medical Data, Automobile Data, Electronic components or assemblies etc., where changes, progress and refinements in the original population are continuously going on. So, real-time data acquisitions was more and more emphasized to reflect the realistic situation as much as possible.
With the innovations in high speed computation and simulation techniques, Bayesian methods are proving to be more useful where acceptable results were not possible any other ways. Most of the unpopularity of the Bayes methods were for mostly two reasons. The difficulty in assigning the Prior Probability and the complicated mathematical expressions encountered which were mostly analytically intractable.
3. Data And Data Science:
Now, this is an entirely different scenario. Now, the continuos data streams of all kinds, structured, unstructured, formatted, not formatted, images, tweets and words etc. make it a challenge for all. Not only the analyses, but the storage and access of the data are also challenging undertakings. Obviously, the role a Statistician is no longer a traditional role, but more in an advisory capacity, should there be a requirement. I am not sure what a typical job description for a Data Scientist is, but I am certain it is not to be a Statistician.
So in my view, the nature of the incoming Data needs to be studied. I suppose each organization has its own definition of the qualitative 'Big Data' in terms of a quantity. Most of the time, the stream gives single or multiple piece of information on a single data point.
a. A visualization or graphical software embedded in a right location of the processing network will be desirable. The Hadoop network is well established to allow such software to be installed. This first visualization is important, it gives an overall view of the data stream. It is no longer a sample but one is studying the whole population with all the changes going on a real time basis. Simple Bar Charts or Pareto Charts can give the basic information in a certain time frame on how the next step is developed.
b. The important factors or signals or variables need to be isolated. A collection of statistical tools known as Data Mining are well tested and widely used. These need some fundamental knowledge in Multivariate Analysis. To develop a basic understanding of Multivariate Analysis. I suggest the book by Anderson, a Classic book can be used both as a text or a reference. There are several other books available in the market.
c. Known methods in DM are Factor Analysis (FA) , Principal Component Analysis (PCA) , Cluster AnaLysis, Regression Analysis with Linear, Logistic, Ridge and Multiple regressions.
All are available on websites and some programs written in R are also on the web. Softwares such as SPLUS can be used to write customized programs. Regular software are MatLab, SWP, SAS are also available. However, embedding capabilities of these into the Data Stream processing and the network might be challenging and need to be studied.
Those who are interested in a more rigorous, mathematical background for the FA and PCA and how those two tools are different from each other please refer to the following links. We found them to be informative.
I will post more suitable references from time to time in this subgroup of the Data Science Central.
Thank you for giving me the opportunity to post my write up.