Subscribe to DSC Newsletter

And the rise of the data scientist. These pictures speak better than words. They represent keyword popularity according to Google. These numbers and charts are available on Google.

Other public data sources include Indeed and LinkedIn (number of applicants per job ad), though they tend to be more job market related. 

Feel free to add your charts, for keywords such as newsql, map reduce, R, graph database, nosql, predictive analytics, machine learning, statistics etc.

Related articles

Views: 31943

Comment

You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Randy Bartlett on September 29, 2013 at 1:47pm

@Vincent, I agree.  Incentives, e.g., tenure, for academic statisticians are all about publishing.  The rankings of statistics departments are usually based on publications.  Sadly, the courses are publication-oriented.  If statistics departments can find a way to split research from teaching, that would be a strong first step.  Also, they could proactively seek feedback from the market place--take a random sample of past graduates. 

Even so, the numbers of statistics degrees is on the rise (http://magazine.amstat.org/blog/2013/05/01/stats-degrees/):

At the same time, I think the statistics departments are starting to feel the heat.  E.g., non-statistics professors want to teach statistics courses to their students.  There were some articles in Amstat News and there will be another next month (October, 2013). 

Comment by Vincent Granville on September 27, 2013 at 8:27am

I think one of the issues is that academic statisticians, who publish theoretical articles not based on data analysis, are... not statisticians anymore. That might be a way to look at it. Also mny statisticians think data science is about analyzing data, but it is more than that. It also involves implementing algorithms that process data automatically, to provide automated predictions and actions, e.g.

  • automated bidding systems
  • estimating (in real time) the value of all houses in US
  • high frequency trading
  • matching a Google Ad with a user and a web page to maximize chances of conversion
  • returning highly relevant results to any Google search
  • book and friend recommendations on Amazon or Facebook
  • tax fraud detection, detection of terrorism
  • scoring all credit card transactions (fraud detection)
  • computational chemistry to simulate new molecules for cancer treatment
  • early detection of an epidemy
  • analyzing NASA pictures to find new planets or asteroids
  • weather forecasts
  • automated piloting (planes, cars)
  • client-customized pricing system (in real time) for all hotel rooms 

All this involves both statistical science and terabytes of data.

Comment by Vincent Granville on August 30, 2013 at 11:15am

@Phillip: Regarding sparse matrices, read my article fast clustering for big data to see an example of how a 10,000,000 by 10,000,000 sparse matrix was replaced by a 50,000,000 entries hash table. This type of processing does not require knowledge of matrix theory. As far as time series are concerned, I believe it is necessary for a data scientist to be very familiar with correlograms and model-fitting (and how correlograms uniquely represent a time series model, under some circumstances), but not with spectral theory, deconvolution, fast Fourier transforms, or signal theory. Though knowledge of entropy / signal compression and noise filtering techniques would be good to have.

By the way, thanks a lot for your great reply. I haven't read it yet (I am about to read it), but we always love to have great contributions from great bloggers! 

Comment by Phillip Middleton on August 30, 2013 at 10:52am

@Vincent: I would disagree that advanced components of matrix theory are unimportant to Data science. One of the more difficult problems to solve involves the issue of sparse matrices, and there aren't great solutions to this at this point.  A mere intro course into this area is unlikely to yield an understanding on how to deal with this problem at the level required to develop a solution. It requires a deeper understanding, with some wit, creativity, and perhaps some luck. 

The jury is out on whether or not I agree that understanding stochastic processes with ODEs is useful. When it comes to time series analysis, many methods are derived from 1st (and depending on your area, higher) order PDE's, whose mechanics are more well understood when dealing with time-series signal processing (whether that be an EEG or an economic forecast). I think understanding the underlying machinery at this level will help further the creative development of better ways to process TS's. 

Comment by Phillip Middleton on August 30, 2013 at 10:43am

This ended up being more of a blog than a comment lengthwise. So bear with me :)

I really believe that the intense discussion of 'data scientist' vs. 'statistician' is one that's driven by quite a number of factors. And I think it is inappropriate, presumptuous at best to say that there is a 'death' of any one particular discipline in this area. The quantitative and computational arena, both in the theoretical and applied arena is evolving, but nevertheless much is grounded in the disciplines of statistical practitioners.

Here are some simple armchair thoughts about how and why I believe that 'data scientist' and 'statistician' are at their core nearly synonymous. Yet, social dynamics, technological development, and economic drivers have begun within the last 5-6 years to rapidly test the mettle of the 'traditional' statistician's role, and thus new paradigms of thought which call for the 'old guard' to evolve have come to life.

1) Social Dynamics
There is nothing secret about the coining of the term 'data scientist' in which 2 original members of LinkedIn were largely responsible. Interestingly, their goals were the same as any statistician's: to extract information from data. However, they approached the problem from the view of a computer scientist. That is, they began realizing that alot of data could indeed be aggregated in interesting ways to elicit learning. Those who were somewhat AI savvy realized that one could train a machine to repeat, within a tolerable range of error, an inference that a technically adept human could make, but on a more massive scale. For some, this created the notion that Big Data equated to Big Information. That is, having Big Data would obviate the need for theoretical constructs to make inferences (for many,including me, this is an unreasonable and careless assumption, and based on conjecture alone and little more). In other words, statistical inference was for the most part reaching a dead end.

Given that in general, algorithmic (and ML) approaches to extracting information could be more easily iterated into smaller chunks than the production of a robust statistical model, naturally industry was quick to adopt the idea. It was quick, sexy, and for some procedures quite intuitive. It basically ushered in the 'good enough' into decision-making thought for industry. However many of these approaches alone were never carefully examined for their intuitiveness. Few could explain the reasons for misclassification rates in unstructured data, nor quantify the uncertainty from decisions based on uncertain insights produced by these methods. Fewer still can clearly explain to a lay person how a neural net scores a particular entity for the probability of a future behavior given a multitude of factors (say vs generalized stat models, in which the contributing factors are spelled out quite well).

But when an idea becomes exciting, healthy skepticism for risk in the use of generally algorithmic approaches or a deeper understanding of precisely how and how well a business/research question is being asked can go by the wayside. Our emotions can easily distort our sense for reason.

I did mention 'speed'. Speed culture is the norm in industry, not necessarily because it *must* be always moving at 300mph, but because it is perceived, without any real evidence, that it must be this way. This is one of the reasons why most companies cannot pry themselves from the flurry of reports generated minute by minute which are merely aggregations of data, sometimes presented in a way which allows for quick 'at-a-glance' understanding. Yet, where this completely fails originates in our own neurologic contruct: we can be easily biased, particularly based on what information we bring to the table when viewing something novel and previously unknown to us. We also have varying capacities to infer 'bigger picture' meaning from reports.

The point of this is, who is the audience of the end product, and to what end does the product provide value, given risk? If you sacrifice speed for accuracy/reliability/comprehensiveness, was the level of accuracy really necessary for a sufficiently evidence-based decision? If the converse, what critical factor may have been overlooked? could one make the same conclusion repeatably? what level uncertainty in my decision poses excessive risk to my company?

In a recent forum on data science (I believe Teradata was monitoring this closely), several premier speakers would ask, what value is your work bringing to bear on your institution? How can that be asserted a priori when embarking on a project that may have a 'black box' construct? There was no person on the panel who could answer that question without major fumbling.

Someone who is sufficiently trained in statistics *should* be able to answer questions like that clearly, as statistical work includes developing an expertise in the subject under study. For medicine this means becoming truly immersed in some relevant aspect of it. For business, it means developing a business acumen and understanding in depth the target and dependent areas of the business under study. This goes far beyond sparse matrix problems or regression coefficients. It is the art to translating the business need into a scientific construct, or innovating and selling the need to develop something which demonstrates that broader level of expertise. One in this field can epitomize, in every way, a fairly Rennaisance character. In this sense, the perceived notion of a 'data scientist' should be no different in this regard.

But so far we're only talking at the level of the individual. Who has evolved into this multidisciplinary proficiency? I would argue very few at this point, though I expect this to change.

In this sense I would say that today, 'data science' works best in the 'hive think' environment, where math/stats/physics/etc, software dev, business, and architecture disciplines integrate in a way which communally develops solutions and products. This is not to say that there is no one 'data scientist'. There are, like I said, a number of talented individuals who break down the interdisciplinary walls quite well. But generally speaking, in the current context of industry, calling someone generically a 'data scientist' would in my mind be akin to calling someone a department. Each person has a role, and each may develop skills which overlap other roles to an ultimate level of proficiency in which they might indeed be this singular 'data scientist'. But these are mere titles, right? I suppose that depends on your potential paycheck (see #3 below).

One thing which concernd me is the rise of vendors, such as the up and coming DataRobot.com, whose mission appears to be turning data scientists out of otherwise lay individuals with data science aspirations. They mention the ability to build incredibly accurate models with a certain amount of ease. Last I checked, this was relevant to the problem of 'buttonology' from vendor creations. You know, the typical case scenario in which an analyst has no idea about the machinery underlying the button they are about to push to receive output, and knowing less than nothing about the soundness of what they are doing. But hey, using technical vernacular is super cool, and sounds very 'smart' (and worse, unquestionable) to those who can't evaluate otherwise with any semblance of skeptical inquiry.

My sincere hope is that this is NOT where we are all headed, as a very large population of underqualified folks who tap dance rather well (Toastmaster virtuosos by any standard), yet are little more adept than using SQL / SAS / some other platform as an expensive reporting instrument, and are sitting in precisely these positions. They are leading, directing, and many times way off base despite political appearances and self promotion.


2)Technological Development
As I mentioned, I believe that the evolution of 'data scientist' was wrought out of a particular set of skills that, like the statistician, also inferred information from data, but from a different angle - that of computer science. It was an independent discovery, but I believe driven by a need. The need was in essence a trade-off, generally accuracy for speed. Algorithms could be rapid-prototyped in the paradigm of engineering.

Statistical models, well,  not always able to be iterated in the same way.

But what was the purpose of iterative development in the first place? Has it been simply to make consumers happy? Is a 'first iteration' sufficient? Much of this goes back to the central argument in industry about what is 'good' and what is 'good enough'. It also has strong implications in short vs long range goals and risk of the analytic product being released at various points of immaturity/maturity.

With the introduction of new data architectures (column based, streaming, batch/file-based, direct I/O) and the exponential increase in power of computer hardware (cloud computing, GPU computing, RAM speeds, solid-state storage, CPU capacity, etc.), the ability to scale processing of much larger quantities of data was within reach, something which whetted the appetites of those both engaged in the hype of Big Data and those engaged in its complex realities.

Additionally software and its architecture, which has always lagged behind hardware in leveraging the underlying technologies for performance, began making some unique moves as well. PMML helped create model portability, computing algorithms began to leverage GPU processing, multiple CPU threads, and manage memory better. The ability to satiate the need for more instant response from input to output began to increase in amplitude.

Given the history of AI technologies, it was a natural input to use the learning paradigm to apply to many kinds of data. Keep in mind however that learning is a stochastic process, whose essence falls into the probability and statistics arena. That is, those nature of those engines were ground in this very discipline to create 'grey' decision logic areas from mere black and white, much like how we think.

In the end, the statistical discipline has pervaded the underpinnings of the technologies in 'data science', whose proponents appear to be fortelling its demise. A bit of irony.

4) Economic Drivers
It is no wonder when Patil and Hammerbacher innovated interesting ways to view data, and LinkedIn experienced the success it did, that a landmark of value in their role was created. When people begin talking about pay, and that headlines such as "Data Scientists Earning $300,000 / year" roll out, it is also no wonder that people who were once classified as Statisticians at the JSM 2 years ago renamed their role as Data Scientists en masse the next year. This isn't neuroscience, it's money.

Interestingly, this sort of thing is common. People attend a university program in a discipline, yet the job market may have little in the title of their degree. Excepting for biostatisics (yes, that was a money-maker, too), statisticians, like physicists, enter or create jobs that only reflect their expertise and talent, not their terminal degree. Hence physicists might only be found in name at academic and government or specialty private institutions. Else they will be found mostly in the tech world, with more generic, quantitative, researcher, or engineering type titles. In this way, it can be easily hard to track who earns what (especially for the Bureau of Labor Statistics), and what the likelihood of success is in attaining a position with certain degrees.

Data science was a moniker given by industry, not academia. And so meeting the learning needs in an institution where the moniker itself wildly varies in definition from place to place, is a very difficult exercise. In the effort to produce a workforce in this area (and earn some cash for a departmental dean :) ) institutions have developed certification programs, and a few may be on the track to something more long term, provided that the 'name hype' doesn't evolve into some other title, or new attention attractor that diverts from the current synthesis that describes data science.

Conclusion: The Statistical Community-at-Large
Much of the original work in the data science field today was given by a more solid understanding of matrix theory and how data vectors could provide information, both numerically and visually. However, this evolution could not have gone any further without the understanding of the logic of uncertainty - statistics (as opposed to mathematics, the logic of certainty: I quote Joe Blitzstein's very elegant definition here). Bayes' Theorem, for example, fit particularly well in learning algorithms and uncertainty quantification. When dealing with complex systems, there were many components that integrated - physics, mathematics, statistics, and computer science. With the addition of folks like linguists, whose entrance into the foray of the math, stat, and computational understanding of the intricate relationships between words to elicit meaning gave us NLP. Add in the graph theorists, and out came new ways to understand relationships visually. Essentially, the 'umbrella' widens, much like Vincent describes.

For whatever reason, part of the statistical community did not venture out into these other areas (except for predominantly the Computational and Graphical stats community). At least, there was not, in my view, a public outcry within the community to expand and correct its course. Only a handful of universities ( a la Carnegie Mellon) have seen the need to develop students further (for these folks know that things like Big Data will require some big theory as well, and computational science concurrent with statistics is an integral part of this).

I believe that the lack of a more publicized voice in the ASA toward moving in this direction is bound historically to the core to the statistician's role, which has been, and will remain, the primary need to design and ensure the scientific soundness in experiments and inferences from data. The introduction of expansions into computational science and graphics has not changed that central need. However the integration of those skillsets has been slow. This is what must change, and we are responsible in helping out. This is not about a $300,000 paycheck and having to put on heavy detectors for pseudoscience. This is where I advocate groups like Interface, and a coalition between the ASA, IEEE, APS, AMS, MAAA, SIAM (a very forward thinking group in the computing and data inference arena), and ACM to bring this together.

The past president of the ASA wrote a very important letter in The American Statistician, noting the importance of nurturing and facilitating this new synthesis. If anything, this is what we all need to be driving, else the entire Data Science experiment prove to fail in the end.

Comment by Nicole on August 29, 2013 at 11:04am

Whatever you call the statisticians/modelers/analysts/etc., the work is still there, and in growing numbers!  I'm not a huge fan of buzz words, but it seems that the role of the Statistician has been divided into several different, more specialized roles.  Additionally, the need for great visuals that make data easy to read has increased the popularity for those who can analyze data and present it well.  I think that what we used to consider a Statistician is now found more in the healthcare field (i.e. biostatistics).  

Comment by Vincent Granville on August 28, 2013 at 1:48pm

@Gary: Check my profile to see what a data scientist does, and the kind of knowledge and expertize he/she has. Data Science encompasses machine learning, inference theory and some of statistical science, but not everything. For example, Wishart distributions, advanced matrix algebra, stochastic differential equations are not part of Data Science. But the theory of error (modern, revisited, least square and model fitting replaced by more robust methodology), design of experiments, cause vs. correlation, time series, and (model-free) confidence intervals is definitely stuff that all data scientists should master and use all the time.  

Comment by Vincent Granville on August 27, 2013 at 10:24am

I also believe the contrast between statistician and data scientist is more pronounced in US than in Europe, because the American Statistical Association and statistical curriculum in US are out of touch with the real world, except in the narrow fields of biostatistics and government statistics. Would be interesting to produce the same charts for Asia only, Europe only, and US only.


When I completed my PhD in Belgium in 1993, it was about image processing and computational statistics. In many ways, it was data science. But in US today, my research would now be classified as engineering, not statistics.

Comment by Ivan on August 26, 2013 at 10:55pm

Interesting post and charts Vincent. However, I wouldn't "kill" anyone or any keyword yet based only on Google trends. Personally, I believe that this is just a pure transition and a smart way to make people believe on statistician's work. Before it was very hard to convince a CEO that a statistician will create/run some algorithms to help increase ROI. Nowadays, everybody believe and take actions based on data scientists' results, even if these are not well supported by statistical and mathematical assumptions. I consider myself as a data scientist, but I started working as a statistician, just as yourself. And we're still alive!

Comment by Richard on August 26, 2013 at 7:13pm

What surprises me is the absence of entrepreneurs, start-up founders or CEO's among statisticians, at least over the last 10 years. This is in contrast with data scientists, computer scientists, operations research professionals and data miners. FICO, Westat, Statsoft, SAS, SPSS, Insightful (and thus S+, and then R) might have been founded by statisticians, but that was a long time ago.

Follow Us

On Data Science Central

On DataViz

On Hadoop

© 2017   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Terms of Service