Subscribe to DSC Newsletter

A story that has been making the news recently and got me thinking about the widening reach of data science involves a rather shadowy US organisation called the National Security Agency (NSA).  These guys have a mandate to gather data that enemies of the state would like to keep secret.  Unfortunately the NSA have been rather over-zealous in their pursuit of information and have wound up gathering data, that was supposed to be private, from the likes of Google, Facebook, Microsoft, Skype, YouTube, Apple, LinkedIn and Yahoo.  This has led to a huge amount of head scratching from the press, public, politicians and dot com companies, about how so much private data was gathered, without people knowing or consenting to it.  It has also led me to thinking that the NSA must have built a pretty impressive database.


As I thought more about this database, that could arguably be the most wide reaching and controversial databases ever built, I wondered how it would relate to the work I do within my analytics consultancy.  In our company we have a team of data scientists, statisticians and programmers that help our clients better understand their world through data.  So whilst the NSA have waded through arguments around national security needs versus personal privacy, I wonder if we could have helped them be more successful with their data. 


Perhaps they could sit down with us and explain how they've built this large data asset, something that in the beginning was seen as a great opportunity to gain competitive advantages, but over time has rather mushroomed into an unmanageable drain on resources, with few visible benefits.  Perhaps we’d then offer to help them use their new data asset to find insights that are actually useful and bring about some genuine stakeholder value.  Now whilst we don’t typically work with shadowy government organizations, the story that they've just (fictionally) explained about their data certainly wouldn't be new to us.


So given this scenario, I think the really interesting question for a data scientist comes next.  If you have a database that includes lots of data from all the giants of the online community and are given a mandate to do something valuable with it, what would you do?  There are obviously a lot of potential answers, but I want to pick a single one to get started. 

Given that the NSA is a government organization I felt that the solution should serve the public in general.  I also felt that as the PR story around this database hasn’t been great, the outcome should include some kind of positive social impact.

Therefore I decided to think through how the data could be used to rate a person's 'success' in a given field and then create a model to help assist other people’s future success.  So for example, a hospital might want to identify the drivers of what makes a good doctor and use this to improve how doctors are educated and trained.  A bank may want to model the attributes associated with traders who have a longevity to their careers based on gradual improvements and use these attributes to improve recruitment processes.   Taking this example of predicting 'success' in a given field, I then considered how the NSA’s DB could be used?

First of all what data is (allegedly) in there?  Within NSA's giant database we probably have a good picture of who most people are through their Facebook and LinkedIn data – what region do they live in, what is their job, how has their career progressed, what are their likes?  We can also get hold of people’s buying habits from the likes of Amazon.  Their consumption history is available in terms of what they read (Kindle purchases, online newspapers) and what they watch and listen to (YouTube).  Finally we know the types of things they communicate about via text analysis of their tweets and emails.

So taking the example of identifying what makes a good doctor, could hospitals use the surreptitiously gathered data from NSA to identify drivers of 'success' in this area?  Well, maybe.  We could first of all have a go at splitting all those interested in medicine from everyone else.  Anyone listed on LinkedIn or Facebook as having taken a degree in a medical related subject will probably do as a starting point.  Then we can try to define success versus failure.  This can be done in many ways but it's best to start simple.  So we could say that anyone who reaches a reasonably senior position in a hospital is termed as a success and anyone who drops out, maybe deciding that their passion is for dentistry, as a non-success. 

Next we can start to define a whole set of variables and look for correlations between them and what we've termed success.  As we have a lot of data sources I'd go with throwing some plausible hypothesis at the data and testing them out for relationships. So we could start with book purchases - perhaps a lot of senior doctors own certain books or books by a certain author.  These could be useful text books or books that inspired them towards their career goal.  Also, do certain newspaper articles or topics in the press pop up more than most?  We could also look at their viewing habits, do successful doctors have a tendency to purchase the 15 season box set of E.R.?  Are particular universities related to success - some obvious names might arise here but maybe also some surprises?  Finally what role does medicine play in their communications - do they exchange emails about medical issues or tweet about the success or failure of their latest operation (a worrying thought).

Whilst a lot of noise would exist in the above methods it may be possible for the NSA DB to identify some key correlations between the success and failure of a good doctor.  This could then lead the medical industry to consider how it develops future doctors.  Perhaps the recommended reading material could change.  Universities associated with success could be analysed alongside those with higher drop-out rates and recommendations created around what lower-rated universities can do to improve.  Maybe online grocery deliveries can be analysed to see if good doctors tend to survive on a diet of carrots and houmous or if they spend their days drinking the strongest coffee available to humankind.

So given all these rich data sources I think it might be possible to model some predictors of what helps make a great doctor.  This model could then be used by hospitals, universities and individuals to help them be as successful as possible in the field of medicine.  Maybe with this type of output, the NSA could also become a little more loved by the public and reduce the number of disillusioned staff who defect to the competition.


Finally I should state that I'm not arguing that this model justifies breaching people’s data privacy rights.  It's also worth pointing out that a lot of the data mentioned is already publicly available and that a similar approach could work using less controversially gathered data.  A valid point however is that with all the rich data sources that are now available, a whole raft of new types of analysis are opening up.  This new analysis will focus less on silo based data, where you analyse a person's interactions with a single company and instead span multiple datasets, each touching upon different aspects of a person's life.  Although the press for this hasn't been great so far, this article hopefully shows that benefits can come from building up a more complete picture of people, from their online activity.  In time the NSA may even be seen as pioneers in creating these wide ranging analytics capabilities, although at the moment this still seems a long way off. 

Views: 2593


You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge

Comment by Davide Imperati on May 30, 2014 at 1:40am

The idea is interesting, but I am quite curious about the results one can obtain.

E.g. once we get the prototypical profile of a good doctor, and considering that most of those datasources are publicly available and they do not check the genuinity of the information plugged in by the users, it won't be difficult to code bots perfectly mimicking the behaviour of e.g. a prototypical high ranking doctor. 
So fundamentally this work could produce a very good source of counter-intelligence. 

On the other end I see quite a lot of subjectivity in defining the classes e.g. what is a 'succesfull doctor' which again might lead to controversial results.

The third issue might be due to the reproducibility of human behaviour. When handling at the scale of institutions or collective aggregate of a number of random variables, we can exploit some large number result and be confident that the observed result will be statistically close to the expectation, but at individual level there might be so many subjective variables that using information about the prototypical behaviour won't be of any use if we intend to reproduce it to 'engineer' successful doctors. E.g. 97% of heroin addicted started consuming cannabis, but consuming cannabis does not make one more likely to escalate into heroin.

To my knowledge, after the arabic spring there has been a few studies of the movement on e.g. facebook and twitter. to my knowledge the most advanced analytic tools have been able to identify some prototypical situation that might lead to riot, but in all those scenarios, no analytic tool was able to identify the special conditions that ignite the violence. So, might be Nash and Shannon are still protecting the last bit of humanity in each individual.

On Data Science Central

© 2019 is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service