A Data Science Central Community
By Dan Kellett, Director of Data Science, Capital One UK
Disclaimer: This is my attempt to explain some of the ‘Big Data’ concepts using basic analogies. There are inevitably nuances my analogy misses.
What is HDFS?
When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. MapReduce is a framework for efficient processing using a parallel, distributed algorithm (see my previous blog here). The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System).
Imagine you wanted to find out how much food each type of animal eats in a day. How would you do it?
The gigantic warehouse
One approach would be to buy or rent out a huge warehouse and store some of every type of animal in the world. Then you could study each type one at a time to get the information you needed – presumably starting with aardvarks and finishing with zebras.
The downside of this approach (other than the smell & the risk the lions would eat the antelopes) is that this would take a looooong time and would be very expensive (renting out a huge building for many years would add up).
This is similar to the approach we take with our data at the moment. Huge amounts of information about our customers are available and we are restricted to analyzing it through a fairly narrow pipeline.
An alternative approach would be to split up the animals and send them to lots of smaller centers where each would could be assessed. Maybe the penguins go to Penzance, the lemurs to Leeds and the oxen to Oxford.
This would mean each center could study their animals and send the information back to a head office to be summarized. This would be much faster and a lot cheaper.
This is an example of parallel processing whereby you split up your data, analyze each part separately and bring it back together at the end.
The key drawback is that you are highly susceptible to failure in one of these mini-centers. What happens if there’s a break-in at the center in Lincoln and all the chickens escape? What if the systems go down in Dundee and all the information on sparrows is lost?
The solution to this is to still use mini-centers but to send animals to multiple centers. For example, you may send some rabbits to Birmingham, some to Edinburgh and some to Cardiff. You are then protected from individual failures and can still carry out the survey quickly.
This is from a very high level what HDFS (Hadoop Distributed File System) does. Data are split up in a way that the overall task is not impacted if an individual node fails. Each node carries out its designated task and then passes the results back to a central node to be aggregated.
When would I use HDFS?
As with any technique related to data science HDFS is one of many approaches you could take to solve a business problem using large amounts of data. The key is being able pick and choose when to take HDFS off the shelf. At a high level: HDFS may help you if your problem is huge but not hard. If you can parallelize your problem, then HDFS coupled with MapReduce should help.