A Data Science Central Community
What is MapReduce?
When people talk about ‘Hadoop’ they are usually referring to either the efficient storing or processing of large amounts of data. The standard approach to reliable, scalable data storage in Hadoop is through the use of HDFS (Hadoop Distributed File System) which may be a topic for a future blog. MapReduce is a framework for efficient processing using a parallel, distributed algorithm. Over the past 18 months we have used MapReduce for a variety of analytic needs, building up use cases as we went along.
The first rule of MapReduce is that there are four words you need to understand: Key, Value, Map and Reduce:
- Keys and Values come as pair, for every key, there is a value. However, don’t think of it as just a column of two numbers, for example ID and balance. The key and the value can be ANYTHING as long as there is some sort of list of keys and corresponding values.
- Map: a function which takes a set of values and returns another set of keys and values. It’s just a function; it can be anything you can write.
- Reduce: a function which takes a key and its associated values, and returns a value and optionally a key. It is important to note when the Reduce function is called by MapReduce, it only calls it using one key and all of that key's values. If the data has many keys, then many Reduce functions will be called.
Below I have tried to make this concrete using a few examples on a fictional dataset. If I assume we have some data on credit card customers – in the table we have an ID, the colour of their card and their balance (see example below).
Here are a few problems you may have to solve and how you would use MapReduce to do this:
Example 1: Selecting all the ‘blue’ cards
This is a very simple example, in fact so simple that we do not need a Reduce function. All we need is a map function which looks at the data and outputs only the accounts with blue cards. There are no input Keys and the input Value here is CARD_COLOUR. The Map here excludes accounts where card colour is not blue. In this case, there is no output key and the output values are all the accounts with blue cards, as we expected.
It’s worth also noting the values have retained their row numbers from the original table. This was the original input key, and we can think of this as where the data was stored originally. It is not required in this example, but note the values are sorted in order of original key.
Example 2: Summing the card balance by card colour
This is another relatively straightforward example, a query of the sum of balance on each of the card types. In this instance we want to limit the amount of data we retrieve to just the card colour and balance – this might be appropriate if we had a lot of accounts. The map function therefore returns all the accounts, but just the card colour and balance for each account.
The map returns the card colour as the key, and the balance as the value. The reduce function takes the key colour and also returns that as its key. The input value to reduce will be all the balances, and the output value will be a sum of all the values for each key.
Example 3: Calculating the mean balance in groups rounded up to the nearest 1000 in balance where ID is even and balance is greater than 1,000.
This is a more involved example which I will go through quickly, just to demonstrate some more functionality. The map step will identify the accounts we want to select, and then assign a key (balance group, for example 1000-2000) and values (balance) to each account. The Reduce step will calculate the mean balance by balance group. Remember the keys are the buckets, and the values are the means.
Examples where we’ve used MapReduce at Capital One:
New covariate generation / assessment
The most important decision we make for both us and our customers is whether to accept their application for credit with us. The better we are able to predict whether a customer is high or low credit risk then the more applicants we can approve. Our primary tools for this decision are our credit scoring risk models. By looking at the data available to us in different ways we were able to create 1.4 million new variables but the downside was this took around 6 weeks of digging to identify the highly predictive variables! By structuring this problem using MapReduce we were able to carry out the equivalent work in approximately 15 hours.
One idea was to use the behaviour of our back book to spot geographical patterns of performance. The hope was that this would help with our understanding of the economy at a macro and micro level. To make this work we had to solve the computer-intensive problem of creating metrics for all 1.6 million UK postcodes using the data from 2.5 million open accounts. This is intensive because you can’t tell whether an account’s behaviour should contribute to a postcode score until you know how far it is from that postcode. That means that the positions of all accounts need to be compared with the positions of all postcodes – a join which would create ~0.5*1.6MM*2.5MM = 2000 billion distance calculations. The final timing is 9 minutes to produce identical results to the day-long previous job, an improvement of some 160x!
When would I use MapReduce?
As with any technique related to data science MapReduce is one of many approaches you could take to solve a business problem using large amounts of data. The key is being able pick and choose when to take MapReduce off the shelf. At a high level: MapReduce may help you if your problem is huge but not hard. If you can parallelise your problem, then MapReduce should help.
Credits: Sam Bennett, Sarah Pollicott, Kevin Chisholm, Jeremy Penzer, Cheryl Renton