I have noticed that when I go to a fast food outlet no matter what I get to drink with my meal it is almost always listed as "Cola" on the receipt. But I didn't order Cola. Ever. Usually I get juice, or milk. So every time I order a burger, I'm clearly a source of bad quality data.
I have looked over the counter on many occasions while I was waiting for my burger and watched the server key in other peoples orders; their fingers flew accross the key pads, but only ever hit the cola key (always in the more central location it seemed). I could actually see the extra wear on the surface of the touch pad. I suspect that the number one reason for keypad replacement in the fast food industry is "cola key not working". I am guessing that employees understand that speed is important, it is fast food after all. I wonder how much data quality is discussed.
Now, this is hardly a scientific study, and falls clearly in the "anecdotal evidence" column, but when I see this it strikes me that somewhere a data warehouse is probably capturing my drink, and its being ignored because all the analysts know that it's always cola.
Or, perhaps, there is a complicated ETL job that took hundreds of hours of expensive consulting time to write that cross feeds drink information from the inventory system that tracks the different quantities of syrups required by each location and then estimates the drinks sold randomly allocating that percentage across the number of meals sold.
If this was done you would not have good information about who drank what with what- is orange soda or milk more popular with the cheese burger? Are the fancy fruit drinks (which have a lower margin) more likely to be ordered by people getting the spicy wrap or the regular? What is the real margin on each meal taking into account the drink?
Or maybe the drink dimension is a special dimension that only shows drink categories at a summarized level because thats the granularity the inventory system uses.
Messy. Reduces the value of the information, hard to explain to the end users. But what can you do if you don't collect the data at the level of each individual order?
Of course, the point is not that I think this wrong drink keyed in issue is an important one for the fast food industry. The point is that if the information at the point of capture is wrong, we can spend a lot of extra effort in the extract transform and load (ETL) logic than we need to with little or no result.
In fact, if we spend enough time on the ETL to make the final data warehouse data appear to be telling us something, it might even be damaging, since the ETL itself might be generating patterns that don't exist, and will lead analysts down dead ends, forever chasing the apparent relationship between Dr. Pepper and curly fries.
And like many issues in business intelligence, and data quality in general, the root problem is one of process and people. Here is what I think the problem is; it is harder to have a thousand data entry people be careful about their data capture than to hire one ETL developer to write some crazy twenty thousand dollar chunk of code. The result of this is that instead of fixing the problem at its source- in this case right at the point of order, we try to fix it in the data base, after the fact.
We need to get the people on the ground who actually experience the event to be motivated to get the data right, right from the start.
Obviously this is true for retail, it's true for the loading dock, it's true for the order desk, it's true even for self serve and on line processes where the data is coming directly from the customer. It's true for all data. Get it right as soon as it goes in, and you've won a big part of the battle.
In my experience, often the key is to only ask for the information you really want, and when you do ask for it, make it clear that it must be accurate, and put in place closed loop processes that ensure it is. Syrup purchases don't match the 100% cola data? Ask why. Include data accuracy as part of the store supervisors assessment criteria. Obviously, the more the process can be automated with bar codes, radio frequency ID tags (RFID) or other technologies, the better.
Data quality starts on the ground. The further from the ground, and the deeper into various operational systems, ETL jobs, staging tables, data warehouses or data marts we try to fix the problem, the harder it will be.