A Data Science Central Community
I tried to essentially do a database join using vlookup in Excel, on two tables A and B that had just one field (email address) and 80,000 records each, to find which records in A were also present in B.
It took 10 minutes with Excel. With Python, using hash tables, it took less than 10 secs. I'm curious to know what would be the performance
Here the data set is so small, it doesn't even make sense to use a MapR architecture - the gain would be very small.
I presume the Python 'hash table' used the native dictionaries data structure, which assigns an integer value to each entry. Joining two tables of data using an integer value is probably quite a bit faster than doing a scan of values between two columns of data, which is what Excel does in a VLOOKUP. Similiar to using an index to locate a record in a table, versus scanning the table for it. And in Excel, scanning two columns of character data looking for common values is like joining two non-indexed database tables together to create an INTERSECT - it will require a scan.
R is really good at this. You can use the merge() function
or for real speed you can use the data.table package
The only restriction is that the tables must fit into RAM. With a 64-bit Windows machine you can use all the RAM you have got. With a 32-bit Windows you are limited to 4 GB however much RAM is there.
See R examples at
Excel performance tips.
What I found with Access is its inability to properly import/export data types. Automated and even forced type casting, from a database table to text or CSV files (or the other way around) is poor, resulting in many errors. It is because this environment is too rigid, not flexible enough to do smart type detection and/or casting. Ever had to deal with a date that was recorded as an integer? Or or comma in a data field value, that screws up your CSV file?
I've had a comma in a data field many times. I now prefer to use tab-separated instead of CSV for exactly that reason.
The issue with truly big data is that you will end up with field separators that are actually data values (text data). What are the chances to find a double tab in a one GB file? Not that high. In an 100 TB file, the chance is very high. Now the question is: is it a big issue, or maybe it's fine as long as less than 0.01% of the data is impacted. In some cases, once the glitch occurs, ALL the data after the glitch is corrupted, because it is not read correctly - this is especially true when a data value contains text that is identical to a row or field separator, such as CR / LF (carriage return / line feed). The problem gets worse when data is exported from UNIX or MAC to WINDOWS.
I've not yet encountered stray tabs in character fields, but I agree it's possible. As you say, there's no standard. I did encounter a frustrating 18 GB file recently that was tab-separated as I requested but two adjacent fields had no separator and had to be parsed by (multiple) Python scripts. At one point I replaced the tabs by a solidus (|, or vertical bar) to make them visible.
Trivial example of merging data of the size you are talking about in SAS, R, and Python.
I'm most familiar with SAS and least familiar with Python. So take it with a grain of salt.
A reader suggested to use INDEX + MATCH instead of VLOOKUP.
The base function to VLOOKUP in R is merge() as previously stated. However, I use plyr join() to make quick work of it with flexibility. For every situation there seems to always be a more efficacious way of executing a simple task like this. With that I tested a few methods in R after creating two dataset with 80K random sample email addresses from a 200K list without replacement. I tested merge, data.table, ff, sqldf, and plyr.
Using rbenchmark with 10 replications here was the outcome:
When I run into memory issues I'll try a few solutions before taking this out of R.
Excel does its job pretty well as long as you double check you have clean data. International email addresses can be interesting to work with.
If you first sort the lists before doing the vlookup's you will see a very big improvement in the execution time.
Alfred Vachris Excel VBA Developer