A Data Science Central Community
I tried to essentially do a database join using vlookup in Excel, on two tables A and B that had just one field (email address) and 80,000 records each, to find which records in A were also present in B.
It took 10 minutes with Excel. With Python, using hash tables, it took less than 10 secs. I'm curious to know what would be the performance
Here the data set is so small, it doesn't even make sense to use a MapR architecture - the gain would be very small.
Related articles
Tags:
The easy answer is that excel is a resource hog & despite being a tremendously powerful tool for mining small datasets, when you start to push past the traditional 65565 rows, you start to move into realms where Microsoft is traditionally not good (memory handling, i/o management, efficient processing).
First a few questions:
1. Cardinality: Were both tables internally unique? Excel bogs down on cartesian products in my experience -- you need one to many or one to one matches. Am assuming you ran a pivot table on both counting the unique instances of each email address, comparing the row count in the table with the grand total (both should be the same).
2. Sorting: Were both tables sorted by email address before attempting the match?
3. Were you using the exact (match only if true) to set a match flag?
Am assuming that this answer is going to be yes across the board, since you're probably a really good excel jockey. I do this processing all the time with 150,000 records looking up against 60 or 70,000 records and while it usually takes 1-2 minutes, 10 sounds like an awful lot -- but I do have a fairly new laptop.
Vlookup formula I use:
=if(isna(vlookup(a2,sheet2($A::$A),0,1)),0,1) sets a flag of 0 if there's no match & assumes that all other situations are a match.
I apologize for not answering your other questions...in my experience with the range of tools listed above, below 10 seconds is normal -- provided that you stack the deck in the favor of all the tools.
1. Ordered data match/merges faster than non-ordered data in almost any situation.
2. Cartesian products blow up match/merges - make sure you dedupe both lists first. Even with sql logic in the database engines or SAS or SPSS, if you do a group by ascending, you should still be able to complete the processing and do the match merge in less than 10 seconds of clock time.
Am assuming that you're not counting the time it will take to load the data into all the different db engines :-)
consider the IFNA function as well
Congratulations on 80k addresses!
I'm not an Excel banger so I've got to assume that traditional research constraints apply:
Did you convert the email addresses to text strings first?
How does MATCH, used for text, compare to VLOOKUP, used for values?
Can you sort the columns first so you start with clusters?
These are email addresses, so character strings of variable length rather than a set of numbers internally converted to 4 bytes (IEEE 754) get compared byte by byte. What happens when you pad the strings to a uniform length?
You are looking for dups so a full table scan of table B for each record in table A forces something like 6.4x10^9 match tests. Pretty busy.
Hi Todd - I suspect Excel does a full join. The hash tables in Python solves the problem in O(n log n), rather than O(n^2).
Good question. I've run into Excel limitations because my data > 1M rows.
I'm turning to Python for exactly this reason. I tried R, but my laptop ran out of memory!
I'm using Python(x,y) and Numpy to do the equivalent of Excel pivot tables. Pretty long setup time (b/c I had to teach myself Python), but it's something I'm going to do daily basically forever, so a Python script seems to be the way to go.
As far as I understand things, a VLOOKUP that is doing exact-matching will use a binary search on your column, which on average will take 15 comparisons per match on your 80,000 addresses. If your join was naively implemented, it repeated this binary search for every single item in Table A, looking up each item in Table B.
I don't have 80,000 email addresses, but I did a test in R where I generated two lists of 200,000 10-digit numbers (stored as character strings), and a merge was way faster than converting the numbers to character strings. A second or so.
VLOOKUP can do both EXACT (TRUE or 0) and APPROXIMATE (FALSE or 1) as part of the [range_lookup]
Assuming your tables are local in SAS, an inner join with 80,000 records should take less than a second.
Basic example of 2 tables randomly ordered, 150,000 records with 80,000 matching, and joining on a character column of length 32. Take .15 seconds on my laptop.
9461 proc sql noprint;
9462 create table a_and_b as
9463 select a.address,
9464 a.a,
9465 b.b
9466 from a inner join
9467 b
9468 on a.address = b.address;
NOTE: Table WORK.A_AND_B created, with 80000 rows and 3 columns.
9469 quit;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.15 seconds
cpu time 0.12 seconds
I don't see why standard Excel pivottables wouldn't do the trick for you.
copy the 2 files into a single sheet (column), adding a column to identify the list where the records came from.
create a pivottable with the email addresses as row labels, the list IDs as columns, and count() as value field.
just tested on 2 sets of 80000 random strings ... in less than 5 seconds (Excel 2010, 8-core CPU)
And presorting is very fast for 80,000 rows - even in Excel if you sort just one column, which is the case here.