A Data Science Central Community
Joins are typically Cartesian products and in many database systems, can be very slow. What are the best solutions to optimize joins
As an illustration, if you join two tables A and B each with one million rows, with a "where" condition narrowing rows down to 10,000 in table A, a "where" condition narrowing rows to 500,000 in table B, then a solution for an efficient join is to
This is far more efficient than having a full Cartesian product which would involve 1,000,000,000,000 rows. Obviously the final solution (unless an outer join) would consist only of 10,000 rows at most.
The only addition to your posting is that the indexes will need to be created properly on the fields used in your where clause. Just need to be careful that you don't create too many indexes - too many indexes clobbers insert statements performance. Primary Key in most database implementations is a form of an index.
Without the indexes, the criteria in the where clause would produce a full table scan - very slow.
At 1 trillion rows, attention to detail in your query construction is critical. Indexes a lone will not help.
Would request you post your question on www.bigdataspecialist.com You might get nice answers as it has large community of Big Data Professionals...
In a Hadoop distributed computing environment you would improve upon this by partitioning table B ("Map") to the various nodes via batch jobs and then merging them back together in the reduce phase.