Subscribe to DSC Newsletter

How are database joins optimized? How can you do better to handle big data?

Joins are typically Cartesian products and in many database systems, can be very slow. What are the best solutions to optimize joins

  1. From a SQL coding point of view: best coding practices for Joins?
  2. From a database point of view: which database architectures have the best optimizers, and how are these optimizers implemented?

As an illustration, if you join two tables A and B each with one million rows, with a "where" condition narrowing rows down to 10,000 in table A, a "where" condition narrowing rows to 500,000 in table B, then a solution for an efficient join is to

  • Identify these 10,000 rows in table A first
  • Put these 10,000 rows in a hash table,
  • Browsing the 1 million records from table B, with a quick look-up check on the hash table for each of the 500,000 rows in table B satisfying the "where" criterion associated with table B

This is far more efficient than having a full Cartesian product which would involve 1,000,000,000,000 rows. Obviously the final solution (unless an outer join) would consist only of 10,000 rows at most. 

Views: 339

Reply to This

Replies to This Discussion

Vincent,

The only addition to your posting is that the indexes will need to be created properly on the fields used in your where clause.  Just need to be careful that you don't create too many indexes - too many indexes clobbers insert statements performance.   Primary Key in most database implementations is a form of an index.

Without the indexes, the criteria in the where clause would produce a full table scan - very slow.  

At 1 trillion rows, attention to detail in your query construction is critical.  Indexes a lone will not help.

Hi,

Would request you post your question on www.bigdataspecialist.com You might get nice answers as it has large community of Big Data Professionals...

Big Data Specialist Q&A on Big Data and Big Data Analytics

In a Hadoop distributed computing environment you would improve upon this by partitioning table B ("Map") to the various nodes via batch jobs and then merging them back together in the reduce phase.

-Ralph Winters

RSS

On Data Science Central

© 2020   TechTarget, Inc.   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service