A Data Science Central Community
I am wondering what proportion of joins (if you sample 10,000 SQL queries from various programmers) involve keys spanning over 2 tables, vs. 3 or 4 or more tables.
Do you think one of the worst bottlenecks with SQL is users writing queries with (say) 3 joins, when these queries could be split in two queries each with two joins? Typically, the way I write SQL code is to embed it into a programming language such as Python, and store all lookup tables that I need as hash tables in memory. So I rarely have to do a join, and when I do, it's just two tables at most.
This approach seems more efficient than writing complex SQL queries with multiple joins, but I'd like to have someone else's opinion, either an SQL expert (I'm not one) or a database architect.
I am a database architect and SQL expert.
If I can do something in the database, I do it there. Joining is what relational databases are made for. You don't need to be an SQL expert to join 4 tables instead of 2.
Do a bit of reading about "window functions" and "common table expressions" in modern SQL databases. You will see that they help you making your SQL code easier and more readable and provide you an endless array of possibilities. In a modern database, you can easily do things like moving averages, cumulative sums, rankings in subgroups etc. with few commands, things that would take hours of programming.
Usually after 3 way joins you will take a performance hit, usually because one or more of the tables need to be sorted. But not always. If your tables are indexed, and you have a mix of large and small tables and are accessing, say 10% of the data, your queries can be much faster. SQL was originally designed as a terse USER based query language, so the business specification of the results can be somewhat different than the execution. The SQL optimizer will usually decide how this is done, and in some SQL variants you can actually specify that you want the join to be done as a hash join, which will perform the lookup essentially as you describe.
Of course, the problem with hash joins occur when you can't fit all of the lookups in memory, but that is another topic.
In some (rare) cases where lookup tables were too big to fit in memory, I used sampling methods and worked with subsets and aggregation rules. A typical example is when a field in your dataset (web log files) is a user agent (browser, abbreviated as UA). You have more unique UA's than can fit in memory, but as long as you keep the 10 million most popular, and aggregate the 200,000,000 rare UA's into a few million categories (based on UA string) you get good results in most applications.
Being an algorithm expert (not an SQL expert), it takes me a couple minutes to do an efficient 4-table join via hash tables in Perl or Python (using my own script templates). Most of what I do is advanced analytics, not database aggregation. Advanced algorithms, but simple to code in Python, such as hidden decision trees. Would be interesting to have the hidden decision trees algorithm translated in SAS or SQL though - it should be straightforward.
Anyway, my point here is more about non-expert SQL users such as business analysts: is it easier, more effective to train them to write better SQL code including sophisticated joins, or to train them to learn Python and blend it with SQL code?
It really depends on what you need to do.
With a well-tuned database server, joining an additional table with the right parameters (which "costs" typing 50 characters) can dramatically reduce query times because the database engine can use additional indexes and other tricks.
Often, your round-trip times to remote databases (clients over VPN etc.) are in the range of seconds, not milliseconds as usual. In those cases, it's much better to send one complex query to the database and receive the results in one batch than waiting for 1,000 queries and results travelling over the network.
If you have the freedom to choose a database server, check out PostgreSQL: with PL/Python or PL/Perl you can bring your advanced algorithms directly into the database.
I am glad you mentioned PostgreSQL. I had researched this and found that it had extended SQL quite well. I was looking at it it terms of text analytics and unstructured data, but I found that It was also able to transform rows and columns and handle user defined functions quite well using c++
I routinely use queries that use up to 7 or more joins, I do see though, that the more joins I use the slower the query works, especially when I am doing something other than just a simple `SELECT` statement.
The short answer is: Needing your data of several tables. The best approach would be one (1) SQL statement delivering it all in once. SQL however is rather cumbersome by it support staff and operations.
It is part of the game at ICT:
- Within ever evolving hardware the once being bottenecks and limitions are changing continously.
- The more older habits have become common they are also becoming Ivory-towers and silo-s. These are becoming new hurdles that previous did not exist. That has happened with the DBA-role.
By the way I am being confronted with this challenge for many years, mixing up several sides Infra/DBA/Analytics.
The technical view:
- SQL was developed for OLTP databases not for analytics (1980's) ANSI SQL is a standard.
- For the analytics approach there is omission with missings, While researching data there can be many reasons for a missing. The reason is also information. Anlytical procedures very often do not like missings so imputation is part of preparing data.
- OLTP DBMS where designed with a lot of operational requirements (ACID) and with the limitation they don not fit into internal memory of the machine. All is limited by optimzing IO/dasd. To manage those DBMS a new role was born in the ICT field, the DBA.
The DBA Ivory tower:
- MDL (MetaDataLanguage) is the holy part of a DBA. (metadata language security pages physical layout)
- DDL (DataDefinitonLanguage) is the communication to service their cleints. (Indexing structure schema fields). This is commonly not allowed for the users analysts, there must be service request, incident, problem.
- DML (DataManipulationLanguage) reading records/tables (select ) is allowed and sometimes adding/writing.
With this segregation a good well performing analytics environment has become very difficult.
The DBMS is often normalized and/or oriented to server as destination of an ETL process delivering Cubes.
The cubes is sometimes called analytics, but is not the analytics of Vincents area. A whole world of confusing about that is arising.
Yes some good performance designs are possible with a DBMS with many joins/views . It is no difference in logic as Vincint already has described doing it by himself with in-memory (hash tables) approach.
Every SQL must be able to get perfoming well with all DBA access DBA knowledge and the technical tuning efforts. But does an analyst want to be a DBA? Does he want to wait for him? Is there a DBA available?
The hyping with "big data" with NO-SQL has a reason.
The newer databases like Vertica Terradata have been designed with more the analytics goal.
Al lot of descriptive statistics can be stored within those kind DBMS-s. Still the fit is not complete, oh how ironic. A DBMS approach is starting with teh statement "having all data". Analysts are quickly subsetting and changeing subsetting as insight changes.
Almost every modern DBMS is capable of some in-database processing (java C++) but it requires cooperation and is all specfic to a technical implementation. That is requiring very good cooperation on all involved area-s. I have never witnessed some kind of that cooperation, in contrary.
What I had in mind is a system where you have to download the lookup tables not very often (maybe once a week) and access the main (fact) table more frequently. If you must re-upload the lookup tables very frequently, then the Python approach looses its efficiency, and you make your colleagues unhappy because of your frequent downloads that slow down the whole system.
That is not the traditional DBMS, although it can be used that way. It is about OLAP MDX.
Creating the information is the whole process before it. Sometimes needing a lot of corrections and assumptions. ETL datacleansing and verfifyng requiring a lot by specialst. It is the focus of the BI world by Kimball/Immon. The data is duplicated a lot as te goal is quick retrieval visualisation.
Running on multiple computers is indicated as a ditributed DBMS. Duplicating data for retrieval should be an possiblity. The difficulties of Acid with distributed system to be avoided.
The whole concept of duplicating data and optimizing data-access for fast retrieval is that was the initator of HADOOP. The promise of relative cheap hardware with a lot duplicates servering as fast as possible the data (caching as much as possible in memory). Some evolutions with this approach are the commercials with in-memory analytics - visual analytics.
When the effort for the ETL building the DWH (cubes) could be eliminated (no IT staff anymore), doing "self service BI" it could change the OLAP approach completely.
Vincent, if your goal is to just eliminate the join clauses in the select statement, then you can remove the joins and write the selection(the join) criteria in the where clause. And of course, you would have to be sure that you have indexes and caching set properly on the database.
I am of the opinion to do as little as possible in SQL (just enough to retrieve the data) and then pass the data into a hadoop(or MPI, or PVM) based application to do the number crunching. The language of my choice would be C.
In general, the rule of thumb is to "bring algorithms close to data". The approach you describe does not scale, especially to terabyte+ dataset sizes
There are several standard benchmarks, and you can check out how various database server products are doing at http://www.tpc.org/tpcds/default.asp Historically, TPC-D was used as a standard benchmark for data warehousing. The TPC Web site says it is obsolete since 1999, but when I last worked on it at Dataupia and Infobright (in 2006 - 2009), vendors still actively competed in it. In any case, TPC-DS is its replacement. I mention this because if you look at the benchmark database schema and at the queries comprising the benchmarks, you will see the challenge and where the issues are coming from.
In a nutshell, if you look at the list of TPC-D benchmark query definitions, you will see several multi-table joins involving very large tables, joined on different fields. The queries are complex, with subqueries that cannot be rewritten into something else (e.g., correlated 'EXISTS' subqueries with self-joins, outer joins, and so on). Can you process such queries the way you describe it in memory if you have several multi-terabyte tables? The answer is no.
There have been hundreds of person centuries of software engineering labor spent and thousands of research papers published on decision-support query processing. Most database products capable of processing such complex queries on multi-terabyte scale (up to 100 Tb or even larger) use a combination of sharding. bitmap indexing, and columnar. A shortlist of database server product names includes Exasol, Oracle, Sybase IQ, Netezza, Infobright, Vertica, Teradata (Kickfire), etc.
Sharding by itself does not solve the problem, because (i) join queries are processable within individual shards unless the tables are joined on the fields also used for sharding (Netezza and Dataupia used "data redistribution" message passing operations to allow fr sharding in such context); (ii) aggregation functions might not always be "grand-aggregated"; (iii) data might be skewed due to unbalanced loading and more importantly, as a result of applying local filters. These are some of the reasons why MapReduce is a BigData integration capability as opposed to a replacement for columnar databases (such as Infobright, Vertica, or Sybase IQ) as far as data warehousing is concerned.