A Data Science Central Community
For those of us that are familiar with writing SQL queries from traditional relational databases, we have been conditioned to avoid Cartesian Joins. However, with the explosion of Big Data, maybe we should reconsider that conditioning principle.
A Cartesian join in a beginner’s SQL guide may be called a Cartesian product. In other SQL guides, it may be called a Cross Join. Regardless of the name, it all implies that for the tables you have listed in your FROM clause, there is no common relationship specified in your WHERE clause. This will result in selecting all specified rows from the first table joined to all the rows in the second table.
Big Data has introduced a new era of Statistical thinking. We can ask more questions from the data. Now a Cartesian join may be exactly what you need. However, Cartesian conditions are still very dangerous without a clear understanding of what you are doing. So caution is still valid. It just depends on what question you are trying to answer. Here are a few examples:
There are many questions that can be answered with Cartesian products. However, the most important factor is the skill set of the worker. This is serious business. The worker must know were the data is coming from. If you are working with multiple data sources, the part of the data set that is coming from a relational database may not be desired as a Cartesian result. It all depends on the type of question you are answering.
What do you think?
IMHO I think it's more about how you go build that dataset for answering questions that can be answered with cartesian joins only. There are many ways to optimize it and almost always you have some scenario where you need to consider all the dimensions of it but you can always find ways like finding common subsets and then joining them or writing data pipes to do aggregations first and then joining them etc. Running unnecessary Cartesian joins will slow down your analysis and promise of parallel computing is also getting model done as quickly as possible to leave room for analyzing the results and iterate over the model, running rather big cartesians as a part of data prep may hurt it.
All these examples are like experimental designs using all combinations of a factor. This is an old solution, not a new one. Fractional designs for this problem have been known since the 1930's and are very efficient. Theare are other optimal screening designs available now, for example in the JMP DOE platform.
3-4 are process optimisation questions that would benefit from response surface modelling
Fortunately there are many ways to make optimal experiments now, so you can select cases or, as in (2) set up product choice experiments.
Yes, they are using all combinations of a factor and they have been used for years. That does not mean that they are no longer valid.
Thanks for mentioning the new designs. I will look into using them. Are there any others?