A Data Science Central Community

__Category: Model__

OLAP is an important constituent part of BI(business intelligence).

Understood literally, OLAP is online analytical processing, that is, users conduct analytical operation on real-time business data.

But, currently the concept of OLAP is seriously narrowed, and only it refers to operations such as conducting drilling, aggregating, pivoting and slicing based on multi-dimensional data, namely, multi-dimensional interaction analysis.

To apply this kind of OLAP, it is necessary to create in advance a group of topic specific data CUBEs. Then users can display these data in the form of crosstab or graph and conduct in various real-time transformations (pivoting and drilling) on them, in the hope to find in the transformation process a certain law of the data or the argument to support a certain conclusion, thereby achieving the aim of analysis.

**Is this kind of OLAP we need?**

To answer this question, we need to carefully investigate the real application process of the OLAP, thereby finding out what the technical problem the OLAP needs to solve is on earth.

Employees with years’ working experiences in any industry generally have some educated guesses about the businesses they engage in, such as:

*A stock analyst may guess stocks meeting a certain condition are likely to go up.*

*An employee of an airline company may guess what kinds of people are accustomed to buying what kind of flights.*

*A supermarket operator may also guess the commodity at what price is more suitable for the people around the supermarket.*

*…*

These guesses are just the basis for forecast. After operating for a period of time, a constructed business system can also accumulate large quantities of data, and these guesses have most probably been evaluated by these accumulated data, when evaluated to be true, they can be used in forecast; when evaluated to be false they will be re-guessed.

It needs to be noted that **these guesses are made by users themselves** instead of the computer system! What a computer should do is to help a user to evaluate, according to the existing data, the guess to be true or false, namely, on-line data query (including certain aggregation computation). This is just the application process of OLAP. The reason why on-line analysis is needed is that many query computations are temporarily required after a user has seen a certain intermediate result. In the whole process, model in advance is impossible and unnecessary.

We call the above process **evaluation process**, whose purpose is to find from historical data some laws or evidences for conclusions, and the means adopted is to conduct interactive query computation on historical data.

The following are a few examples actually requiring computations (or queries):

* The first n customers whose purchases from the company account for half of the sales volume of the company of the current year;*

* The stocks which go up to the limit for three consecutive days within one month;*

* Commodities in the supermarket which are sold out at 5 P.M for three times within one month;*

* Commodities whose sales volumes in this month have decreased by more than 20% over those of the preceding month;*

* …*

Evidently, this type of computation demand is ubiquitous in business analysis process and all can be computed out from historical database.

Then, can the narrowed OLAP be used to complete the above-mentioned computation process?

**Of course NOT!**

Currently OLAP system has two key disadvantages:

- The multi-dimensional cube is prepared in advance by the application system and user does not have the capability to temporarily design or reconstruct the cube, so once there is new analysis demand, it is necessary to re-create the cube.
- The analysis actions could be implemented by cube are rather monotonous. The defined actions are quite few, such as the drilling, aggregating, slicing, and pivoting. The complicated analysis behavior requiring multi-steps is hard to implement.

Although the current OLAP products are splendid regarding its look and feel, few on-line analysis capabilities powerful enough are provided actually.

Then, what kind of OLAP do we need?

It is very simple, and we need a kind of on-line analytical system that can support evaluation process!

Technically speaking, steps for evaluation process can be regarded as computation regarding data (query can be understood to be filter computation). This kind of computation can be freely defined by user and user can occasionally decide the next computation action according to the existing intermediate result, without having to model beforehand. Additionally, as data source is generally database system, it is necessary to require this kind of computation to be able to very well support mass structured data instead of simple numeric computation.

Then, can SQL (or MDX) play this role?

SQL is indeed invented for this aim and it owns complete computation capability and it adopts a writing style similar to natural language.

But, as SQL computation system is too basic, it is very difficult and over-elaborate to use it to achieve complex computation, such as problems listed in the preceding paragraphs. It is even not so easy for programmers who have received professional training, so ordinary users can only use SQL to implement some of the simplest queries and aggregate computation (based on the filter and summarization of a single table). This result leads to the fact that the application of SQL has already deviated far away from its original intention of invention, almost becoming the expertise for programmers.

We should follow the working thought of SQL to carefully study the specific disadvantage of SQL and find the way to overcome it in an effort to develop a new generation of computation system, thereby implementing the evaluation process, namely, the real OLAP.

__Category: Model__

SQL is invented primarily to provide a method to access structured data in order to transparentise the physical storage scheme, so a lot of various types of English vocabularies and syntaxes are used in SQL to reduce the difficulty in understanding and writing it. And the relational algebra as the basic theory of SQL is a complete computation system, which can compute everything in principle. In terms of this, we certainly should use SQL to satisfy various demands for data computation.

But, though relational database has achieved a huge success, evidently SQL fails to realize its original aim of invention. Except very few simple queries can be completed by end user using SQL, most of SQL users are still technical personnel, and even many complex queries are no easy job for technical personnel.

Why? We inspect the disadvantage of SQL in computation through a very simple example.

Suppose there is a sales performance table consisting of three fields (to simplify the problem, date information is omitted):

sales_amount |
Sales performance table |

sales |
Name of salesman, suppose there is no duplicate name. |

product |
Products sold |

amount |
Sales amount of the salesman on the product |

Now we want to know the name list of the salespersons whose sales amounts rank among the top 10 places both in air-conditioners and TV sets.

This question is rather simple and people will very naturally design out the computation process as follows:

- Arrange the sequence according to the sales amount of air-conditioner and find out the top 10 places.
- Arrange the sequence according to the sales amount of TV and find out the top 10 places.
- Get the intersection of the results of 1 and 2 and obtain the answer.

Now we use SQL to do it.

- Find out the top 10 places of the sales amount of air-conditioner. This is very simple: select top 10 sales from sales_amount where product='AC' order by amount desc
- Find out the top 10 places of the sales amount of TV. The action is the same: select top 10 sales from sales_amount where product='TV' order by amount desc
- Seek the intersection of 1 and 2. This is somewhat troublesome, as SQL does not support computation by steps. The computation result of the above two steps cannot be saved, and thus it is necessary to copy it once again:

select * from

( select top 10 sales from sales_amount where product='AC' order by amount desc )

intersect

( select top 10 sales from sales_amount where product='TV' order by amount desc )

A simple 3-step computation has to be written like this using SQL, and daily computations of more than 10 steps are in great numbers. So this evidently goes beyond the acceptability of many people.

In this way, we know the first important disadvantage of SQL: **Do not support computation by steps**. Dividing complex computation into several steps can reduce the difficulty of a problem to a great extent. On the contrary, completing many steps of computation into one step can increase the difficulty of a problem to a great extent.

It can be imagined that, if a teacher requires pupils to create only one calculation formula to complete the calculation in solving application problems, how distressed the pupils will feel (of course, there are certain clever children who can solve the problem)!

SQL query cannot by conducted by steps, but the stored procedure written out with SQL can operate by steps. Then, is it possible to use the stored procedure to conveniently solve this problem?

For the time being, we just ignore how complex is the technical environment in which the stored procedure is used (this is enough to make most people give it up) and the incompatibility caused by differences of databases. We only try to know theoretically whether it is possible to use SQL to make this computation simpler and faster.

- Compute the top 10 places sales amount of air-conditioners. The statement is still the same, but we need to save the result for use by Step 3, while in SQL, it is only possible to use table to store set data. So we need to create a temporary table:

create temporary table x1 asselect top 10 sales from sales_amount where product='AC' order by amount desc

- Compute the top 10 places of the sales amount of TV. Similarly

create temporary table x2 asselect top 10 sales from sales_amount where product='TV' order by amount desc

- Seek the intersection, the preceding steps are troublesome but this step is simpler:
*select * from x1 intersect x2*

After the computation is done in steps, the working thought becomes clear, but it still appears over-elaborate to use a temporary table. In the computation of mass structured data, temporary set, as intermediate result, is rather common. If the temporary table is created for storage in all cases, the computation efficiency is low and it is not intuitive.

Moreover, SQL does not allow the value of a certain field to be a set (namely temporary table), so in this way, it is impossible to implement some computations even if we tolerate the over elaborate.

If we change the problem into computing the salespersons whose sales amounts of all products rank among the top 10 places, try thinking how to compute it. By continuing to use the above-mentioned working thought, it is very easy to get the below points:

- Group the data according to products, arrange the sequence of each group, and get the top 10 places;
- Get the intersection of the top 10 places of all products;

As we do not know beforehand how many products there are, so it is necessary to also store the grouping result in a temporary table. There is a field in this table that needs to store the corresponding group members, which is not supported by SQL, so the method is unfeasible.

If supported by window function (SQL2003 standard), it is possible to change the working thought. After grouping by product, compute the number of times each salesman appears in the top 10 places of the sales amounts of all product category group. If the number of times is the same as the total number of the product categories, it indicates this salesman is within the top 10 places regarding the sales amounts of all product categories.

select sales

from ( select sales,

from ( select sales,

rank() over (partition by product order by amount desc ) ranking

from sales_amount)

where ranking <=10 )

group by sales

having count(*)=(select count(distinct product) from sales_amount)

**How many people can write such complex SQL?**

Moreover, in many databases, the window functions are not supported. Then, it is only possible to use the stored procedure to develop a loop, according to the sequence, the top 10 places of each product, and seek the intersection of the result of the preceding time. This process is not very much simpler than using high level language to develop, and it is also necessary to cope with the triviality of the temporary table.

Now, we know the second important disadvantage of SQL: **Set-lization is not complete**. Though SQL has the concept of set, it fails to provide set as a kind of basic data type, which makes it necessary to transform a lot of natural set computations in thinking and writing.

In the above computation, we have used the keyword top**.** In fact there is not such a thing (it can be combined out by other computation computations) in the theory of relational algebra, and this is not the standard writing style of SQL.

**Let us see how difficult it is to look for the top 10 places when there is no top.**

Rough working thought: Seek out the number of members whose sales amount are higher than itself to rank the sales person, and then get the members whose places do not exceed 10, and the SQL is written as follows:

select sales

from ( select A.sales sales, A.product product,

(select count(*)+1 from sales_amount

where A.product=product AND A.amount<=amount) ranking

from sales_amount A )

where product='AC' AND ranking<=10

or

select sales

from ( select A.sales sales, A.product product, count(*)+1 ranking

from sales_amount A, sales_amount B

where A.sales=B.sales and A.product=B.product AND A.amount<=B.amount

group by A.sales,A.product )

where product='AC' AND ranking<=10

Professional technical personnel may not necessarily write such SQL statement well! And only the first ten places are computed.

To say the least, even if there is top**,** it only makes it easy to get the preceding part lightly. If we change the problem into getting the 6^{th} place to the 10^{th} place, or seeking the salesman whose sales amount is 10% more than that of the next one, the difficulty is still there.

The reason causing this phenomenon lies in the third important disadvantage of SQL: **Lack the support of ordered set**. SQL inherits the unordered set in mathematics, which directly causes the fact that the computations relating to sequence are rather difficult. And it can be imagined how common the computations relating to sequence (such as over the preceding month, over the same period last year, the first 20%, and rankings) will be.

The newly added window functions in SQL2003 standard provides some computation capabilities relating to sequence, which makes it possible to solve some problems in a relatively simple method and alleviate the problem of SQL to a certain extent. But the use of window functions is often accompanied by sub-query, and it cannot enable user to directly use the sequence number to access set member, so there are still many ordered computations that are difficult to solve.

Now we want to pay attention to the gender proportion of the “good” salespersons that are computed out, that is, how many males and females there are respectively. Generally, the gender information is recorded in the employee table but not in the performance table, and it is simplified as follows:

employee |
Employees table |

name |
Names of employees, suppose there is no repeated name. |

gender |
Genders of employees. |

We have already computed out the name list of “good” salespersons, and the relatively natural idea is to seek out their genders from the employee table using name list, and count the number. But in SQL, it is necessary to use join operation to get information across tables . In this way, following the initial result, SQL will be written as:

select employee.gender,count(*)

from employee,

( ( select top 10 sales from sales_amount where product='AC' order by amount desc )

intersect

( select top 10 sales from sales_amount where product='TV' order by amount desc ) ) A

where A.sales=employee.name

group by employee.gender

With only an associated table more, it is made so over-elaborate and in reality there are rather more cross-table storages and they are often multi-layered. For example, for salespersons, there are departments where there are managers, and now we want to know by which managers these “good” salespersons are managed. Then there are three table joins, and it is indeed no easy job to write clear where and group in this computation.

This is just the fourth important disadvantage of SQL as we want to say: **Lack of object reference**, in relational algebra, the relations between objects completely depends on foreign key. This not only makes the efficiency very low in looking for relation, but also makes it impossible to directly treat the record pointed by foreign key as the attribute of primary record . Try thinking, can the above statement be written as this:

select sales.gender,count(*)

from (…) // …is the SQL computing the “good” salespersons above

group by sales.gender

Evidently, this statement is not only clearer, and at the same time, the computation will also be more efficient (without join computation).

We have analyzed, through a simple example, the four important difficulties of SQL. We believe this is just the main reason why SQL fails to reach the original intention of its invention. The process of solving business problem based on a kind of computation system is in fact the process of **translating business problems into formalized computation syntax** (similar to the case in which a pupil solves application problem, translates the problem into formalized four arithmetic operations). Before overcoming these difficulties, SQL model system rather does not comply with people’s natural thinking habit, causing great barriers in translating problems, making it very difficult for SQL to be applied, on a large scale, in data computation for business problems.

For still another example which is easily understood by programmer, use SQL as data computation, which is similar to the case in which assembly language is used to complete four arithmetic operations. We very easily write out the calculation expression such as 3+5*7, but to use assembly language (take X86 as the example), it needs to be written as

mov ax,3

mov bx,5

mul bx,7

add ax,bx

In either writing or reading, such code is far inferior to 3+5*7 (it will be more troublesome if we come across decimal). Though it cannot be regarded as a big problem to a skilled programmer, to most people, however, this kind of writing is too hard to understand. In this sense, FORTRAN is really a great invention.

__Category: Model__

The computing power of SQL for mass structured data is complete, that is to say, it is impossible to find anything that SQL cannot compute. But its support layer is too low, which can lead to over-elaborate operation in practical application.

The over-elaborate operation is specifically reflected in the following four aspects:

**Computation without sub-step**: SQL requires computation to be written out in one statement, and it is necessary to adopt storage procedure to implement computation step by step. No sub-step not only causes difficulty in thinking, but also makes it difficult to use intermediate result.**Set is unordered**: SQL does not directly provide the mechanism of using position to refer to set members, and conversion is needed to implement computation relating to order and positioning.**Set-lization is not complete**: SQL set function is simple and is only used to indicate the query result set and cannot be explicitly applied as basic data type.**It lacks object reference**: SQL does not support record reference, the association between data tables adopts equivalent foreign key scheme, and in conducting multi-table joint computation, it is necessary to conduct join operation. So it is not only difficult to understand, but also low in efficiency.

Implementing data computation process based on a type of computation system is in fact the **process of translating business problem into formalized computation syntax (**which is similar to the case in which a primary-school student solves an application problem by translating the problem into formalized four arithmetic operations). Because of the above-mentioned four problems of SQL, in handling complex computation, its model system is inconsistent with people’s natural thinking habit. It causes a great barrier in translating problems, leading to the case that **the difficulty to formalize the problem-solving method into computation syntax is much greater than to find the solution of the problem**.

We give the following examples to describe respectively the problems in the four aspects.

*To make the statement in the examples as simple as possible, here a large number of SQL2003 standard window functions are used. So we adopt the ORACLE database syntax that does a relatively good job in supporting SQL2003 standard as it will be generally more complex to adopt the syntax of other databases to program these SQLs.*

**Computation without sub-step**

Carrying out complex computation step by step can reduce the difficulty of the problem to a great extent, conversely, collecting a multi-step computation into one to be completed in just one step increases the complexity of the problem.

**Task 1** The number of persons of the sales department, where, the number of persons whose native place is NY, and where, the number of female employees?

The number of persons of the sales department

1 |
select count(*) from employee where department=‘sales’ |

Where, the number of persons whose native place is Beijing

1 |
select count(*) from employee where department=‘sales ’ and native_place=‘NY’ |

And where, the number of female employees

1 |
select countt (*) from employee |

2 |
where department=‘sales’ and native_place=‘NY’ and gender =‘female’ |

**Conventional thought**: Select the persons of the sales department for counting, and from it, find out the persons whose native place is NY for counting, and then further find out the number of female employees for counting. The query each time is based on the existing result last time, so it is not only simple in writing but also higher in efficiency.

But, the computation of SQL cannot be conducted in steps, and it is impossible to reuse the preceding result in answering the next question, and it is only possible to copy the query condition once more.

**Task 2** Each department selects a pair of male and female employees to form a game team.

1 |
with A as |

2 |
(select name, department, |

3 |
row_number() over (partition by department order by 1) seq |

4 |
from employee where gender =‘female’), |

5 |
B as |

6 |
(select name, department, |

7 |
row_number() over(partition by department order by 1) seq |

8 |
from employee where sex =‘female’) |

9 |
select name, department from A |

10 |
where department in ( select distinct department from B ) and seq=1 |

11 |
union all |

12 |
select name, department from B |

13 |
where department in (select distinct department from A ) and seq=1 |

Computation without sub-step sometimes not only causes trouble in writing and low efficiency in computation, but even causes serious deformation in the train of thought.

The intuitive thought of this task: For each department cycle, if this department has male and female employees, then select one male employee and one female employee and add them to the result set. But SQL does not support this kind of writing with which the result set is completed step by step (to implement this kind of scheme, it is necessary to use the stored procedure). At this time, it is necessary to change the train of thought into: Select male employee from each department, select female employee from each department, select out, respectively from the two result sets, members whose departments appear in another result set, and finally seek the union of the sets.

Fortunately, there are still with sub-statement and window function over (SQL2003 standard begins to support); otherwise this SQL statement will be simply ugly.

**The set is unordered**.

Ordered computation is very common in mass data computation (obtain the first 3 places/the third place, compare with the preceding period). But SQL adopts the mathematical concept of unordered set, so ordered computation cannot be conducted directly, and it is necessary to adjust the train of thought and change the method.

**Task 3** Company’s employees whose ages are in the middle

1 |
select name, birthday |

2 |
from (select name, birthday, row_number() over (order by birthday) ranking |

3 |
from employee ) |

4 |
where ranking=(select floor((count(*)+1)/2) from employee) |

Median is a very common computation, and originally it is only necessary to simple get out, from the ordered set, the members whose positions are in the middle. But SQL unordered set mechanism does not provide the mechanism which directly uses position to access member. It is necessary to create a man-made sequence number field, and then use the condition query method to select it out, causing the case in which a sub-query is needed to complete the query.

**Task 4** For how many trading days has this stock gone up consecutively in the longest?

1 |
select max(consecutive_day) |

2 |
from (select count(*) (consecutive_day |

3 |
from (select sum(rise_mark) over(order by trade_date) days_no_gain |

4 |
from (select trade_date, |

5 |
case when |

6 |
closing_price >lag(closing_price) over(order by trade_date) |

7 |
then 0 else 1 end rise_mark |

8 |
from stock_price) ) |

9 |
group by days_no_gain) |

Unordered set can also cause train of thought to deform.

The conventional train of thought for computing the number of consecutive days in which the stock rises: Set up a temporary variable whose initial value is 0 to record the consecutive dates in which the stock rises, and then compare it with the preceding day. If the stock does not rise, then clear the variable to 0; if it rises, add 1 to the variable, and see the maximum value appearing from the variable when the cycle is over.

In using SQL, it is impossible to describe this process, so it is necessary to change the train of thought. To compute the accumulate number of days in which stock does not rise from the initial date to the current date, and the one with the same number of days in which stock does not rise is the consecutive trading days in which the stock rises, and from its sub-group, it is possible to find out the interval in which the stock rises, and then seek its maximum count. It is already not so easy to read and understand this statement and it is more difficult to write it out.

**Set-lization is not complete**.

It is beyond any doubt that set is the basis of mass data computation. Although SQL has the concept of set, it is limited to describing simple result set, and it does not take the set as a basic data type to enlarge its application scope.

**Task 5** Employees in the company whose birthday are the same as those of others

1 |
select * from employee |

2 |
where to_char (birthday, ‘MMDD’) in |

3 |
( select to_char(birthday, ‘MMDD’) from employee |

4 |
group by to_char(birthday, ‘MMDD’) |

5 |
having count(*)>1 ) |

The original intention of grouping is to split the source set into several subsets, and its returned values are also these sub-sets. But SQL cannot describe this kind of "set consisting of sets", so it forcibly conducts the next step aggregating computation on these sub-sets and forms conventional result set.

But sometimes what we want is not the summary value on sub-sets, but rather the subsets themselves. At this time, it is necessary to use from the source set the condition obtained from grouping to query again, so sub-query appears again unavoidably.

**Task 6** Find out students whose scores ranks in top 10 for all subjects

1 |
select name |

2 |
from (select name |

3 |
from (select name, |

4 |
rank() over(partition by subject order by score DESC) ranking |

5 |
from score) |

6 |
where ranking<=10) |

7 |
group by name |

8 |
having count(*)=(select count(distinct subject) from score) |

Use set-lized train of thought, order and filter the sub-sets of subjects after grouping to select the top 10 of every subject, and then it is possible to complete the task by finding out the intersection set of these sub-sets. But SQL cannot describe the "set of set" and has not the intersection operation to cope with indefinite quantity set. At this time, it is necessary to change the train of thought and use the window function to find out the top 10 of every subject, and then find out, according to student sub-group, the students whose number of appearances is the same as the quantity of subjects, which causes difficulty in understanding.

It lacks object reference.

In SQL, the reference relation between tables depends on equivalent foreign key for maintenance and it is impossible to directly use the record at which the foreign key point as the field of this record. In query, it is necessary to seek help of multi-table join or sub-query to complete the query, which causes not only trouble in writing but also low efficiency in operations.

**Task 7 ** Female manager’s male employees

Use multi-table join.

1 |
select A.* |

2 |
from employee A, department B, employee C |

3 |
where A. department=B. department and B. manager=C. name and |

4 |
A.sex =‘male’ AND C. gender =‘female’ |

Use sub-query.

1 |
select * from employee |

2 |
where department in |

3 |
(select department from department |

4 |
where manager in |

5 |
(select name from employee where gender =‘female’)) |

If the department field in the employee table points at the record in the department table while the manager field in the department table points at the record in the employee table, then it is only necessary to write this query condition simply as this kind of intuitive high-efficiency form:

where department. manager. sex ='female' and sex ='male'

But in SQL, it is only possible to use multi-table join or sub-query to write out the two kinds of obviously obscure statements.

**Task 8** Companies with which employees have their first jobs

Use multi-table join.

1 |
select name, company, first_company |

2 |
from (select employee.name name, resume.company company, |

3 |
row_number() over(partition by resume. name |

4 |
order by resume.start_date) work_seq |

5 |
from employee, resume where employee.name = resume.name) |

6 |
where work_seq=1 |

Use sub-query.

1 |
select name, |

2 |
(select company from resume |

3 |
where name=A. name and |

4 |
start date=(select min(start_date) from resume |

5 |
where name=A. name)) first_company |

6 |
from employee A |

Without object reference mechanism and the completely set-lized of SQL, it is naturally impossible to handle the sub-table as an attribute of the primary table (field value). Regarding the query of sub-table, there are two methods. The first is to use multi-table join, increase the complexity of the statement, and use filter or grouping to convert the result set into the situation having one-to-one correspondence with the primary table record (the joined record has one-to-one correspondence with the sub-table). The second is to adopt sub-query, and each time compute temporarily the sub-table relating to the primary table record to record sub-sets, and increase the overall computation workload (it is impossible to use with sub-statement in sub-query) and trouble in writing.

© 2019 AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

## You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge