A Data Science Central Community
The computing power of SQL for mass structured data is complete, that is to say, it is impossible to find anything that SQL cannot compute. But its support layer is too low, which can lead to over-elaborate operation in practical application.
The over-elaborate operation is specifically reflected in the following four aspects:
Implementing data computation process based on a type of computation system is in fact the process of translating business problem into formalized computation syntax (which is similar to the case in which a primary school student solves an application problem by translating the problem into formalized four arithmetic operations). Because of the abovementioned four problems of SQL, in handling complex computation, its model system is inconsistent with people’s natural thinking habit. It causes a great barrier in translating problems, leading to the case that the difficulty to formalize the problemsolving method into computation syntax is much greater than to find the solution of the problem.
We give the following examples to describe respectively the problems in the four aspects.
To make the statement in the examples as simple as possible, here a large number of SQL2003 standard window functions are used. So we adopt the ORACLE database syntax that does a relatively good job in supporting SQL2003 standard as it will be generally more complex to adopt the syntax of other databases to program these SQLs.
Carrying out complex computation step by step can reduce the difficulty of the problem to a great extent, conversely, collecting a multi-step computation into one to be completed in just one step increases the complexity of the problem.
Task1 The number of persons of the sales department, where, the number of persons whose native place is NY, and where, the number of female employees?
The number of persons of the sales department
Conventional thought: Select the persons of the sales department for counting, and from it, find out the persons whose native place is NY for counting, and then further find out the number of female employees for counting. The query each time is based on the existing result last time, so it is not only simple in writing but also higher in efficiency.
But, the computation of SQL cannot be conducted in steps, and it is impossible to reuse the preceding result in answering the next question, and it is only possible to copy the query condition once more.
Task2 Each department selects a pair of male and female employees to form a game team.
Computation without sub-step sometimes not only causes trouble in writing and low efficiency in computation, but even causes serious deformation in the train of thought.
The intuitive thought of this task: For each department cycle, if this department has male and female employees, then select one male employee and one female employee and add them to the result set. But SQL does not support this kind of writing with which the result set is completed step by step (to implement this kind of scheme, it is necessary to use the stored procedure). At this time, it is necessary to change the train of thought into: Select male employee from each department, select female employee from each department, select out, respectively from the two result sets, members whose departments appear in another result set, and finally seek the union of the sets.
Fortunately, there are still with sub-statement and window function over (SQL2003 standard begins to support); otherwise this SQL statement will be simply ugly.
Ordered computation is very common in mass data computation (obtain the first 3 places/the third place, compare with the preceding period). But SQL adopts the mathematical concept of unordered set, so ordered computation cannot be conducted directly, and it is necessary to adjust the train of thought and change the method.
Task3 Company’s employees whose ages are in the middle
Median is a very common computation, and originally it is only necessary to simple get out, from the ordered set, the members whose positions are in the middle. But SQL unordered set mechanism does not provide the mechanism which directly uses position to access member. It is necessary to create a manmade sequence number field, and then use the condition query method to select it out, causing the case in which a sub-query is needed to complete the query.
Task4 For how many trading days has this stock gone up consecutively in the longest?
Unordered set can also cause train of thought to deform.
The conventional train of thought for computing the number of consecutive days in which the stock rises: Set up a temporary variable whose initial value is 0 to record the consecutive dates in which the stock rises, and then compare it with the preceding day. If the stock does not rise, then clear the variable to 0; if it rises, add 1 to the variable, and see the maximum value appearing from the variable when the cycle is over.
In using SQL, it is impossible to describe this process, so it is necessary to change the train of thought. To compute the accumulate number of days in which stock does not rise from the initial date to the current date, and the one with the same number of days in which stock does not rise is the consecutive trading days in which the stock rises, and from its sub-group, it is possible to find out the interval in which the stock rises, and then seek its maximum count. It is already not so easy to read and understand this statement and it is more difficult to write it out.
It is beyond any doubt that set is the basis of mass data computation. Although SQL has the concept of set, it is limited to describing simple result set, and it does not take the set as a basic data type to enlarge its application scope.
Task5 Employees in the company whose birthday are the same as those of others
The original intention of grouping is to split the source set into several subsets, and its returned values are also these sub-sets. But SQL cannot describe this kind of "set consisting of sets", so it forcibly conducts the next step aggregating computation on these sub-sets and forms conventional result set.
But sometimes what we want is not the summary value on subsets, but rather the subsets themselves. At this time, it is necessary to use from the source set the condition obtained from grouping to query again, so sub-query appears again unavoidably.
Task6 Find out students whose scores ranks in top 10 for all subjects
Use set-lized train of thought, order and filter the sub-sets of subjects after grouping to select the top 10 of every subject, and then it is possible to complete the task by finding out the intersection set of these sub-sets. But SQL cannot describe the "set of set" and has not the intersection operation to cope with indefinite quantity set. At this time, it is necessary to change the train of thought and use the window function to find out the top 10 of every subject, and then find out, according to student sub-group, the students whose number of appearances is the same as the quantity of subjects, which causes difficulty in understanding.
In SQL, the reference relation between tables depends on equivalent foreign key for maintenance and it is impossible to directly use the record at which the foreign key point as the field of this record. In query, it is necessary to seek help of multitable join or subquery to complete the query, which causes not only trouble in writing but also low efficiency in operations.
Task7 Female manager’s male employees
Use multi-table join.
If the department field in the employee table points at the record in the department table while the manager field in the department table points at the record in the employee table, then it is only necessary to write this query condition simply as this kind of intuitive high-efficiency form:
where department. manager. sex ='female' and sex ='male'
But in SQL, it is only possible to use multi-table join or sub-query to write out the two kinds of obviously obscure statements.
Task8 Companies with which employees have their first jobs
Use multi-table join.
Without object reference mechanism and the completely setlized of SQL, it is naturally impossible to handle the sub-table as an attribute of the primary table (field value). Regarding the query of sub-table, there are two methods. The first is to use multitable join, increase the complexity of the statement, and use filter or grouping to convert the result set into the situation having one-to-one correspondence with the primary table record (the joined record has one-to-one correspondence with the subtable). The second is to adopt subquery, and each time compute temporarily the subtable relating to the primary table record to record subsets, and increase the overall computation workload (it is impossible to use with sub-statement in sub-query) and trouble in writing.