A Data Science Central Community
Both esProc and R language are typical data processing and analysis languages with two-dimensional structured data objects. They are all good at multi-step complex computations. However their two -dimensional structured data objects are quite different from each other in the underlying mechanism. As a result, esProc is better at computation with structured data, and especially suitable for developers to do business computing. R is better at matrix computation and more suitable for scientists to do scientific or engineering computation.
esProc’s two-dimensional structured data type is sequence table object (TSeq). Sequence table is based on records, with multiple records forming a row-styled two-dimensional table. In combination with the column name, this two-dimensional table can form a complete data structure. R language is based on vector, with multiple vectors forming a column-styled two-dimensional table. In combination with the column name, the two-dimensional table can form a complete data structure.
These underlying mechanisms affect actual user experience. In the following part we will compare the difference in practical use between sequence table object and data frame, in terms of basic functions, advanced features, actual use cases and test results.
Note: Primitive functions of development language are to be used in the following comparisons, the third party extension packages won’t be involved.
Example 1:retrieve two-dimensional structured data from the file, and access the value of the second column in the first row by coordinates.
Comparison: there is no significant difference in the most basic functions.
Note: the sales.txt file is tab separated structured data, and the first few lines are as following:
Comparison: there is no significant difference between the two.
Example 3: Access column data. There are two scenarios, and each falls into two situations: access by column number and column names:retrieve only the second column, or retrieve a combination of the second column and the fourth column.
Comparison: Both can access the column data. The only difference is in the syntax for retrieving multiple column data. Data frame is retrieving the number directly, while with sequence table a new sequence table will be build with the new function. Although the syntax is different, the actual methods used are the same: both are duplicating two columns of data from the original objects to new objects.
Example 4: record manipulation. Includes: retrieve the first two records, appending records, inserting record in the second row, deleting the record in the second row.
append<- data.frame(OrderID=152, Client="CA", SellerId=5, Amount=2961.40, OrderDate="2010-12-5 0:00:00")
data<- rbind(data, append)
insert<-data.frame(OrderID=153, Client="RA", SellerId=4, Amount=1931.20, OrderDate="2009-11-5 0:00:00")
Comparison: record manipulation is possible in both ways. esProc is relatively more convenient. It can use insert function to append or insert records directly to sequence table, while in R language we need to split the data frame and then merge them again to achieve the same result in an indirect way.
As both sequence table and data frame are structured, two-dimensional data object, no significant difference exists in basic functions for data reading/writing,data access and maintenance.
Example 5: modifying the association. A1, A2 are two-dimensional structured data object with the same field ID. We now need to add the bonus field values of A2 to the salary field values in A1 according to ID.
A1=db.query("select id,name,salary from salary order by id")
A2=db.query("select id,bonus from bonus order by id")
Data frame has no functions to modify the association. We need to do manual coding for this, which is omitted here.
Example 6: merging associations. A1, A2, A3 are two-dimensional structured data objects with the same field sequence number. Please associate them with left join. As the data is sorted by sequence number, please leverage merging methods to improve the speed for association.
Sequence Table：[email protected](A1:salary,id; A2:bonus,id; A3,attendance,id)
Data frame supports association of two tables, such as: merge(A1,A2,by.x="id",by.y="id",all=TRUE).In this case three tables are associated, which can be achieved indirectly through two two-tableassociations.
In addition, the data frame does not support merging of association, and therefore no speed improvement is possible. In other words, data frame cannot use ordered sequence data to improve performance, not only with association, but also with other operations.
Example 7: Record lookup. Four scenarios: retrieving records with the Amount greater than 1000; retrieving the sequence number or records with the Amount greater than 1000;return records with primary key value of “v”, return the sequence number for records with primary key value of “v”.
Data frame：only the first two scenarios can be achieved, which is done with following code:
newdata<- data [data $ Amount>1000,]
which(data $ Amount >1000)
Data frame hasn’t the concept of major key, so we need to do manual coding for other 2 scenarios as indirect methods, or employ a third party package (i.e. data.table). The codes are omitted here.
Example 8: Group sum. The data is grouped by Client and SellerId. Then the other two fields are aggregated: do a sum for Amount field, and do a count for OrderID field.
Data frame：only support single field aggregation, such as the sum of Amount. As following：
To do aaggregation of two fields at the same time with data frame, we can only use two separate aggregate statements and then merge the results. Codes are omitted here.
Example 9: Reuse grouping. Group data by Client. Complete multiple subsequent computations on group result. Including: aggregation by amount, and count after grouping by SellerId.
Data frame does not support reuse of grouping directly. Grouping and aggregation usually need to be done in one step. This means we need to do two identical grouping operations to accomplish the same purpose. As following:
If we want to reuse grouping, we must use split function and loop to achieve this. The code is both lengthy and with low performance.
Sequence tables and data frame are quite different in terms of advanced features. This is mainly demonstrated in the following five ways:
In this part we use a real case for comprehensive comparison o fdata frame and sequence table.
Computation target: according to daily transactions, selecting stocks from blue-chip stocks whose prices rises in 5 days in a row.
Ideas: Importing data; filtering out previous month's data; grouped them according to the ticker; sort the data by dates; compute the growth amount for closing price over previous day; compute the number of days for continuous positive growth; filtering out the stocks which rise in 5 or more days in a row.
Sequence Table Solution:
Data frame Solution:
Test 1: Generating 10 million records in memory, each consists of three fields. All values are random numbers. Records are filtered, and each field is summed.
Comparison: sequence table needs 50.534 seconds, while data frame needs 91.999 seconds. The gap is obvious.
Test 2: Retrieving 1.2G txt file. Do filtering and sum on two fields
Comparison: sequence table takes 87.122 seconds, while data frame takes 1.1347 hours. The performance difference is tens of times. The reason for this is mainly due to the extremely low speed for file reading.
From the above comparison, we can see that sequence table are better than data frame in terms of rich features, easy syntax, memory consumption, development effort, library function performance and coding performance, etc.. Of course, data frame is not the full strength of R language. R has a powerful vector matrix and the associated mass functions, which make it more professional than esProc in scientific and engineering computation.