A Data Science Central Community

Cross-row and group computation often involves computing link relative ratio and year-on-year comparison. Link relative ratio refers to comparison between the current data and data of the previous period. Generally, it takes month as the time interval. For example, compare the sales amount of April with that of March, and the growth rate we get is the link relative ratio of April. Hour, day, week and quarter can also be used as the time interval. Year-on-year comparison is the comparison between the current data and data of the corresponding period of the previous year. For example, compare the sales amount of April 2014 with that of April 2013 and compute the growth rate which is April’s year-on-year comparison. Data of multiple periods are usually compared to find the variation trend in practical business.

Now let’s look at the method of computing link relative ratio and year-on-year comparison in R language through an example.

**Case description:**

Compute the link relative ratio and year-on-year comparison of each month’s sales amount during a specified period of time. The data come from orders table *sales*, in which column Amount contains order amount and column OrderDate contains order dates. Some of the data are as follows:

**Code:**

1 sales<-read.table("E:\\ salesGroup.txt",sep="\t", header=TRUE)

2 filtered<-subset(sales,as.POSIXlt(OrderDate)>=as.POSIXlt('2011-01-01 00:00:00') &as.POSIXlt(OrderDate)<=as.POSIXlt('2014-08-29 00:00:00'))

3 filtered$y<-format(as.POSIXlt(filtered$OrderDate),'%Y')

4 filtered$m<-format(as.POSIXlt(filtered$OrderDate),'%m')

5 agged<-aggregate(filtered$Amount, filtered[,c("m","y")],sum)

6 agged$lrr<- c(0, (agged$x[-1]-agged$x[-length(agged$x)])/agged$x[-length(agged$x)])

7 result<-agged[order(agged$m),]

8 result$yoy<-NA

9 for(i in 1:nrow(result)){

10 if(i>1 && result[i,]$m==result[i-1,]$m){

11 result[i,]$yoy<-(result[i,]$x-result[i-1,]$x)/result[i-1,]$x

12 }

13 }

**Code interpretation:**

1. The first four lines of code are easy to understand. *read.table* is used to read data from the table and *subset* to filter data, and two *format* functions are used to generate year and month respectively. Note that the beginning and ending time should be output dynamically from the console using *scan* function; here they are simplified as fixed constants.

After computing, some of the values of database frame *filtered* are:

2. agged<-aggregate(filtered$Amount, filtered[,c("m","y")],sum), this line of code summates the order amount of each month of each year. Note that in the code, the month must be written before the year though data are grouped by the year and the month according to business logic. Otherwise R language will perform grouping first by the month, then by the year, which will get result inconsistent with business logic and make data viewing inconvenient.

After computing, some of the values of data frame *agged* are:

3. agged$lrr<- c(0, (agged$x[-1]-agged$x[-length(agged$x)])/agged$x[-length(agged$x)])，this line of code computes link relative ratio. The result will be stored in the new column *Irr*. Business logic is (order amount of the current month – order amount of the previous month)\order amount of the previous month.

Note: [-N] in the code represents that the Nth row of data is removed. So agged$x[-1]means the first row of data is removed and agged$x[-length(agged$x)]means the last row of data is removed. By performing certain operation between the two, link relative ratio can be obtained indirectly. But the result won’t include the link relative ratio of the first month (i.e. January 2011), so a zero should be added to the code. We can see that the code logic and the business logic share some similarities but are quite different. The code is difficult to understand.

At this point, some of the values of data frame *agged*are:

4. result<-agged[order(agged$m),], this line of code sorts data by the month and the year. Since the data of the year are ordered, we just need to perform sorting by the month. result$yoy<-NA initializes a new column which will be used to store the year-on-year comparison of sales amount.

Now the value of *result* is:

5. The loop judgment in the last four lines of code is to compute the year-on-year comparison. Business logic: (order amount of the current month – order amount of the previous month)\order amount of the previous month. Code logic: from the second line, if the month in the current line is the same as that in the previous line, the code will compute year-on-year comparison. Detailed code is result[i,]$yoy<-(result[i,]$x-result[i-1,]$x)/result[i-1,]$x. We can see that the code written in this way is easy to understand and its logic is quite similar to the business logic.

The only weakness of this piece of code is that it cannot use the *loop* function of R language, which makes it a little lengthy. But compared with the difficult operation of link relative ratio, maybe a longer but simple code is better.

The final results are as follows:

**Summary:**

R language can compute link relative ratio and year-on-year comparison, but the operation of link relative ratio is difficult to understand and the code of year-on-year comparison is a little lengthy. The codes of both operations are not easy to learn.

**The third-party solution**

Python, esProc and Perl, all of which can perform structured data computation, can be used to handle this case. In the following, we’ll briefly introduce esProc and Python’s solutions.

**esProc**

esProc is good at expressing business logic freely with agile syntax. Its code is concise and easy, as shown below:

In the above code, **groups** function is used to group and summarize data by the year and the month. The *derive* functions in A4 and A6 generate link relative ratio and year-on-year comparison respectively.

As can be seen from the code, esProc also uses[-N]. Different from [-N] in R language, it doesn’t represent removing the Nth row; it represents the Nth row counted from the current line. For example, [-1] is the previous line. In this way, the operation of link relative ratio can be simply expressed as (x-x[-1])/x[-1].But R language hasn’t expressions for relative positions, which makes its code difficult to understand.

In the year-on-year comparison operation, esProc uses judgment function *if* in *loop* function, making it avoid the lengthy loop statement and its code simpler. While R language only has the judgment statement but hasn’t the judgment function. This is the reason why its code is lengthy.

Finally, these are the computed results:

**Python****（Pandas****）**

*Pandas* is*Python*’s third-party package. Its basic data type is created by imitating R’s dataframe but gets improved greatly. At present, its latest version is 0.14. Its code for handling this case is as follows:

1 sales = pandas.read_csv('E:\\salesGroup.txt',sep='\t')

2 sales['OrderDate']=pandas.to_datetime(sales.OrderDate,format='%Y-%m-%d %H:%M:%S')

3 filtered=sales[(sales.OrderDate>='2011-01-01 00:00:00') & (sales.OrderDate<='2014-08-29 00:00:00')]

4 filtered['y']=filtered.OrderDate.apply(lambda x: x.year)

5 filtered['m']=filtered.OrderDate.apply(lambda x: x.month)

6 grouped=filtered.groupby(['y','m'],as_index=False)

7 agged=grouped.agg({'Amount':[sum]})

8 agged['lrr']=agged['Amount'].pct_change()

9 result=agged.sort_index(by=['m','y'])

10 result.reset_index(drop=True,inplace=True)

11 result['yoy']=result.apply(lambda _:numpy.nan, axis=1)

12 for row_index, row in result.iterrows():

13 if(row_index>0 and result.ix[row_index,'m']==result.ix[row_index-1,'m']):

14 result.ix[row_index,'yoy']=(result.ix[row_index,'Amount']-result.ix[row_index-1,'Amount'])/result.ix[row_index-1,'Amount']

In the code, *pct_change()* function is used to directly compute the link relative ratio, which is more convenient than the method used by R language and esProc. But this kind of function is not universal and can only deal with isolated cases. When it is required to compute link relative ratio or year-on-year comparison, Pandas can only complete the task by combining *div* function and *shift* function, which makes its code more difficult to understand than R’s.

In computing year-on-year comparison, Pandas’ code is as lengthy as R’s. This is because *Pandas* also cannot use *if* function in *loop* function. I’m afraid cooperation of *apply* function and *lambda* syntax is needed if we want to write simpler code.

Finally, let’s look at the computed results:

Please pay attention to the following easy-to-get-wrong details:

1 The code must be sort_index(by=['m','y'])when we perform sorting by the month and the year. The simple form sort(m), which used in R language and esProc, is not allowed.

2 *Pandas* has the assignment syntax asresult.loc[row_index,‘yoy’]=value. But when assigning value to a certain element in data frame, we should write the code as result.ix[row_index,'yoy']=value.

3 When iterrows()is used to perform loop, its loop number row_indexis index instead of row number. To make the row number conform to the index, reset_index() should be used to reset the indexes.

© 2020 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of AnalyticBridge to add comments!

Join AnalyticBridge