Subscribe to DSC Newsletter

Can anyone please help me understand the logic and reasoning behind this piece of code :

data xyz;

set xyz;

if(_n_ eq 1) then set incdata(keep=incstd inc99);

if incstd>2*inc99 then inc_est2= min(inc_est,(4*inc99));

else inc_est2=inc_est;

end;

Thanks!

Raghu

Tags: capping, outliers, removal, rule

Views: 5306

Reply to This

Replies to This Discussion

Raghu,

I have a vague idea but would need to know precisely what the variables inc_est, incstd and inc99 are.  It looks something like if the value is greater than two 99-percentiles (roughly 5sd) above a mean then truncate it.

Graham

 

Thanks Graham!

Raghu:

*** Code line number identification ***;
* 1; data xyz;

* 2; set xyz;

* 3; if(_n_ eq 1) then set incdata(keep=incstd inc99);

* 4;     if incstd>2*inc99 then inc_est2= min(inc_est,(4*inc99));

* 5;    else inc_est2=inc_est;

* 6; end;


Line 1: "data xyz;" creates a new data set from the input data set on Line 2: ("set xyz;").

Line 3: The beginning of an SAS-style FOR loop utilizing if-then-else structure. The "_n_" is a
special variable available during execution/run time that is not output to the resultant datatset.
It is literally a SAS-provided automatic counter variable. So, the writer of this code wants the
first row of "xyz" to be used, but none (if any exist) of the remaining records from "xyz". The
author wants all records after the first row to come from the incdata dataset ("set incdata"). The
additional code "(keep = incstd inc99)" is a way of only selecting those two variables from the
"incdata" data set.

Line 4: Another SAS if statement is used--boolean style--where the condition of "incstd" being
greater that 2 times inc99 is true. Another variable is created "inc_est2" that is assigned to
be the mininum of two inputs, "inc_est" or the value of 4 times "inc99".

Line 5: If the condition evaluates as FALSE--that is, where "incstd" is not greater that 2 times
"inc99", the "inc_est2" is populated with the value in "inc_est".

Line 6: Closing out then loop begun on Line 3:.

*** Code Analysis ***;
So, it appears that the first row from the dataset "xyz" is to be used to create a new dataset
(using the exact same name "xyz") but the remaining data will come from the "incdata" dataset
(where only "incstd" and "inc99" will be used). "incstd" most likely refers to an Income variable
that has been standardized (possibly mean = 0 std = 1). "inc99" is most likely the 99th percentile
value; if "incstd" is more than twice as large as "inc99" then treat it as an outlier and impute
with the smaller of the two: "inc_est" or 4 times the value of "inc99". If that condition is not
met (i.e., if it's not an outlier), then simply the keep the value found in "inc_est". This will
be evaluated for 1 (the first observation in "xyz") plus <however many rows are in "incdata"> times.

Matt

Hi Matt,

Thanks a lot in the first place!Your explanation was so descriptive.

I wanted to take some time to understand the percentiles,std devn and their relations.You have very well explained the code part.Thanks for that.I am very much confused with the logic part.(incstd represents std dev)

From the knowledge that I have got regarding the percentiles and std dev, I interpret the code as follows :

if incstd>2*inc99 then inc_est2= min(inc_est,(4*inc99));

else inc_est2=inc_est;

if the std dev of income is twice the 99th percentile i.e 2 times three std devs(6sigmas) then estimated income would be the minimum of the two i.e.  estimated income and 4 times the 99th percentile i.e. min of estimated income and 12sigmas.

Consider the statement : I want my data to be between 20 and 30.If the data point goes below 15 then I would cap it to 15 and if it goes above 35 then I would cap it as 35.

The above statement is very clear to me and the logic behind it.It would be really great if you can draw an analogy of the question in the context with this statement.

 

Thanks & Regards,

Raghu

Thanks for more background, clarity and the example. I would highly suggest keeping things more linear; that is, instead of thinking of 3 vs. 6 vs. 12 sigmas, think of p99 as the value at which 99 percent of all other values for that variable fall below.

Your example highlights a common practice in a data analysis/model scoring exercise to ensure outliers do no result in scores outside certain tolerance limits (for example, outside the possibility of legitimate scores). Typical cut point choices are the 1st percentile (p1 - to trim extreme low values) and the 99th percentile (p99 - to trim extreme high values). In your example, this could be extended to the values of 15 (p1) and 35 (p99), respectively.

p1:

The problem is the code has no specification for how to handle low extremes, otherwise there would be code to address p1. Given that you're studying income, this *might* not be as important since it is bounded by 0 and any negative values would either be data errors or transactional states prior to becoming an appropriate positive value.

p99:

Unfortunately, the code does not accomplish what you've laid out with your example. It's not pulling outliers back in to the 99th percentile, the logic only deals with cases where the "incstd" value is greater than twice p99.

NOTE: This "incstd" variable is the key; we need to know more about it before we proceed. How was it created / what is the origin/ what are the host variable(s)? I'm concerned about interpretation until we understand this variable. Can you trace the code back to uncover "incstd"?

Hi Matt,

Thanks for the prompt response.I have taken the example code from the Wiley e Book.If you could google for the search term "data mining cookbook modeling data for marketing risk and customer relationship management pdf", this ebook would come up first in the results.The code is in page number 63.

Please refer to it for the background!

Also matt, it would be great if you could suggest me any better understandable code to cap outliers in my data.

I shall be awaiting your response.

Thanks a lot once again!You are really very helpful.

Thanks & Regards,

Raghu

Sorry for the belated response - it's been a busy summer so far.

I'm familiar with the book and found it no problem. I think we'd accomplish more if I could provide you with a more concrete example - matching code and data. To do so will require a software package. Which statistical software(s) are you utilizing?

Hello Matt,

That's fine.I tried to explore the solution myself but still not able to relate the terms .Currently I am using SAS (version 9.3).

It would be great if you could explain me the capping rule with code in a simpler way.

I shall be awaiting your response.

Regards,

Raghu

RSS

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service