Subscribe to DSC Newsletter

I'm just wondering how most data mining algorithms handle data measured only at the ordinal level. R doesn't seem to have an ordinal data type - it only has factors (categorical) and continuous variable types.So there's no real way of flagging that the data is measured in that way.

 

I'm guessing that it doesn't matter for most algorithms (except linear regression) as most data mining algorithms can handle non-linearity with ease and don't have assumptions about normality etc. So other than LR, are there any other algorithms that need to be avoided when using ordinal data?

Views: 377

Reply to This

Replies to This Discussion

R does use ordered factors.  You can do it a couple of ways.  As the factor data type use factor(myvector, ordered=TRUE)  or you can use the ordered() function to convert an existing vector or factor.  I believe there are also R packages for ordinal data processing.  See "ordinal"
Thats some great info Larry, thanks for that! I didn't realise that the factor type actually had that parameter... I suppose the remaining question is, which algorithms can take advantage of that ordering facility, and for which algorithms does it not help?
I would imagine you could use the generalized linear model ie.  glm()  function to develop a predictor.  Of course you could use many non-linear methods such as rpart, bayesian inference, or neurel nets.
Now that I know that you can have ordered factors in R, that idea makes good sense, thanks again for your help Larry.

Bootstrapping and cross fold sampling techniques can be used to "prove" whether or not the algorithm is sensitive to normality.  If your model doesn't hold up to these tests, it doesn't matter whether or not the assumptions are met.  But, generally, DM algorithms are more forgiving to assumption of normality, and you can gain alot by just looking into subsets of samples.

 

-Ralph Winters

Thats a good idea to use bootstrapping... But is proving normality all that needs to be considered? If an algorithm isn't affected by violations of normality can it still be affected by other problems due to non linearity in the ordinal predictors?

 

The answer to that question's probably implicit in the mathematics behind the algorithms, but my maths skills need some brushing up!

You need at least a basic understand of the algorithms before you use them.  E,g, Some decision tree splitting algorithms will use a chi-square test to determine if a node is to be split. Chi-square is a non-parametric test, and does not depend upon an underlying distribution.

 

-Ralph Winters

I do have a basic understanding of the assumptions behind most of the algorithms. But there are always different implementations of them, especially in R. Often (especially when building several different types of model) I don't have an incredibly in depth knowledge of every  implementation of every model that I'm throwing at a problem.

 

My comment was actually referring more to whether a lack of a normality assumption usually meant a lack of a linearity assumption... Which it usually does but not always.

RSS

On Data Science Central

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service