Subscribe to DSC Newsletter

I have a set of Independent Variables - both Categorical Variables and Continuous Variables. There is the predictor variable which have certain classes say C1 to Cn. The aim is to predict the category membership!

I'm facing two issues. Any discriminant procedure requires only continuous variables for prediciting. And second, logistic regression which can be used produces probability values of category membership, which does not equivalently specify the inter-class variance using distance measures like a Canonical Discriminant Analysis does using %plotit macro.

Hence, I've got two questions.
1. If I've got mixed variables - both Continuous & Catergorical, can I still predict membership of category in the predictor variable? If yes, how?
2. If the answer to the above is to use Logistic Regression or Genmod/Catmod, can I still obtain a plot of the various observations that are governed by the category in a distance measure plot to find out the between category variance/distance and hence understand visually what is the scenario of the categories.

Also, I'm not able to plot using %plotit due to the high no. of observations I've got (1.5 Mi). Do I need to consider a downscaling to bring it down to a lesser no? Or can I plot a contour to know the idea of the area coverage?

Views: 8916

Reply to This

Replies to This Discussion

you can performthis classification using the C&RT in statistica data miner. It allows u to take both continious and catagorical predictor variables for classification
Thanks for your reply Atul.

I was actually looking to do a Discriminant Analysis, not a classification one. A discriminant analysis plots the classes across visually aided tools to help understand the extent of classification. Some examples I found were Canonical Discriminant Analysis on SAS etc. But all of it requires only numerical arguments!

I also think clustering with mixed variable types is of huge concern. C&RT is a solution, if it exists. I use SAS, so not much solution.
I hit upon a few other methodologies -
1. Use of AUTOCLASS algorithm
2. Use of Jaccard's coefficient in determining distance between classes\
3. Kohonen Clustering

Some of these are good ones, but a little too complicated in places. I would appreciate if someone has any experience with any of these would come forward to help me out with executing these.

Many Thanks.
I am a little confused: Do you already have a predictor variable or not ?

A short remark:
I did not know that Kohonen Clustering is directly applicable to discrete data. I guess it works for the same reason as in the case of Logistic Regression: Transform a discrete variable with k values into k binary variables.

But beware: If you perform this transformation you lose the original context of the variable which a) makes interpretation hard and b) requires some contemplation about the metric to use (if you want to cluster) to control the influence of the discrete variables on the overall distance.

If this does not frighten you, I recommend http://databionic-esom.sourceforge.net/ as a start. It allows among other features the visualization of classes (if you already have one).

But if you do not have any classes yet, I would try Autoclass ... I am ashamed to admit this, but I cannot help you out with this one.

kind regards,

Steffen
Arun

There may be some confusion over terminology here.

First, for the sake of clarity, I should mention that I will use the term, "predictor" to indicate those (continuous or categorical) variables that are being used to make predictions via a model, not the one that contains the original known classification. I would usually refer to this as the "target", "response" or "output" variable.

Second, I would normally think of discriminant and classification methods as doing the same thing - generating a model that will allow you to predict group/class membership from a set of training data with a known grouping variable. In machine learning this is usually referred to as supervised learning. Methods include Discriminant Analysis, C&RT, CHAID, MLP, RBF, SVM and many others.

Kohonen is a clustering method, which starts with no known classification and forms clusters of cases or variables based on their inherent similarity, the same as classical k-means cluster analysis. This is referred to as unsupervised learning.

It is true that Fisher's original Discriminant Analysis only included continuous predictor variables but there is a generalisation of this method that allows you to include both continuous and categorical predictors and gives the same kind of output (probabilities of group membership, etc.). It works by generating a set of dummy variables for each categorical predictor, as for General Linear Models. You could do this by hand before putting these variables into a classical discriminant analysis (or logistic regression or whetever method you choose) but it is a much better if the software handles this for you automatically.

The chief limitation of classical methods such as discriminant analysis and logistic regression is their reliance on linear equations, which can cause them to fail in some cases. For example, if the classes are not contiguous or if they cannot be separated by linear planes/hyperplanes. Tree-based methods (e.g. C&RT) and Neural networks (e.g. MLP, RBF) are much better at handling this type of problem.

Needless to say, STATISTICA includes all of these methods, and can handle both categorical and continuous predictors for all model types.

Matt
That's a lot of input! Thanks to both Stephen & Matt for your inputs. I'll address some of your questions, while also describe in detail how I wish to proceed.

To Stephen's comment:
I guess that's how any kind of linear equation modeling will work if they are given categorical variables. Matt is right in saying if that if linear equations won't work, it's best to use tree or neural networks. But the question is "How do we know that the linear equations isn't working well?". Can I find any convergence criterion that can tell me this?
Also, after Matt's clarification, I have about 10-20 predictor variables and one 'target' (class present by business segmentation for marketing) variable. Among the predictor variables, 50% are categorical - like decile ranking of spend, yes/no flags while remaining are quantitative ones. There are two things I want to do.
1. Supervised Learning - To understand the class membership and inter-class variance (I've got about 12 classes). This will enable me to understand if the current classes are good by themselves and if I only need to enhance the clusters with additional dimensions to increase inter-class variance, while decrease within-class variance.
2. Unsupervised Learning - This will enable me to unearth the underlying relationship within the data, to understand what is the optimum level of clusters possible, and finally arrive at a different set of clusters based on variables.

The assumption is that, the business has been followed owing to historical practices and upon few initial analyses, I found the segmentation not very good. Infact, it's done on purely business criteria. So, the problem at hand is to find a more optimal segmentation/cluster.

Thanks a lot for the link, I'll take a look at that and get back. Also, I'm looking into AUTOCLASS docs, but if anyone else can help me with it, it would be great.

To Matt's Comment:
First, I was unaware that Kohonen is similar to K-Means. Kohonen, what I was referring to, used Self- Organising Maps, which used Neural Networks. Hence it's not linear discriminating in nature. So, use of self-adjusting weights come into play with each iteration. But from what you just said, does the self-adjustment take place with reference to the "MEANS" of each cluster? IF that's the case, I'll need to think about it. Thanks for letting me know, since I think I missed that part.
Due to the prevalence of 50% categorical, I'm of the opinion that a 'means' method will not work. Please let me know if I'm wrong in assuming so.

I agree on the supervised learning that you elaborated on. I only didn't know CHAID and C&RT were also part of it, since I've used it for unsupervised before- not for class membership! Also, I'm using SAS, so E-Miner has most of the stuff that we're talking here, except AUTOCLASS.

Once again, thanks for all your help.. I'll keep you all updated on my progress, if I come across any stuff. If you have more to tell me, I'm more than willing to listen! :)

Thanks.
I do not have much time currently, but I cannot hold back to remark this:

Kohonen is equal to k-Means if a) k = number of neurons used in the map and b) k-means is implemented in the way that the center is adjusted instead of simple calculating the mean value.

So ... If you use SOM/ESOM with really a lot of neurons (e.g. 80*50, depending on the size of your input) you can reveal more structure than k-means is able to.

Unfortunately k-means is not a standardized algorithm but more like a framework with the option to implement each of the intermediate steps another way. There is no such thing as THE k-means. ;)

kind regards,

Steffen

PS: Yes, please keep us updated ! Your problem is indeed a general one and hence of general interest ;).
Arun, we use n-d methods including visualisation which have been likened to a high-speed form of discriminant analysis. We have no problem handling a mix of continuous and categorical variables. Its a little-known method but if you would like to investigate further please contact me offline Robin_Brooks AT www.curvaceous.com
UPDATES:

I used the unsupervised learning method of Kohonen Vectors to arrive at a couple of solutions (not robust or checked for optimacy). The first time I used 10 Clusters, and the second time, I used 6 Clusters. After this, I used the Mean Statistics data of each cluster to come up with the distance measure and then located them on a MDS map.
Also, just after the Kohonen map, I did a Candisc to arrive at canonical forms of the model, so I can make a 2D Scatter Plot of the segments.

1. What I need to know is, is my segmentation/clustering anywhere near optimal?
2. From the Canonical Plot, I see the scatter plot very different from a good cluster solution. Is this because this is a canonical form which includes the effect of other variables?
How can I visually check the scatterplot of the individual observations segmented into the respective clusters?
3. From the MDS, I see that using 6 clusters provides an equidistant set of clusters - I think this is AFAIK, the most optimal cluster solution (Equidistant means they have max inter-cluster variability). Here again, have I done the right method in using the Mean Stats to arrive at a distance measure and plot the MDS to visually depict the inter-cluster variability?
4. Lastly, I wanted to know if I can arrive at 'optimum no. of cluster nodes' when done for a categorical set of variables. I know the Knee curve works for numerical variables, how about for this data?

Thanks.
Scatter Plot with 10 Clusters using Canonical Structures


Scatter Plot with 6 Clusters using Canonical Structures


MDS Plot after find distance between 10 clusters using PROC DISTANCE


MDS Plot after find distance between 6 clusters using PROC DISTANCE

RSS

On Data Science Central

Follow Us

© 2019   AnalyticBridge.com is a subsidiary and dedicated channel of Data Science Central LLC   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service