A Data Science Central Community
I recently developed a cross sell application that took product purchase history and flagged the record with a 1 for purchased, and 0 if not, within orders.
I then used Multinomial Logistic Regression to assign new orders to the cluster.
I found that at a certain point, the classification did not work with too many clusters, but was flawless with fewer clusters @ 100% correct classification.
The goal was to produce cross sell tables and rates according to the cluster membership as opposed to one single table.
This may seem like a lot of effort but then I put the whole thing into a spreadsheet application that call center reps could use while talking with customers who were buying.
The reason I did this, was because they had not solutions for this and none are in sight.
Anyway, what could I have done better in this situation?
If you could tell me what a cross sell application does, I could share thoughts on it using theory of Multinomial Logit or other decision based categorical models. At first sight kindly ensure u r not violating the IIA assumption : that a phenyl bottle with a green label can be bought regardless of if another bottle with a red label has in bought. Another problem is this I find or re-reading. Log-log models(multi-nomial) logit are found inconsistent while dealing with data residuals which different distributions. In your case, the choice if toothpastes, vaccines and laptop are all included, and you are trying to find out basically different residual patterns even robust adjustment cant solve much. Try to follow methods that give a set of shapes or slopes, and then try to predict a class of products matching the spending propensity. I know this is too vague, and I do not know what cross selling is But, this is certain, if your results are significantly different from ols eg wrong signs, then the Model(logistic) is WRONG. mail me @[email protected] Your work seems interesting to me. The reason I am taking too much of Interest in all this is ven I am new to this field and need to pick up skills fast.
I wonder how you fitted the Multinomial Logistic Regression to Dichotomous dependent variable? Anyway I would suggest fitting Simple Logist Models for Binary dependent variables. Multinomial Logistic models are for dependent variables with more than two categories in response. Is this the case with your purchase orders? I do not see such from your discussion. Anyway you would find much details in the Online Course on Logistic Regression Models on http://elearning.aneconomist.com.
Okay, Glade to discuss and learn Cluster Analysis. But I ought to suggest you can even fit a Logistic Regression Models where the Response Variable is Binary and across cluster. You can fit the Binary Logistic Regression Models and accommodate the Cluster by setting the VCE for Cluster. Do you think the case is not what I assume. I am aware of Stata's routine of logit where VCE can be set to cluster by an option after the main command line, vce(cluster clustvarid)
Thank you. Muhammad, I will follow up on that. Regards, Sean.
Could you please point some elarning resources on cluster analysis.
Hi Sean. My concern here is that you have over-fitted the data or have perfect separation. 100% correct classification is a big red flag that your model has one of these problems. Are you getting very large coefficients in your model which is also a sign of this. Even if not I would be worried.
It means that the data used to build the model will be fit perfectly by the model but that new cases might be predicted very badly. For any model with say 50 cases and 50 predictors you will get 100% fit. Separation occurs when a predictor (or linear combination of many predictors) always yields one category for the dependent variable as a result of sparsity in the data. See http://en.wikipedia.org/wiki/Separation_%28statistics%29.
You really ought to crossvalidate your model. You can either hold some sample back (for testing) or use a more sophisticated method like Correlated Component Regression designed for high dimensional logistic problems (number of predictors approaching or exceeding the number of cases) which does the crossvalidation as part of the model selection process. See http://www.logitresearch.com/Introduction_to_CCR
Gary, I forgot to mention in my reply that your approach seems very exciting. It is as if the logitresearch algorithm was tailor made for this specific type of application. The interface is pretty slick too. Thanks again for your help. That did achieve one objective of the post! The remaining objective is for someone to say use another completely different method altogether.
Bayesian Network should be a good alternative.