A Data Science Central Community
Some colleagues of mine are working with survey responses, and are attempting to predict behaviors with demographic data. So, the plan is to define a dependent variable from some combination of responses to the survey questions, and then use a regression technique to model this dependent variable using other characteristics of the respondents. We all agree on the 5 or so questions that will define the dependent variable, but we disagree on how to specify the definition.
I want to look at the actual questions being answered, and create a "score" as a weighted count of the 'yeses' to the questions (weights based on how "on point" each question is to the behavior we are trying to define). My colleagues thought that this was too imprecise, and particularly criticised the 'intuitive' weight assignment.
My colleagues want to apply a clustering algorithm (probably K-means) to define clusters based on the 5 questions, and then use these cluster assignments as the dependent variable in a subsequent regression. I think this is the most ridiculous approach that I have ever heard of, but I can't find any material to back me up.
What do you guys think of these approaches? Am I crazy, are they, or are we all crazy?
I would have chosen factor analysis instead of cluster analysis. With cluster analysis your variables can still all be in the same cluster and be correlated with each other. Factor analysis will remove the correlation. However, the reality is that (k-means) cluster analysis is not based upon hypothesis testing (it is based upon geometry). So it is essentially exploratory analyis and there is no one "right" or precise answer. This is also true with any kind of latent class analysis There are a couple of things I would suggest: Compare your binned scores to the clusters and see which has better separation and minimum variance. Set up some out-of-sample test groups, and see which model is more robust. See which segment names are the most meaningful, and which make the most sense. You have suggested two different approaches, and it makes sense to use them in tandem to reach consensus.
Thanks for your response.
My concern with using any of these methods stems from the fact that I am presently just trying to define people with a propensity to have a certain behavior. All of these questions address aspects of that behavior to some extent. Clustering and PC both tend to exclude certain answers. Clustering might say that people who answered yes to questions 1 and 2 are in a cluster (because there are a lot of them), but people who ALSO answered yes to questions 3 and 4 are in another cluster (because they are 'way over there' in the answer space). We will end up picking a cluster (or group of clusters) to model based on the pattern of yeses in the clusters.
PC would come closer to including all of the responses, and would give us a 'weighted' sum of answers that means something statistically. I just don't think it means what we want it to mean. I want some meansure of how likely a respondent is to exhibit the (latent) behavior addressed by all of these questions so I can try to model that likelihood. PC will only give me linear combinations of the answers that will address the correlations within the answers. Or am I missing something?
Do you know what the "certain behavior" is? Then set it up as a regression problem. If not, I would look into performing a root cause analysis to see what the true drivers are that are leading to the latent response. Sometimes, that is more than what statistics can offer.
Perhaps you should calculate Cronbach's alpha to check whether the 5 questions do measure the same construct. If not, try to find out which questions do measure the construct you're interested in. If you have determined the correct questions, you can combine them in a simple way (e.g. take the mean).
Another way would be to set up a structural equation model with your latent construct and the questions as constructs. One advantage of this type of analysis is that it's theory driven.