A Data Science Central Community
I have this question. I have a dataset with unique IDs (people).
Each one has some attributes. I want to classify them to good and bad customers.
Since I donot have a training set (i.e. having for some IDs their score 0 or 1), how can I classify them to 2 groups?
I understand that regression (logistic for example) cannot take place since I donot have a dependent variable.
One solution could be clustering for example and have only 2 clusters? (If I am lucky and these are well differentiated).
This is what is often called unsupervised learning, or more specifically unsupervised classification. You are right that may types of procedure such as logistic regression, neural networks, decision trees and so on, require an outcome label (a dependent variable) in order to work; i.e. they are supervised learners.
When you say "good" and "bad" I presume that translates into something like good / bad payers on a loan, those that did/did not respond to a direct marketing communication or those that did or did not make an insurance claim or some similar problem.
To be clear, for theses types of problem you are not going to be able to classify them as accurately as if you had a dependent variable.
However, you are also right to suggest that some form of clustering is probably the best option to follow and may still provide a good solution. This is because all this type of clustering does is group cases together on the basis of them having similar attributes, regardless of whether they are "good" or "bad." The assumption is then that those with similar attributes (in the same cluster) will behave in a similar way.
For the above types of problem what you will find is that certain clusters are more pure than others. However, even if your data is very good, your clusters probably won't be entirely pure. For example, in the best cluster maybe 90% are good and 10% are bad, and in the worst cluster 90% are bad and 10% are good. How pure your clusters are will depend very much on the type of problem and the quality of your data.
Two clusters might seem like an obvious choice, but you may well get better results with several clusters, and some trial and error will help establish how many clusters are best for your problem.
The big problem that I expect you will encounter, is then how to label the resulting clusters. For example, if the process generates two clusters, how do you determine which one contains mostly goods and which one mostly bads?
If you have just a small number of examples with a class label (good or bad) then could be invaluable because you can then examine the proportion of goods and bads in each cluster and assume that that will be representative of the cluster as a whole (The assumption here is that the class labels that you do have are not biased in some way). Likewise a small sample of goods and bads would enable you to perform something called K-nearest neighbor clustering which generally gives good results.
However, if you have no class labels at all, then an alternative is to apply exert opinion to examine the average properties of the clusters. So maybe best to illustrate this with an example:
Lets say the we are talking about a credit granting problem. In cluster 1, 50% of individuals are bankrupt and 70% have arrears on their credit cards. In cluster 2 only 3% of individuals are bankrupt and only 2% have arrears on their credit cards. The expert view is that cluster 1 is "Bad" and cluster 2 is "Good" because I know that people who have experienced problems with credit in the past, tend to be "Bad" borrowers again in the future.
Hope this is of some use.
only if you are lucky and the features will be able to separate the two classes, however this is not guaranteed. You might have strong features that are able to influence the distance measure you will be using, but they are irrelevant and will split your data into two classes that you don't want.
In addition verifying your results will require that you examine random instances of the two classes to verify correctness of membership. This might take you a long time, may be as long as it will take you to create a set labeled instances of your data. So, you might be better off labeling a subset of your data and use a random tree to classify the rest. If you have a huge data set look at bootstrapping your training data using a small training subset to create larger training data sets.