A Data Science Central Community
A prospect asked us to build a predictive model. They will provide some tagged data for model development and some untagged data for testing. We were required to send back the untagged data with our decision- the data will have only two columns: ID and decision (good/bad).
This is a modeling contest, our results will be compared with those of our competitors, but I feel the returned data required by the prospect does not make sense to me. If our model rejects 1000 and false positive is 20%, the other model reject s1500 and false positive rate is 30%. There is no way to tell which model is better. I feel the prospect should ask every competitor to send the same number of rejects (e.g. 1000, based on bad rate of tagged data). To my surprise, nobody else has brought up this issue. Did I miss something? Any thoughts? Thanks.
The output they're requesting seems consistent with what I've seen for similar competitions in the past. The idea is accuracy - if your models are going to be used for real time credit decisions, then the only thing they're going to deliver is good/bad in real life. They're just asking you for what you would be asked to deliver in the future, right?
Very likely they'll make the 2x2 correct/incorrect classification matrix for each of the entries - just as you will to validate your models on the input data. What you might want to do to build and test your models is create a large number of samples with different underlying rejection rates -- the best solution will minimize type 1 and type 2 error, regardless of the underlying quality of the input data.
Thanks, Lynne. I agree with you theoretically, but in practice, the prospect should have a single measure to evaluate models. There is no way to minimize type 1 error and type 2 error at the same time, we need to find a balance b/w these two, and this is why I need to know what evaluation criteria will be used.
In many situations both the false positives and the false negatives are important. There is a trade-off between increased "recall" (finding all the cases of interest) and increased "precision" (reduced false positives). So evaluating both false positive and false negatives will tell you more about the model's performance.
Sounds like your prospect is uninformed about the criteria for a good predictive model. But then again, a lot of clients can be swayed by these model "competitions" (e.g. Netflix), in which the best model is judged via simple criteria, without taking into regard other factors such as model stability.
It's a classification model - correct classification rate where type 1 and type 2 error are minimized is the most important measure of model fitness. The client is actually doing it the right way.
Our job as statisticians and modelers is to deliver accuracy and stability. That's where OUR experimental design in the model building process becomes critical -- if we cannot deliver accuracy and stability, then our models won't deliver financial results.
Lynne - This is a modeling contest and the judging criteria can be what ever they want it to be. Take a look at some of the evaluation criteria used by a company such as Kaggle and you will see how this can affect the scoring.
I don't see the false positive issue you're talking about, but I do think the client can do a lot better than what they are asking for. 2x2 confusion matricies are a poor way to judge model performance. For instance, if there is a low natural positive rate, say 1%, it is very hard to get a higher accuracy than forecasting everything as negative. What they should be asking for is model score, so they can then evaluate the model accross the range of scores and do things like calculating ROC curves.
Edmund- I totally agree with you. If the client is asking for score instead of decision in the returned data, I don't have to worry about what evaluation criteria they are going to use.