A Data Science Central Community
In the context of credit scoring, one tries to develop a predictive model using a regression formula such as Y = Σ wi Ri, where Y is the logarithm of odds ratio (fraud vs. non fraud). In a different but related framework, we are dealing with a logistic regression where Y is binary, e.g. Y = 1 means fraudulent transaction, Y = 0 means non fraudulent. The variables Ri, also referred to as fraud rules, are binary flags, e.g.
This is the first order model. The second order model involves cross products Ri x Rj to correct for rule interactions. The purpose of this question is to how best compute the regression coefficients wi, also referred to as rule weights. The issue is that rules substantially overlap, making the regression approach highly unstable. One approach consists of constraining the weights, forcing them to be binary (0/1) or to be of the same sign as the correlation between the associated rule and the dependent variable Y. This approach is related to ridge regression. We are wondering what are the best solutions and software to handle this problem, given the fact that the variables are binary.
Note that when the weights are binary, this is a typical combinatorial optimization problem. When the weights are constrained to be linearly independent over the set of integer numbers, then each Σ wi Ri (sometimes called unscaled score) corresponds to one unique combination of rules. It also uniquely represents a final node of the underlying decision tree defined by the rules.