A Data Science Central Community
You know that you want to build a predictive model. You've framed your problem in terms of classification or regression. You've prepared some training data (which took an age). Now there's just the small matter of choosing an appropriate algorithm.
You've heard or experienced first hand that Random Forests, Elastic Net Regression or Deep Belief Networks are "the business" and so you're going to use one of these (you've probably already verified that these algorithms are appropriate to your problem based on their general capabilities: whether it be their ability to deal with real valued data, "big" streaming data, multiple classes and so on).
However, no two algorithms are the same (if they were we'd simply have fewer to choose from). As such there are a host of questions that you may not have even thought to ask which could make or break your choice.
Here are three very important questions, in no particular order:
How heterogeneous is your problem?
Something I hear from problem owners fairly often is that they would like separate models for different groups of cases in their data, the motivation being along the lines that customer group A is known to behave markedly differently from customer group B, and so a one-size-fits-all model would just not make sense.
While this reflects a very good understanding of the client's own data, and an all-too-rare awareness of the limitations of old-school statistical modeling techniques, it also belies a misunderstanding of the capabilities of modern machine learning algorithms.
The central premise of predictive modeling is precisely that one size does not fit all - otherwise we would just assign the same outcome to all cases and be done with it. The intention is that to whatever extent customer group A is different to customer group B, our algorithms should recognize this the resultant model should treat the two groups differently. At the same time, to whatever extent customer group A is similar to customer group B, we would still like our algorithms to identify those similarities and discover general rules. Our joint model - having benefited from the entire data - will be much stronger overall.
This is all fine, but it does mandate using algorithms which are capable of modeling such heterogeneous or discontinuous data. And to be fair not all algorithms can.
Here's a wee cheat sheet:
Some algorithms which can readily model discontinuous or highly heterogeneous data
Some algorithms which cannot (using these algorithms it might be advisable to build multiple models from different groups of data)
Is a probabilistic approach suitable?
For the purpose of this article a probabilistic model is any model whose rules derive more-or-less from the frequency of patterns or events observed in the data. Often these models will include prior probabilities to describe influences which are beyond the scope of the variables contained in the dataset: i.e. the probability of an event (be it a loan default, or a customer conversion) in the absence of any other information or "all other things being equal". In practice prior probabilities can influence or dominate a model even in the presence of considerable contextual information.
At the opposite end of the spectrum from probabilistic models are those which assume or exhibit something akin to idempotence. In this world view it doesn't matter how often an algorithm is presented with a specific set of events during training (statistical significance notwithstanding), those events will never come to dominate the model. Such algorithms are often associated with "robustness" because they give consistent performance in the face of considerable changes to the distribution (relative frequency) of the events observed in the data at production time.
This distinction has major ramifications for algorithm choice. If you can guarantee that the probabilities of various events present in your training data (e.g. customer types A and B) are a perfect reflection of those that your model will encounter in use (and therefore also that such real world probabilities will not change significantly over time), then probabilistic models often give unmatched predictive performance. If however these assumptions are unreasonable (or unknown) then the use of probabilistic algorithms may be extremely ill-advised. The worst thing is that you might not know until your model is in production!
Not every algorithm wears its probabilistic heart on its sleeve, but there are clues you can look for. In particular algorithms that employ probabilistic reasoning will tend to attain very competitive Accuracy scores during cross-validation, but may perform less competitively on balanced measures such as F-score and average (per-class) Recall.
Some algorithms which embody probabilistic assumptions
Some distinctly non-probabilistic algorithms
Does your data contain high-order variable interactions?
It is fairly well known that some algorithms impose (often unrealistic) independence assumptions: e.g. that a movie recommendation may be seen as depending on the genre, and may also be seen as depending on the year of release, but that the genre can have no bearing on which years of release are considered appropriate. We say that such algorithms do not model interactions between the explanatory variables.
What is less well known is that there are various "flavours" of independence: just because an algorithm can model interactions between variables, doesn't mean it can do so very well in the context of a given problem.
For example, given figures for income and existing debt, a Decision Tree algorithm can readily identify different sets of rules regarding income for borrowers with debt below and above a certain threshold (this ability is what makes Decision Trees particularly useful for handling heterogeneous data - see above).
However you could give the same Decision Tree as much data as you like and it will never come up with a rule which says "If income exceeds debt by more than a factor of...", no matter how strong the evidence for such a rule may be in the data and no matter how simple that rule seems to you or me. The upshot is that a Decision Tree or Random Forest might be able to build a model that describes the data, but it will often be unduly complex (a tell-tale sign here is that a tree will reference the same variable many times, even within one branch) and it will not generalize well to new cases.
And here we've only considered interactions between pairs of variables: hardly "high order".
If you suspect that your data might contain complex variable interactions (and by default you probably should - for to not suspect is to assume), but you don't have a suitably powerful general-purpose learning algorithm in your toolbox, one effective strategy can be to employ a separate feature extraction step, for example using Independent Component Analysis (ICA). So in the above example one of the outputs of this feature extraction step might be a new variable which captures the concept of "debt-to-income ratio"; this extracted variable can then be consumed directly by a Decision Tree or a regression algorithm.
Some algorithms which can natively discover high-order variable interactions
Some algorithms which are strongly limited by independence assumptions