A Data Science Central Community
Hi all, I would like to get the group's view on the advantages and disadvantages of Random Forests and MARS modelling vs Linear regression. It would be interesting to compare them both at a statistical principles level, but also in their usefulness to econometrics.
There's no way to give you a good answer within a forum posts so I'll summarize my thoughts in a few small sentences. RF can be considered a very powerful modeling approach but is pretty much a black box. To put it in terms of linear regression, it is like building 200 linear regression models, with predictors and data chosen at random for each tree, and letting the overall prediction being an average (or voted) prediction of all 200 models. With linear regression, you have one model built on all predictors, or predictors chosen by a modeling approach whether selection, stepwise or best subsets. You can also see with that example how different the prediction equations would be, with linear regression fairly easy to understand. With RF...well...there really isn't an equation per se. The utility really comes down to what your purpose is. Are you primarily focused on accurate predictions? If so, RF may be your answer. Do you need to understand how the variables work together towards a prediction? If so, you may need linear regression (or an easily interpretable model).
Linear regression is difficult to interpret, subject over-fitting, sensitive to outliers, and only work in contexts in which associations are nearly linear. If you only have a few variables and hundred observations, it might be enough.