Regression trees are a way to partition your explanatory variables to (potentially) better predict an outcome of interest. Regression trees start with a an outcome (let’s call it y) and a vector of explanatory variables (X).
For instance, let y be health care spending, X=(X1,X2) where X1 is the patient’s age and X2 is the patient’s gender.
The regression tree will identify a cutoff for both X1 and X2 that will minimize the error from a regression. The error to be minimized could either be the sum of squared errors, the sum of the absolute value of errors, or any other error structure of interest. For patient age, the regression tree will pick a cut-off (say age 40) that minimizes the error. In the case of patient gender, this variable is already binary so the regression tree would split the data into males and females. The regression tree logic will then decide whether the age or gender division best fits the data.
Regression trees are recurssive. For instance, if it was determined that age minimized the error in the first node, the second set of nodes would look health care spending for <40 and male, <40 and female, ≥40 and male, ≥40 and female.
Visualizing Decision Trees
Luis Torgo’s trial based regression model tutorial gives both a visual and mathematical description of 2 variable regression tree.
In practice, however, regressions have numerous variables. Further, more complex regression trees can allow for variable interactions while creating the partitions. Thus, regression trees risk becoming too detailed and losing predictive power. Typically, the user imposes stopping rules, such as a minimum sample size per node, a minimum increase in R-squared for each additional node, or using a validation sample to prune some of the leaf nodes.
Real World Decision Tree Example with Interactions
A paper by Buchner, Wasem and Schillo (2015) uses the regression tree approach and applies it to a risk adjustment model for setting health insurance premiums in Germany. They allow for the risk adjustment to include disease interactions as one would assume that an additional disease acts in a non-linear way with respect to cost. The authors find that adding interaction terms to the risk adjustment model adds little predictive value.
The resulting risk adjustment formula shows an improvement in the adjusted R2 from 25.43% to 25.81% on the evaluation data set. Predictive ratios are calculated for subgroups affected by the interactions. The R2 improvement detected is only marginal. According to the sample level performance measures used, not involving a considerable number of morbidity interactions forms no relevant loss in accuracy.
Thus, although added detail and model sophistication can be a good thing, one must be sure to closely examine whether the additional complexity adds significant predictive value. In this case it did not.