/ ISLR

ISLR - Chapter 8. Tree-Based Methods

Chapter 8. Tree-Based Methods

  • Background
    Tree by binary question, splitting variables with split points.
    Partitions of input space & fit a constant to each one.
    $X = (X_1,\ldots,X_p)$ to $R_1,\ldots,R_M$ regions, p-dimension rectangles.
    where \(R_m\cap R_{m'}, m\ne m', \bigcup_{m=1}^M R_m = \mathbb{X}.\)

  • Model:
    \(f(X)=\sum_{m=1}^M C_m I(X\in R_m)\)
    where $C_m$ is a constant for each region m.

8.1. The Basics of Decision Trees

8.1.1. Regression Trees

Prediction via Stratification of the Feature Space

  • Two steps of building a regression tree:
    1. Divide the predictor space; the set of possible values for $X_1, \ldots, X_p$ into J distinct and non-overlapping regions; $R_1, \ldots, R_J$.
    2. For every observations that falls into the region $R_j$, we make the same prediction, which is simply the mean of the response values for the training observations in $R_j$.
  • To divide the predictor space into high-dimensional rectangles, or boxes, our goal is to find the boxes that minimize the RSS, given by:
    \(\sum_{j=1}^J\sum_{i\in R_j}(y_i-\hat{y}_{R_j})^2\)
    where $\hat{y}_{R_j}$ is the mean response for the training observations within the jth box.
    I.e., the Least Squares Criterion in given regions:
    \(\begin{align*} \text{min}_{C_m}\sum_{i=1}^N RSS &= \text{min}_{C_m}\sum_{i=1}^N(y_i-f(x_i))^2 \\ &= \text{min}_{C_m}\sum_{i=1}^N\left[y_i - \sum_{m=1}^M C_m I(x_i\in R_m) \right]^2 \\ \rightarrow \hat{C}_m &= \text{ave}(y_i|x_i\in R_m) = \bar{y}_m \end{align*}\)

  • How to split into spaces?
    For its computational infeasibility, we take a top-down, greedy approach that it known as recursive binary splitting. To perform recursive binary splitting, we first select the predictor $X_j$ and the cutpoint s such that splitting the predictor spaace in to the regions \(\{X|X_j<s\}\) and \(\{X|X_j\ge s\}\) leads to the greatest possible reduction in RSS.

  • For splitting variable $X_j$ and split point s,
    In two binary partitions \(R_1(j,s) = \{X|X_j<s\}\) and \(R_2(j,s) = \{X|X_j\ge s\}\),
    \(\text{min} \left[ \sum_{i:x_i\in R_1(j,s)}(y_i-\hat{y}_{R_1})^2 + \sum_{i:x_i\in R_2(j,s)}(y_i-\hat{y}_{R_2})^2 \right]\), or
    \(\text{min}_{C_1}\sum_{i:x_i\in R_1(j,s)}(y_i-C_1)^2 + \min_{C_2}\sum_{i:x_i\in R_2(j,s)}(y_i-C_2)^2\).
    \(\begin{align*} \rightarrow & \text{ave}(y_i|x_i\in R_1(j,s)) = \hat{C}_1 = \hat{y}_{R_1} \\ & \text{ave}(y_i|x_i\in R_2(j,s)) = \hat{C}_2 = \hat{y}_{R_2} \end{align*}\)

  • We repeat this process over the two previously identified regions, looking for the best predictor and best cutpoint in order to split the data further so as the minimize the RSS within each of the resulting regions. The process continues until a stopping criterion is reached; for instance, until no region contains more than five observations.

Tree Pruning

  • Tree size is based on the number of regions(M) and the model complexity is on the number of parameters($C_m$). Since there is an extremely large number of possible subtrees, estimating the CV error for every possible subtree would be too cumbersome. Instead, to find the optimal M, we use Cost-Complexity Pruning, reducing from a very large tree $T_0$ to a subtree that has the lowest test error rate.

  • For each value of a nonnegative tuning parameter $\alpha$ there corresponds a subtree $T\in T_0$, such that minimizing
    \(\sum_{m=1}^{|T|}\sum_{i:x_i\in R_m}(y_i-\hat{y}_{R_m})^2 + \alpha|T|\).
    Here |T| indicates the number of terminal nodes of the tree T, $R_m$ is the rectangle, or the subset of predictor space corresponding to the mth terminal node, and $\hat{y}_{R_m} = \hat{C}_m$ is the predicted response, the mean of the training observations in $R_m$. When $\alpha=0$, then the subtree T will simply equal $T_0$. As $\alpha$ increases, there is a price to pay for having a tree with many terminal nodes, and so the quantity will tend to be minimized for a smaller subtree.

  • Algorithm

    1. Use recursive binary splitting to grow a large tree, stopping only when each terminal node has fewer than some minimum number of observations, threshold.
    2. Apply cost complexity pruning to obtain a sequence of best subtrees, as a function of $\alpha$.
    3. Use K-fold CV to choose $\alpha$:
      (a) Repeat Steps 1 and 2 on all but kth fold of the training data.
      (b) Evaluate the mean squared prediction error on the data in the left-out kth fold, as a function of $\alpha$.
      Average the results for each value of $\alpha$, and pick a $\alpha$ to minimize the average error.
    4. Return the subtree from Step 2 that corresponds to the chosen value of $\alpha$.

8.1.2. Classification Trees

  • Instead of the mean response of the training observations that belong to the same terminal node, a classification tree predict that each observation in the region to the most commonly occurring class of training observations in the region to which it belongs. In interpreting the result, we are often interested not only in the class prediction, but also in the class proportions among the training observations that fall into that region.

  • A natural alternative to RSS, the classification error rate; the fraction of the training observations in that region that do not belong to the most common class:
    \(E=1-\text{max}_k(\hat{p}_{mk})\), where \(\hat{p}_{mk} = \frac{1}{N_m}\sum_{x_i\in R_m}I(y_i=k)\),
    represents the proportion of training observations in the mth region that are from the kth class. By majority vote rule, \(k(m) = \text{argmax}_k \hat{p}_mk\).

  • However, the classification error is not sufficiently sensitive for tree-growing, in practice we use two other measures preferable.
    • Gini index:
      \(G=\sum_{k=1}^K\hat{p}_{mk}(1-\hat{p}_{mk})\),
      a measure of total variance across the K classes. It is often referred to as a measure of node purity; a small value indicates that a node contains predominantly observations from a single class.
    • Cross-entropy or deviance:
      \(D=-\sum_{k=1}^K\hat{p}_{mk}\log\hat{p}_{mk}\),
      a nonnegative value that take on a small value if the mth node is pure.
      these measures are typically used to evaluate the quality of a particular split, since they are more sensitive to node purity than is the classification error rate. The classification error rate is preferable if prediction accuracy of the final pruned tree is the goal.
  • Being performed to increase node purity, some of the splits yield two terminal nodes that have the same predicted value. Even though such splits do not reduce the classification error, it improves the Gini index and the entropy.

8.1.3. Trees Versus Linear Models

  • Linear regression model is:
    \(f(X) = \beta_0 + \sum_{j=1}^p X_j\beta_j\),
    whereas Regression tree model is:
    \(f(X) = \sum_{m=1}^M c_m\cdot 1_{(X\in R_m)}\)

  • If there is a highly non-linear and complex relationship between the features and the response, then decision trees may outperform classical approaches. Also, trees may be preffered for the sake of interpretability and visualization.

8.1.4. Advantages and Disadvantages of Trees

  • Pros
    • Good interpretability and easy to explain.
    • Closely mirror human decision-making than do some of the regression and classification approaches.
    • Can be displayed graphically, and are easily interpreted even by a non-expert.
    • Can easily handle qualitative predictors without the need to create dummy variables.
  • Cons
    • Do not have the same level of predictive accuracy as the regression and classification approaches.
    • Non-robust, a small change in the data can cause a large change in the final estimated tree(High variance).
    • Lack of smoothness, can be far from the true function.

By aggregating many decision trees, the predictive performance of trees can be substantially improved.

8.2. Bagging, Random Forests, Boosting, and Bayesian Additive Regression Trees

  • An ensemble method is an approach that combines many simple “building ensemble block” models in order to obtain a single and potentially very powerful model. These simple building block models are sometimes known as weak learners, since they may lead to mediocre predictions on their own.

8.2.1. Bagging

  • Bootstrap aggregation, or bagging, is a general-purpose procedure for reducing the variance of a statistical learning method. Given a set of n independent observations $Z_1,\ldots,Z_n$, each with variance $\sigma^2$, the variance of the mean $\bar{Z}$ of the observations is given by $\sigma^2/n$. In other words, averaging a set of observations reduces variance.
    Hence it is a natural way to reduce the variance and increase the test set accuracy of a statistical learning method. We could calculate \(\hat{f}^1(x), \ldots, \hat{f}^B(x)\) using B seperate training sets, and average them in order to obtain a single low-variance model, \(\hat{f}_{\text{avg}}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^b(x)\).

  • In practice, we do not have access to multiple training sets. Instaed, we can bootstrap, by taking repeated samples from the (single) training data set. The bagging model,
    \(\hat{f}_{\text{bag}}(x) = \frac{1}{B}\sum_{b=1}^B\hat{f}^{*b}(x)\).

  • It is particularly useful for decision trees. To apply begging to regression trees, we simply construct B regression trees using B bootstrapped training sets, and average the resulting predictions. These trees are grown deep but not pruned. Each individual tree has high variance but low bias. Averaging these trees reduces the variance. Bagging improves the accuracy by combining hundreds or even thousands of trees into a single procedure.

  • To a classification problem where Y is qualitative, the simplest approach is the majority vote; For a given test observation, record the class predicted by each of the B trees and the overall prediction is the most commonly occurring class among those predictions.

Out-of-Bag Error Estimation

  • Without cross-validation or the validation set approach, there is a very straightforward way to estimate the test error of a bagged model. The reamining observations not used to fit a given bagged tree are referred to as the out-of-bag(OOB) observations. We can predict the response for the ith observation using each of the trees in which that observation was OOB. To obtain a single prediction for the ith observation, we average these predicted responses to a regression set, or take a majority vote to a classification set. An OOB prediction can be obtained in this way for every n observations, from which the overall OOB MSE or classification error can be computed.

  • Sinse the response for each observation is predicted using only the trees that were not fit using that observation, the resulting OOB error is a valid estimate of the test error for the bagged model.

Variable Importance Measures

  • Bagging improves prediction accuracy at the expense of interpretability. One can obtain an overall summary of the importance of each predictor using the RSS or the Gini index. In the case of bagging regression trees, we can record the total amount that the RSS is decreased due to splits over a given predictor, averaged over all B trees. A large value indicates an important predictor.

8.2.2. Random Forests

  • If there is one very strong predictor along with a number of other moderately strong predictors, in bagging method, most trees will use this strong predictor in the top split. Consequently, all of the bagged trees will look quite similar and the predictions from the bagged trees will be highly correlated. Averaging highly correlated quantities does not lead to a substantial reduction in variance over a single tree in this setting.

  • Random forests overcome this problem by forcing each split to consider only a subset of the predictors. As in bagging, we build a number of decision trees on bootstrapped training samples. But when building these decision trees, each time a split in a tree is considered, a random sample of m predictors is chosen as split candidates from the full set of p predictors. Typically, the number of predictors considered at each split m is approximately equal to the square root of the total number of predictors p; $m\approx\sqrt{p}$.

  • On average $(p-m)/p$ of the splits will not even consider the strong predictor, and so other predictors will have more of a chance. This process is decorrelating the trees, thereby making the average of the resulting trees less variable and more reliable.

Permutation Importance

  • Calculate the estimated test error with the original OOB samples as validation data, then randomly suffle; or permute the order of the jth variable on OOB sample to break the relationship between $X_j$ and Y. With this permuted OOB samples, calculate the Permuation Measure. If the result of permuation measure is worse than that of reference measure, we can consider $X_j$ is an important variable. For all variables X, we can make the rank of variable importance. However, permuation importance can overestimates the importance of correlated predictors.

8.2.3. Boosting

  • Bagging creates multiple copies of the original training data set using the bootstrap, fits a separate decision tree to each copy and then combines all of the trees in order to create a single predictive model. Each tree is built on a bootstrap data set, independent of the other trees.

  • Boosting works in a similar way, except that the trees are grown sequentially: each each tree is grown using information from previously grown trees. This approach does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set.

  • Algorithm
    1. Set $\hat{f}(x)=0$ and $r_i=y_i$ for all i in the training set.
    2. For $b=1,2,\ldots,B$, repeat:
      (a) Fit a tree $\hat{f}^b$ with d splits ($d+1$ terminal nodes) to the training data (X,r).
      (b) Update $\hat{f}$ by adding in a shrunken version of the new tree:
      \(\hat{f}(x)\leftarrow\hat{f}(x)+\lambda\hat{f}^b(x)\).
      (c) Update the residuals, removing the effect of bth tree,
      Unexplained residuals \(r_i\leftarrow r_i - \lambda\hat{f}^b(x_i)\).
    3. Output the boosted model,
      \(\hat{f}(x)=\sum_{i=1}^B\lambda\hat{f}^b(x)\).
  • Boosting combines multiple weak learners(poor prediction models) and make better prediction. In procedure, data(the weights of observations) are repeatedly modified and the model learns slowly.

  • Boosting tuning parameters
    1. B the number of trees. Unlike bagging and random forests, boosting can overfit if B is too large, although this overfitting tends to occur slowly if at all. We use cross-validation to select B.
    2. $\lambda$ the shrinkage parameter, a small positive number. This controls the rate at which boosting learns. Typical values are 0.01 or 0.001. Very small $\lambda$ can require using a very large value of B to achieve good performance.
    3. d the number of splits in each tree, which controls the complexity of the boosted ensemble. $d=1$ works well, in which case each tree is a stump, consisting of a single split. In this case, the boosted ensemble is fitting an additive model, each term involves only a single variable. Generally d is the interaction depth, and controls the interaction order of the boosted model.

8.2.4. Bayesian Additive Regression Trees

  • Bagging and random forests make predictions from an average of independent trees, while boosting uses a weighted sum of trees. BART is related to both approaches: each tree is constructed in a random manner and each tree tries to caputre signal not yet accounted for by the current model.

  • let K denote the number of regression trees, and B the number of iterations for which the BART will be run. The notation $\hat{f}_k^b(x)$ represents the prediction at x for the kth regression tree used in the bth iteration. At the end of each iteration, the K trees from that iteration will be summed; \(\hat{f}^b(x)=\sum_{k=1}^K\hat{f}_k^b(x)\) for $b=1,\ldots,B$.

  1. First iteration, all trees are initialized to have a single root node, with \(\hat{f}_k^1(x) = \frac{1}{nK}\sum_{i=1}^n y_i\), the mean of the response values divided by the total number of trees. Thus, \(\hat{f}^1(x)=\sum_{k=1}^K\hat{f}_k^1(x)=\frac{1}{n}\sum_{i=1}^n y_i = \bar{y}\).

  2. Next, updates each of the K trees, one at a time.
    In the bth iteration, we subtract from each response value the predictions from all but the kth tree to obtain a partial residual
    \(r_i = y_i - \sum_{k'<k}\hat{f}_{k'}^b(x_i) - \sum_{k'>k}\hat{f}_{k'}^{b-1}(x_i)\)
    Rather than fitting a fresh tree to this partial residual, BART randomly chooses a perturbation to the tree from the previous iteration from a set of possible perturbations, favoring ones that improve the fit to the partial residual.

  • Two components to the perturbation:
    We may change the structure of the tree by adding or pruning branches.
    We may change the prediction in each terminal node of the tree.
  1. Output of BART is a collection of prediction models,
    \(\hat{f}^b(x) = \sum_{k=1}^K\hat{f}_k^b(x)\)
  • Typically, the first few of these prediction models in earlier iterations, known as the burn-in period, tend not to provide very good results. For the number of burn-in iterations L, for instance we might take L=200, we simply remove and take the average after L iterations; \(\hat{f}(x)=\frac{1}{B-L}\sum_{b=L+1}^B\hat{f}^b(x)\). Or we can compute quantities other than the average.

8.2.5. Summary of Tree Ensemble Methods

  • Bagging: The trees are grown independently on random samples of the observations and tend to be quite similar to each other. Thus, bagging can get caught in local optima and fail to thoroughly explore the model space.
  • Random Forests: The trees are grown independently on random samples but each split on each tree is performed using a random subset of the features, decorrelating the trees and leading to a more thorough exploration of the model space.
  • Boosting: We only use the original data and do not draw any random samples. Trees are grown successively, usiang slow learning approach: each new tree is fit to the signal that is left over from the earlier trees, and shrunken down before it is used.
  • BART: WE only use the original data and trees are grown successively. However, each tree is perturbed in order to avoid local minima and achieve a more thorough exploration of the model space.