22 April 2020 / ISLR

ISLR - Chapter 5. Resampling Methods

Chapter 5. Resampling Methods

repeatedly drawing samples from a training set and refitting a model of interest on each sample in order to obtain additional information about the fitted model.
To optain information that would not be available from fitting the model only once using the original training sample.
e.g. to estimate the variability of a model fit, draw different samples and fit it to each new sample, then examine the extent to which the resulting fits differ.

In the absence of a very large designated test set that can be used to directly estimate the test error rate, a class of methods that estimate the test error rate by holding out a subset of the training observations from the fitting process, then applying the statistical learning method to those held out observations.

randomly dividing the available set of observations into two parts;
a training set and a validation set (or hold-out set)
model is fit on the training set, and the fitted model is used to predict the responses for the observations in the validation set. The validation set error rate estimates the test error rate.
Repeating this predecure, we have different estimate for the test MSE over random splits of the observations and there are two issues:
1. The validation estimate of the test error rate can be highly variable, depending on which observations are included in the training set or the validation test.
2. Only a subset of the observations are used to fit the model. Trained on fewer observations, the validation set error rate may overestimate the test error rate for the model fit on the entire data set.

each single observation is used for the validation set, and the remaining observations are for the training set. The statistical learning method is fit on the n-1 training obs. The prediction is made for the excluded observation.
LOOCV estiamte for the test MSE:
$CV_{(n)} = \frac{1}{n}\sum_{i=1}^n MSE_i$
No overestimation on the test error, No variance of test MSE, but Expansive.
a shortcut of LOOCV on Least Squares(regression):
$\begin{align*} CV_{(n)} = \frac{1}{n} \sum_{i=1}^n \left(\frac{y_i - \hat y_i}{1-h_i}\right)^2 \end{align*}$
where $\hat y_i$ is the fitted value from the original least squares fit, one-time build of a full model and set a leverage $h_i = \frac{1}{n}+\frac{(x_i-\bar{x})^2}{\sum_{i^\prime=1}^n(x_{i^\prime}-\bar{x})^2}$. The levearge lies between 1/n and 1, reflects the amount that an observation influences its own fit.

random division into k groups, or folds, of approximately equal size. A fold is used for the validation set, and the method is fit on the remaining k-1 folds. The MSE is computed on the observations in the held-out fold and the procedure is repeated k times.
k-Fold CV estimate for the test MSE:
$CV_{(k)} = \frac{1}{n}\sum_{i=1}^k MSE_i$
when k=n, LOOCV is a special case of k-Fold. Using smaller k, k-fold CV has a computational advantage to LOOCV.
We perform CV to:
To determine how well a given model can be expected to perform on independent data.
To identify a model results in the lowest test error, over different models or different levels of flexibility.

Besides the computational advantage, k-fold CV often gives more accurate estimates of the test error rate than does LOOCV.
LOOCV will give approximately unbiased estiamtes of the test error, containing n-1, almost as many as the number of observations in the full data set. By contrast, k-fold CV will lead to an intermediate level of bias, containing (k-1)n/k observations. Clearly, LOOCV is to be preferred in the perspective of bias reduction.
But, in LOOCV, averaging the outputs of n fitted models, which are trained on an almost identical set of observations, these outputs are highly correlated with each other. This high correlation results in higher variance of test error estimate from LOOCV than from k-fold CV.

LOOCV on the classification:
$CV_{(n)} = \frac{1}{n}\sum_{i=1}^n Err_i$, where $Err_i = I(y_i \ne \hat{y}_i)$.
k-fold CV on the classification:
$\frac{1}{n}\sum_{i=1}^k MCR_i$.

Sampling with replacement on:
Dataset $Z = (z_1, \ldots, z_n)$, $ z_i = (x_i,y_i)$
Sample $Z^{*b}$, where $ b = 1, \ldots, B$ samples
for Any statistic term $S(Z)$ computed from full dataset Z,
and $S(Z^{*b})$ from bootstrap samples,
$\begin{align*} Var(\hat{S(Z)}) = \frac{1}{B-1}\sum_{b=1}^B(S(Z^{*b})-\bar{S}^*)^2 \end{align*}$
$\cdots \bar{S}^{*} = \frac{1}{B}\sum_{b=1}^B S(Z^{*b})$