/ ISLR

ISLR - Chapter 2. Statistical Learning

Chapter 2. Statistical Learning

2.1. What is Statistical Learning?

  • Format of Statistical Learning:
    $ Y=f(X)+\epsilon $
    assume a relationship between $Y$ and $X$;
    $Y$: observed quantitative response
    $X$: $p$ different predictors $(X_1, X_2, \ldots , X_p)$
    $f$ : Fixed but UNKNOWN function of predictors $X$
    $\epsilon$ : a random error term, independent of $X$ which has mean zero; $E(\epsilon)=0$
    $\rightarrow$ Statistical Learning: A set of methods estimating $f$.

2.1.1. Why Estimate $f$ ?

  • Why do we estimate $f$ ?
    1) Prediction: $Y$
    2) Inference: Relationship between $X$ & $Y$

Prediction

  • $\text{Prediction of } \; Y : \hat{Y} = \hat{f}(X)\quad (\hat{f} : \text{Estimate of }f) $
    $\text{Accuracy of } \; \hat{Y} \text{ depends on reducible & irreducible errors}$;

  • with a given(fixed) estimate $\hat{f}$ and a set of predictors $X$,
    while true function $f$ is also fixed from definition,
    \(\underbrace{E[Y-\hat{Y}]^2}_{MSE} = \underbrace{[f(X)-\hat{f}(X)]^2}_{\text{reducible error}} + \underbrace{Var(\epsilon)}_{\text{irreducible error}}\)

  • derivation
    \(\begin{align*} E[Y-\hat{Y}]^2 &= E\left[\{f(X)+\epsilon\} - \hat{f}(X) \right]^2 \\ &= E\left[ \{f(X)-\hat{f}(X)\}^2 + 2 \epsilon \ (f(X)-\hat{f}(X)) + \epsilon^2 \right] \\ &= E\left[\{f(X)-\hat{f}(X)\}^2\right] + 2E[\epsilon]\cdot E\left[f(X)-\hat{f}(X)\right] + E[\epsilon^2] \\ & \quad \quad \scriptstyle f, \hat{f} \text{ and } X \text{ are fixed, } E(\epsilon)=0 \text{ and } E(\epsilon^2)\ =\ Var(\epsilon) \\ &= [f(X)-\hat{f}(X)]^2 + Var(\epsilon) \end{align*}\)

  • Reducible error : controlable, can be reduced by selecting a better model Irreducible error: uncontrolable, nature of data

Inference

  • interested in: Which predictors are associated with the response?
    What is the relationship between the response and each predictor?
    Can the relationship between Y and each predictor be adequately summarized
    using a linear equation, or is the relationship more complicated?

  • Characteristics of linear & nonlinear models:
    1) Linear : Relatively Simple & interpretable inference but poor prediction
    (when true function $f$ is nonlinear)
    2) Nonlinear : More accurate but difficult to interpret
    $\rightarrow$ Depends on your Goal of analysis

2.1.2. Estimating $f$ : By Using Training Data

  • Estimation Methods:
    1) Parametric: Assume specific function $f$ (linear&non- both)
    2) Non-parametric: No Assumption; Data-driven method (nonlinear)

Parametric Methods: Model-based approach

  • Step 1: Assumption about the functional form of $f$
    Training data consist of n obs. {$(x_1,y_1), (x_2,y_2), \ldots , (x_n,y_n)$}
    where $x_i = ( x_{i1}, x_{i2}, \ldots , x_{ip} ) ^T$
    with $ \scriptstyle\text{(p+1)} $ coefficients $\ \beta_0, \beta_1, \ldots, \beta_p$
    $\rightarrow $ linear form $ f(X)= \beta_0 + \beta_1X_1 + … + \beta_pX_p$

  • Step 2: Estimate Parameters
    $Y \approx \hat{\beta}_0 + \hat{\beta}_1X_1 + … + \hat{\beta}_pX_p = \hat{f}(X)$
    $\rightarrow $ approaches to fitting model

  • Advantage:
    Simplifies the problem of estimating $f$ to estimating a set of parameters

  • Disadvantage: Specific form of $f$
    Not match with the true $f$: poor estimation & poor prediction
    $\rightarrow\;$ Flexible parametric model : model with more parameters
    • Flexible model $\equiv$ Complex model
      but overfitting problem still remains:
      models could follow errors or noises, too closely
  • Underfitting vs Overfitting problems
    underfit: estimation has not enough accuracy, miss to capture the true structure
    overfit: more complexity than true $f$

Non-parametric Methods: Data-driven approach

  • No assumptions about functional form of $f$
    $\Rightarrow $ potential to accurately fit a wider range of possible shapes of $f$
  • Disadvantage:
    1) Large number of obs. are required to get an accurate $f$ relative to a parametric method
    2) Overfitting $\Leftrightarrow$ Level of smoothness; Model complexity
    (how to determine it?)

Example of Nonparametric model: KNN Reg.

  • Idea: Similar inputs have Similar outputs
    Step: when predicting $\hat{Y}_0$ with input $X_0$
    • 1) Determine $K$ (level of smoothness)
    • if K==N: Y = Var(Y)
      if K==1: Y = perfect fit; no training error

    • 2) find $K$ closest training obs. from the target input point($X_0$) using Euclidean distance

    • 3) average of these #k Y values
      \(\rightarrow\hat{Y}_0 = \frac{1}{K}\sum_{x_i\in \mathcal N}^K y_i\)
  • Effect of $K$ (tuning parameter/ model flexibility parameter)
    • decides the level of smoothness (under or overfitting)
    • As $K \uparrow \; \Rightarrow \text{flexibility} \downarrow$ : the fitted line is simple
    • As $K \downarrow \; \Rightarrow \text{flexibility} \uparrow$ : the fitted line is wiggly

2.1.3. Trade-off between Prediction Accuray & Interpretability

  • tradeoff between flexibility and interpretability

image.png

  • based on our goal:
    inference ($\rightarrow$ restrictive model)
    prediction ($\rightarrow$ flexible model)
    BUT NOT ALWAYS THE CASE

2.1.4. Supervised Vs. Unsupervised Learning

  • Supervised:
    for each observation $x_i$, there is an associated response measurement $y_i$
    goal is to fit a model, to predict or interpret(inference)

  • Unsupervised:
    no associated response $y_i$, not possible to fit a model, no response to predict
    goal is to understand the relationships between the variables or between the observations

2.1.5. Regression Vs. Classification Problems

  • Variables either quantitative or qualitative(categorical)
  • Different models can be used on both problem

2.2. Assessing Model Accuracy

  • Evaluate the performance of models on a given dataset and Select the best model
  • Criterions, e.g. Better prediction of $Y$

2.2.1. Measuring the Quality of Fit

Mean squared error (MSE)

  • Training MSE = \(\frac{1}{n} \sum_{i=1}^n (y_i - \hat{f}(x_i))^2\)
    the most commonly-used measure in the regression setting
    can be used for building model, but is not our interest;
    cannot be a measure for model selection

  • when training RSS = \(\sum(y_i - \hat{y_i})^2\), MSE = \(\frac{RSS}{N}\)
    \(R^2 = 1 - \frac{RSS}{TSS}\)
    • as more $X$ variables added, even $X$ are not important predictors,
    • RSS & MSE decrease and $R^2$ increases, regardless of existence of test data
  • Test MSE \(= \frac{1}{m}\sum_{i=1}^m(y_i^0 - \hat{f}(x_i^0))^2\)
    with unseen test observation not used to train,
    Select model that gives the lowest test MSE.
    • Test MSE(error rate) is always larger than Training MSE;
      We estimated model function $f$ to minimize its training error
    • if there’s NO test data: Sample re-use methods
      (e.g., bootstrap, cross-validation)
  • for Classification problem: Misclassification rate

2.2.2. Bias-Variance Trade-off

  • expected test MSE($\scriptstyle\text{ideal measure}$)= $E[y_0-\hat{f}(x_0)]^2$ when $(x_0,y_0)$ is a test obs.
  • in population: #K training sets and #K function $f_k$
    Estimation of expected test MSE: \(\\ \hat{E}[y_0 - \hat{f}(x_0)]^2 = \frac{1}{K} \sum_{k=1}^K[y_0-\hat{f}_k(x_0)]^2\)

  • derivation
    For given $x_0$, (when $\epsilon$ is irreducible error) \(\\ \begin{align*} \hat{E}[y_0 - \hat{f}(x_0)]^2 &= E [ \{ y_0 -E(\hat{f}(x_0)) \} - \{ \hat{f}(x_0) - E(\hat{f}(x_0)) \} ]^2 \\ (1) \cdots &= E [ y_0 -E(\hat{f}(x_0)) ] ^2 \\ (2) \cdots & \; + E [\hat{f}(x_0) - E(\hat{f}(x_0)) ] ^2 \\ (3) \cdots & \; - 2E[ \{ y_0 -E(\hat{f}(x_0)) \} \{ \hat{f}(x_0) - E(\hat{f}(x_0)) \} ] \end{align*}\)

  • (1) \(\\ \begin{align*} &= E [(f(x_0) + \epsilon - E(\hat{f}(x_0)) ] ^2 \\ &= E[ f(x_0)^2 +\epsilon^2 + [E(\hat{f}(x_0)) ]^2 + 2f(x_0)\epsilon -2f(x_0)E(\hat{f}(x_0)) - 2\epsilon E(\hat{f}(x_0)) ] \\ & \quad \scriptstyle\text{when } E(\epsilon) \text{ goes to } 0 \\ &= E[E(\hat{f}(x_0)) - f(x_0)] ^2 + E(\epsilon^2) \\ & \quad \scriptstyle{f(x_0) - E(\hat{f}(x_0)) - \text{ is a Bias}} \\ &= [Bias(\hat{f}(x_0))]^2 + Var(\epsilon^2) \end{align*}\)

  • (2) = $Var(\hat{f}(x_0)); $ model variance($\scriptstyle\text{different fit’s from randomness of training sets}$)
  • (3) = 0

$\therefore \hat{E}[y_0 - \hat{f}(x_0)]^2 = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon) $

  • $Var(\hat{f}(x_0)$ : model variance

  • $ [Bias(\hat{f}(x_0)]$: Systematic model error
    (caused by model assumptions; e.g. true f: nonlinear)

  • $ Var(\epsilon)$ : Irreducible error

Model Flextibility $\uparrow$ $\Rightarrow$ model variance $\uparrow$ & model bias $\downarrow$
Model Flextibility $\downarrow$ $\Rightarrow$ model variance $\downarrow$ & model bias $\uparrow$

image.png

$\rightarrow$ optimal model flextibility is different for each datasets

Trade-off in KNN Reg.

  • given $Y = f(X) + \epsilon$, $E(\epsilon) = 0 $ and $Var(\epsilon) = \sigma^2$
  • Expected test MSE at $x_0$:
    \(E[ (Y- \hat{f_k}(x_0) )^2 | X = x_0 ] = Var(\hat{f}(x_0)) + [Bias(\hat{f}(x_0))]^2 + Var(\epsilon)\)

  • (1) \(\\ \begin{align*} Var(\hat{f_k}(x_0)) &= Var(\frac 1 K \sum_{i \in \mathcal N (x_0)} Yi )\\ &= \frac 1 {K^2} \sum Var( Yi ) \\ &= \frac 1 {K^2} \sum Var( f( x_i) + \epsilon_i ) \\ &= \frac 1 {K^2} \sum \sigma^2 = \frac 1 {K^2} \cdot K \cdot \sigma^2 = \frac {\sigma^2} K \end{align*}\)

  • (2) \(\\ \begin{align*} Bias(\hat{f_k}(x_0)) &= f(x_0) - E(\hat{f}(x_0)) \\ &= { f(x_0) - E[\frac1 K \sum_{i \in \mathcal N (x_0)} ^ K Y_i ] } \\ &= { f(x_0)- \frac 1 K \sum E (f(x_i)+\epsilon_i)} \\ &= f(x_0) - \frac 1 K \sum E (f(x_i) \\ \end{align*}\)

\(\therefore\) Expected test MSE = \(\sigma ^2 + [f(x_0) - \frac{1}{K}\sum_{x_i \in \mathcal N (x_0)}^k f(x_i) ]^2 + \frac {\sigma^2}{K}\)

  • when we use large number of K - as K increases,
    model complexity and model variance decreases, then bias increases;
    which is a simpler model.

2.2.3. The Classification Setting

  • training misclassification rate \(= \frac 1 n \sum_{i=1}^n I(y_i \not = \hat y_i )\)
    $\hat y_i$: $i_{th}$ observation using $\hat f$
    $I(y_i \not = \hat y_i )$: indicator variable that;
    1 if $y_i \not = \hat y_i$ and 0 if $y_i = \hat y_i$

  • test error = $ Ave(I(y_0 \not = \hat y_0))$ where;
    $\hat y_0$ is the predicted class label for the test observation with predictor $x_0$

The Bayes Classifier - Two-class Problem

  • Assign a test observation with predictor vector $x_0$ to the class $j$ for which;
    conditional probability $Pr(Y=j|X=x_0)$ is largest.
    with only two possible response values, say class 1 or class 2,
    predict class 1 if $Pr(Y=1|X=x_0)$ > 0.5, and class 2 otherwise.

  • Bayes error rate
    for choosing the largest $Pr(Y=j|X=x_0)$,
    the error rate will be $\ 1 - max_j Pr(Y=j|X=x_0)$ at $X=x_0$

    • Overall Bayes error rate = \(1 - E\left[max_j Pr(Y=j|X) \right]\)
      expectation averages the probability over all possible values of X

K-Nearest Neighbors (KNN)

  • \(Pr(Y=j|X=x_0) = \frac{1}{K}\sum_{i \in \mathcal{N}_0} I(y_i=j)\)
    conditional probability for class j
    $\mathcal{N}_0$: K points in the training data that are closest to $x_0$
    KNN classifies the test obs. to the class with the largest probability.