ISLR - Chapter 3. Linear Regression
- Chapter 3. Linear Regression
- 3.1. Simple Linear Regression
- 3.2. Multiple Linear Regression
- 3.3. Other Considerations in Regression Model
- 3.5. Comparison with K-Nearest Neighbors
Chapter 3. Linear Regression
- input variable X can be:
Quantitative or Transformations of quantitative inputs
Basis expansions leading to polynomial representation
Dummy coding of qualitative variables: for G groups, G-1 dummy variables required
Interactions between inputs ( $X_3 = X_1 \cdot X_2$ )
3.1. Simple Linear Regression
-
$Y \approx \beta_0 + \beta_1 X$
saying regressing Y on X,
for predicting a quantitative response Y
on the basis of a single predictor variable X
assumes a linear relationship between. -
$\beta$: model coefficients or parameters
this case, two unknown constants
$\beta_0$ = intercept
$\beta_1$ = slope
3.1.1. Estimating the Coefficients
-
for given data of n observastions: {$(x_1,y_1,), (x_2,y_2), \ldots, (x_n,y_n)$}
\(\hat{y}_i = \hat\beta_0 + \hat\beta_1 x_i\) for prediction of Y on ith value of X,
\(e_i = y_i - \hat{y}_i\) for representation of ith residual,
\(\bar{y} \equiv \frac{1}{n}\sum_{i=1}^n y_i\) and \(\bar{x} \equiv \frac{1}{n}\sum_{i=1}^n x_i\) are sample means -
the least squares coefficient estimates
\(\begin{align*} RSS &= e_1^2 + e_2^2 + \cdots + e_n^2 \\ &= \sum_{i=1}^n(y_i-\beta_0-\beta_1 x_i) \end{align*}\) is a square function of each coefficients $\beta_0, \beta_1$,
then \(\frac{\partial RSS}{\partial\beta_0} = -2\sum(y_i - \beta_0 - \beta_1 x_i)\)
in least squares,
\(\sum y_i - n\hat\beta_0 - \sum\hat\beta_1 x_i = 0\)
\(\begin{align*} \Leftrightarrow \hat\beta_0 &= \frac{1}{n}\sum_{i=1}^n y_i - \frac{\hat\beta_1}{n}\sum_{i=1}^n x_i \\ &= \bar{y} - \hat\beta_1\bar{x} \\ \therefore\hat\beta_0 &= \bar{y} - \hat\beta_1 \bar{x} \end{align*}\)
on the other partial derivaties \(\frac{\partial RSS}{\partial\beta_1} = -2\sum x_i(y_i - \beta_0 - \beta_1 x_i)\),
substituting $ \bar{y} - \hat\beta_1 \bar{x} $ for $ \hat\beta_0$:
\(\sum x_i(y_i - \hat\beta_0 - \hat\beta_1 x_i) = 0 \\ \sum x_i(y_i - \bar{y} - \hat\beta_1(x_i - \bar{x}) = 0 \\ \sum x_i(y_i - \bar{y}) - \hat\beta_1 \sum x_i(x_i - \bar{x}) = 0 \\ \Leftrightarrow \sum x_i(y_i - \bar{y}) = \hat\beta_1 \sum x_i(x_i - \bar{x})\)
\(\begin{align*} \therefore \hat\beta_1 &= \frac{\sum x_i(y_i - \bar{y})} {\sum x_i(x_i - \bar{x})} \\ &= \frac{\sum(x_i-\bar{x})(y_i - \bar{y})} {\sum(x_i - \bar{x})^2} \end{align*}\)
3.1.2. Assessing the Accuracy of the Coefficient Estimates
-
when the true function of $Y = f(X) + \epsilon$,
linear function $Y = \beta_0 + \beta_1 X + \epsilon$ -
an unbiased estimation?
if we use the sample mean $\hat\mu$ to estimate $\mu$
in the sense that unbiased on average, we expact both are equal.
for one particular set of obs., its $\hat\mu$ might not correct,
but if we could average a huge number of estimates from a number of sets of obs.,
this average would exactly equal $\mu$.
by the same way, $\hat\beta_0$ and $\hat\beta_1$ are unbiased estimators. -
how accurate is the sample mean $\hat\mu$ as an estimate of $\mu$?
for a single estimate $\hat\mu$, by computing the standard error $SE(\hat\mu)$:
\(Var(\hat\mu) = SE(\hat\mu)^2 = \frac{\sigma^2}{n}\) -
Assume that $\epsilon_i$ have common variance $\sigma^2$ and are uncorrelated,
standard errors of unbiased estimators in linear regression
\(SE(\hat\beta_0)^2 = \sigma^2\left[\frac{1}{n}+\frac{\bar{x}^2}{\sum_{i=1}^n(x_i-\bar{x})^2}\right]\),
\(SE(\hat\beta_1)^2 = \frac{\sigma^2}{\sum_{i=1}^n(x_i-\bar{x})^2}\) -
from data, estimation of $\sigma$ residual standard error
RSE = \(\sqrt{RSS/(n-2)}\) -
standard error and confidence interval
confidence interval is defined as a range of values such that with given probability,
the range will contain the true unknown value of the parameter
defined in terms of lower and upper limits computed from the sample of data
95% confidence interval for $\beta_1$: \(\hat\beta_1 \pm 2 \cdot SE(\hat\beta_1)\)
95% confidence interval for $\beta_0$: \(\hat\beta_0 \pm 2 \cdot SE(\hat\beta_0)\) -
Hypothesis Test
null hyphothesis $H_0$: no relationship between X and Y; $\beta_1 = 0$
alternative hypothesis $H_a$: some relationship between X and Y; $\beta_1 \ne 0$
How far is $\hat\beta_1$ far enough from 0, we can be confident that true $\beta_1$ is non-zero? -
t-statistics
\(t = \frac{\hat\beta_1 - 0}{SE(\hat\beta_1)}\)
measures the number of standard deviations that $\hat\beta_1$ is away from 0 -
p-value
prob. of observing any number equal to $|t|$ or larger, assuming $\beta_1$ = 0
with small p-value, infer that there is an association between the predictor and the response,
we can reject the null hypothesis
3.1.3. Assessing the Accuracy of the Model
Residual Standard Error (RSE)
- RSE = \(\sqrt{\frac{1}{n-2}RSS} = \sqrt{\frac{1}{n-2}\sum_{i=1}^n(y_i - \hat{y}_i)^2}\)
an estimate of the standard deviation of \(\epsilon\)
a measure of the lack of fit of the model to the data
$R^2$ Statistics
-
\(R^2 = 1 - \frac{RSS}{TSS}\)
TSS; the total variance in the response Y, amount of variability inherent
RSS; amount of variability that is left unexplained after performing the regression
TSS-RSS; amount of variability in the response that is explained
$R^2$; proportion of variability in Y that can be explained using X -
\(Cor(X,Y) = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sqrt{\sum_{i=1}^n(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^n(y_i-\bar{y})^2}}\)
is also a measure of the linear relationship between X and Y
we can use $r = Cor(X,Y) $ instead of $R^2$ to assess the fit of the linear model
in simple linear regression, $R^2 = r^2$,
squared correlation and the squared R statistic are identical.
3.2. Multiple Linear Regression
- Model: $ f(X) = \beta_0 + \sum_{j=1}^p \beta_j X_j + \epsilon$
$\Rightarrow$ Prediction of Y from X (p input variables)
$\quad$ f: Regression function $E(Y|X)$
3.2.1. Estimating the Regression Coefficients
-
prediction: $\hat{y} = \hat\beta_0 + \sum_{j=1}^p \hat\beta_j X_j$
-
How to estimate coefficients $\beta = (\beta_0, \beta_1, \ldots , \beta_p)^T$?
Least Squares $\hat\beta$: $\beta$ minimizing RSS
when Y is (n by 1) / X is (n by (p+1)) matrix,
RSS = $ ( y - X\hat\beta)^T ( y - X\hat\beta) $,
Least square estimator (LSE): $ \hat\beta = (X^T X)^{-1} X^T Y $ -
derivation
\(\begin{align*} RSS &= y^T y - \hat\beta^T X^T y - y^T X \hat\beta \hat\beta^T X^T X \hat\beta \\ &= y^T y - 2\hat\beta^T X^T y + \hat\beta^T X^T X \hat\beta \end{align*}\)
take the derivative with respect to $\hat\beta$ minimizing RSS,
\(\frac{\partial e^T e}{\partial \hat\beta} = -2X^T y + 2X^T X\hat\beta = 0\)
we get ‘normal equations’ $(X^T X)\hat\beta = X^T y$
$\therefore \hat\beta = (X^T X)^{-1} X^T y $
Ordinary Least Squares Estimator
-
With Assumption:
1) $Y_i$ responses are uncorrelated and Var($Y_i$) = $\sigma^2$
$\quad$($\equiv$ $\epsilon_i$ are independent and Var($\epsilon_i$) = $\sigma^2$)
2) $X = (X_1, \ldots, X_p)^T$ is fixed (not random) -
By (1) & (2), for OLS estimator $\hat\beta$;
\(\begin{align*} \hat\beta &= (X^TX)^{-1}X^TY & \cdots Y=X\beta + \epsilon \\ &= \beta + (X^TX)^{-1}X^T\epsilon \end{align*}\)
$\hat\beta$ is a linear estimator of $\beta$
\(\begin{align*} E(\hat\beta) &= E(\beta) + (X^TX)^{-1}X^T E(\epsilon) \\ & \quad \scriptstyle\text{ where } E[\epsilon] = 0 \\ &= \beta \end{align*}\)
so that $\hat\beta$ is an unbiased estimator of $\beta$
\(\begin{align*} Var(\hat\beta) &= E[\hat\beta - E(\hat\beta)]^2 \\ &= E[ (X^TX)^{-1}X^T\epsilon ]^2 \\ &= (X^TX)^{-1}X^TE(\epsilon^2)X(X^TX)^{-1}\\ &= \sigma^2(X^TX)^{-1}X^TX(X^TX)^{-1}\\ &=\sigma^2(X^T X)^{-1} \end{align*}\)
while $\hat\sigma^2$ as Estimator of Variance of $\epsilon$ (= $\sigma^2$):
\(\begin{align*} \hat\sigma^2 &= \frac 1 {n-p-1} \sum_{i=1}^n (y_i - \hat y_i)^2 \\ &= \frac1{n-p-1}\sum_{i=1}^n \epsilon_i^2 \\ \rightarrow E(\hat\sigma^2) &= \sigma^2 \end{align*}\)
$\therefore \text{ unbiased estimator; }\frac{RSS}{n-p-1}$ -
With Assumption:
3) $\epsilon \sim ^{\text{iid}} N(0,\sigma^2)$ : normal distribution assumption of error- $\hat\beta\sim MVN(\beta,\sigma^2(X^TX)^{-1}), \quad (Y\sim MVN(X\beta,\sigma^2 I)$
$\frac{(n-p-1)\hat\sigma^2}{\sigma^2}\sim\chi^2_{n-p-1}$
$\hat\beta$ and $\hat\sigma^2$ are independent
- $\hat\beta\sim MVN(\beta,\sigma^2(X^TX)^{-1}), \quad (Y\sim MVN(X\beta,\sigma^2 I)$
Gauss-Markov Theorem
- Assumption: $E(\epsilon_i) = 0, Var(\epsilon_i) = \sigma^2 < \infty\text{, } \epsilon_i$’s are independent
- among all linear unbiased estimator $\tilde{\beta} = Cy$ & $E(\tilde{\beta}) = \beta)$
Then, $Var(\hat\beta) \le Var(\tilde{\beta})$ when LSE $\hat\beta$ = BLUE - G-M theorem says that LSE $\hat\beta$ is the best linear unbiased estimator (BLUE)
- Conversely, there may exist biased estimators with smaller MSE than LSE.
Properties of Good estimators
- unbiaseness
- efficiency(small variance)
- consistancy(as n goes infinity, estimator goes true parameter)
- sufficiency
3.2.2. Important Questions
3.2.2.1. Hypothesis Test
-
1) Test for all coefficients
$H_0$: all $\beta$ coefficients are zero
$H_a$: at least one $\beta_j$ is non-zero -
Test statistic: $ F = \frac{(TSS-RSS)/p}{RSS/(n-p-1)}$
where TSS = $\sum_{i=1}^n(y_i-\bar y)^2$ and RSS = $\sum_{i=1}^n(y_i-\hat{y}_i)^2$ - from the linear model assumptions,
\(E\left[RSS/(n-p-1)\right] = \sigma^2\)- if $H_0$ is true, \(E\left[TSS-RSS/p\right] = \sigma^2\)
F-statistic value is close to 1 - if $H_a$ is true, \(E\left[TSS-RSS/p\right] > \sigma^2\)
F-statistic value is greater than 1
- if $H_0$ is true, \(E\left[TSS-RSS/p\right] = \sigma^2\)
-
2) Test for a particular subset of coefficients
$H_0: \beta_{p-q+1} = \beta_{p-q+2} = \cdots = \beta_{p} = 0$
$H_1$: At least one of those is not zero -
Full model: $Y=\beta_0 + \beta_1X_1 + \cdots + \beta_pX_p + \epsilon$
Reduced model under $H_0: Y= \beta_0 + \beta_1X_1 + \cdots + \beta_{p-q}X_{p-q} + \epsilon$ -
Test statistic: F = \(\frac{(RSS_0 - RSS)/q}{RSS/(n-p-1)}\)
where RSS is from full model and $RSS_0$ is from reduced model -
Check F-test first before t-tests for each $\beta_j$
F-test : test for all coef, T-test: test for indiv. coeff
when F-test is significant ($H_1$ for all coeff is true)
$\Rightarrow$ perform t-test but still need to control $\alpha$
but if F-test is not sig. $\Rightarrow$ can’t trust t-test result : all coeffs are ZERO - if $p > n$, more coefficients $\beta_j$ to estimate than observastions,
we cannot fit the MLR model using least squares
and F-statistic cannot be used
3.2.2.2. Variable Selection
- often, the response is only associated with a subset of the predictors
But we can’t consider all $2^p$ models that contain subsets of p variables- subset selection, e.g. Forward, Backward, Mixed selection
- Criterions to judge the quality of a model
- statistics, e.g. Mallow’s $C_p$, Akaike information criterion(AIC),
Bayesian information criterion(BIC), adjusted $R^2$ - model outputs, residuals, patterns.
- statistics, e.g. Mallow’s $C_p$, Akaike information criterion(AIC),
3.2.2.3. Model fit
- most common measures of model fit: RSE and $R^2$
SLR model; $R^2$, square of the correlation of the response and the variable
MLR model; $R^2$ = $Cor(Y,\hat{Y})^2$, correlation between response and fitted model
fitted; model that maximizes this correlation among all possible linear model
3.2.2.4. Prediction Errors
-
Uncertainty between $\hat Y$ and $f(X)$
the least squares plane $\hat{Y} = X^T \hat\beta$
the true population regression plane $f(X) = X^T\beta$
Variation due to $\hat\beta$ (Model variance)
in ideal situation, we can drive number of training datasets from population and have several $f(x)$ and $\hat\beta$
$\scriptstyle\text{Confidence interval}$ C.I. for Y:
$E(\hat Y) = x^T\beta = f(x)$, $Var(\hat Y) = \sigma^2 x^T( X^TX)^{-1}x$
$\therefore (1-\alpha)100$% C.I. for Y = $\hat Y \pm t_{(1-\alpha/2, n-p-1)} \hat\sigma\sqrt{x^T(X^TX)^{-1}x}$ -
Model bias:
caused by assuming a linear model for f(x).
$\rightarrow$ (1)(2) are reducible error -
Uncertainty between Y and $\hat Y$
random error $\epsilon$ is in the model, irreducible error
$\scriptstyle\text{Prediction interval}$ P.I. = reducible + irreducible error
for new, unseen test obs. $x_0 = (1, x_{01}, \ldots, x_{op})^T$, $\hat Y_0 = x_0^T \beta$
$Var(\hat Y_0) = \sigma^2 + \sigma^2 x_0^T( X^TX)^{-1}x_0$ $\scriptstyle\text{(irreducible + reducible)}$
$\therefore (1-\alpha)100$% P.I. for $Y_0$ = $\hat Y_0 \pm t_{(1-\alpha/2, n-p-1)} \hat\sigma\sqrt{1+x^T(X^TX)^{-1}x}$
3.3. Other Considerations in Regression Model
3.3.1. Qualitative Predictors
-
for p qualitative predictors, create p-1 dummy variables
with a baseline of the last predictor that is not included in the dummies -
various approaches lead to equivalent model fits with different coefficients and interpretations,
are designed to measure particular contrasts
3.3.2. Extensions of the Linear Model
-
standard linear regression model has highly restrictive assumptions,
that are often violated in practice. -
additivity: relationship between predictor $X_j$ and Y is independent
to the values of the other predictors -
linearity: model has a constant slope of $X_j$
-
Removing the Additive Assumption
in real-world problems and it’s potential variables are not independent;
it has synergy or interaction effect.
by introducing interaction term, we can relax the additive assumption -
e.g.
\(\begin{align*} Y &= \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2 + \epsilon \\ &= \beta_0 + (\beta_1 + \beta_3 X_2)X_1 + \beta_2 X_2 + \epsilon \\ &= \beta_0 + \tilde{\beta_1}X_1 + \beta_2 X_2 + \epsilon \end{align*}\) -
hierarchical principle
if we include an interaction in a model, we should also include the main effects,
even if the p-values associated with their coefficients are not significant. -
Non-linear Relationships; e.g. polynomial regression
3.3.3. Potential Problems
- Non-linearity of the response-predictor relationships
- Residual plots for identifying non-linearity
residuals $e_i = y_i - \hat{y}_i$ versus a predictor $x_i$, or fitted values $\hat{y}_i$
- Residual plots for identifying non-linearity
-
Correlation of error terms
we assumed the error terms, $\epsilon_1, \epsilon_2, \ldots, \epsilon_n$ are uncorrelated
But if error terms are correlated, confidence of our model cannot be guaranteed.
correlations frequently occur in context of time series data,
we may see tracking in the residuals. - Non-constant variance of error terms
standard errors, confidence intervals and hyphothesis test in linear model
relpy upon the assumption of $Var(\epsilon_i) = \sigma^2$,
that the error terms have a constant variance.- one solution for this heteroscedasticity
transform using a concave function such as $log Y$ or $\sqrt{Y}$
- one solution for this heteroscedasticity
-
Outliers
to identify outliers, plot the studentized residuals
computed by dividing each residual $e_i$ by its estimated standard error - High-leverage points: unusual value for $x_i$
the least squares line can be heavily affected by just a couple of observations
but there is no simple way to plot all dimensions of the data simultaneously- leverage statistic $h_i$ to quantify an observation’s leverage
- Collinearity
when two or more predictor variables are closely related to one another, it makes difficult to separate out the individual effects of collinear variables. Reducing the accuracy of the estimates of the regression coefficients, standard error for $\hat\beta_j$ increases and the t-statistic decreases.
we may fail to reject $H_0: \beta_j = 0$,
the power of hypothesis test is reduced by collinearity.- Correlation matrix of the predictors
- variance inflation factor(VIF) to assess multicollinearity
where \(R^2_{X_j|X_{-j}}\) is from a regression of \(X_j\) onto all of the other predictors,
\(X_j = \sum_{i\ne j}\beta_i X_i + e\), and \(VIF(\hat\beta_j) = \frac{1}{1-R^2_{X_j|X_{-j}}}\).
having the smallest possible value of 1 if there’s no collinearity among the predictors and bigger values if there are more collinearity. Practically, a VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. - solution: Drop or Combine
3.5. Comparison with K-Nearest Neighbors
- KNN regression
a non-parameric method: No assumption of $f(x)$
\(\hat f(x_0) = \frac{1}{K}\sum_{x_i\in\mathcal{N}_0} y_i\)- ($x_i,y_i$): training data
- $\mathcal N_0$ : K neighbors of prediction point $x_0$
- Preferable situations to linear regression:
- True $f(x)$ is nonlinear
- when Goal is Prediction rather than Inference
- large number of observations per predictor(or low dimensions)
- Curse of Dimensionality
to capture 10% data as neighbor space for variable X…- p=1 : X range (0,1), edge rank $e_1(0.1) = 0.1$
- p=2 : $X_1, X_2$ range (0,1),
edge rank $e_2(0.1) = \sqrt{0.1} = 0.1^{1/2} \approx 0.316 $ - p=3 : $e_3(0.1) = 0.1^{1/3} \approx 0.464 $
- p=10 : $e_{10}(0.1) = 0.1^{1/10} \approx 0.794 $
-
Reduction of r : Average using fewer obs.
small K : higher variance of the fit, poor prediction - For the same density, when p = 1, n = 100 -> when p = 10, n = $100^{10}$