/ ISLR

ISLR - Chapter 7. Moving Beyond Linearity

Chapter 7. Moving Beyond Linearity

7.1. Polynomial Regression

  • standard linear model $y_i = \beta_0 + \beta_1 x_i + \epsilon_i$
    to a polynomial function $y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \cdots + \beta_d x_i^d + \epsilon_i$
    For large enough degree d, a polynomial regression produces an extremely non- linear curve. The coefficients $\beta_d$’s can be easily estimated using least squares linear regression.

7.2. Step Functions

  • Break the range of X into bins, and fit a different constant in each bin; converting a continuous variables into an ordered categorical variable.

  • By cutpoints $c_1, c_2, \ldots, c_K$ in the range of X, construct K+1 new variables
    \(\begin{align*} C_0(X) &= I(X < c_1), \\ C_1(X) &= I(c_1 \le X < c_2), \\ C_2(X) &= I(c_2 \le X < c_3), \\ &\vdots \\ C_{K-1}(X) &= I(c_{K-1} \le X < c_K), \\ C_{K}(X) &= I(c_K \le X), \\ \end{align*}\)
    where I($\cdot$) is an indicator function returning value of 1 or 0. These are sometimes called dummy variables. For any value of X, $\sum_{i=0}^K C_i(X) = 1$.

  • The least squares model: $y_i = \beta_0 + \beta_1 C_1(x_i) + \cdots + \beta_K C_K(x_i) + \epsilon_i$
    When $X<c_1$, all of the predictors $C_1(X), C_2(X), \ldots, C_K(X)$ are zero, so $\beta_0$ is the mean value of Y or $X<c_1$. By comparison, the model predicts a response of $\beta_0 + \beta_j$ for $c_j \le X < c_{j+1}$, so $\beta_j$ represents the average increase in the response for X in such range.

  • Unless there are natural breakpoints in the predictors, piecewise-constant functions can miss the action.

7.3. Basis Functions

  • Polynomial and piecewise-constant regression models are in fact special cases of a basis function approach; with a family of functions or transformations applied to a variable X: $b_1(X), b_2(X), \ldots, b_K(X)$, fit a linear model
    $y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_K b_K(x_i) + \epsilon_i$.
    as a standard linear model with predictors $b_i(x_i), b_2(x_i), \ldots, b_K(x_i)$. Hence, we can use least squares and all of the inference tools, such as std.err for coefficient estimates and F-statistics for the model’s overall significance.

  • Note that the basis functions $b_k(\cdot)$ are fixed and known(or, we choose the functions before). E.g., for polynomial regression, the basis functions are $b_j(x_i)=x_i^j$.

7.4. Regression Splines

7.4.1. Piecewise Polynomials

  • Instead of fitting a high-degree polynomial over the entire range of X, piecewise polynomial regression fits separate low-degree polynomials over different regions. Divide dimension X by knots and apply different polynomial model to each region. If we place K knots throughout the range of X, then we fit K+1 different models.

  • a cubic regression model; $y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \beta_3 x_i^3 + \epsilon_i$
    is a piecewise cubic with no knots with d = 3.
    On the other hand,
    \(y_i = \begin{cases} \beta_{01}+\beta_{11}x_i + \beta_{21}x_i^2 + \beta_{31}x_i^3 + \epsilon_i & \mbox{if }x_i < c \\ \beta_{02}+\beta_{12}x_i + \beta_{22}x_i^2 + \beta_{32}x_i^3 + \epsilon_i & \mbox{if }x_i \ge c. \\ \end{cases}\)
    is with a single knot at a point c.

  • For other degrees d:
    a piecewise-constant functions are piecewise polynomials of degree 0, a piecewise linear functions are of degree 1.

7.4.2. Constraints and Splines

  • To fix the discontinuity and overcomplexity, we add some additional constraints:
    in a degree-d spline, or a piecewise degree-d polynomial, it requires the continuity in derivatives up to degree d-1 at each knot.

7.4.3. The Spline Basis Representation

  • $y_i = \beta_0 + \beta_1 b_1(x_i) + \beta_2 b_2(x_i) + \cdots + \beta_{K+3} b_{K+3}(x_i) + \epsilon_i$
    is a cubic spline model with K knots.

  • with a truncated power basis function per knot $\xi$, which is defined as \(h(x,\xi) = (x-\xi)_{+}^3 = \begin{cases} (x-\xi)^3 & \mbox{if }x>\xi \\ 0 & \mbox{otherwise,} \end{cases}\)
    so, in fitting a cubic spline with K knots, we perform least squares regression with an intercept and 3+K predictors, of the form $X, X^2, X^3, h(X,\xi_1), h(X,\xi_2), \ldots, h(X,\xi_K)$.
    This amounts to estimating a total of K+4 regression coefficients; for this regression, fitting a cubic spline with K knots uses K+4 degrees of freedom.

  • However, splines can have high variance at the outer range of the predictors; $X<c_1$ or $X\ge c_K$, when X takes on either a very small or very large value. We see that the confidence bands in the boundary region appear fairly wild.
    A natural spline with additional boundary constraints that the function is required to be linear at the boundary, generally produce more stable estimates. In this case, the corresponding confidence intervals are narrower.

7.4.4. Choosing the Number and Locations of the Knots

  • The regression spline is most flexible in regions that contain a lot of knots, because in those regions the polynomial coefficients can change rapidly. Hence, one option is to place more knots in places where the function might vary most rapidly, and to place fewer knots where it seems more stable. In practice, it is common to place knots in a uniform fashion.

  • To determine the number of knots, or equivalently the degrees of freedom of the spline contain, we use cross-validation method.

7.4.5. Comparison to Polynomial Regression

  • Regression splines often give superior results to polynomial regression. This is because unlike polynomials, which must use a high degree to produce flexible fits, splines introduce flexibility by increasing the number of knots but keeping the degree fixed. Generally, this approach produces more stable estimates.

  • Splines can produce a reasonable fit at the boundaries and also allow us to place more knots, or flexibility, over regions where the function f seems to be changing rapidly, and fewer knots where f appears more stable.

7.5. Smoothing Splines

7.5.1. An Overview of Smoothing Splines

  • In fitting a smooth curve to a set of data, some function $g(x)$, that fits the observed data well: Minimizing $RSS = \sum_{i=1}^n(y_i-g(x_i))^2$. But if we don’t put any constraints on g, then we can always make RSS zero simply by choosing g such that it interpolates all of the $y_i$. Such a function would woefully overfit the data, would be too flexible.

  • Smoothing spline is the function g that minimizes
    \(\sum_{i=1}^n(y_i - g(x_i))^2 + \lambda\int g''(t)^2, dt\)
    with a nonnegative tuning parameter $\lambda$, takes the “Loss+Penalty” formulation like the ridge and lasso. The term $\sum_{i=1}^n(y_i - g(x_i))^2$ is a loss function that makes g to fit the data well, and the term \(\lambda\int g''(t)^2, dt\) is a penalty term that reduces the variability in g. \(g''(t)\) indicates the second derivative of the function g, corresponds to the amount by which the slope(the first derivative) is changing. Broadly speaking, the second derivative is a measure of its roughness: large in absolute value if g(t) is very wiggly near t, close to zero otherwise. It is zero as the derivative of a straight line, the function is perfectly smooth. The integral is a summation over the range of t, so \(\int g''(t)^2, dt\) is a measure of the total change in the function $g’(t)$ over its entire range. If g is very smooth, then $g’(t)$ will be close to constant and \(\int g''(t)^2, dt\) will take on a small value. Therefore, the penalty term encourages g to be smooth. The larger the $\lambda$, the smoother the g will be.
    When $\lambda=0$, then there’s no penalty and the function g will be very jumpy and have perfect fit. When $\lambda\rightarrow\infty$, g will be perfectly smooth; a linear least squares line $g(x)=ax+b$. The $\lambda$ controls the bias-variance trade-off of the smoothing spline.

  • Smoothing spline g(x) have some special properties: it is a piecewise polynomial with knots at the unique values of $x_1,\ldots,x_n$, and continuous first and second derivatives at each knot. Furthermore, it is linear in the region outside of the extreme knots.
    In other words, the Smoothing spline function is a nautral spline with knots at $x_1,\ldots,x_n$; it is a shrunken version of such a natural spline, where the value of the tuning parameter $\lambda$ controls the level of shrinkage.

7.5.2. Choosing the Smoothing Parameter $\lambda$

  • Have seen that a smoothing spline is simply a natural spline with knots at every unique value of $x_i$, it might seem that a smoothing spline will have far too many degrees of freedom, since a knot at each data point allows a great deal of flexibility. But $\lambda$ controls the roughness of the smoothing spline, and hence the effective degrees of freedom. We can show that as $\lambda$ increase from 0 to $\infty$, the effective degrees of freedom $df_\lambda$ decrease from n to 2.

  • Usually degrees of freedom refer to the number of free parameters, such as the number of coefficients fit in a polynomial or cubic spline. Though a smoothing spline has n parameters and n nominal degrees of freedom, these n parameters are heavily constrainted or shrunk down. Thus $df_\lambda$ is a measure of the flexibility of the smoothing spline, the higher it is, the more flexible the smoothing spline.

  • The definition of effective degrees of freedom, the measure of model complexity:
    \(\hat{\mathbf{g}}_\lambda = \mathbf{S}_\lambda\mathbf{y}\),
    where \(\hat{\mathbf{g}}_\lambda\) is the solution to the smoothing spline function g for a particular choice of $\lambda$, it is an n-vector containing the fitted values of the model at the training points $x_1,\ldots,x_n$.
    Then the effective degrees of freedom is defined to be
    \(df_\lambda = \text{trace}(\mathbf{S}_\lambda) = \sum_{i=1}^n\{ \mathbf{S}_\lambda \}_{ii}\),
    the sum of the diagonal elements of the matrix $\mathbf{S}_\lambda$.
    e.g.) for a linear regression model:
    \(\mathbb{H} = \mathbb{X}(\mathbb{X}^T\mathbb{X})^{-1}\mathbb{X}^T\),
    \(\text{trace}(\mathbb{H}) = p+1\)

  • In fitting a smoothing spline, we do not need to select the number or location of the knots; there will be a knot at each training observation. Instead, we need to choose the value of $\lambda$. For this case, LOOCV can be computed very efficiently with essentially the same cost as computing a single fit, using the following formula:
    \(RSS_{cv}(\lambda) = \sum_{i=1}^n(y_i - \hat{g}_\lambda^{(-i)}(x_i))^2 = \sum_{i=1}^n\left[\frac{y_i-\hat{g}_\lambda(x_i)}{1-\{\mathbf{S}_\lambda\}_{ii}}\right]^2\)
    where \(\hat{g}_\lambda^{(-i)}(x_i)\) indicates the fitted value for this smoothing spline evaluated at $x_i$, where the fit uses all the training observations except for the ith observation. In contrast, \(\hat{g}_\lambda(x_i)\) indicates the smoothing spline evaluated at $x_i$, where the function is fit to the full data. Thus, we can compute each of LOOCV fits, by one-time computing of the original fit to all of the data.

  • By effective d.f, we can directly compare the model complexities of models discussed so far, such as linear regression, ridge regression, smoothing splines, cubic splines, etc.

7.6. Local Regression

  • A different approach for fitting flexible non-linear functions, with the idea of KNN but closer observations have more weights. This is sometimes referred to as a memory-based procedure, because we need all the training data each time we compute a prediction.

  • Algorithm: Local Regression At $X = x_0$
    1. Gather the faction $s = k/n$ of training points whose $x_i$ are closest to $x_0$.
    2. Assign a weight $K_{i0} = K(x_i,x_0)$ to each point in this neighborhood, so that the point furthest from $x_0$ has weight zero, and the closest has the highest weight. All other points get weight zero.
    3. Fit a weighted least squares regression using the aforementioned weights;
      Objective function: \(\text{min}_{\beta_0,\beta_1} \left[\sum_{i=1}^n K_{i0}(y_i - \beta_0 - \beta_1 x_i)^2 \right]\)
    4. The fitted value at $x_0$ is given by $\hat{f}(x_0) = \hat\beta_0 + \hat\beta_1 x_0$.
  • In performing local regression, there are choices to be made, such as how to define the weighting function K, and which regression model to fit. The most important choice is the span s, which is the proportion of points used to compute the local regression at $x_0$; “How many neighbors?”. It plays a role of the tuning parameter $\lambda$, controls the flexibility of the fit. The smaller s, the more local and wiggly the fit. We can use CV methods to choose s, or we can specify it directly.

  • The idea of local regression can be generalized in many different ways. In multivariate settings, one very useful generalization involves fitting a multiple linear regression model that is global in some variables, but local in another, such as time. Such varying coefficient models are a useful way of adapting a model to the most recently gathered data.

  • Local regression can perform poorly if the dimension p is much larger, suffers the dimensionality problem. Also, it has a boundary problem because there are less data points for neighbors.

7.7. Generalized Additive Models

  • Approaches presented in sections above can be seen as extensions of simple linear regression, flexibly predicting a response Y on the basis of a single predictor X. GAMs provide a general framework for extending a standard linear model by allowing non-linear functions of each of the variables, while maintaining additivity. They can be applied with both quantitative and qualitative responses.

7.7.1. GAMs for Regression Problems

  • standard multiple linear regression model is
    \(y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} + \epsilon_i\).
    for non-linear relationships, replace each linear component \(\beta_j x_{ij}\) with an unspecified (smoothing) non-linear function \(f_j(x_{ij})\), now the model is
    \(\begin{align*} y_i &= \beta_0 + \sum_{j=1}^p f_j(x_{ij}) + \epsilon_i \\ &= \beta_0 + f_1(x_{i1}) + f_2(x_{i2}) + \cdots + f_p(x_{ip}) + \epsilon_i. \end{align*}\)
    It is called an additive model because we calculate a separate f for each X, then add together all of their contributions.

  • In case of using a smoothing spline to fit a GAM, it is not quite as simple as a natural spline case, since the least squares cannot be used. However, Standard softwares have some functions for GAMs using smoothing splines, via an approach known as backfitting; a method to fit a multivariate model by repeatedly updating the fit for each predictor in turn, holding the others fixed. Each time we update a function, we simply apply the fitting method for that variable to a partial residual.

7.7.2. GAMs for Classification Problems

  • For qualitative response Y, standard logistic regression model is
    \(\log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p.\)
    An extension allowing non-linear relationships; a logistic regression GAM is
    \(\log\left(\frac{p(X)}{1-p(X)}\right) = \beta_0 + f_1(X_1) + f_2(X_2) + \cdots + f_p(X_p).\)

Pros and Cons of GAMs

  • Pros
    1. GAMs allow us to fit a non-linear function to each variable, so that we can automatically model non-linear relationships. This means we do not need to manually try out many different transformations on each variable individually.
    2. Potentially make more accurate predictions with non-linear fits.
    3. Because the model is additive, we can examine the effect of each variable on the response individually while holding all of the other variables fixed.
    4. The smoothness of the function for the variable can be summarized via degrees of freedom.
  • Cons
    1. The model is restricted to be additive; with many variables, important interactions can be missed. However, we can manually add interaction terms to the model by including additional predictors of the form like $X_j \times X_k$.
    2. The solution of the optimization is not unique; $\beta_0$ is not identifiable because each $f_j$ model has its intercept term; a GAM has p+1 total intercepts and they are not distinguishable. For this problem, we make a restriction that every jth X variable to be centered; \(\sum_{i=1}^n f_j(x_{ij}) = 0\) and $\hat\beta_0 = \bar{y}$.