Hi all and Merry Christmas,
I have some questions about linear regression having read about it in Chapter 3 of Elements of Statistical Learning (ESL). My questions are not about finer details, but more about high-level concepts. I've put my questions as "claims" in bold font which I would really appreciate if you could confirm/deny. I apologise in advance for the long post.
Please do not reply only about the claims that are wrong, please also explicitly let me know which of my claims are correct.
First, we assume that there is a relationship between the response (denoted by random variable [imath]Y[/imath]) and the features (denoted by random vector [imath]X ∈ \mathbb{R}^p[/imath]) and a Gaussian error term (denoted by [imath]ε[/imath]). Then we have:
[math]Y = f(X) + ε[/math]
Now we assume that [imath]f(X)[/imath] is linear, so we have:
[math]f(X) \equiv β_0 + \sum B_j X_j = X^Tβ[/math]
Suppose we have many observations of [imath]X[/imath] and [imath]Y[/imath], i.e., we have many pairs [imath](y_i, x_i)[/imath]. We compute the residual sum of squares (RSS) as follows:
[math]RSS = ∑(y_i - f(x_i))^2 = ∑(y_i - x_i^Tβ)^2 = (\mathbf{y} - \mathbf{X}β)^T(\mathbf{y} - \mathbf{X}β) \text{ in matrix form}[/math]
Let [imath]\hat{β}[/imath] be the vector of coefficients which minimises the RSS for our observations. A closed-form solution for obtaining [imath]\hat{\beta}[/imath] can be derived by differentiating the RSS with respect to [imath]β[/imath], setting the derivative to 0 and solving for [imath]β[/imath]:
[math]\hat{β} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{Xy}[/math]
Claim 1: Computing [imath](\mathbf{X}^T\mathbf{X})^{-1}[/imath] might be numerically unstable if [imath]\mathbf{X}^T\mathbf{X}[/imath] is singular or nearly singular, so we use QR decomposition to solve for [imath]\hat{β}[/imath] instead.
We can achieve the QR decomposition in many ways. Some common methods are the Gram-Schmidt process, Householder reflections and Givens rotations.
Claim 2: Multicollinearity in [imath]\mathbf{X}[/imath] results in numerical instabilities when computing [imath](\mathbf{X}^T\mathbf{X})^{-1}[/imath]. To deal with multicollinearity, we use techniques like ridge or lasso regression.
Ridge regression boils down to minimising the RSS but with the constraint [imath]∑β_j^2 ≤ t[/imath]. In this context the [imath]RSS^{ridge}[/imath] can be written in matrix form as:
[math]RSS^{ridge} = (\mathbf{y} - \mathbf{X}β)^T(\mathbf{y} - \mathbf{X}β) + λβ^Tβ[/math]
Let [imath]β^{ridge}[/imath] be the vector of coefficients which minimises [imath]RSS^{ridge}[/imath]. The above equation can be manipulated to obtain a closed-form solution for [imath]β^{ridge}[/imath].
On the other hand, lasso regression boils down to minimising the RSS but with the constraint [imath]∑|β_j| ≤ t[/imath].
Claim 3: Unlike ridge regression, there is no closed-form solution for obtaining the vector of coefficients in lasso regression, which is a quadratic programming problem. A modified version of the least angle regression (LAR) algorithm can be used to obtain the lasso coefficients*.
*From the last paragraph in page 68 of ESL.
Ridge and lasso regression focus on "shrinking" the individual coefficients to deal with multicollinearity.
Claim 4: Unlike ridge and lasso regression, methods such as principal component regression (PCR) and partial least squares (PLS) focus on modifying the inputs/features to deal with multicollinearity instead of modifying the coefficients.
Many thanks.
I have some questions about linear regression having read about it in Chapter 3 of Elements of Statistical Learning (ESL). My questions are not about finer details, but more about high-level concepts. I've put my questions as "claims" in bold font which I would really appreciate if you could confirm/deny. I apologise in advance for the long post.
Please do not reply only about the claims that are wrong, please also explicitly let me know which of my claims are correct.
First, we assume that there is a relationship between the response (denoted by random variable [imath]Y[/imath]) and the features (denoted by random vector [imath]X ∈ \mathbb{R}^p[/imath]) and a Gaussian error term (denoted by [imath]ε[/imath]). Then we have:
[math]Y = f(X) + ε[/math]
Now we assume that [imath]f(X)[/imath] is linear, so we have:
[math]f(X) \equiv β_0 + \sum B_j X_j = X^Tβ[/math]
Suppose we have many observations of [imath]X[/imath] and [imath]Y[/imath], i.e., we have many pairs [imath](y_i, x_i)[/imath]. We compute the residual sum of squares (RSS) as follows:
[math]RSS = ∑(y_i - f(x_i))^2 = ∑(y_i - x_i^Tβ)^2 = (\mathbf{y} - \mathbf{X}β)^T(\mathbf{y} - \mathbf{X}β) \text{ in matrix form}[/math]
Let [imath]\hat{β}[/imath] be the vector of coefficients which minimises the RSS for our observations. A closed-form solution for obtaining [imath]\hat{\beta}[/imath] can be derived by differentiating the RSS with respect to [imath]β[/imath], setting the derivative to 0 and solving for [imath]β[/imath]:
[math]\hat{β} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{Xy}[/math]
Claim 1: Computing [imath](\mathbf{X}^T\mathbf{X})^{-1}[/imath] might be numerically unstable if [imath]\mathbf{X}^T\mathbf{X}[/imath] is singular or nearly singular, so we use QR decomposition to solve for [imath]\hat{β}[/imath] instead.
We can achieve the QR decomposition in many ways. Some common methods are the Gram-Schmidt process, Householder reflections and Givens rotations.
Claim 2: Multicollinearity in [imath]\mathbf{X}[/imath] results in numerical instabilities when computing [imath](\mathbf{X}^T\mathbf{X})^{-1}[/imath]. To deal with multicollinearity, we use techniques like ridge or lasso regression.
Ridge regression boils down to minimising the RSS but with the constraint [imath]∑β_j^2 ≤ t[/imath]. In this context the [imath]RSS^{ridge}[/imath] can be written in matrix form as:
[math]RSS^{ridge} = (\mathbf{y} - \mathbf{X}β)^T(\mathbf{y} - \mathbf{X}β) + λβ^Tβ[/math]
Let [imath]β^{ridge}[/imath] be the vector of coefficients which minimises [imath]RSS^{ridge}[/imath]. The above equation can be manipulated to obtain a closed-form solution for [imath]β^{ridge}[/imath].
On the other hand, lasso regression boils down to minimising the RSS but with the constraint [imath]∑|β_j| ≤ t[/imath].
Claim 3: Unlike ridge regression, there is no closed-form solution for obtaining the vector of coefficients in lasso regression, which is a quadratic programming problem. A modified version of the least angle regression (LAR) algorithm can be used to obtain the lasso coefficients*.
*From the last paragraph in page 68 of ESL.
Ridge and lasso regression focus on "shrinking" the individual coefficients to deal with multicollinearity.
Claim 4: Unlike ridge and lasso regression, methods such as principal component regression (PCR) and partial least squares (PLS) focus on modifying the inputs/features to deal with multicollinearity instead of modifying the coefficients.
Many thanks.