Is my understanding of linear regression correct?

alovya · 12/28/22

Hi all and Merry Christmas,

I have some questions about linear regression having read about it in Chapter 3 of Elements of Statistical Learning (ESL). My questions are not about finer details, but more about high-level concepts. I've put my questions as "claims" in bold font which I would really appreciate if you could confirm/deny. I apologise in advance for the long post.

Please do not reply only about the claims that are wrong, please also explicitly let me know which of my claims are correct.

First, we assume that there is a relationship between the response (denoted by random variable [imath]Y[/imath]) and the features (denoted by random vector [imath]X ∈ \mathbb{R}^p[/imath]) and a Gaussian error term (denoted by [imath]ε[/imath]). Then we have:

[math]Y = f(X) + ε[/math]
Now we assume that [imath]f(X)[/imath] is linear, so we have:

[math]f(X) \equiv β_0 + \sum B_j X_j = X^Tβ[/math]
Suppose we have many observations of [imath]X[/imath] and [imath]Y[/imath], i.e., we have many pairs [imath](y_i, x_i)[/imath]. We compute the residual sum of squares (RSS) as follows:

[math]RSS = ∑(y_i - f(x_i))^2 = ∑(y_i - x_i^Tβ)^2 = (\mathbf{y} - \mathbf{X}β)^T(\mathbf{y} - \mathbf{X}β) \text{ in matrix form}[/math]
Let [imath]\hat{β}[/imath] be the vector of coefficients which minimises the RSS for our observations. A closed-form solution for obtaining [imath]\hat{\beta}[/imath] can be derived by differentiating the RSS with respect to [imath]β[/imath], setting the derivative to 0 and solving for [imath]β[/imath]:

[math]\hat{β} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{Xy}[/math]
Claim 1: Computing [imath](\mathbf{X}^T\mathbf{X})^{-1}[/imath] might be numerically unstable if [imath]\mathbf{X}^T\mathbf{X}[/imath] is singular or nearly singular, so we use QR decomposition to solve for [imath]\hat{β}[/imath] instead.

We can achieve the QR decomposition in many ways. Some common methods are the Gram-Schmidt process, Householder reflections and Givens rotations.

Claim 2: Multicollinearity in [imath]\mathbf{X}[/imath] results in numerical instabilities when computing [imath](\mathbf{X}^T\mathbf{X})^{-1}[/imath]. To deal with multicollinearity, we use techniques like ridge or lasso regression.

Ridge regression boils down to minimising the RSS but with the constraint [imath]∑β_j^2 ≤ t[/imath]. In this context the [imath]RSS^{ridge}[/imath] can be written in matrix form as:

[math]RSS^{ridge} = (\mathbf{y} - \mathbf{X}β)^T(\mathbf{y} - \mathbf{X}β) + λβ^Tβ[/math]
Let [imath]β^{ridge}[/imath] be the vector of coefficients which minimises [imath]RSS^{ridge}[/imath]. The above equation can be manipulated to obtain a closed-form solution for [imath]β^{ridge}[/imath].

On the other hand, lasso regression boils down to minimising the RSS but with the constraint [imath]∑|β_j| ≤ t[/imath].

Claim 3: Unlike ridge regression, there is no closed-form solution for obtaining the vector of coefficients in lasso regression, which is a quadratic programming problem. A modified version of the least angle regression (LAR) algorithm can be used to obtain the lasso coefficients*.

*From the last paragraph in page 68 of ESL.

Ridge and lasso regression focus on "shrinking" the individual coefficients to deal with multicollinearity.

Claim 4: Unlike ridge and lasso regression, methods such as principal component regression (PCR) and partial least squares (PLS) focus on modifying the inputs/features to deal with multicollinearity instead of modifying the coefficients.

Many thanks.

Daniel Duffy · 12/28/22

Up to Claim 1 is standard. After that it is worthy of more detail.

Gram-Schmidt process is unstable. Use modified Gram-Schmidt process.
Numerical instability doesn't happen all on its own; what is the (root) cause in this case?
On a follow-on remark, you need to go back to the fundamentals of numerical analysis, not Hastie (it is heavy stuff and skims ,..)

Ordinary least squares - Wikipedia

en.wikipedia.org

TBH, it feels like a multiple choice quiz in its currrent form.

Incster · 1/12/23

I wouldn't say this is wrong per se, but the way you present it by reducing [math]f(X)[/math] to a literal linear sum of its argument does not encompass everything a liner regressions can do.

But as a reminder, [math]f(X)[/math] being linear is does not mean linear in the sense of lines (i.e. [math]\beta_0x_0+...+\beta_nx_n[/math]), but rather a function that's linear in its coefficients (which a line also is). [math]f(x)=\beta_0e^{\pi x}+\beta_1sin(ln(x))[/math] is linear in its coefficients just as much as a line is.

Is my understanding of linear regression correct?

alovya

Daniel Duffy

C++ author, trainer

Ordinary least squares - Wikipedia

Incster

Similar threads