Chapter 9 Parameter Estimation in MLR
9.1 Estimation of β in MLR
The equation for the multiple linear regression (MLR) model is:
Yi=β0+β1xi1+β2xi2+⋯+βkxik+ϵi
To estimate the βj parameters, we follow the same strategy as in SLR: minimizing the sum of squared residuals ∑ni=1e2i.
Graphically, this yields the hyperplane (the p-dimensional analogue of a plane) that best fits the data, by minimizing the distance between each point and the hyperplane. However, unlike SLR for which we can draw a scatterplot and regression line, visualizing this hyperplane is difficult. Instead, we are primarily limited to algebraic representations of the model.
Minimizing the sum of squared residuals means minimizing the function S(β0,β1,…,βk)=n∑i=1(yi−(β0+β1xi1+⋯+βkxik))2 This is a higher-dimension analogue to (3.1). To find the values ˆβ0,ˆβ1,…,ˆβk that minimize (9.1), we differentiate S(β0,β1,…,βk) with respect to all (k+1) βj’s and solve a system of p=k+1 equations. However, solving this system of (k+1) equations gets tedious, so we will not go into the details of that approach here. A simpler and more scalable approach is to use a matrix representation of the model.
9.2 Matrix form of the MLR model
9.2.1 Random Vectors
Before introducing the matrix form of the MLR model, we first briefly review notation for random vectors.
Definition 9.1 An n-dimensional random vector is a vector, each component of which is a random variable. Y=[Y1⋮Yn]
Definition 9.2 The mean vector is a vector whose elements are the element-wise mean of a random vector, assuming those means exist. μ=μY=E[Y]=[E[Y1]⋮E[Yn]]
Definition 9.3 The covariance matrix (or variance-covariance matrix) is a matrix containing the variances and covariances of the elements in a random vector (assuming E[Y2i]<∞). Σ=ΣY=Var(Y)=E[(Y−μY)(Y−μY)T]=[E[(Y1−μ1)(Y1−μ1)]E[(Y1−μ1)(Y2−μ2)]…E[(Y1−μ1)(Yn−μn)]E[(Y2−μ2)(Y1−μ1)]E[(Y2−μ2)(Y2−μ2)]…E[(Y2−μ2)(Yn−μn)]⋮⋮⋮E[(Yn−μn)(Y1−μ1)]E[(Yn−μn)(Y2−μ2)]…E[(Yn−μn)(Yn−μn)]]=[Var(Y1)Cov(Y1,Y2)⋯⋯Cov(Y1,Yn)Cov(Y2,Y1)Var(Y2)⋮⋮⋱⋮⋮⋱Cov(Yn−1,Yn)Cov(Yn,Y1)⋯⋯Cov(Yn,Yn−1)Var(Yn)]
An important property of the covariance matrix is that if A is a q×n matrix and y is an n-vector, Var[Ay]=AVar[y]AT (a q×q matrix).
9.2.2 Matrix form of the MLR model
To write the MLR model in matrix form, we translate each component into a vector or matrix. For the predictor variables xij, we create an (n×p) covariate matrix: X=[1x11x12⋯x1k1x21⋮⋮⋮⋮1xn1⋯⋯xnk]
The ith row in X corresponds to the ith observation, and the jth column corresponds to the jth predictor variable (where j=1 corresponds to the intercept). When needed, we can let xj denote the jth column of X. Note that p=k+1.
We then write the β′s as a (p×1) vector: β=[β0β1β2⋮βk].
We write the Yi’s as an (n×1) vector: Y=[Y1Y2⋮Yn] and write the ϵi’s as an (n×1) vector: ϵ=[ϵ1ϵ2⋮ϵn].
Together, these pieces give the MLR model in matrix form:
Y=Xβ+ϵ
To see the connection to the non-matrix form, we can multiply out the pieces to get: [Y1Y2⋮Yn]=[1x11x12⋯x1k1x21⋮⋮⋮⋮1xn1⋯⋯xnk][β0β1β2⋮βk]+[ϵ1ϵ2⋮ϵn]=[β0+x11β1+⋯+x1kβk+ϵ1β0+x21β1+⋯+x2kβk+ϵ2⋮β0+xn1β1+⋯+xnkβk+ϵn]
9.2.3 Assumptions of MLR model
The MLR model assumptions are:
- E[ϵ]=0
- Var[ϵ]=σ2I
- X is “full-rank”
Assumption 1 is the same as the E[ϵi]=0 assumption from SLR. Assumption 2 implies constant variance (Var(ϵi)=σ2) and no correlation between the ϵi’s (Cov(ϵi,ϵj)=0). Assumption 3 is discussed in detail below (Section 9.4).
9.3 Estimation of β in MLR (Matrix form)
In the matrix form of the MLR model, we can write the vector of residuals as e=y−Xˆβ. Minimizing the sum of square residuals becomes minimizing S(ˆβ)=eTe. This criterion can be re-written: S(ˆβ)=n∑i=1e2i=eTe=(y−Xˆβ)T(y−Xˆβ)=yTy−(Xˆβ)Ty−yTXˆβ+(Xˆβ)TXˆβ=yTy−2ˆβTXTy+ˆβTXTXˆβ
We minimize this by differentiating with respect to β: S(β)∂β=−2XTy+2XTXβ and then setting the derivative to zero: 0=−2XTy+2XTXˆβ This gives the “normal equations” for MLR: XTXˆβ=XTy And by multiplying each side by the inverse of XTX, we obtain the equation for the least-squares estimator of β: ˆβ=(XTX)−1XTy
9.3.1 Relationship to SLR
What if k=1? In that case, we expect our formulas for MLR to reduce to the quantities we found fro SLR. To see this, let X=[1x1⋮⋮1xn]. Then: ˆβ=(XTX)−1XTy=([1⋯1x1⋯xn][1x1⋮⋮1xn])−1[1⋯1x1⋯xn][y1⋮yn]=([nn∑i=1xin∑i=1xin∑i=1xi2])−1[n∑i=1yin∑i=1xiyi]=1nn∑i=1xi2−(n∑i=1xi)2([n∑i=1xi2−n∑i=1xi−n∑i=1xin])[∑ni=1yin∑i=1xiyi]=1nn∑i=1xi2−(n∑i=1xi)2([n∑i=1xi2n∑i=1yi−n∑i=1xi∑i=1yi−n∑i=1xin∑i=1yi+nn∑i=1xiyi])⇒ˆβ1=nn∑i=1xiyi−n∑i=1xin∑i=1yinn∑i=1xi2−(n∑i=1xi)2=SxySxx As expected, this gives us exactly the form of ˆβ1 from SLR.
9.4 Why must X be full-rank?
In the assumptions above, we said that X must be full-rank. Conceptually, this means that each column of X is providing different information; no information is duplicated in the chosen predictor variables. For example, including both speed in miles per hour and kilometers per hour in an MLR model is redundant.
Mathematically, this assumption means that the columns of X are linearly independent, so the dimension of the column space of X is equal to the number of columns in X. For example, if x3=x1+x2, then the matrix X=[1x1x2x3] is not full rank (it has rank = 3) because the column x3 can be written as a linear combination of columns x1 and x2.
The underlying reason for this assumption is because we need to be able to find the inverse of XTX to compute ˆβ. If X is less than full rank, then (XTX)−1 does not exist.
9.5 Fitting MLR in R
To fit a MLR model in R
, we again use the lm()
command. To include multiple predictor variables in the model, separate them by +
: y ~ x1 + x2 + x3
.
As with SLR, the intercept is automatically included.
Detailed output can be obtained from either tidy()
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -5434. 287. -19.0 1.05e-54
## 2 flipper_length_mm 48.2 1.84 26.2 1.94e-82
## 3 bill_length_mm -5.20 4.86 -1.07 2.85e- 1
## 4 sexmale 359. 41.6 8.63 2.73e-16
or summary()
:
##
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm + bill_length_mm +
## sex, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -878.71 -246.26 -0.71 226.67 1061.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5433.534 286.558 -18.961 < 2e-16 ***
## flipper_length_mm 48.209 1.841 26.179 < 2e-16 ***
## bill_length_mm -5.201 4.860 -1.070 0.285
## sexmale 358.631 41.572 8.627 2.73e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 355.8 on 329 degrees of freedom
## (11 observations deleted due to missingness)
## Multiple R-squared: 0.8065, Adjusted R-squared: 0.8047
## F-statistic: 457.1 on 3 and 329 DF, p-value: < 2.2e-16
9.6 Properties of OLS Estimators
As with SLR, it is useful to understand the distributional properties of ˆβ. It is straightforward to show that ˆβ is unbiased: E[ˆβ]=E[(XTX)−1XTy]=(XTX)−1XTE[y]=(XTX)−1XT(Xβ+E[ϵ])=(XTX)−1XTXβ=β This tell us that assuming the model is correct, on average ˆβ will be equal to β. This doesn’t mean it will be true for any particular dataset, but in repeated experiments it would be right on average.
What about the variance of ˆβ?
Recall that if A is a q×n matrix and y is an n-vector, Var[Ay]=AVar[y]AT (a q×q matrix).
To find the covariance matrix for ˆβ, we use the property Var[Ay]=AVar[y]AT. This allows us to compute the variance matrix as: Var(ˆβ)=Var((XTX)−1XTy)=(XTX)−1XTVar(y)((XTX)−1XT)T=(XTX)−1XTσ2IX(XTX)−1=σ2(XTX)−1XTX(XTX)−1=σ2(XTX)−1
Some important facts about the matrix σ2(XTX)−1:
- The diagonal elements of Var(ˆβ) are the variances of each βj.
- The off-diagonal elements provide the covariances of the elements of βj.
The form of Var(ˆβ0) is a multivariate extension of Var(ˆβ1)=σ2/Sxx from SLR. Now, instead of just considering the spread of xij on the variance of ˆβj, the variability is also impacted by other variables. This means that the correlation between two predictor variables can impact the uncertainty (and thus confidence intervals) of both point estimates. This will be important when we return to multicollinearity in Section 15.
9.7 Fitted values for MLR
Fitted values for the MLR model are ˆyi=ˆβ0+ˆβ1xi1+…ˆβkxik.In matrix form, this is: ˆy=Xˆβ=X(XTX)−1XTy=[X(XTX)−1XT]y=Hy
Similarly, we can write the residuals as: e=y−Xˆβ=y−Hy=(I−H)y
9.8 Hat Matrix H
The matrix H=X(XTX)−1XT is called the `hat’ matrix. The name comes from its use in calculating fitted values (putting the “hat” on the y’s).
H is important because it captures the relationships between the predictor variables (x’s). The diagonal elements of H are especially useful in quantifying leverage and influence (Section 13).
Mathematically, H is a symmetric, idempotent matrix that projects onto the column space of X. These attributes make H foundational to the theory of linear models, although most of the details are outside the scope of this text.
9.9 Estimating σ2
Like in SLR, the parameter σ2=Var(ϵi) represents the variance of the error above and below the “line” (actually a hyperplane) in MLR. Once again, we can estimate σ2 using SSRes: SSRes=n∑i=1(yi−ˆyi)2=n∑i=1e2i=eTe=[(I−H)y]T(I−H)y=yT(I−H)T(I−H)y=yT(I−H)y
There are n−p degrees of freedom associated with ˆσ2, since there are p estimated β’s (ˆβ0,ˆβ1,…,ˆβk). For those who are interested in additional theory, the rank of I−H is n−p, which is how this degrees of freedom is derived.
Our estimator for σ2 is then:
ˆσ2=SSResn−p
9.10 Estimating Var(ˆβ)
To estimate Var(ˆβ), we just plug in ˆσ2 in for σ2 in the formula for Var(ˆβ). The square-root of the diagonal elements of ^Var(ˆβ) are the standard errors of ˆβj.
In R, the vcov
function will compute ^Var(ˆβ):
## (Intercept) flipper_length_mm bill_length_mm sexmale
## (Intercept) 82115.6880 -434.680727 105.511986 1940.88100
## flipper_length_mm -434.6807 3.391103 -5.572842 -3.27875
## bill_length_mm 105.5120 -5.572842 23.620806 -48.95908
## sexmale 1940.8810 -3.278750 -48.959084 1728.20296
You can compute standard errors from ^Var(ˆβ):
## (Intercept) flipper_length_mm bill_length_mm sexmale
## 286.558350 1.841495 4.860124 41.571661
## # A tibble: 4 × 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -5434. 287. -19.0 1.05e-54
## 2 flipper_length_mm 48.2 1.84 26.2 1.94e-82
## 3 bill_length_mm -5.20 4.86 -1.07 2.85e- 1
## 4 sexmale 359. 41.6 8.63 2.73e-16