Introduction to Regression
Preface
License
1
Introduction and Example Datasets
1.1
What is regression?
1.2
Regression Goals
1.2.1
Scientific Goals
1.2.2
Mathematical Goals
1.3
Example Datasets
1.3.1
Palmer Penguins (Part 2)
1.3.2
Baseball Hits
1.3.3
Housing Price
1.3.4
Bike Share Programs
1.3.5
Car fuel efficiency
2
Simple Linear Regression
2.1
The Simple Linear Regression (SLR) Model
2.1.1
Goal of SLR
2.1.2
Theoretical SLR model
2.1.3
SLR Model Assumptions
2.2
Random Variables v. Data
2.3
Parameter Interpretation
2.3.1
Regression Line Equation
2.3.2
Interpreting
\(\beta_1\)
2.3.3
Interpreting
\(c\beta_1\)
2.3.4
Interpreting
\(\beta_0\)
2.3.5
Interpreting
\(\beta_0 + \beta_1x_i\)
2.3.6
Interpretation of
\(\sigma^2\)
2.3.7
Interpreting
\(\beta_1\)
with binary
\(x\)
2.4
Putting it together: Penguin example
2.5
Exercises
3
Estimating SLR Parameters
3.1
Ordinary Least Squares Estimation of Parameters
3.2
Understanding
\(\hat\beta_1\)
with correlation
3.3
Fitting SLR models in R
3.3.1
OLS estimation in R
3.3.2
Plotting SLR Fit
3.3.3
OLS estimation in R with binary x
3.4
Properties of Least Squares Estimators
3.4.1
Why OLS?
3.5
Estimating
\(\sigma^2\)
3.5.1
Not Regression: Estimating Sample Variance
3.5.2
Estimating
\(\sigma^2\)
in SLR
3.5.3
Estimating
\(\sigma^2\)
in R
3.6
Exercises
4
Inference on
\(\beta\)
in Simple Linear Regression
4.1
Inference Goals
4.2
Standard Error of
\(\beta_1\)
4.2.1
Sampling Distribution of
\(\hat\beta_1\)
4.2.2
Standard Errors
4.3
Not Regression: One-sample
\(t\)
-test
4.4
\(p\)
-values
4.5
Hypothesis Testing for
\(\beta_1\)
4.6
Other forms of hypothesis tests for
\(\beta_1\)
4.6.1
Testing against values other than 0
4.6.2
One-sided tests
4.7
Confidence Intervals (CIs)
4.7.1
Definition and Interpretation
4.7.2
Inverting a Hypothesis Test
4.8
CIs for
\(\beta_1\)
4.8.1
Confidence Intervals “by hand” in R
4.8.2
Confidence Intervals in R
4.9
Summarizing Inference for
\(\beta_1\)
4.10
Inference for
\(\beta_0\)
4.10.1
Hypothesis Testing for
\(\beta_0\)
4.10.2
CIs for
\(\beta_0\)
4.11
Exercises
5
Inference and Prediction for the Mean Response in SLR
5.1
Estimating Mean & Predicting Observations
5.1.1
Estimation or Prediction?
5.1.2
Computing
\(\hat\mu_{Y|x_0}\)
and
\(\hat{y}_0\)
5.2
Inference for the mean (
\(\mu_{Y|x_0}\)
)
5.2.1
CIs and Testing
5.2.2
Variance of
\(\hat\mu_{y|x_0}\)
5.2.3
Computing the CI for Mean Response
5.2.4
Plotting CI for Mean Response in R
5.3
Prediction Intervals
5.3.1
PI for New Observation
5.3.2
Calculating a PI
6
Inference for the SLR Model
6.1
Sum of Squares Decomposition
6.2
Coefficient of Determination (
\(R^2\)
)
6.3
F-Test for Regression
6.3.1
F-test in R “by hand”
6.3.2
F-test in R
6.4
ANOVA Table
7
Centering and Scaling in SLR
7.1
Regression with Centered
\(x\)
7.2
Rescaling Units
7.2.1
Rescaling
\(x\)
7.2.2
Rescaling
\(Y\)
7.3
Exercises
8
The Multiple Linear Regression (MLR) Model
8.1
Multiple Linear Regression
8.2
MLR Model 1: One continuous and one binary predictor
8.3
MLR Model 2: Two continuous predictors
8.4
Interpreting
\(\beta_j\)
in the general MLR model
8.5
Linear Combinations of
\(\beta_j\)
’s
8.5.1
Computing Mean Values
8.5.2
Computing Differences
8.6
Exercises
9
Parameter Estimation in MLR
9.1
Estimation of
\(\beta\)
in MLR
9.2
Matrix form of the MLR model
9.2.1
Random Vectors
9.2.2
Matrix form of the MLR model
9.2.3
Assumptions of MLR model
9.3
Estimation of
\(\beta\)
in MLR (Matrix form)
9.3.1
Relationship to SLR
9.4
Why must X be full-rank?
9.5
Fitting MLR in
R
9.6
Properties of OLS Estimators
9.7
Fitted values for MLR
9.8
Hat Matrix
\(\mathbf{H}\)
9.9
Estimating
\(\sigma^2\)
9.10
Estimating Var
\((\hat{\boldsymbol{\beta}})\)
10
Inference in MLR
10.1
What kind of hypothesis test?
10.2
Photosynthesis Data
10.3
Hypothesis Tests for
\(\beta_j\)
10.3.1
Scientific vs. Statistical Question
10.3.2
General Form of
\(H_0\)
10.4
Confidence Intervals for
\(\beta_j\)
10.5
Testing for Significance of Regression (Global F-Test)
10.5.1
Scientific vs. Statistical Question
10.5.2
Global F-test
10.5.3
ANOVA Table
10.6
Testing for Subsets of Coefficients
10.6.1
Scientific vs. Statistical Question
10.6.2
Partial F-test
10.7
Adjusted
\(R^2\)
10.8
Estimation and Prediction of the Mean
10.8.1
Confidence Intervals for the Mean
10.8.2
Prediction Intervals for New Observations
10.9
Exercises
11
Indicators and Interactions
11.1
Categorical Variables
11.2
Indicator Variables
11.3
Indicators in R
11.4
Testing Categorical Variables
11.5
Interactions in Regression
11.6
Interpreting Parameters in Interactions
11.7
Interactions in R
11.8
Interactions with 2 Continuous Variables
11.9
More Complicated Interactions
12
Assessing Model Assumptions
12.1
Residuals
12.1.1
Raw Residuals
12.1.2
Scaled Residuals
12.1.3
Standardized Residuals
12.1.4
Studentized Residuals
12.1.5
Residuals Summary
12.2
MLR Model Assumptions
12.2.1
Assumption 1
:
\(\E[\bmepsilon] = \mathbf{0}\)
12.2.2
Assumption 2
:
\(\Var(\epsilon_i) = \sigma^2 \text{ for } i=1, \dots, n\)
12.2.3
Assumption 3
:
\(Cov(\epsilon_i, \epsilon_j) = 0 \text{ for } i\ne j\)
12.2.4
Assumption 4
:
\(\bmX\)
is full-rank.
12.2.5
Assumption 5:
\(\epsilon_i\)
are approximately normally distributed.
12.2.6
Examples
13
Unusual Observations
13.1
Leverage
13.1.1
Impact of high leverage points
13.1.2
Quantifying Leverage
13.2
Influence
13.2.1
Quantifying Influence
13.2.2
Cook’s D
13.2.3
\(DFBETAS\)
13.3
Influence Measures in R
13.4
What to do?
14
Model Transformations
14.1
Why transform?
14.2
Outcome (
\(Y\)
) Transformations
14.2.1
Interpreting a Model with
\(\log Y\)
14.2.2
Other transformations
14.3
Transformations on
\(X\)
14.3.1
Transformations on
\(X\)
14.3.2
Log Transforming
\(x\)
’s
14.4
Log-transforming
\(x\)
and
\(Y\)
14.5
Polynomial Models
14.5.1
Polynomial Regression
14.5.2
Estimating Polynomial Models
14.5.3
Testing in Polynomial Models
14.5.4
Predicting Polynomial Models
14.6
Splines
14.7
Exercises
15
Multicollinearity and Shrinkage
15.1
Multicollinearity
15.2
Dealing with Multicollinearity
15.3
MSE Decomposition
15.4
Ridge Regression
15.4.1
Ridge Regression in R
15.5
LASSO
15.5.1
LASSO in R
16
Model Selection for Prediction
16.1
Variable Selection & Model Building
16.2
Prediction Error
16.3
Model Selection & Evaluation Strategy
16.3.1
Synthetic Data Example
16.3.2
Candidate Models
16.3.3
Linear Model Fit
16.3.4
Quadratic Model Fit
16.3.5
Cubic Model Fit
16.3.6
Logarithmic Model Fit
16.3.7
Comparing Model Fits
16.3.8
Test Set Evaluation
16.3.9
Model Selection for Prediction Recap
16.4
Cross-validation
16.4.1
Cross-validation
16.4.2
Example: Predicting Cereal Ratings
16.4.3
CV using
caret
16.4.4
Cross-validation Summary
16.5
AIC & Information Criteria
16.6
Choosing Models to Compare
16.7
Other Types of Models
17
Model Selection for Association
17.1
Model Misspecification – Mathematical consequences
17.1.1
Correct model
17.1.2
Not including all variables
17.1.3
Including too many variables
17.2
Confounding, Colliders, and DAGs
17.2.1
Confounders
17.2.2
Directed Acyclic Graphs (DAGs)
17.2.3
Accounting for Confounders
17.2.4
Confounding Example: FEV in Children
17.2.5
Confounding & Randomized Experiments
17.2.6
Colliders
17.3
Confirmatory v. Exploratory
17.3.1
Association Study Goals
17.3.2
Confirmatory Analyses
17.3.3
Exploratory Analyses
17.3.4
Model Selection Process – Confirmatory Analysis
17.3.5
Model Selection Process – Exploratory Analysis
17.3.6
Identifying a Statistical Model
17.4
Variable Selection
17.4.1
What not to do
17.4.2
Predictor of Interest
17.4.3
Adjustment Variables
17.4.4
Things to Consider
17.5
Exercises
18
Logistic Regression: Introduction
18.1
Logistic and Logit Functions
18.1.1
Logistic Function
18.1.2
Logistic Regression
18.1.3
Odds
18.2
Logistic Regression in R
18.2.1
Fitting Logistic Regression in R
18.3
Risk Ratios and Odds Ratios
18.3.1
Calculating Quantities in Logistic Regression
18.3.2
Calculating Risk of CHD
18.3.3
Calculating the Odds
18.4
Interpreting
\(\beta_1\)
and
\(\beta_0\)
18.4.1
Interpreting
\(\beta_1\)
18.4.2
Interpreting
\(\beta_1\)
: CHD Example
18.4.3
Interpreting
\(\beta_0\)
18.4.4
Calculating Odds Ratios and Probabilities in R
18.5
Multiple Logistic Regression
18.5.1
Coefficient Interpretation
19
Inference in Logistic Regression
19.1
Maximum Likelihood
19.1.1
Likelihood function
19.1.2
Maximum Likelihood in Logistic Regression
19.2
Hypothesis Testing for
\(\beta\)
’s
19.2.1
Likelihood Ratio Test (LRT)
19.2.2
LRT in R
19.2.3
Hypothesis Testing for
\(\beta\)
’s
19.2.4
Wald Test
19.2.5
Score Test
19.3
Interval Estimation
19.3.1
Wald Confidence Intervals
19.3.2
Profile Confidence Intervals
19.4
Generalized Linear Models (GLMs)
20
Appendix: Random Variables
20.1
Expected value
20.2
Variance
References
Published with bookdown
Introduction to Regression Analysis in R
References
Horst, Allison Marie, Alison Presmanes Hill, and Kristen B Gorman. 2020.
Palmerpenguins: Palmer Archipelago (Antarctica) Penguin Data
.
https://allisonhorst.github.io/palmerpenguins/
.