Chapter 2 Simple Linear Regression
\(\newcommand{\E}{\mathrm{E}}\) \(\newcommand{\Var}{\mathrm{Var}}\)
2.1 The Simple Linear Regression (SLR) Model
2.1.1 Goal of SLR
In simple linear regression (SLR), our goal is to find the best-fitting straight line, commonly called the regression line, through a set of paired \((x, y)\) data. The line should go through the “middle” of the data and represent the “average” trend in the outcome variable as a function of the predictor variable. Throughout this chapter and the next, we will look at how to more precisely define what the line represents, how to interpret it in context, and how to estimate it.
Example 2.1 In Example 1.1, we introduced data on penguin flipper length and body mass. The following plot shows the data and the corresponding linear regression line:
2.1.2 Theoretical SLR model
The general form for the equation of a line includes a slope and intercept. For regression, we use that form but also add an error term that accounts for random deviation around the line.
There are two complimentary ways to think about the SLR model. The first is as a theoretical mathematical model for random variables and the second is as a calculated line for specific dataset. Here we first introduce the theoretical model. The calculated line for a specific dataset is discussed in Section 2.2. Later chapters will present both together.
Let \(i\) be an index for each observation (penguins, people, units, etc.) The theoretical equation for the Simple Linear Regression (SLR) model is:
\[\begin{equation} Y_i = \beta_0 + \beta_1x_i + \epsilon_i \tag{2.1} \end{equation}\]
where:
\(Y_i\) is a random variable representing the outcome variable (e.g. body mass)
\(x_i\) is a fixed predictor variable (e.g., flipper length)
\(\beta_0\) is a parameter representing the intercept
\(\beta_1\) is a parameter representing the slope
\(\epsilon_i\) is a random variable representing variation (the “error” in the model)
These five elements can be categorized as three different kinds of quantities:
- Random variables (\(Y_i\) and \(\epsilon_i\)): Random quantities that are not directly observed. For a given dataset, the corresponding observed values will be denoted \(y_i\) and \(e_i\) (See Section 2.2)
- Parameters (\(\beta_0\) and \(\beta_1\)): Numbers that are fixed but unknown. These values are the same for all observations in the model. Estimating the values of these parameters (the topic of Section 3) is the basic task in “fitting” a regression model.
- Observed data (\(x_i\)): Values that are fixed and known. Each observation can have a different value of \(x_i\), or there could be as few as only two distinct values for \(x_i\). Even though these might vary between datasets, the SLR model considers them fixed. SLR refers specifically to settings where there is only one predictor variable. In Chapter 8 we will extend this to multiple predictor variables.
2.1.3 SLR Model Assumptions
The assumptions corresponding to the SLR model (equation (2.1)) are:
- \(\E[\epsilon_i] = 0\)
- The error terms have mean zero. This allows the \(\epsilon_i\) to account for variability above and below the line.
- \(\Var[\epsilon_i] = \sigma^2\)
- The error terms have constant variance. (If the variance was different for each observation, then \(\Var[\epsilon_i] = \sigma_i^2\).)
- \(\epsilon_i\) are uncorrelated
- Each observation gives you new information. This usually means that each observation is independent.
For now, the key assumption is \(E[\epsilon_i] = 0\). Importantly, we do not require an assumption about normality of the error term. We will discuss these assumptions in detail, including what happens when they are violated and how to check if they are satisfied, in Chapter 12.
2.2 Random Variables v. Data
The theoretical SLR model (2.1) is an equation for the random variables \(Y_i\). In practice, we observed data \(y_i\) and estimate a line that has an estimated slope and intercept.
In other words, real data are not generated from the theoretical model \[Y_i = \beta_0 + \beta_1x_i + \epsilon_i\] But for a dataset with specific values \((x_i, y_i)\), we can use the theoretical model to describe the data:
\[\begin{equation} y_i = \hat\beta_0 + \hat\beta_1x_i + e_i \tag{2.2} \end{equation}\]
In equation (2.2):
- \(y_i\) is the observed value of the outcome for observation \(i\)
- \(e_i\) is the residual value corresponding to observation \(i\)
- \(\hat\beta_0\) is the estimated intercept for the regression line
- \(\hat\beta_1\) is the estimated slope for the regression line
- \(x_i\) is observed value of the predictor variable for observation \(i\)
The “hats” in \(\hat\beta_0\) and \(\hat\beta_1\) indicate that the values are estimates of the true parameter \(\beta_1\) and \(\beta_0\). The difference between the theoretical model and estimated model can be summarized in the following table:
Theoretical/Math World | Real World Data |
---|---|
\(Y_i =\) outcome (a random variable) | \(y_i =\) outcome (a known number) |
\(x_i =\) predictor (a known number) | \(x_i =\) predictor (a known number) |
\(\epsilon_i=\) error (a random variable) | \(e_i =\) residual (a known number) |
\(\beta_0 =\) intercept parameter (an unknown number) | \(\hat\beta_0 =\) intercept estimate (a calculated number) |
\(\beta_1 =\) slope parameter (an unknown number) | \(\hat\beta_1 =\) slope estimate (a calculated number) |
2.3 Parameter Interpretation
2.3.1 Regression Line Equation
A consequence of Assumption 1 (\(\E[\epsilon_i] = 0\)) is that the mean of \(Y_i\) is a straight line:
\[\begin{align} \E[Y_i] &= \E[\beta_0 + \beta_1x_i + \epsilon_i] \notag\\ &= \E[\beta_0 + \beta_1x_i] + \E[\epsilon_i]\notag\\ &= \beta_0 + \beta_1x_i + 0 \notag\\ &= \beta_0 + \beta_1x_i \tag{2.3} \end{align}\]
This allows us to interpret the parameters \(\beta_0\) and \(\beta_1\) in terms of the slope and intercept of the line for the average value of the outcome.
In this section only, we are assuming that \(\beta_0\), \(\beta_1\), and \(\sigma^2\) are known. Starting in Section 3 we will introduce estimates of this parameter and start including “estimated” in interpretations.
2.3.2 Interpreting \(\beta_1\)
The \(\beta_1\) parameter represents the slope of the regression line. In other words, \(\beta_1\) is the difference in \(\E[Y_i]\) between observations that differ in \(x_i\) by one unit.
In almost any data analysis project, an important task is providing an interpretation of the model parameters. When describing the interpretation of \(\beta_1\), it is important to include these key elements:
- An interpretation of \(\beta_1\) as an estimated difference in the average value of the outcome variable
- For linear regression, “difference in average” is the same as “average difference”. You can use either phrase.
- Specify that this is for a 1-unit difference in the predictor variable (x).
- Be cautious about referring to \(\beta_1\) as an “increase” or a “decrease”. Those words imply that an intervention was conducted, in which the value of \(x\) was directly modified. This is possible in controlled experiments, but is not the case for observational datasets. To indicate direction, it can be helpful to use words such as “greater”, “lower”, or “higher”.
- Include confidence intervals (see Section 4.8)2
- If appropriate, include information about the conclusion from a hypothesis test (see Section 4.5)3
- Include units for both the outcome and predictor
- If known, include statement about population/context.
Example 2.2 Consider two groups of penguins:
- Penguins that have flipper lengths of 201mm. Call this “Group A”
- Penguins that have flipper lengths of 200mm. Call this “Group B”
Assuming the SLR model (2.1), what is the difference in average body mass between these groups of penguins? To answer this, first write out the regression equation for the mean body mass in each group:
\[\text{Group A:} \quad \E[Y_A] = \beta_0 + \beta_1*201\] \[\text{Group B:} \quad \E[Y_B] = \beta_0 + \beta_1*200\]
Then take the difference between them: \[\begin{align*} \E[Y_A] - \E[Y_B] &= \left(\beta_0 + \beta_1*201 \right) - \left(\beta_0 + \beta_1*200\right)\\ & = 201\beta_1 - 200\beta_1\\ &= \beta_1 \end{align*}\]
Possible summarizing sentences for \(\beta_1\):
- \(\beta_1\) is the difference in average body mass (in g) for penguins that differ in flipper length by 1 mm.
- The average difference in body mass for penguins that have 1 mm longer flippers is \(\beta_1\) grams.
- The difference in average body mass for penguins that differ in flipper length by one millimeter is \(\beta_1\) grams.
We will expand upon these sentences in Examples 2.8 and 4.7.
2.3.3 Interpreting \(c\beta_1\)
Because \(\beta_1\) is the slope of the regression, we can easily interpret multiple of \(\beta_1\), say \(c\beta_1\) for some \(c \in \mathbb{R}\), as differences in the average value of the outcome variable for differences of \(c\) units in the predictor variable.
Example 2.3 In Example 2.2, the value \(10\beta_1\) can be summarized as: the difference in average body mass (in g) for penguins that differ in flipper length by 10 mm.
Example 2.4 In Example 2.2, the value \(25\beta_1\) can be summarized as: the difference in average body mass (in g) for penguins that differ in flipper length by 2.5 cm.
Interpreting multiples of \(\beta_1\) is closely related to the idea of re-scaling the predictor \(x\), which is covered in Section 7.2.
2.3.4 Interpreting \(\beta_0\)
The \(\beta_0\) parameter represents the intercept of the regression line. In other words, \(\beta_0\) is the average value of \(Y_i\) (i.e. \(\E[Y_i]\)) for observations with an \(x\) value of 0. While mathematically useful, \(\beta_0\) might make no practical or scientific sense!
When describing the interpretation of \(\beta_0\), follow all of the same guidelines as for \(\beta_1\) above, with the following exception:
- Interpret it as the average value of the outcome variable (as opposed to a difference in the average value).
- Specify it corresponds to a value of 0 for \(x\), rather than a 1-unit difference in \(x\).
Example 2.5 Consider a third group of penguins:
- Penguins with flipper length of 0 mm. Call this “Group C”
\[\text{Group C:} \quad \E[Y_C] = \beta_0 + \beta_1*0 = \beta_0\]
Equivalent interpretation statements for \(\beta_0\) in this context:
- \(\beta_0\) is the average body mass (in g) for penguins that have a flipper length of 0 mm.
- The average body mass for penguins that have 0 mm long flippers is \(\beta_0\) grams.
2.3.5 Interpreting \(\beta_0 + \beta_1x_i\)
From Equation (2.3), the points on the regression line can be interpreted as the average value of the outcome variable for units with a particular value of \(x\).
Example 2.6 Consider penguins in Group A (from Example 2.2). In terms of parameters in the simple linear regression mdoel, what is their average body mass?
Since penguins in Group A have flipper lengths of \(x_i = 200\)mm, their average body mass is given by:
\[\E[Y_i | x_i = 200] = \beta_0 + 200\beta_1.\]
In Section 5, we will see more on how to interpret this value and the difference between estimating a mean and predicting a new observation.
2.3.6 Interpretation of \(\sigma^2\)
The parameter \(\sigma^2\) represents the variance of the data around the regression line:
\[\begin{align*} \Var(Y_i) &= \Var(\beta_0 + \beta_1x_i + \epsilon_i)\\ &= \Var(\beta_0 + \beta_1x_i) + \Var(\epsilon_i)\\ &= 0 + \sigma^2\\ &= \sigma^2 \end{align*}\]
- A large value of \(\sigma^2\) means that data are more spread out vertically around the line.
- A small value of \(\sigma^2\) means that data are vertically close to the line
The following two plots show simulated data. The left panel has data generated from a model with \(\sigma^2 = 10\) and the right panel data comes from a model with \(\sigma^2 = 1\).
2.3.7 Interpreting \(\beta_1\) with binary \(x\)
Although the most common setting of the simple linear regression model is with a continuous predictor variable (such as flipper length), the model can equally be applied when the predictor variable is binary, meaning it takes on two values.
If \(x_i\) can take on two values, then instead of a line, the SLR model simply becomes two distinct average values for the two groups. Suppose for one group, \(x = 0\) and for the other group, \(x=1\). A variable defined this was is called an indicator variable since it serves as a binary marker of group membership. The two groups could be anything, such as male/female, home/away, automatic/manual, etc. (We will see how to use indicators for variables with 3 or more categories in Section 11.)
Then the two “lines” for the model are: \[\begin{align*} \E[Y_i | x_i = 0] &= \beta_0\\ \E[Y_i | x_i = 1] &= \beta_0 + \beta_1*1 = \beta_0 + \beta_1 \end{align*}\] The average value of the outcome for those with \(x_i=0\) is \(\beta_0\) and the average value of the outcome for those with \(x_i =1\) is \(\beta_0 + \beta_1\). Furthermore, the difference between these values is: \[\begin{align*} \E[Y_i | x_i = 1] - \E[Y_i | x_i = 0] &= (\beta_0 + \beta_1) - (\beta_0) \\ &= \beta_1 \end{align*}\] Thus, the interpretations of \(\beta_0\) and \(\beta_1\) are:
- \(\beta_0\) is the average value of \(Y\) among observations in the group with \(x=0\)
- \(\beta_1\) is the difference in average value of \(Y\) between observations with \(x=1\) and those with \(x=0\).
Example 2.7 Consider the penguin data, but now when we model body mass (the outcome) as a function of penguin sex (the predictor variable). Define the indicator variable
\[\begin{equation} x = \begin{cases} 0 & \text{ if } \texttt{sex} \text{ is } \texttt{"female"} \\ 1 & \text{ if } \texttt{sex} \text{ is } \texttt{"male"} \end{cases}. \tag{2.4} \end{equation}\] Then in the SLR model \(Y_i = \beta_0 + \beta_1x_i + \epsilon_i\), we have the following interpretations for the regression parameters:
- \(\beta_0\) is the average body mass for female penguins.
- \(\beta_0 + \beta_1\) is the average body mass for male penguins.
- \(\beta_1\) is the difference in body mass between male and female penguins.
2.4 Putting it together: Penguin example
Example 2.8 In the example of penguin flipper length and body mass, the equation for the estimated regression line is: \[y_i = -5780.8 + 49.7*x_i\] (We will see how these numbers are calculated in Section 3.) We can interpret the slope of this regression line as follows:
A difference of one mm in flipper length is associated with an estimated difference of 49.7 g greater average body mass among penguins in Antarctica.
Note the key elements in this sentence:
- “average” – Linear regression model tells us about the mean, not a specific observation
- “estimated” – The number 49.7 is an estimate of the true (and unknown) parameter value.
- Units are provided for flipper length and body mass.
- “among penguins in Antarctica” – Context and/or population
- This is observational data, so the relationship is stated as an association, not causation.
The residuals for each observation can be calculated as the difference between each data point and its corresponding modeled mean: \[ e_i = y_i - (\hat\beta_0 + \hat\beta_1x_i)\] Figure 2.3 shows a graphical representation of the residuals \(e_i\).
2.5 Exercises
Exercise 2.1 In the context of Example 2.2, what is an interpretation of \(5\beta_1\)?
Exercise 2.2 In the context of Example 2.5, does the quantity \(10\beta_0\) make practical sense? If yes, provide an interpretation. If no, explain why not.
Exercise 2.3 Consider a variation of Example 2.2, in which a simple linear regression mdoel is fit with body mass (in g) as the predictor variable and flipper length (in mm) as the outcome. Explain how this switch affects the interpretations of \(\beta_0\), \(\beta_1\), and \(\sigma^2\).
Exercise 2.4 Consider a SLR model with flipper length as the outcome and sex as the predictor, with \(x\) defined as in equation (2.4). What are the interpetations of \(\beta_0\) and \(\beta_1\)?
Exercise 2.5 In the SLR model of Example 2.7, what is the difference in average body mass between females and male? (Note that the order of the queston is important here).