Chapter 2 Simple Linear Regression
2.1 The Simple Linear Regression (SLR) Model
2.1.1 Goal of SLR
In simple linear regression (SLR), our goal is to find the best-fitting straight line, commonly called the regression line, through a set of paired (x,y) data. The line should go through the “middle” of the data and represent the “average” trend in the outcome variable as a function of the predictor variable. Throughout this chapter and the next, we will look at how to more precisely define what the line represents, how to interpret it in context, and how to estimate it.
Example 2.1 In Example 1.1, we introduced data on penguin flipper length and body mass. The following plot shows the data and the corresponding linear regression line:
2.1.2 Theoretical SLR model
The general form for the equation of a line includes a slope and intercept. For regression, we use that form but also add an error term that accounts for random deviation around the line.
There are two complimentary ways to think about the SLR model. The first is as a theoretical mathematical model for random variables and the second is as a calculated line for specific dataset. Here we first introduce the theoretical model. The calculated line for a specific dataset is discussed in Section 2.2. Later chapters will present both together.
Let i be an index for each observation (penguins, people, units, etc.) The theoretical equation for the Simple Linear Regression (SLR) model is:
Yi=β0+β1xi+ϵi
where:
Yi is a random variable representing the outcome variable (e.g. body mass)
xi is a fixed predictor variable (e.g., flipper length)
β0 is a parameter representing the intercept
β1 is a parameter representing the slope
ϵi is a random variable representing variation (the “error” in the model)
These five elements can be categorized as three different kinds of quantities:
- Random variables (Yi and ϵi): Random quantities that are not directly observed. For a given dataset, the corresponding observed values will be denoted yi and ei (See Section 2.2)
- Parameters (β0 and β1): Numbers that are fixed but unknown. These values are the same for all observations in the model. Estimating the values of these parameters (the topic of Section 3) is the basic task in “fitting” a regression model.
- Observed data (xi): Values that are fixed and known. Each observation can have a different value of xi, or there could be as few as only two distinct values for xi. Even though these might vary between datasets, the SLR model considers them fixed. SLR refers specifically to settings where there is only one predictor variable. In Chapter 8 we will extend this to multiple predictor variables.
2.1.3 SLR Model Assumptions
The assumptions corresponding to the SLR model (equation (2.1)) are:
- E[ϵi]=0
- The error terms have mean zero. This allows the ϵi to account for variability above and below the line.
- Var[ϵi]=σ2
- The error terms have constant variance. (If the variance was different for each observation, then Var[ϵi]=σ2i.)
- ϵi are uncorrelated
- Each observation gives you new information. This usually means that each observation is independent.
For now, the key assumption is E[ϵi]=0. Importantly, we do not require an assumption about normality of the error term. We will discuss these assumptions in detail, including what happens when they are violated and how to check if they are satisfied, in Chapter 12.
2.2 Random Variables v. Data
The theoretical SLR model (2.1) is an equation for the random variables Yi. In practice, we observed data yi and estimate a line that has an estimated slope and intercept.
In other words, real data are not generated from the theoretical model Yi=β0+β1xi+ϵi But for a dataset with specific values (xi,yi), we can use the theoretical model to describe the data:
yi=ˆβ0+ˆβ1xi+ei
In equation (2.2):
- yi is the observed value of the outcome for observation i
- ei is the residual value corresponding to observation i
- ˆβ0 is the estimated intercept for the regression line
- ˆβ1 is the estimated slope for the regression line
- xi is observed value of the predictor variable for observation i
The “hats” in ˆβ0 and ˆβ1 indicate that the values are estimates of the true parameter β1 and β0. The difference between the theoretical model and estimated model can be summarized in the following table:
Theoretical/Math World | Real World Data |
---|---|
Yi= outcome (a random variable) | yi= outcome (a known number) |
xi= predictor (a known number) | xi= predictor (a known number) |
ϵi= error (a random variable) | ei= residual (a known number) |
β0= intercept parameter (an unknown number) | ˆβ0= intercept estimate (a calculated number) |
β1= slope parameter (an unknown number) | ˆβ1= slope estimate (a calculated number) |
2.3 Parameter Interpretation
2.3.1 Regression Line Equation
A consequence of Assumption 1 (E[ϵi]=0) is that the mean of Yi is a straight line:
E[Yi]=E[β0+β1xi+ϵi]=E[β0+β1xi]+E[ϵi]=β0+β1xi+0=β0+β1xi
This allows us to interpret the parameters β0 and β1 in terms of the slope and intercept of the line for the average value of the outcome.
In this section only, we are assuming that β0, β1, and σ2 are known. Starting in Section 3 we will introduce estimates of this parameter and start including “estimated” in interpretations.
2.3.2 Interpreting β1
The β1 parameter represents the slope of the regression line. In other words, β1 is the difference in E[Yi] between observations that differ in xi by one unit.
In almost any data analysis project, an important task is providing an interpretation of the model parameters. When describing the interpretation of β1, it is important to include these key elements:
- An interpretation of β1 as an estimated difference in the average value of the outcome variable
- For linear regression, “difference in average” is the same as “average difference”. You can use either phrase.
- Specify that this is for a 1-unit difference in the predictor variable (x).
- Be cautious about referring to β1 as an “increase” or a “decrease”. Those words imply that an intervention was conducted, in which the value of x was directly modified. This is possible in controlled experiments, but is not the case for observational datasets. To indicate direction, it can be helpful to use words such as “greater”, “lower”, or “higher”.
- Include confidence intervals (see Section 4.8)2
- If appropriate, include information about the conclusion from a hypothesis test (see Section 4.5)3
- Include units for both the outcome and predictor
- If known, include statement about population/context.
Example 2.2 Consider two groups of penguins:
- Penguins that have flipper lengths of 201mm. Call this “Group A”
- Penguins that have flipper lengths of 200mm. Call this “Group B”
Assuming the SLR model (2.1), what is the difference in average body mass between these groups of penguins? To answer this, first write out the regression equation for the mean body mass in each group:
Group A:E[YA]=β0+β1∗201 Group B:E[YB]=β0+β1∗200
Then take the difference between them: E[YA]−E[YB]=(β0+β1∗201)−(β0+β1∗200)=201β1−200β1=β1
Possible summarizing sentences for β1:
- β1 is the difference in average body mass (in g) for penguins that differ in flipper length by 1 mm.
- The average difference in body mass for penguins that have 1 mm longer flippers is β1 grams.
- The difference in average body mass for penguins that differ in flipper length by one millimeter is β1 grams.
We will expand upon these sentences in Examples 2.8 and 4.7.
2.3.3 Interpreting cβ1
Because β1 is the slope of the regression, we can easily interpret multiple of β1, say cβ1 for some c∈R, as differences in the average value of the outcome variable for differences of c units in the predictor variable.
Example 2.3 In Example 2.2, the value 10β1 can be summarized as: the difference in average body mass (in g) for penguins that differ in flipper length by 10 mm.
Example 2.4 In Example 2.2, the value 25β1 can be summarized as: the difference in average body mass (in g) for penguins that differ in flipper length by 2.5 cm.
Interpreting multiples of β1 is closely related to the idea of re-scaling the predictor x, which is covered in Section 7.2.
2.3.4 Interpreting β0
The β0 parameter represents the intercept of the regression line. In other words, β0 is the average value of Yi (i.e. E[Yi]) for observations with an x value of 0. While mathematically useful, β0 might make no practical or scientific sense!
When describing the interpretation of β0, follow all of the same guidelines as for β1 above, with the following exception:
- Interpret it as the average value of the outcome variable (as opposed to a difference in the average value).
- Specify it corresponds to a value of 0 for x, rather than a 1-unit difference in x.
Example 2.5 Consider a third group of penguins:
- Penguins with flipper length of 0 mm. Call this “Group C”
Group C:E[YC]=β0+β1∗0=β0
Equivalent interpretation statements for β0 in this context:
- β0 is the average body mass (in g) for penguins that have a flipper length of 0 mm.
- The average body mass for penguins that have 0 mm long flippers is β0 grams.
2.3.5 Interpreting β0+β1xi
From Equation (2.3), the points on the regression line can be interpreted as the average value of the outcome variable for units with a particular value of x.
Example 2.6 Consider penguins in Group A (from Example 2.2). In terms of parameters in the simple linear regression mdoel, what is their average body mass?
Since penguins in Group A have flipper lengths of xi=200mm, their average body mass is given by:
E[Yi|xi=200]=β0+200β1.
In Section 5, we will see more on how to interpret this value and the difference between estimating a mean and predicting a new observation.
2.3.6 Interpretation of σ2
The parameter σ2 represents the variance of the data around the regression line:
Var(Yi)=Var(β0+β1xi+ϵi)=Var(β0+β1xi)+Var(ϵi)=0+σ2=σ2
- A large value of σ2 means that data are more spread out vertically around the line.
- A small value of σ2 means that data are vertically close to the line
The following two plots show simulated data. The left panel has data generated from a model with σ2=10 and the right panel data comes from a model with σ2=1.

Figure 2.1: Simulated data showing the impact of different values of σ2.
2.3.7 Interpreting β1 with binary x
Although the most common setting of the simple linear regression model is with a continuous predictor variable (such as flipper length), the model can equally be applied when the predictor variable is binary, meaning it takes on two values.
If xi can take on two values, then instead of a line, the SLR model simply becomes two distinct average values for the two groups. Suppose for one group, x=0 and for the other group, x=1. A variable defined this was is called an indicator variable since it serves as a binary marker of group membership. The two groups could be anything, such as male/female, home/away, automatic/manual, etc. (We will see how to use indicators for variables with 3 or more categories in Section 11.)
Then the two “lines” for the model are: E[Yi|xi=0]=β0E[Yi|xi=1]=β0+β1∗1=β0+β1 The average value of the outcome for those with xi=0 is β0 and the average value of the outcome for those with xi=1 is β0+β1. Furthermore, the difference between these values is: E[Yi|xi=1]−E[Yi|xi=0]=(β0+β1)−(β0)=β1 Thus, the interpretations of β0 and β1 are:
- β0 is the average value of Y among observations in the group with x=0
- β1 is the difference in average value of Y between observations with x=1 and those with x=0.
Example 2.7 Consider the penguin data, but now when we model body mass (the outcome) as a function of penguin sex (the predictor variable). Define the indicator variable
x={0 if sex is "female"1 if sex is "male". Then in the SLR model Yi=β0+β1xi+ϵi, we have the following interpretations for the regression parameters:
- β0 is the average body mass for female penguins.
- β0+β1 is the average body mass for male penguins.
- β1 is the difference in body mass between male and female penguins.
2.4 Putting it together: Penguin example
Example 2.8 In the example of penguin flipper length and body mass, the equation for the estimated regression line is: yi=−5780.8+49.7∗xi (We will see how these numbers are calculated in Section 3.) We can interpret the slope of this regression line as follows:
A difference of one mm in flipper length is associated with an estimated difference of 49.7 g greater average body mass among penguins in Antarctica.
Note the key elements in this sentence:
- “average” – Linear regression model tells us about the mean, not a specific observation
- “estimated” – The number 49.7 is an estimate of the true (and unknown) parameter value.
- Units are provided for flipper length and body mass.
- “among penguins in Antarctica” – Context and/or population
- This is observational data, so the relationship is stated as an association, not causation.

Figure 2.2: Fitted regression line for penguin data.
The residuals for each observation can be calculated as the difference between each data point and its corresponding modeled mean: ei=yi−(ˆβ0+ˆβ1xi) Figure 2.3 shows a graphical representation of the residuals ei.

Figure 2.3: Residuals for penguin data.
2.5 Exercises
Exercise 2.1 In the context of Example 2.2, what is an interpretation of 5β1?
Exercise 2.2 In the context of Example 2.5, does the quantity 10β0 make practical sense? If yes, provide an interpretation. If no, explain why not.
Exercise 2.3 Consider a variation of Example 2.2, in which a simple linear regression mdoel is fit with body mass (in g) as the predictor variable and flipper length (in mm) as the outcome. Explain how this switch affects the interpretations of β0, β1, and σ2.
Exercise 2.4 Consider a SLR model with flipper length as the outcome and sex as the predictor, with x defined as in equation (2.4). What are the interpetations of β0 and β1?
Exercise 2.5 In the SLR model of Example 2.7, what is the difference in average body mass between females and male? (Note that the order of the queston is important here).