Chapter 4 Inference on in Simple Linear Regression

4.1 Inference Goals

A fundamental task in statistics is inference, in which we use a sample of data to make generalizations about relationships in larger populations. Inference is a key component of an association analysis (see Section 1.2.1). Inference is usually conducted via hypothesis tests and confidence intervals.

Statistical inference is typically driven by an underlying scientific goal. A standard inferential question for a regression analysis is something like: Is there a relationship between ‘x’ and the average value of ‘y’?, where ‘x’ and ‘y’ are a predictor and outcome of interest. To answer this question using regression, we first need to translate it into a statistical question about specific model parameters.

It helps to first note that if there is no relationship between the variable and the average outcome , then an appropriate linear model is the intercept-only model:

In equation (4.1), the predictor does not appear at all, and so the regression line is simply a horizontal line with intercept . Figure 4.1 shows an example of data generated from a model in which there is no relationship between and .

$Example data from a hypothetical regression model with $\beta_0 = 3$ and $\beta_1 = 0$. The black line shows the *true* mean of $y$ as a function of $x$.$

Figure 4.1: Example data from a hypothetical regression model with and . The black line shows the true mean of as a function of .

How is the intercept-only model useful? Well, it allows to change the scientific question:

Is there a relationship between ‘x’ and the average value of ‘y’?

into a statistical question:

In the model , is ?

If the answer to this question is “No”, then and we are left with the intercept-only model. Testing against a null hypothesis of zero is by far the most common test done in regression models, since this allows us to test for an association.

Remember that in Section 3.2 we pointed out that the slope estimate had the same sign as the correlation between and ? Well, if that correlation is zero, then the estimated line would be flat–just like it is when there is no relationship between and .

Example 4.1 In the penguin data we might ask: Is there a relationship between flipper length and body mass in penguins? The corresponding statistical question is: In the SLR model with flipper length and body mass, is ? We saw in Example 3.1 that the estimated SLR equation for the penguin data was . Thus, the estimated slope is 49.7 g/mm. Clearly this value is not 0. But is it meaningfully different from 0? We address that question in the remainder of this chapter.

4.2 Standard Error of

Before getting into the detail of how we conduct hypothesis tests for regression parameters, we first need to review a few key statistical concepts. The first is standard error.

4.2.1 Sampling Distribution of

The notation actually refers to two different quantities. On one hand, this is the estimated slope of the linear regression line; that is, is a number or an estimate. But also denotes an estimator, which is a rule for calculating an estimate from a dataset.

As an estimator, has a distribution. We have previously seen that under the standard SLR assumptions:

The value of tells about how much variation there is the values of calculated from many different datasets. If we conduct repeated experiments all using the same population and settings, we obtain a different each time.

The distribution of values we obtain is called the of
The standard deviation (or the variance) of this distribution tells us about the uncertainty in

4.2.2 Standard Errors

The standard deviation of the sampling distribution of is called the standard error of and it is denoted:

The value of depends on three factors:

The variance of the error terms ()
The variation in the values of (via )
The number of observations (via )

It’s important to note that we don’t know , but we can estimate it as (see Section 3.5.3). This gives an estimate of : So we calculate the estimated standard error as

4.3 Not Regression: One-sample -test

To see how we incorporate into a hypothesis test, let’s first go back to perhaps the most well-known test in statistics, the one-sample -test.

For a one-sample -test, we assume that we have an independent sample of values , , , from a population in which . Our hypothesis is then to test whether for some chosen value of . We can write this as: We use the test statistic If is true, then follows a -distribution with degrees of freedom, i.e. .

If is “large”, then we reject in favor of
If is not “large”, then we do not reject

But what does “large” value of mean? Large means is far from the hypothesized value . To account for the variability in the distribution of , we divide by its standard error. This is because a large difference between and is more meaningful when has smaller variance.

We then compare to the distribution and compute :

Density curve for T-distribution. Shaded areas on left and right represent $P(T > |t|)$.

Figure 4.2: Density curve for T-distribution. Shaded areas on left and right represent .

This probability is given by the shaded areas on the right and left in Figure 4.2. Since our alternative hypothesis is two-sided (as opposed to a one-sided hypothesis such as ), then we need to consider the values in both directions.

If (the “-value”) is smaller than the chosen critical value , then we reject . How to choose is the subject of much debate, but a widely used value is 0.05. In some contexts, 0.01 and 0.1 are also used as critical values for rejecting .

If , then we fail to reject . It’s important to note that failing to reject is not the same as proving !

Why is and not ? This is because we have estimated the standard error , and so has more variation than a normal distribution. Part of what makes tests of this form so useful is that when (the sample size) is large enough, then has a -distribution, even if does not have a normal distribution! This is a consequence of the Central Limit Theorem (CLT).

The CLT tells us that for a sequence of random variables , , with finite mean and finite variance , the distribution of the mean of the observations can be approximated by a normal distribution with mean and variance when is sufficiently large.

4.4 -values

While widely used, -values are commonly mis-used. It’s important to keep in mind what a -value can (and can’t!) tell you. A formal definition for a -value is:

Definition 4.1 The -value for a hypothesis test is the probability, if is true, of observing a test statistic the same or more in favor of the alternative hypothesis than the result that was obtained.

The underlying idea is that a small -value means that getting data like what we observed is not very compatible with being true.

Incorrect interpretations of -values:

Probability the null hypothesis is incorrect
Probability that the alternative hypothesis is true
Probability of these results occurring by chance

In 2018, the American Statistical Association (ASA) published a statement about -values.⁵ Included in that statement were the following reminders:

-values can indicate how incompatible the data are with a specified statistical model.
-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a -value passes a specific threshold.
Proper inference requires full reporting and transparency.
A -value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a -value does not provide a good measure of evidence regarding a model or hypothesis.

4.5 Hypothesis Testing for

In simple linear regression, we can conduct a hypothesis test for just like how we conducted the one-sample -test.

First, we set up the null and alternative hypotheses:

⁶

Most commonly, , which mean we are testing whether there is a relationship between and . The statistic is:

If and is true, then has a -distribution with degrees of freedom. If is true, but is not normally distributed, then follows a distribution approximately, because of the CLT. We reject at the level if .

Example 4.2 Let’s return to the penguin data in Example 4.1, in which we asked: In the SLR model with flipper length and body mass, is ?

We previously calculated the estimated slope as . From the question formulation, we know that the test value . To calculate , we now need to compute . We can obtain from R, either from the output of summary() or by calculating it directly:

penguin_lm <- lm(body_mass_g ~ flipper_length_mm, data=penguins)
sig2hat <- summary(penguin_lm)$sigma^2 
sig2hat

## [1] 155455.3

# Alternative way "by hand"
sum(residuals(penguin_lm)^2)/(nobs(penguin_lm)-2)

## [1] 155455.3

We can then calculate as:

sxx <- sum(penguins$flipper_length_mm^2, na.rm=T) - 1/nobs(penguin_lm)*sum(penguins$flipper_length_mm, na.rm=T)^2
sxx

## [1] 67426.54

Now we combine these values to compute

In R, this computation is:

t <- (coef(penguin_lm)[2]- 0)/sqrt(sig2hat/sxx)
t

## flipper_length_mm 
##          32.72223

We can then compute the -value, by comparing to a -distribution:

2*pt(abs(t), df=nobs(penguin_lm)-2, lower=FALSE)

## flipper_length_mm 
##     4.370681e-107

This tell us that . And so we reject at the level.

Although the method just describe works, it can be tedious and is prone to mistakes. Since the null hypothesis is so commonly tested, R provides the results of the corresponding hypothesis test in its standard output.

Example 4.3 Compare the following output with the results calculated manually in Example 4.2.

tidy(penguin_lm)

## # A tibble: 2 × 5
##   term              estimate std.error statistic   p.value
##   <chr>                <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)        -5781.     306.       -18.9 5.59e- 55
## 2 flipper_length_mm     49.7      1.52      32.7 4.37e-107

We see a column with , , , and ! This same information is also available from the summmary() command:

summary(penguin_lm)

## 
## Call:
## lm(formula = body_mass_g ~ flipper_length_mm, data = penguins)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1058.80  -259.27   -26.88   247.33  1288.69 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -5780.831    305.815  -18.90   <2e-16 ***
## flipper_length_mm    49.686      1.518   32.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 394.3 on 340 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.759,  Adjusted R-squared:  0.7583 
## F-statistic:  1071 on 1 and 340 DF,  p-value: < 2.2e-16

4.6 Other forms of hypothesis tests for

4.6.1 Testing against values other than 0

Although less common, it is possible to conduct hypothesis tests against a null value other than zero. For example, we could set up the hypotheses

To conduct this hypothesis test, we would need to use the “by hand” procedure for computing and .

Example 4.4 In the penguin data from Example 4.1, what is the conclusion of testing the null hypothesis in Equation (4.2)?

We compute as:

In R, this can be computed as:

penguin_lm_df <- tidy(penguin_lm)
t <- (penguin_lm_df$estimate[2]- 40)/penguin_lm_df$std.error[2]
t

## [1] 6.378781

We can then compute the -value, by comparing to a -distribution:

2*pt(abs(t), df=nobs(penguin_lm)-2, lower=FALSE)

## [1] 5.832893e-10

We reject the null hypothesis that at the level.

4.6.2 One-sided tests

Another alternative is a one-sided test, which involves null and alternative hypotheses of the form:

With one-sided hypotheses, the calculation of is the same as in the two-sided setting, but the calculation of the -value is different. Instead of , we evaluate or , depending on whether the alternative hypothesis is greater than () or less than (). The direction of the inequality for calculating the -value should always match the direction in the alternative hypothesis.

Example 4.5 In the penguin data from Example 4.1, what is the conclusion from a test of the hypotheses in equation (4.2)?

We can use the value calculated earlier. But now we evaluated the -value as:

pt(penguin_lm_df$statistic[2], df=nobs(penguin_lm)-2, lower=TRUE)

## [1] 1

In this example, , so we fail to reject the null hypothesis that is greater than or equal to zero. This should come as no surprise, since our estimate is (much) greater than zero.

4.7 Confidence Intervals (CIs)

4.7.1 Definition and Interpretation

Hypothesis tests provide an answer to a specific question (Is there evidence to reject the null hypothesis?), but they don’t directly provide information about the uncertainty in the point estimates. In many contexts, what is often more useful than a conclusion from a hypothesis test is an estimate of a parameter and its uncertainty. Confidence intervals provide a way to describe the uncertainty in a parameter estimate.

Definition 4.2 A confidence interval is a random interval that, if the model is correct, would include (“cover”) the true value of the parameter with probability .

In this definition, it is important to note that the interval is random, not the parameter. The parameter is a fixed, but unknown, constant, and so it cannot have a probability distribution associated with it.⁷ A common incorrect interpretation of a CI is that the probability of the parameter being in the interval is .

For a single dataset, there is no guarantee that the true value of a parameter will be included within the confidence interval. But if the model is correct, then an interval generated by the same procedure should include the true value in of analysis of independently-collected data. Figure 4.3 shows the coverage of 95% CIs calculated for 100 simulated datasets when the true value of .

Figure 4.3: Example of coverage of 95% CIs in 100 simulated datasets.

4.7.2 Inverting a Hypothesis Test

To create a confidence interval, we invert a hypothesis test. Recall that for testing the null hypothesis against the alternative hypothesis , we computed the test statistic by plugging in , , and . We then compared the value of to a distribution to compute the -value . For a confidence interval, we reverse this process. That is, we plug in , , and , then solve for as an unknown value.

The distribution of is . This distribution has mean zero and a standardized variance (it’s close to 1, although not exactly 1). There exists a number, which we denote , such that the area under the curve between and is . Mathematically, this can be written:

Graphically, this looks like:

We can rearrange equation (4.3), so that is alone in the middle:

4.8 CIs for

The procedure from the previous section gives a confidence interval for :

4.8.1 Confidence Intervals “by hand” in R

To compute a confidence interval “by hand” in R, we can plug in the appropriate values into the formulas. The estimates and can be calculated from an lm object. To compute , use the qt() command, which can be used to find such that for a given value of . In order to compute , we need to find such that . Because of the symmetry of the distribution, this will yield an . This can be implemented in the following code:

alpha <- 0.05
t_alphaOver2 <- qt(1-alpha/2,
                   df = 100-2)
t_alphaOver2

## [1] 1.984467

An alternative approach is to find . To do this using qt(), set the lower=FALSE option:

t_alphaOver2 <- qt(alpha/2,
                   df = 100-2,
                   lower=FALSE)
t_alphaOver2

## [1] 1.984467

Example 4.6 In the penguin data, suppose we wish to construct a confidence interval for using the formulas. This can be done with the following code:

penguin_lm <- lm(body_mass_g~flipper_length_mm,
                data=penguins)
alpha <- 0.05
t_alphaOver2 <- qt(1-alpha/2,
                   df = nobs(penguin_lm)-2)
CI95Lower <- coef(penguin_lm)[2] - t_alphaOver2 * tidy(penguin_lm)$std.error[2]
CI95Upper <- coef(penguin_lm)[2] + t_alphaOver2 * tidy(penguin_lm)$std.error[2]
c(CI95Lower, CI95Upper)

## flipper_length_mm flipper_length_mm 
##          46.69892          52.67221

4.8.2 Confidence Intervals in R

In practice, it is much simpler to let R compute the confidence interval for you. Two standard options for this are:

Add conf.int=TRUE when calling tidy() on the lm output. This will add a conf.low and conf.high column to the tidy output. By default, a 95% confidence interval is constructed. To change the level, set conf.level= to a different value.
Call the confint() command directly on the lm object. This prints the confidence intervals only (no point estimates). To change the level, set the level= argument.

tidy(penguin_lm, conf.int=TRUE)

## # A tibble: 2 × 7
##   term              estimate std.error statistic   p.value conf.low conf.high
##   <chr>                <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)        -5781.     306.       -18.9 5.59e- 55  -6382.    -5179. 
## 2 flipper_length_mm     49.7      1.52      32.7 4.37e-107     46.7      52.7

tidy(penguin_lm, conf.int=TRUE, conf.level=0.99)

## # A tibble: 2 × 7
##   term              estimate std.error statistic   p.value conf.low conf.high
##   <chr>                <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)        -5781.     306.       -18.9 5.59e- 55  -6573.    -4989. 
## 2 flipper_length_mm     49.7      1.52      32.7 4.37e-107     45.8      53.6

confint(penguin_lm)

##                         2.5 %      97.5 %
## (Intercept)       -6382.35801 -5179.30471
## flipper_length_mm    46.69892    52.67221

4.9 Summarizing Inference for

When testing vs. , it is best to write a complete sentence explaining your conclusion. In the sentence, clear describe the null hypothesis and whether it was rejected or not. Report the exact -value, unless it is below 0.0001, in which case writing is sufficient.

Confidence intervals are generally reported in parentheses after a point estimate is given. It is standard to specify the confidence level when doing so.

Example 4.7 We can update the interpretation summary from Example 2.8 by adding a second sentence so that our full conclusion is:

A difference of one mm in flipper length is associated with an estimated difference of 49.7 g (95% CI: 46.7, 52.7) greater average body mass among penguins in Antarctica. We reject the null hypothesis that there is no linear relationship between flipper length and average penguin body mass ().

Note that we have used the phrase “no linear relationship” when describing the null hypothesis that has been rejected. The SLR model can only tell us about a linear relationship; other types of relationships might still be possible. (We will look at quadratic, cubic, and other more flexible relationships in Section 14).

Example 4.8 (Continuation of Example 3.4.) Suppose we wish to formally test whether the average body mass is the same between male and female penguins. In the SLR model with body mass as the outcome and an indicator of sex as the predictor, this means testing against . To do this, we can extract the necessary information from the output:

tidy(penguin_lm2, conf.int=TRUE)

## # A tibble: 2 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)    3862.      56.8     68.0  1.70e-196    3750.     3974.
## 2 sexmale         683.      80.0      8.54 4.90e- 16     526.      841.

Here, , and . So we would reject the null hypothesis and can summarize our result as:

We reject the null hypothesis that there is no difference in body mass between female and male penguins (). The estimated difference in average body mass of penguins, comparing males to females, is 683 grams (95% CI: 526, 840), with males having larger average mass.

4.10 Inference for

4.10.1 Hypothesis Testing for

Hypothesis testing for the intercept works in a similar fashion. For the null and alternative hypotheses

the -statistic is: We then compare this value to a distribution. The value of and its -value can be computed by hand, or extracted from the standard output:

tidy(penguin_lm)

## # A tibble: 2 × 5
##   term              estimate std.error statistic   p.value
##   <chr>                <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)        -5781.     306.       -18.9 5.59e- 55
## 2 flipper_length_mm     49.7      1.52      32.7 4.37e-107

In this example, the test statistic for testing is and its -value is less than 0.0001, so we reject . Of course, this may not be a meaningful test, since it corresponds to the average body mass of a penguin without flippers!

4.10.2 CIs for

We can construct a CI for in the same way as for :

These can be computed by R in the same manner as CI’s for .

4.11 Exercises

Exercise 4.1 Use R to compute the two-sided -value for a test statistic . Assume this comes from a regression model fit to observations.

Exercise 4.2 Suppose we fit a simple linear regresion model with celebrity income as the predictor variable and their number of social media followers as the outcome. Explain what the null hypothesis would mean in this context.

Exercise 4.3 In the penguin data with flipper length as the predictor variable and body mass as the outcome, what is the conclusion of a hypothesis test of against the alternative ?

Exercise 4.4 In the setting of Example 4.8, what would the null hypothesis mean scientifically? Perform a test against the alternative and summarize your conclusions.

Exercise 4.5 Suppose we fit a simple linear regression model with observations and . Given a 95% confidence interval of , what is the value of ?

Exercise 4.6 Write a conclusion sentence about for the model fit in Example 3.4.

https://doi.org/10.1080/00031305.2016.1154108 ↩︎
This quantity is called “beta-one-naught”, not “beta-ten”↩︎
Unless one adopts a Bayesian paradigm.↩︎

Chapter 4 Inference on β in Simple Linear Regression

4.1 Inference Goals

4.2 Standard Error of β1

4.2.1 Sampling Distribution of ˆβ1