Chapter 7 Centering and Scaling in SLR
This chapter covers several different variations of the simple linear regression model. The fundamental aspects of the model do not change, but the interpretations of parameters are impacted by transformations to the predictor and outcome variables.
7.1 Regression with Centered x
One common variation of the SLR model is to center the predictor variable, which means we use xi−¯x instead of xi.
We can re-arrange the SLR model equation as: Yi=β0+β1xi+ϵi=β0+β1xi−β1¯x+β1¯x+ϵi=β0+β1¯x+β1(xi−¯x)+ϵi=β′0+β1(xi−¯x)+ϵi
Thus, we still have an SLR model, but with β′0 in place of β0. From the derivation above, we can see that β′0=β0+β1¯x. In words, this means that β′0 is the mean of Y for observations with the average value of the predictor variable. This is the primary reason for centering the x variable, since it has a practical meaning, often unlike the intercept from an uncentered model.
The slope parameter β1 is unchanged by this reparameterization. The model inference–hypothesis tests and confidence intervals–are not changed by centering the x variable.
Graphically, centering the predictor variable is equivalent to using the point (¯x,¯y) as the origin (see Figure 7.1). This has no impact on the slope of the regression line, but does affect the intercept.
data:image/s3,"s3://crabby-images/e51ba/e51ba6763e2769c0bc95e5f3d2115a0ff7357f90" alt="Shifting the origin to the average of the data."
Figure 7.1: Shifting the origin to the average of the data.
Example 7.1 What happens to the results from the model for the penguin data when we center flipper length variable?
First, we need to create the centered variable:
penguins <- penguins %>%
mutate(flipper_length_mm_centered=flipper_length_mm - mean(flipper_length_mm, na.rm=TRUE))
When we fit the SLR model with this centered variable, we get point estimates ˆβ0=4201.8 and ˆβ1=49.7. As expected, the slope is the same as the uncentered model. The estimated average body mass for penguins with “average” flipper length is 4,202 grams. Since the SLR line passes through (¯x,¯y), this is exactly the mean penguin body mass in this data.
7.2 Rescaling Units
7.2.1 Rescaling x
In addition to centering, another common transformation is to scale a variable. When the predictor variable is scaled by a factor c, the value of β1 (and thus ˆβ1) is scaled by 1/c. This can be verified mathematically by re-arranging the SLR equation: Yi=β0+β1xi+ϵi=β0+β1c(cxi)+ϵi=β0+~β1~xi+ϵi where ˜xi=cxi and ˜β1=β1/c.
In principle, predictor variables can be rescaled by any value, but the most common is scaling by its standard deviation. Together with centering the variable, this is known as standardization of the predictor variable. This leads to β1 representing the difference in the average value of the outcome for a one-standard deviation difference in x. Other common scaling factors are: factors of 10 or other values that can change units (e.g., going from kilometers to meters) and the interquartile range (IQR) of x.
Example 7.2 If in addition to centering the flipper length in Example 7.1, suppose we also scale the flipper lengths by their standard deviation. In R, this can be done easily using the scale()
command:
If we fit a SLR model with this standardized version of flipper length, we obtain ˆβ0=4201.8 and ˆβ1=698.7. The scaling of x had not impact on the point estimate ˆβ0, but did change the value of ˆβ1. From this model, an interpreation statement for ˆβ1 is: A one standard deviation difference in flipper length is associated with a difference of 699 grams in average body mass (95% CI: 657, 741).
This demonstrates one advantage to scaling by the standard deviation: it can simplify interpretations of the slope parameter, particularly in contexts where the audience may not be familiar with the practical importance of a 1-unit difference in the predictor. For example, someone who is not familiar with the average lengths of penguin flippers may not have a good sense of how large a 1 mm difference in flipper length is. Does it represent a small, but important, amount? Or is it a negligible difference? The same reader could understand that a one-standard deviation difference in flipper length is a meaningfully large difference in flipper length. However, when reporting results from a model with a standardized predictor, it is still important to report what the magnitude of the standard deviation is.
Rescaling the predictor variable has no impact on the inferential conclusions about the model. Any multiplicative change in ˆβ1 due to scaling also impacts se(ˆβ1) by the same factor, and so the test statistics for hypothesis tests remain unchanged.
7.2.2 Rescaling Y
It is also possible to rescale the outcome variable. This is generally only done when changing units; it is uncommon to standardize the outcome variable. The impact on the regression parameters can again be computed by rearranging the SLR model equation: Yi=β0+β1xi+ϵicYi=cβ0+cβ1xi+cϵi˜Yi=˜β0+˜β1xi+~ϵi where ˜Yi=cYi, ˜β0=cβ0, ˜β1=cβ1 and Var(˜ϵi)=c2σ2.
Unlike rescaling x, which only impacted β1, rescaling the outcome variable affects all three parameters in the SLR model. However, once again the inferential conclusions are not changed since the numerator and denominator in the test statistic are multiplied by the same factor.
Example 7.3 In the penguin data, if we rescale the body mass variable to be in kilograms instead of grams (1 kg = 1,000 g), we obtain the point estimates ˆβ0=−5.78 and ˆβ1=0.0497. Since we have just changed y by a factor of 1/1000, we see the corresponding shift in these point estimates.
7.3 Exercises
Exercise 7.1 Consider the penguin data from Example 7.1. What would be the estimate of ˆβ1 if the model was fit using flipper length in centimeters (1 cm = 10 mm)?