Chapter 3 Multiple Linear Regression

Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

A Running Example

Remember the delivery time example we introduced in Chapter 1? What if we have more than one covariate, e.g., delivery volume and delivery distance? Will that help with the prediction? To answer that question, we need multiple linear regression

Suppose that the response variable is still delivery time, the covariates now include delivery volume (i.e., number of packages) and delivery distance (i.e., how far each truck travels to finish the trip). Then we can use a multiple linear regression in this situation to describe this relationship between the response and covariates.

\[ y_i=\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i \]

A multiple linear regression is harder to visualize than simple linear regression. However, with two covariates, we can still plot it.

In general, the multiple linear regression model with $k$ regressors is

\[ y_i=\beta_0 + \beta_1 x_{i1} + ... + \beta_k x_{ik} + \epsilon_i =\beta_0 + \sum_{j=1}^{k}\beta_j x_{ij} + \epsilon_i \]

Notation:

$k$ – number of regressors

$y_i$ – $i$-th observation of the response

$x_{ij}$ – $i$-th observation of the $j$-th covariate.

Least Square Estimate

The least square estimate is still the same as in the simple linear regression case. It still minimize the sum of square distance. The sum of square distances is

\[ S(b_0,b_1,...,b_k)=\sum_{i=1}^{n}(y_i - b_0 - b_1 x_{i1} - ... - b_k x_{ik})^2 \]

And our least square estimate is

\[ (\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_k)=\arg\min_{b_0,b_1,...,b_k}S(b_0,b_1,...,b_k) \]

A demostration is here: http://setosa.io/ev/ordinary-least-squares-regression/

Example

Now let us turn to our delivery time example.

delivery <- read.csv("data_delivery.csv",h=T)
head(delivery)

##   Index DeliveryTime NumberofCases Distance
## 1     1        16.68             7      560
## 2     2        11.50             3      220
## 3     3        12.03             3      340
## 4     4        14.88             4       80
## 5     5        13.75             6      150
## 6     6        18.11             7      330

# visualize data
pairs (delivery,pch=20)

# fit linear regression
model1 <- lm(DeliveryTime ~ NumberofCases + Distance, data=delivery)
summary(model1)

## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7880 -0.6629  0.4364  1.1566  7.4197 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.341231   1.096730   2.135 0.044170 *  
## NumberofCases 1.615907   0.170735   9.464 3.25e-09 ***
## Distance      0.014385   0.003613   3.981 0.000631 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared:  0.9596, Adjusted R-squared:  0.9559 
## F-statistic: 261.2 on 2 and 22 DF,  p-value: 4.687e-16

# obtain coefficient estimate
model1$coefficients

##   (Intercept) NumberofCases      Distance 
##    2.34123115    1.61590721    0.01438483

# obtain residuals
model1$residuals

##          1          2          3          4          5          6          7 
## -5.0280843  1.1463854 -0.0497937  4.9243539 -0.4443983 -0.2895743  0.8446235 
##          8          9         10         11         12         13         14 
##  1.1566049  7.4197062  2.3764129  2.2374930 -0.5930409  1.0270093  1.0675359 
##         15         16         17         18         19         20         21 
##  0.6712018 -0.6629284  0.4363603  3.4486213  1.7931935 -5.7879699 -2.6141789 
##         22         23         24         25 
## -3.6865279 -4.6075679 -4.5728535 -0.2125839

# obtain fitted value
model1$fitted.values

##         1         2         3         4         5         6         7         8 
## 21.708084 10.353615 12.079794  9.955646 14.194398 18.399574  7.155376 16.673395 
##         9        10        11        12        13        14        15        16 
## 71.820294 19.123587 38.092507 21.593041 12.472991 18.682464 23.328798 29.662928 
##        17        18        19        20        21        22        23        24 
## 14.913640 15.551379  7.706807 40.887970 20.514179 56.006528 23.357568 24.402854 
##        25 
## 10.962584

# obtain estimate and standard errors
summary(model1)$coef

##                 Estimate  Std. Error  t value     Pr(>|t|)
## (Intercept)   2.34123115 1.096730168 2.134738 4.417012e-02
## NumberofCases 1.61590721 0.170734918 9.464421 3.254932e-09
## Distance      0.01438483 0.003613086 3.981313 6.312469e-04

# obtain just standard errors
summary(model1)$coef[,2]

##   (Intercept) NumberofCases      Distance 
##   1.096730168   0.170734918   0.003613086

Our the interpretations of the coefficients, residuals, fitted values are the same as the simple linear regression. In addition, the least square estimate is also unbiased for the multiple linear regression. The standard error of each individual estimate can be interpreted the way as simple linear regression, i.e., as “wiggle room”.

$R^2$ and Adjusted $R^2$

$R^2$ is calculated exactly same as in simple linear regression

\[ R^2=\frac{SSR}{SST}=1-\frac{SSRes}{SST} \]

However, $R^2$ can be inflated simply by adding more terms to the model (even insignificant terms) in the multiple linear regression setting. Therefore, we need to penalize for added terms to the model, i.e., Adjusted $R^2$.

\[ R^2_{adj}=1-\frac{SSRes/(n-k-1)}{SST/(n-1)}=1-(1-R^2)\frac{n-1}{n-k-1} \]

where $k$ is the total number of covariates in the model (not including the intercept).

Example:

#generate a random variable, add it to the model
set.seed(2)
noise <- rnorm(nrow(delivery),0,1)

model2 <- lm(DeliveryTime ~ NumberofCases + Distance + noise, data=delivery)

summary(model1)$r.square

## [1] 0.9595937

summary(model2)$r.square

## [1] 0.9596191

summary(model1)$adj.r.squared

## [1] 0.9559205

summary(model2)$adj.r.squared

## [1] 0.9538504

Hypothesis Testing in Multiple Linear Regression

Once we have estimated the parameters in the model, we face two immediate questions:

Is the linear regression as a whole effective and significant? This leads to the test for significance of regression (or F-test).
Which specific regressors seem important? This leads to t-test.
Which groups of regressors seem important? This leads to partial F-test.

Test for Significance of Regression

The test for significance of regression is also called F-Test or ANOVA test (analysis of variance test). It a test to determine if there is a linear relationship between the response and any of the regressor variables. Therefore, we are testing the following.

The hypotheses are

$H_0: \beta_1=\beta_2=...=\beta_k=0$

$H1: \beta_j \neq 0 \text{ for at least one }j$

To decide whether to reject or not reject the null hypothesis, we look at the p-value of the F-test.

When p-value is …	our decision is …	which implies …
smaller than 0.05	reject $H_0$	regression is significant
larger than 0.05	do not reject $H_0$	regression is not significant

To perform the F-test, we use summary(...).

# perform F test 
summary(model1)

## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7880 -0.6629  0.4364  1.1566  7.4197 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.341231   1.096730   2.135 0.044170 *  
## NumberofCases 1.615907   0.170735   9.464 3.25e-09 ***
## Distance      0.014385   0.003613   3.981 0.000631 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared:  0.9596, Adjusted R-squared:  0.9559 
## F-statistic: 261.2 on 2 and 22 DF,  p-value: 4.687e-16

Interpreation of F Test

Reject null hypothesis means the linear regression as a whole is significant in explaining the variation in response variable. At least one regressor is significant with nonzero slope.

Fail to reject, the linear regression is not significant and all the slopes of regressors are zero.

In the example of delivery time, the p-value of F-test is 4.687e-16, which is smaller than 0.05. So we reject the null hypothesis and conclude there is at least one significant covariate.

t-Test

T-test tests for whether one particular covariate (say, $j$-th covariate) has a nonzero slope. Therefore, we could in total conduct $k$ t-tests for all $k$ covariates or $k$ slopes. The testing procedure is the same as what we did in the simple linear regression. To test the $j$-th covariate, the hypotheses are

$H_0:\beta_j=0$

$H_1:\beta_j \neq 0$

To decide whether to reject or not reject the null hypothesis, we look at the p-value of the t-test.

When p-value is …	our decision is …	which implies …
smaller than 0.05	reject $H_0$	$j$th covariate has a nonzero slope
larger than 0.05	do not reject $H_0$	$j$th covariate has a zero slope

To perform the F-test, we use summary(...).

# perform t test 
summary(model1)

## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7880 -0.6629  0.4364  1.1566  7.4197 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.341231   1.096730   2.135 0.044170 *  
## NumberofCases 1.615907   0.170735   9.464 3.25e-09 ***
## Distance      0.014385   0.003613   3.981 0.000631 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared:  0.9596, Adjusted R-squared:  0.9559 
## F-statistic: 261.2 on 2 and 22 DF,  p-value: 4.687e-16

Interpretation of t-Test

Reject null hypothesis means that, after adjusted for other regressors in the linear regression, this particular regressor is significant in explaining the variation in response variable, its slope coefficient is nonzero.

Fail to reject means that, after adjusted for other regressors, this regressor is not significant and its slope is zero.

For the delivery time example, the p-values are 3.25493157366564e-09 and 0.000631246862242841 for NumberofCases and Distance. Therefore, we conclude both covariates have nonzero slopes.

Partial F-Test

Partial F-test tests on multiple slopes simultaneously. Suppose we have fitted a linear regression with $k$ covariates. To test for the slopes of a subset of $m$ covariates and $m<k$, we use a partial F test.

The hypotheses are:

$H_0: \beta_j = 0 \text{ for all }j\text{ in the subset of covariates}$

$H_1: \beta_j \neq 0 \text{ for at least one }j\text{ in the subset of covariates}$

To decide whether to reject or not reject the null hypothesis, we look at the p-value of the partial F-test.

When p-value is …	our decision is …	which implies …
smaller than 0.05	reject $H_0$	at least one nonzero slope in the subset
larger than 0.05	do not reject $H_0$	this subset has all zero slopes

To perform the partial F-test, we use anova(...). Here is an new example.

d=read.csv("data_AirlinesCost.csv")
names(d)=c(
  "Airline",
  "LengthFlight",
  "Speed",
  "DailyFlight",
  "PopulationServed",
  "Cost",
  "Revenue",
  "Ton",
  "Capacity",
  "Assets",
  "Investments",
  "AdjustedAssets"
)

Suppose we want to check whether a subset of covariates, including LengthFlight, Speed,DailyFlight, and PopulationServed, is significant in the regression. Then our hypotheses are

$H_0: \beta_{LengthFlight}=\beta_{Speed}=\beta_{DailyFlight}=\beta_{PopulationServed}= 0$

$H_1: \text{at least one slope is nonzero, i.e.,} \beta_{LengthFlight} \neq 0 \text{ or } \beta_{Speed} \neq 0 \text{ or } \beta_{DailyFlight} \neq 0 \text{ or } \beta_{PopulationServed} \neq 0$

To perform such a test, we use the following function.

model3=lm(Cost ~ LengthFlight + Speed + DailyFlight + PopulationServed + Revenue + Ton + Capacity + Assets + Investments,data=d)
model4=lm(Cost ~ Revenue + Ton + Capacity + Assets + Investments,data=d)
anova(model3,model4)

## Analysis of Variance Table
## 
## Model 1: Cost ~ LengthFlight + Speed + DailyFlight + PopulationServed + 
##     Revenue + Ton + Capacity + Assets + Investments
## Model 2: Cost ~ Revenue + Ton + Capacity + Assets + Investments
##   Res.Df   RSS Df Sum of Sq      F  Pr(>F)  
## 1     21 48091                              
## 2     25 71490 -4    -23399 2.5544 0.06905 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note that the p-value is 0.06905, which means we do not reject the null hypothesis, and conclude that all slopes are zero.

Interpretation of Partial F-Test

Reject null hypothesis means that, after adjusted for other regressors (not in the partial F test) in the linear regression, these regressors (in the partial F test) are significant in explaining the variation in response variable, at least one of slope coefficients (in the partial F test) is nonzero.

Fail to reject means that, after adjusted for other regressors (not in partial F test), these regressors (in the partial F test) are not significant and all of their slopes are zero.

For the airline example, the p-value is 0.06905, which means we do not reject the null hypothesis, and conclude that all slopes of LengthFlight, Speed,DailyFlight, and PopulationServed are zero.

T-Test and F-Test as Special Cases of Partial F-Test

Both the t-test and F-test can be considered as special cases of partial F-test.

To see t-test as a special case of partial F-test, we present the following example.

# use partial F test to compare covariate individually as the t-test.
# reduced is a simpler model with one covariate, Distance
reduced = lm(DeliveryTime ~ Distance, data=delivery) # Reduced model
# full is a more complicated model with two covariates, NumberofCases and Distance
full = model1 # full model
# use F test to compare two models
anova(reduced, full)

## Analysis of Variance Table
## 
## Model 1: DeliveryTime ~ Distance
## Model 2: DeliveryTime ~ NumberofCases + Distance
##   Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
## 1     23 1185.39                                  
## 2     22  233.73  1    951.66 89.575 3.255e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# check p value
summary(model1)

## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7880 -0.6629  0.4364  1.1566  7.4197 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.341231   1.096730   2.135 0.044170 *  
## NumberofCases 1.615907   0.170735   9.464 3.25e-09 ***
## Distance      0.014385   0.003613   3.981 0.000631 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared:  0.9596, Adjusted R-squared:  0.9559 
## F-statistic: 261.2 on 2 and 22 DF,  p-value: 4.687e-16

# we could repeat the above procedure for another reduced model with only NumberofCases as covariate
reduced = lm(DeliveryTime ~ NumberofCases, data=delivery) # Reduced model
full = model1 # full model
anova(reduced, full)

## Analysis of Variance Table
## 
## Model 1: DeliveryTime ~ NumberofCases
## Model 2: DeliveryTime ~ NumberofCases + Distance
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     23 402.13                                  
## 2     22 233.73  1     168.4 15.851 0.0006312 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# check p value
summary(model1)

## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7880 -0.6629  0.4364  1.1566  7.4197 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.341231   1.096730   2.135 0.044170 *  
## NumberofCases 1.615907   0.170735   9.464 3.25e-09 ***
## Distance      0.014385   0.003613   3.981 0.000631 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared:  0.9596, Adjusted R-squared:  0.9559 
## F-statistic: 261.2 on 2 and 22 DF,  p-value: 4.687e-16

To see F-test as a special case of partial F-test, we present the following example.

# we can even let the reduced model be a model with no covariate, note that 1 here represents the linear regression only has intercept and no covariates.
reduced = lm(DeliveryTime ~ 1, data=delivery) # Reduced model
full = model1 # full model
anova(reduced, full)

## Analysis of Variance Table
## 
## Model 1: DeliveryTime ~ 1
## Model 2: DeliveryTime ~ NumberofCases + Distance
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     24 5784.5                                  
## 2     22  233.7  2    5550.8 261.24 4.687e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# check p value
summary(model1)

## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7880 -0.6629  0.4364  1.1566  7.4197 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.341231   1.096730   2.135 0.044170 *  
## NumberofCases 1.615907   0.170735   9.464 3.25e-09 ***
## Distance      0.014385   0.003613   3.981 0.000631 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared:  0.9596, Adjusted R-squared:  0.9559 
## F-statistic: 261.2 on 2 and 22 DF,  p-value: 4.687e-16

Note that if we only use anova(model1) instead of anova(reduced, full), the results are very different and we do not cover this case in this course.

Summary of hypothesis testing

A hypothesis test for testing that one slope parameter is 0

T-test

A hypothesis test for testing that all of the slope parameters are 0

F-test

A hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0

Partial F-test

Interval Estimation

Interval estimate of regression coefficients in the multiple linear regression is the same as the simple linear regression except we have multiple intervals for multiple slopes. Interpretation is the same as simple linear regression too.

confint(model1,level=0.95)

##                     2.5 %     97.5 %
## (Intercept)   0.066751987 4.61571030
## NumberofCases 1.261824662 1.96998976
## Distance      0.006891745 0.02187791

Prediction

Prediction in multiple linear regression is the same as the simple linear regression. We still use predict(...). Interpretation is the same as simple linear regression too.

Here is an example to make predictions for that delivery time data set.

observations_for_pred=data.frame(NumberofCases=c(10,15,20), Distance=c(1000,1500,2000))
observations_for_pred

##   NumberofCases Distance
## 1            10     1000
## 2            15     1500
## 3            20     2000

head(delivery)

##   Index DeliveryTime NumberofCases Distance
## 1     1        16.68             7      560
## 2     2        11.50             3      220
## 3     3        12.03             3      340
## 4     4        14.88             4       80
## 5     5        13.75             6      150
## 6     6        18.11             7      330

Note the column names of observations_for_pred are exactly the same as the column names in the original data frame delivery for these covariates. This is very important. Otherwise, R would not be able to recognize what covariates to use for prediction. It is because model1 is built using data frame delivery. So if we want to use model1 for prediction, we need to feed it a same format data frame with the same column names.

To make predictions based on model1, we use the following code.

# obtain prediction for new observations
predict(model1,observations_for_pred, type="response")

##        1        2        3 
## 32.88513 48.15708 63.42903

# obtain prediction interval for new observations
predict(model1,observations_for_pred,interval="prediction", level=0.95, type="response")

##        fit      lwr      upr
## 1 32.88513 24.87868 40.89158
## 2 48.15708 38.70022 57.61394
## 3 63.42903 52.14917 74.70889

# obtain confidence interval for mean response for new observations
predict(model1,observations_for_pred,interval="confidence", level=0.95, type="response")

##        fit      lwr      upr
## 1 32.88513 28.59454 37.17572
## 2 48.15708 41.54359 54.77056
## 3 63.42903 54.39901 72.45905

Note that the confidence interval for mean response is shorter than the prediction interval.

We can also make predictions for all the existing observations.

# obtain these intervals for existing observations.
predict(model1,delivery,interval="confidence", level=0.95, type="response")

##          fit       lwr       upr
## 1  21.708084 19.551297 23.864872
## 2  10.353615  8.556216 12.151013
## 3  12.079794  9.955744 14.203843
## 4   9.955646  7.980522 11.930770
## 5  14.194398 12.343039 16.045757
## 6  18.399574 17.000017 19.799132
## 7   7.155376  5.222061  9.088692
## 8  16.673395 14.966973 18.379818
## 9  71.820294 67.048610 76.591977
## 10 19.123587 16.128667 22.118507
## 11 38.092507 36.108636 40.076378
## 12 21.593041 19.314141 23.871941
## 13 12.472991 10.801755 14.144227
## 14 18.682464 16.791631 20.573297
## 15 23.328798 21.958209 24.699388
## 16 29.662928 26.909298 32.416559
## 17 14.913640 13.265705 16.561574
## 18 15.551379 13.454112 17.648645
## 19  7.706807  5.607492  9.806121
## 20 40.887970 38.732422 43.043518
## 21 20.514179 17.766059 23.262299
## 22 56.006528 51.776559 60.236497
## 23 23.357568 21.984492 24.730644
## 24 24.402854 22.055286 26.750421
## 25 10.962584  9.217532 12.707636

predict(model1,delivery,interval="prediction", level=0.95, type="response")

##          fit        lwr      upr
## 1  21.708084 14.6126113 28.80356
## 2  10.353615  3.3589989 17.34823
## 3  12.079794  4.9942032 19.16538
## 4   9.955646  2.9132656 16.99803
## 5  14.194398  7.1857225 21.20307
## 6  18.399574 11.4964758 25.30267
## 7   7.155376  0.1246073 14.18615
## 8  16.673395  9.7016031 23.64519
## 9  71.820294 63.5460584 80.09453
## 10 19.123587 11.7301065 26.51707
## 11 38.092507 31.0476684 45.13735
## 12 21.593041 14.4595011 28.72658
## 13 12.472991  5.5097274 19.43625
## 14 18.682464 11.6632578 25.70167
## 15 23.328798 16.4315145 30.22608
## 16 29.662928 22.3638539 36.96200
## 17 14.913640  7.9559322 21.87135
## 18 15.551379  8.4737709 22.62899
## 19  7.706807  0.6285915 14.78502
## 20 40.887970 33.7928734 47.98307
## 21 20.514179 13.2171816 27.81118
## 22 56.006528 48.0324043 63.98065
## 23 23.357568 16.4597897 30.25535
## 24 24.402854 17.2470809 31.55863
## 25 10.962584  3.9812364 17.94393

Standardized Regression Coefficients

In the delivery time example, can you identify the most influential covariate? To answer this question, we need standardized regression coefficients.

In multiple linear regression, it is often difficult to directly compare regression coefficients due to possible varying dimensions. It may be beneficial to work with dimensionless regression coefficients. Dimensionless regression coefficients are often referred to as standardized regression coefficients. To obtain standardized regression coefficients, we introduce the method of

Unit normal scaling.

Unit Normal Scaling

This approach employs unit normal scaling for the regressors and the response variable. That is,

\[ z_{ij}= \frac{x_{ij}-\bar{x}_j}{s_j}\\ y^*_{i}= \frac{y_{i}-\bar{y}}{s_y} \]

where

\[ s^2_j=\frac{1}{n-1}\sum_{i=1}^{n} (x_{ij}-\bar{x}_j)^2\\ s^2_y=\frac{1}{n-1}\sum_{i=1}^{n} (y_{i}-\bar{y})^2 \]

After the scaling, all of the scaled regressors and the scaled response have sample mean equal to zero and sample variance equal to 1. Furthermore, we build another new linear regression using the scaled regressors and the scaled response. The model becomes

\[ y^*_i = b_1 z_{i1} + ... + b_k z_{ik} + \epsilon^*_i \]

The coefficients from this new regression, $b_1,...,b_k$, are called standardized regression coefficients.

For the delivery time example, we conduct unit normal scaling as follows.

# original regression coefficient
model1

## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
## 
## Coefficients:
##   (Intercept)  NumberofCases       Distance  
##       2.34123        1.61591        0.01438

# transform the data using unit normal scaling 
delivery_unit_normal=as.data.frame(apply(delivery,2,function(x){(x-mean(x))/sd(x)}))
# redo regression
model1_unit_normal <- lm(DeliveryTime ~ NumberofCases+Distance, data=delivery_unit_normal)
# obtain standardized regression coefficients
model1_unit_normal

## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery_unit_normal)
## 
## Coefficients:
##   (Intercept)  NumberofCases       Distance  
##    -1.118e-16      7.163e-01      3.013e-01

summary(model1_unit_normal)

## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery_unit_normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37282 -0.04270  0.02811  0.07450  0.47792 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.118e-16  4.199e-02   0.000 1.000000    
## NumberofCases  7.163e-01  7.568e-02   9.464 3.25e-09 ***
## Distance       3.013e-01  7.568e-02   3.981 0.000631 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.21 on 22 degrees of freedom
## Multiple R-squared:  0.9596, Adjusted R-squared:  0.9559 
## F-statistic: 261.2 on 2 and 22 DF,  p-value: 4.687e-16

summary(model1) # compare with the original regression, only estimates change.

## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7880 -0.6629  0.4364  1.1566  7.4197 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.341231   1.096730   2.135 0.044170 *  
## NumberofCases 1.615907   0.170735   9.464 3.25e-09 ***
## Distance      0.014385   0.003613   3.981 0.000631 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared:  0.9596, Adjusted R-squared:  0.9559 
## F-statistic: 261.2 on 2 and 22 DF,  p-value: 4.687e-16

Interpretation Standardized Regression Coefficients

It means: when $x_1$ increases by one standard deviation (i.e., one $x_1$’s standard deviation), the mean/average/expectation of response variable increases by $b_1$ standard deviations (i.e., mean of $y$ increases by $b_1*s_y=7.163e-01*s_y$ where $x_y$ is $y$’s standard deviation)

Hidden Extrapolation in Multiple Regression

Similar to the simple linear regression, we want to prevent from making predictions that are essentially extrapolation in the multiple linear regression.

For the delivery time example, suppose we want to make predictions at the following four data points.

New observations symbol	NumberofCases	Distance
a	8	275
b	20	250
c	28	500
d	8	1200

Do these observations lead to extrapolation or not? How do we prevent from extrapolating? Note that the ranges of covariates are

max(delivery$NumberofCases)

## [1] 30

min(delivery$NumberofCases)

## [1] 2

max(delivery$Distance)

## [1] 1460

min(delivery$Distance)

## [1] 36

It seems these four observations are OK since they are within the ranges. But later we will show the opposite conclusion. To answer the question of extrapolation in multiple linear regression, we need the concept of leverage.

Here is an illustration of extrapolation in the multiple linear regression is below.

To prevent extrapolation, we need one additional tool called leverage. Leverage represents each data point’s distance to the center of the data set, it also measures each data point’s potential to influence the fitted linear regression. Each data point has one leverage value. In R, use hatvalues(model1) to obtain leverages.

For the delivery time example, we have obtained each data point’s leverage as follows.

hatvalues(model1)

##          1          2          3          4          5          6          7 
## 0.10180178 0.07070164 0.09873476 0.08537479 0.07501050 0.04286693 0.08179867 
##          8          9         10         11         12         13         14 
## 0.06372559 0.49829216 0.19629595 0.08613260 0.11365570 0.06112463 0.07824332 
##         15         16         17         18         19         20         21 
## 0.04111077 0.16594043 0.05943202 0.09626046 0.09644857 0.10168486 0.16527689 
##         22         23         24         25 
## 0.39157522 0.04126005 0.12060826 0.06664345

cbind(delivery, leverage=hatvalues(model1))

##    Index DeliveryTime NumberofCases Distance   leverage
## 1      1        16.68             7      560 0.10180178
## 2      2        11.50             3      220 0.07070164
## 3      3        12.03             3      340 0.09873476
## 4      4        14.88             4       80 0.08537479
## 5      5        13.75             6      150 0.07501050
## 6      6        18.11             7      330 0.04286693
## 7      7         8.00             2      110 0.08179867
## 8      8        17.83             7      210 0.06372559
## 9      9        79.24            30     1460 0.49829216
## 10    10        21.50             5      605 0.19629595
## 11    11        40.33            16      688 0.08613260
## 12    12        21.00            10      215 0.11365570
## 13    13        13.50             4      255 0.06112463
## 14    14        19.75             6      462 0.07824332
## 15    15        24.00             9      448 0.04111077
## 16    16        29.00            10      776 0.16594043
## 17    17        15.35             6      200 0.05943202
## 18    18        19.00             7      132 0.09626046
## 19    19         9.50             3       36 0.09644857
## 20    20        35.10            17      770 0.10168486
## 21    21        17.90            10      140 0.16527689
## 22    22        52.32            26      810 0.39157522
## 23    23        18.75             9      450 0.04126005
## 24    24        19.83             8      635 0.12060826
## 25    25        10.75             4      150 0.06664345

The maximum of leverage, $h_{max}$, of existing data points is the 9-th observation with leverage of 0.4982922 as indicated below.

For a new observation that we need to predict, we can also calculate its leverage. To prevent extrapolation, make sure that the new observation’s leverage is less than the max of existing leverages. Let $x_{new}$ be a new observation at which prediction is of interest. Then its leverage is

\[ h_{new}=x_{new}^T (X^T X)^{-1} x_{new} \]

where $x_{new}$ is a column vector with the first element being 1 and rest of elements being its covariates, and $X$ is the design matrix.

If $h_{new} > h_{max}$ then the point is a point of extrapolation, otherwise, it is not, where $h_{max}$ is the maximum leverage of existing data points and $h_{new}$ is the leverage of the new data point being predicted.

For example, a new observation with NumberofCases = 8, and Distance = 275, i.e., point a in the beginning of this section. Then $x_{new}$ is c(1, 8 , 275).

x_new=c(1,8,275) # first define a vector containing the covariate information. Note that we put 1 in the front of this vector.  This 1 is for the intercept.  So this 1 never changes.

The design matrix $X$ is

X=model.matrix(model1)
X

##    (Intercept) NumberofCases Distance
## 1            1             7      560
## 2            1             3      220
## 3            1             3      340
## 4            1             4       80
## 5            1             6      150
## 6            1             7      330
## 7            1             2      110
## 8            1             7      210
## 9            1            30     1460
## 10           1             5      605
## 11           1            16      688
## 12           1            10      215
## 13           1             4      255
## 14           1             6      462
## 15           1             9      448
## 16           1            10      776
## 17           1             6      200
## 18           1             7      132
## 19           1             3       36
## 20           1            17      770
## 21           1            10      140
## 22           1            26      810
## 23           1             9      450
## 24           1             8      635
## 25           1             4      150
## attr(,"assign")
## [1] 0 1 2

The leverage of $x_{new}$, i.e., point a, can be obtained as

# obtain the leverage for a new observation with NumberofCases=8 and Distance =275 
t(x_new)%*%solve(t(X)%*%X)%*%x_new

##            [,1]
## [1,] 0.05397255

Therefore, the leverage of point a is 0.0539726 which is smaller than the $h_{max}=$0.4982922. Which means point a is not a extrapolation point. However, suppose we want to make prediction for all four observations.

# similarly, we could calculate the hii for new observations with NumberofCases=20, Distance 250 (or NumberofCases=28, Distance 500, or NumberofCases=8, Distance 1200).
x_newb=c(1,20,250)
t(x_newb)%*%solve(t(X)%*%X)%*%x_newb

##           [,1]
## [1,] 0.5891742

x_newc=c(1,28,500)
t(x_newc)%*%solve(t(X)%*%X)%*%x_newc

##           [,1]
## [1,] 0.8987353

x_newd=c(1,8,1200)
t(x_newd)%*%solve(t(X)%*%X)%*%x_newd

##          [,1]
## [1,] 0.867362

Since their leverages are larger than $h_{max}=$0.4982922. These point b, c, and d are all extrapolation. Here is a summary.

What Decides the Leverage?

Finally, we plot the contour of leverage values of all possible data points given the observed data points on the left. The black dots are the original data points which decide the contour, the blue dots are the new observations to be predicted. On the right, we add the four data points. As we can see, only one point falls in the middle of the contour. The other three are outside of the largest contour curve that contains the original data points.