Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.
Remember the delivery time example we introduced in Chapter 1? What if we have more than one covariate, e.g., delivery volume and delivery distance? Will that help with the prediction? To answer that question, we need multiple linear regression
Suppose that the response variable is still delivery time, the covariates now include delivery volume (i.e., number of packages) and delivery distance (i.e., how far each truck travels to finish the trip). Then we can use a multiple linear regression in this situation to describe this relationship between the response and covariates.
\[ y_i=\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \epsilon_i \]
A multiple linear regression is harder to visualize than simple linear regression. However, with two covariates, we can still plot it.
In general, the multiple linear regression model with \(k\) regressors is
\[ y_i=\beta_0 + \beta_1 x_{i1} + ... + \beta_k x_{ik} + \epsilon_i =\beta_0 + \sum_{j=1}^{k}\beta_j x_{ij} + \epsilon_i \]
Notation:
\(k\) – number of regressors
\(y_i\) – \(i\)-th observation of the response
\(x_{ij}\) – \(i\)-th observation of the \(j\)-th covariate.
The least square estimate is still the same as in the simple linear regression case. It still minimize the sum of square distance. The sum of square distances is
\[ S(b_0,b_1,...,b_k)=\sum_{i=1}^{n}(y_i - b_0 - b_1 x_{i1} - ... - b_k x_{ik})^2 \]
And our least square estimate is
\[ (\hat{\beta}_0,\hat{\beta}_1,...,\hat{\beta}_k)=\arg\min_{b_0,b_1,...,b_k}S(b_0,b_1,...,b_k) \]
A demostration is here: http://setosa.io/ev/ordinary-least-squares-regression/
Now let us turn to our delivery time example.
delivery <- read.csv("data_delivery.csv",h=T)
head(delivery)
## Index DeliveryTime NumberofCases Distance
## 1 1 16.68 7 560
## 2 2 11.50 3 220
## 3 3 12.03 3 340
## 4 4 14.88 4 80
## 5 5 13.75 6 150
## 6 6 18.11 7 330
# visualize data
pairs (delivery,pch=20)
# fit linear regression
model1 <- lm(DeliveryTime ~ NumberofCases + Distance, data=delivery)
summary(model1)
##
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## NumberofCases 1.615907 0.170735 9.464 3.25e-09 ***
## Distance 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
# obtain coefficient estimate
model1$coefficients
## (Intercept) NumberofCases Distance
## 2.34123115 1.61590721 0.01438483
# obtain residuals
model1$residuals
## 1 2 3 4 5 6 7
## -5.0280843 1.1463854 -0.0497937 4.9243539 -0.4443983 -0.2895743 0.8446235
## 8 9 10 11 12 13 14
## 1.1566049 7.4197062 2.3764129 2.2374930 -0.5930409 1.0270093 1.0675359
## 15 16 17 18 19 20 21
## 0.6712018 -0.6629284 0.4363603 3.4486213 1.7931935 -5.7879699 -2.6141789
## 22 23 24 25
## -3.6865279 -4.6075679 -4.5728535 -0.2125839
# obtain fitted value
model1$fitted.values
## 1 2 3 4 5 6 7 8
## 21.708084 10.353615 12.079794 9.955646 14.194398 18.399574 7.155376 16.673395
## 9 10 11 12 13 14 15 16
## 71.820294 19.123587 38.092507 21.593041 12.472991 18.682464 23.328798 29.662928
## 17 18 19 20 21 22 23 24
## 14.913640 15.551379 7.706807 40.887970 20.514179 56.006528 23.357568 24.402854
## 25
## 10.962584
# obtain estimate and standard errors
summary(model1)$coef
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.34123115 1.096730168 2.134738 4.417012e-02
## NumberofCases 1.61590721 0.170734918 9.464421 3.254932e-09
## Distance 0.01438483 0.003613086 3.981313 6.312469e-04
# obtain just standard errors
summary(model1)$coef[,2]
## (Intercept) NumberofCases Distance
## 1.096730168 0.170734918 0.003613086
Our the interpretations of the coefficients, residuals, fitted values are the same as the simple linear regression. In addition, the least square estimate is also unbiased for the multiple linear regression. The standard error of each individual estimate can be interpreted the way as simple linear regression, i.e., as “wiggle room”.
\(R^2\) is calculated exactly same as in simple linear regression
\[ R^2=\frac{SSR}{SST}=1-\frac{SSRes}{SST} \]
However, \(R^2\) can be inflated simply by adding more terms to the model (even insignificant terms) in the multiple linear regression setting. Therefore, we need to penalize for added terms to the model, i.e., Adjusted \(R^2\).
\[ R^2_{adj}=1-\frac{SSRes/(n-k-1)}{SST/(n-1)}=1-(1-R^2)\frac{n-1}{n-k-1} \]
where \(k\) is the total number of covariates in the model (not including the intercept).
#generate a random variable, add it to the model
set.seed(2)
noise <- rnorm(nrow(delivery),0,1)
model2 <- lm(DeliveryTime ~ NumberofCases + Distance + noise, data=delivery)
summary(model1)$r.square
## [1] 0.9595937
summary(model2)$r.square
## [1] 0.9596191
summary(model1)$adj.r.squared
## [1] 0.9559205
summary(model2)$adj.r.squared
## [1] 0.9538504
Once we have estimated the parameters in the model, we face two immediate questions:
Is the linear regression as a whole effective and significant? This leads to the test for significance of regression (or F-test).
Which specific regressors seem important? This leads to t-test.
Which groups of regressors seem important? This leads to partial F-test.
The test for significance of regression is also called F-Test or ANOVA test (analysis of variance test). It a test to determine if there is a linear relationship between the response and any of the regressor variables. Therefore, we are testing the following.
The hypotheses are
\(H_0: \beta_1=\beta_2=...=\beta_k=0\)
\(H1: \beta_j \neq 0 \text{ for at least one }j\)
To decide whether to reject or not reject the null hypothesis, we look at the p-value of the F-test.
When p-value is … | our decision is … | which implies … |
---|---|---|
smaller than 0.05 | reject \(H_0\) | regression is significant |
larger than 0.05 | do not reject \(H_0\) | regression is not significant |
To perform the F-test, we use summary(...)
.
# perform F test
summary(model1)
##
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## NumberofCases 1.615907 0.170735 9.464 3.25e-09 ***
## Distance 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
Reject null hypothesis means the linear regression as a whole is significant in explaining the variation in response variable. At least one regressor is significant with nonzero slope.
Fail to reject, the linear regression is not significant and all the slopes of regressors are zero.
In the example of delivery time, the p-value of F-test is 4.687e-16, which is smaller than 0.05. So we reject the null hypothesis and conclude there is at least one significant covariate.
T-test tests for whether one particular covariate (say, \(j\)-th covariate) has a nonzero slope. Therefore, we could in total conduct \(k\) t-tests for all \(k\) covariates or \(k\) slopes. The testing procedure is the same as what we did in the simple linear regression. To test the \(j\)-th covariate, the hypotheses are
\(H_0:\beta_j=0\)
\(H_1:\beta_j \neq 0\)
To decide whether to reject or not reject the null hypothesis, we look at the p-value of the t-test.
When p-value is … | our decision is … | which implies … |
---|---|---|
smaller than 0.05 | reject \(H_0\) | \(j\)th covariate has a nonzero slope |
larger than 0.05 | do not reject \(H_0\) | \(j\)th covariate has a zero slope |
To perform the F-test, we use summary(...)
.
# perform t test
summary(model1)
##
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## NumberofCases 1.615907 0.170735 9.464 3.25e-09 ***
## Distance 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
Reject null hypothesis means that, after adjusted for other regressors in the linear regression, this particular regressor is significant in explaining the variation in response variable, its slope coefficient is nonzero.
Fail to reject means that, after adjusted for other regressors, this regressor is not significant and its slope is zero.
For the delivery time example, the p-values are 3.25493157366564e-09 and 0.000631246862242841 for NumberofCases
and Distance
. Therefore, we conclude both covariates have nonzero slopes.
Partial F-test tests on multiple slopes simultaneously. Suppose we have fitted a linear regression with \(k\) covariates. To test for the slopes of a subset of \(m\) covariates and \(m<k\), we use a partial F test.
The hypotheses are:
\(H_0: \beta_j = 0 \text{ for all }j\text{ in the subset of covariates}\)
\(H_1: \beta_j \neq 0 \text{ for at least one }j\text{ in the subset of covariates}\)
To decide whether to reject or not reject the null hypothesis, we look at the p-value of the partial F-test.
When p-value is … | our decision is … | which implies … |
---|---|---|
smaller than 0.05 | reject \(H_0\) | at least one nonzero slope in the subset |
larger than 0.05 | do not reject \(H_0\) | this subset has all zero slopes |
To perform the partial F-test, we use anova(...)
. Here is an new example.
d=read.csv("data_AirlinesCost.csv")
names(d)=c(
"Airline",
"LengthFlight",
"Speed",
"DailyFlight",
"PopulationServed",
"Cost",
"Revenue",
"Ton",
"Capacity",
"Assets",
"Investments",
"AdjustedAssets"
)
Suppose we want to check whether a subset of covariates, including LengthFlight
, Speed
,DailyFlight
, and PopulationServed
, is significant in the regression. Then our hypotheses are
\(H_0: \beta_{LengthFlight}=\beta_{Speed}=\beta_{DailyFlight}=\beta_{PopulationServed}= 0\)
\(H_1: \text{at least one slope is nonzero, i.e.,} \beta_{LengthFlight} \neq 0 \text{ or } \beta_{Speed} \neq 0 \text{ or } \beta_{DailyFlight} \neq 0 \text{ or } \beta_{PopulationServed} \neq 0\)
To perform such a test, we use the following function.
model3=lm(Cost ~ LengthFlight + Speed + DailyFlight + PopulationServed + Revenue + Ton + Capacity + Assets + Investments,data=d)
model4=lm(Cost ~ Revenue + Ton + Capacity + Assets + Investments,data=d)
anova(model3,model4)
## Analysis of Variance Table
##
## Model 1: Cost ~ LengthFlight + Speed + DailyFlight + PopulationServed +
## Revenue + Ton + Capacity + Assets + Investments
## Model 2: Cost ~ Revenue + Ton + Capacity + Assets + Investments
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 21 48091
## 2 25 71490 -4 -23399 2.5544 0.06905 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note that the p-value is 0.06905, which means we do not reject the null hypothesis, and conclude that all slopes are zero.
Reject null hypothesis means that, after adjusted for other regressors (not in the partial F test) in the linear regression, these regressors (in the partial F test) are significant in explaining the variation in response variable, at least one of slope coefficients (in the partial F test) is nonzero.
Fail to reject means that, after adjusted for other regressors (not in partial F test), these regressors (in the partial F test) are not significant and all of their slopes are zero.
For the airline example, the p-value is 0.06905, which means we do not reject the null hypothesis, and conclude that all slopes of LengthFlight
, Speed
,DailyFlight
, and PopulationServed
are zero.
Both the t-test and F-test can be considered as special cases of partial F-test.
To see t-test as a special case of partial F-test, we present the following example.
# use partial F test to compare covariate individually as the t-test.
# reduced is a simpler model with one covariate, Distance
reduced = lm(DeliveryTime ~ Distance, data=delivery) # Reduced model
# full is a more complicated model with two covariates, NumberofCases and Distance
full = model1 # full model
# use F test to compare two models
anova(reduced, full)
## Analysis of Variance Table
##
## Model 1: DeliveryTime ~ Distance
## Model 2: DeliveryTime ~ NumberofCases + Distance
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 1185.39
## 2 22 233.73 1 951.66 89.575 3.255e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# check p value
summary(model1)
##
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## NumberofCases 1.615907 0.170735 9.464 3.25e-09 ***
## Distance 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
# we could repeat the above procedure for another reduced model with only NumberofCases as covariate
reduced = lm(DeliveryTime ~ NumberofCases, data=delivery) # Reduced model
full = model1 # full model
anova(reduced, full)
## Analysis of Variance Table
##
## Model 1: DeliveryTime ~ NumberofCases
## Model 2: DeliveryTime ~ NumberofCases + Distance
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 23 402.13
## 2 22 233.73 1 168.4 15.851 0.0006312 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# check p value
summary(model1)
##
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## NumberofCases 1.615907 0.170735 9.464 3.25e-09 ***
## Distance 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
To see F-test as a special case of partial F-test, we present the following example.
# we can even let the reduced model be a model with no covariate, note that 1 here represents the linear regression only has intercept and no covariates.
reduced = lm(DeliveryTime ~ 1, data=delivery) # Reduced model
full = model1 # full model
anova(reduced, full)
## Analysis of Variance Table
##
## Model 1: DeliveryTime ~ 1
## Model 2: DeliveryTime ~ NumberofCases + Distance
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 24 5784.5
## 2 22 233.7 2 5550.8 261.24 4.687e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# check p value
summary(model1)
##
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## NumberofCases 1.615907 0.170735 9.464 3.25e-09 ***
## Distance 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
Note that if we only use anova(model1)
instead of anova(reduced, full)
, the results are very different and we do not cover this case in this course.
A hypothesis test for testing that one slope parameter is 0
A hypothesis test for testing that all of the slope parameters are 0
A hypothesis test for testing that a subset — more than one, but not all — of the slope parameters are 0
Interval estimate of regression coefficients in the multiple linear regression is the same as the simple linear regression except we have multiple intervals for multiple slopes. Interpretation is the same as simple linear regression too.
confint(model1,level=0.95)
## 2.5 % 97.5 %
## (Intercept) 0.066751987 4.61571030
## NumberofCases 1.261824662 1.96998976
## Distance 0.006891745 0.02187791
Prediction in multiple linear regression is the same as the simple linear regression. We still use predict(...)
. Interpretation is the same as simple linear regression too.
Here is an example to make predictions for that delivery time data set.
observations_for_pred=data.frame(NumberofCases=c(10,15,20), Distance=c(1000,1500,2000))
observations_for_pred
## NumberofCases Distance
## 1 10 1000
## 2 15 1500
## 3 20 2000
head(delivery)
## Index DeliveryTime NumberofCases Distance
## 1 1 16.68 7 560
## 2 2 11.50 3 220
## 3 3 12.03 3 340
## 4 4 14.88 4 80
## 5 5 13.75 6 150
## 6 6 18.11 7 330
Note the column names of observations_for_pred
are exactly the same as the column names in the original data frame delivery
for these covariates. This is very important. Otherwise, R would not be able to recognize what covariates to use for prediction. It is because model1
is built using data frame delivery
. So if we want to use model1
for prediction, we need to feed it a same format data frame with the same column names.
To make predictions based on model1
, we use the following code.
# obtain prediction for new observations
predict(model1,observations_for_pred, type="response")
## 1 2 3
## 32.88513 48.15708 63.42903
# obtain prediction interval for new observations
predict(model1,observations_for_pred,interval="prediction", level=0.95, type="response")
## fit lwr upr
## 1 32.88513 24.87868 40.89158
## 2 48.15708 38.70022 57.61394
## 3 63.42903 52.14917 74.70889
# obtain confidence interval for mean response for new observations
predict(model1,observations_for_pred,interval="confidence", level=0.95, type="response")
## fit lwr upr
## 1 32.88513 28.59454 37.17572
## 2 48.15708 41.54359 54.77056
## 3 63.42903 54.39901 72.45905
Note that the confidence interval for mean response is shorter than the prediction interval.
We can also make predictions for all the existing observations.
# obtain these intervals for existing observations.
predict(model1,delivery,interval="confidence", level=0.95, type="response")
## fit lwr upr
## 1 21.708084 19.551297 23.864872
## 2 10.353615 8.556216 12.151013
## 3 12.079794 9.955744 14.203843
## 4 9.955646 7.980522 11.930770
## 5 14.194398 12.343039 16.045757
## 6 18.399574 17.000017 19.799132
## 7 7.155376 5.222061 9.088692
## 8 16.673395 14.966973 18.379818
## 9 71.820294 67.048610 76.591977
## 10 19.123587 16.128667 22.118507
## 11 38.092507 36.108636 40.076378
## 12 21.593041 19.314141 23.871941
## 13 12.472991 10.801755 14.144227
## 14 18.682464 16.791631 20.573297
## 15 23.328798 21.958209 24.699388
## 16 29.662928 26.909298 32.416559
## 17 14.913640 13.265705 16.561574
## 18 15.551379 13.454112 17.648645
## 19 7.706807 5.607492 9.806121
## 20 40.887970 38.732422 43.043518
## 21 20.514179 17.766059 23.262299
## 22 56.006528 51.776559 60.236497
## 23 23.357568 21.984492 24.730644
## 24 24.402854 22.055286 26.750421
## 25 10.962584 9.217532 12.707636
predict(model1,delivery,interval="prediction", level=0.95, type="response")
## fit lwr upr
## 1 21.708084 14.6126113 28.80356
## 2 10.353615 3.3589989 17.34823
## 3 12.079794 4.9942032 19.16538
## 4 9.955646 2.9132656 16.99803
## 5 14.194398 7.1857225 21.20307
## 6 18.399574 11.4964758 25.30267
## 7 7.155376 0.1246073 14.18615
## 8 16.673395 9.7016031 23.64519
## 9 71.820294 63.5460584 80.09453
## 10 19.123587 11.7301065 26.51707
## 11 38.092507 31.0476684 45.13735
## 12 21.593041 14.4595011 28.72658
## 13 12.472991 5.5097274 19.43625
## 14 18.682464 11.6632578 25.70167
## 15 23.328798 16.4315145 30.22608
## 16 29.662928 22.3638539 36.96200
## 17 14.913640 7.9559322 21.87135
## 18 15.551379 8.4737709 22.62899
## 19 7.706807 0.6285915 14.78502
## 20 40.887970 33.7928734 47.98307
## 21 20.514179 13.2171816 27.81118
## 22 56.006528 48.0324043 63.98065
## 23 23.357568 16.4597897 30.25535
## 24 24.402854 17.2470809 31.55863
## 25 10.962584 3.9812364 17.94393
In the delivery time example, can you identify the most influential covariate? To answer this question, we need standardized regression coefficients.
In multiple linear regression, it is often difficult to directly compare regression coefficients due to possible varying dimensions. It may be beneficial to work with dimensionless regression coefficients. Dimensionless regression coefficients are often referred to as standardized regression coefficients. To obtain standardized regression coefficients, we introduce the method of
This approach employs unit normal scaling for the regressors and the response variable. That is,
\[ z_{ij}= \frac{x_{ij}-\bar{x}_j}{s_j}\\ y^*_{i}= \frac{y_{i}-\bar{y}}{s_y} \]
where
\[ s^2_j=\frac{1}{n-1}\sum_{i=1}^{n} (x_{ij}-\bar{x}_j)^2\\ s^2_y=\frac{1}{n-1}\sum_{i=1}^{n} (y_{i}-\bar{y})^2 \]
After the scaling, all of the scaled regressors and the scaled response have sample mean equal to zero and sample variance equal to 1. Furthermore, we build another new linear regression using the scaled regressors and the scaled response. The model becomes
\[ y^*_i = b_1 z_{i1} + ... + b_k z_{ik} + \epsilon^*_i \]
The coefficients from this new regression, \(b_1,...,b_k\), are called standardized regression coefficients.
For the delivery time example, we conduct unit normal scaling as follows.
# original regression coefficient
model1
##
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
##
## Coefficients:
## (Intercept) NumberofCases Distance
## 2.34123 1.61591 0.01438
# transform the data using unit normal scaling
delivery_unit_normal=as.data.frame(apply(delivery,2,function(x){(x-mean(x))/sd(x)}))
# redo regression
model1_unit_normal <- lm(DeliveryTime ~ NumberofCases+Distance, data=delivery_unit_normal)
# obtain standardized regression coefficients
model1_unit_normal
##
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery_unit_normal)
##
## Coefficients:
## (Intercept) NumberofCases Distance
## -1.118e-16 7.163e-01 3.013e-01
summary(model1_unit_normal)
##
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery_unit_normal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37282 -0.04270 0.02811 0.07450 0.47792
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.118e-16 4.199e-02 0.000 1.000000
## NumberofCases 7.163e-01 7.568e-02 9.464 3.25e-09 ***
## Distance 3.013e-01 7.568e-02 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.21 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
summary(model1) # compare with the original regression, only estimates change.
##
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = delivery)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7880 -0.6629 0.4364 1.1566 7.4197
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.341231 1.096730 2.135 0.044170 *
## NumberofCases 1.615907 0.170735 9.464 3.25e-09 ***
## Distance 0.014385 0.003613 3.981 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared: 0.9596, Adjusted R-squared: 0.9559
## F-statistic: 261.2 on 2 and 22 DF, p-value: 4.687e-16
It means: when \(x_1\) increases by one standard deviation (i.e., one \(x_1\)’s standard deviation), the mean/average/expectation of response variable increases by \(b_1\) standard deviations (i.e., mean of \(y\) increases by \(b_1*s_y=7.163e-01*s_y\) where \(x_y\) is \(y\)’s standard deviation)
Finally, we plot the contour of leverage values of all possible data points given the observed data points on the left. The black dots are the original data points which decide the contour, the blue dots are the new observations to be predicted. On the right, we add the four data points. As we can see, only one point falls in the middle of the contour. The other three are outside of the largest contour curve that contains the original data points.