Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

1 Variable selection in general

Let us start with a data set that concerns the hardening of cement. In particular, the researchers were interested in learning how the composition of the cement affected the heat evolved during the hardening of the cement. Therefore, they measured and recorded the following data on 13 batches of cement:

Response y: heat evolved in calories during hardening of cement on a per gram basis
Predictor x1: % of tricalcium aluminate
Predictor x2: % of tricalcium silicate
Predictor x3: % of tetracalcium alumino ferrite
Predictor x4: % of dicalcium silicate

rm(list=ls())
hald <- read.csv("data_HaldCement.csv",h=TRUE)
head(hald)
##       y x1 x2 x3 x4
## 1  78.5  7 26  6 60
## 2  74.3  1 29 15 52
## 3 104.3 11 56  8 20
## 4  87.6 11 31  8 47
## 5  95.9  7 52  6 33
## 6 109.2 11 55  9 22
pairs(hald,pch=20)

For this data set, if you want to build a model to predict y, which variables do you want to include in the regression?

summary(lm(y~x1,data=hald))
## 
## Call:
## lm(formula = y ~ x1, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.061  -9.048   1.339   7.883  15.614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  81.4793     4.9273   16.54 4.07e-09 ***
## x1            1.8687     0.5264    3.55  0.00455 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.73 on 11 degrees of freedom
## Multiple R-squared:  0.5339, Adjusted R-squared:  0.4916 
## F-statistic:  12.6 on 1 and 11 DF,  p-value: 0.004552
summary(lm(y~x2,data=hald))
## 
## Call:
## lm(formula = y ~ x2, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.752  -6.008  -1.684   3.794  21.387 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  57.4237     8.4906   6.763  3.1e-05 ***
## x2            0.7891     0.1684   4.686 0.000665 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.077 on 11 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6359 
## F-statistic: 21.96 on 1 and 11 DF,  p-value: 0.0006648
summary(lm(y~x3,data=hald))
## 
## Call:
## lm(formula = y ~ x3, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.168 -10.075   4.144  10.299  14.399 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 110.2027     7.9478  13.866  2.6e-08 ***
## x3           -1.2558     0.5984  -2.098   0.0598 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.28 on 11 degrees of freedom
## Multiple R-squared:  0.2859, Adjusted R-squared:  0.221 
## F-statistic: 4.403 on 1 and 11 DF,  p-value: 0.05976
summary(lm(y~x4,data=hald))
## 
## Call:
## lm(formula = y ~ x4, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.589  -8.228   1.495   4.726  17.524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 117.5679     5.2622  22.342 1.62e-10 ***
## x4           -0.7382     0.1546  -4.775 0.000576 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.964 on 11 degrees of freedom
## Multiple R-squared:  0.6745, Adjusted R-squared:  0.645 
## F-statistic:  22.8 on 1 and 11 DF,  p-value: 0.0005762
summary(lm(y~x1+x2+x3+x4,data=hald))
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1750 -1.6709  0.2508  1.3783  3.9254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  62.4054    70.0710   0.891   0.3991  
## x1            1.5511     0.7448   2.083   0.0708 .
## x2            0.5102     0.7238   0.705   0.5009  
## x3            0.1019     0.7547   0.135   0.8959  
## x4           -0.1441     0.7091  -0.203   0.8441  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9736 
## F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07

Obviously, models with only one covariate is too weak for prediction, but the model with all covariates included have all insignificant p-values. So what is the best model?

Model-Building Problem

Notes on Variable selection techniques.

Consequences of Deleting Variables

What if transformation is needed for some of the covariates? Here is the strategy for regression model building.

All Possible Regressions

Criterion for variable selection

2 Select by Adjusted R squared

The coefficient of determination \(R^2\) is defined as

\[R^2= 1 - \frac{SSRes}{SST}\]

For a model with k covariates, the adjusted R squared is

\[R^2_{adj}= 1- \frac{n-1}{n-k-1}\bigg(\frac{SSRes}{SST}\bigg)=1 - \frac{n-1}{n-k-1}(1-R^2)\]

Example

library(leaps)
hald_allmodels_r2 <- leaps( x=hald[,-1], y=hald[,1], method="r2")
hald_allmodels_r2
## $which
##       1     2     3     4
## 1 FALSE FALSE FALSE  TRUE
## 1 FALSE  TRUE FALSE FALSE
## 1  TRUE FALSE FALSE FALSE
## 1 FALSE FALSE  TRUE FALSE
## 2  TRUE  TRUE FALSE FALSE
## 2  TRUE FALSE FALSE  TRUE
## 2 FALSE FALSE  TRUE  TRUE
## 2 FALSE  TRUE  TRUE FALSE
## 2 FALSE  TRUE FALSE  TRUE
## 2  TRUE FALSE  TRUE FALSE
## 3  TRUE  TRUE FALSE  TRUE
## 3  TRUE  TRUE  TRUE FALSE
## 3  TRUE FALSE  TRUE  TRUE
## 3 FALSE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
## 
## $size
##  [1] 2 2 2 2 3 3 3 3 3 3 4 4 4 4 5
## 
## $r2
##  [1] 0.6745420 0.6662683 0.5339480 0.2858727 0.9786784 0.9724710 0.9352896
##  [8] 0.8470254 0.6800604 0.5481667 0.9823355 0.9822847 0.9812811 0.9728200
## [15] 0.9823756
cbind(hald_allmodels_r2$which, Rsq = hald_allmodels_r2$r2)
##   1 2 3 4       Rsq
## 1 0 0 0 1 0.6745420
## 1 0 1 0 0 0.6662683
## 1 1 0 0 0 0.5339480
## 1 0 0 1 0 0.2858727
## 2 1 1 0 0 0.9786784
## 2 1 0 0 1 0.9724710
## 2 0 0 1 1 0.9352896
## 2 0 1 1 0 0.8470254
## 2 0 1 0 1 0.6800604
## 2 1 0 1 0 0.5481667
## 3 1 1 0 1 0.9823355
## 3 1 1 1 0 0.9822847
## 3 1 0 1 1 0.9812811
## 3 0 1 1 1 0.9728200
## 4 1 1 1 1 0.9823756
hald_allmodels_r2adj <- leaps( x=hald[,-1], y=hald[,1], method="adjr2")
hald_allmodels_r2adj
## $which
##       1     2     3     4
## 1 FALSE FALSE FALSE  TRUE
## 1 FALSE  TRUE FALSE FALSE
## 1  TRUE FALSE FALSE FALSE
## 1 FALSE FALSE  TRUE FALSE
## 2  TRUE  TRUE FALSE FALSE
## 2  TRUE FALSE FALSE  TRUE
## 2 FALSE FALSE  TRUE  TRUE
## 2 FALSE  TRUE  TRUE FALSE
## 2 FALSE  TRUE FALSE  TRUE
## 2  TRUE FALSE  TRUE FALSE
## 3  TRUE  TRUE FALSE  TRUE
## 3  TRUE  TRUE  TRUE FALSE
## 3  TRUE FALSE  TRUE  TRUE
## 3 FALSE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
## 
## $size
##  [1] 2 2 2 2 3 3 3 3 3 3 4 4 4 4 5
## 
## $adjr2
##  [1] 0.6449549 0.6359290 0.4915797 0.2209521 0.9744140 0.9669653 0.9223476
##  [8] 0.8164305 0.6160725 0.4578001 0.9764473 0.9763796 0.9750415 0.9637599
## [15] 0.9735634
cbind(hald_allmodels_r2adj$which, adjRsq = hald_allmodels_r2adj$adjr2)
##   1 2 3 4    adjRsq
## 1 0 0 0 1 0.6449549
## 1 0 1 0 0 0.6359290
## 1 1 0 0 0 0.4915797
## 1 0 0 1 0 0.2209521
## 2 1 1 0 0 0.9744140
## 2 1 0 0 1 0.9669653
## 2 0 0 1 1 0.9223476
## 2 0 1 1 0 0.8164305
## 2 0 1 0 1 0.6160725
## 2 1 0 1 0 0.4578001
## 3 1 1 0 1 0.9764473
## 3 1 1 1 0 0.9763796
## 3 1 0 1 1 0.9750415
## 3 0 1 1 1 0.9637599
## 4 1 1 1 1 0.9735634

According adjusted R squared, ## 3 1 1 0 1 0.9764473 corresponds to the best model. That is, x1, x2, and x4 should be included in the regression.

Another example:

asphalt <- read.csv("data_Asphalt.csv",header=TRUE)
head(asphalt)
##   Observation     y  x1   x2   x3 x4  x5    x6
## 1           1  6.75 2.8 4.68 4.87  0 8.4 4.916
## 2           2 13.00 1.4 5.19 4.50  0 6.5 4.563
## 3           3 14.75 1.4 4.82 4.73  0 7.9 5.321
## 4           4 12.60 3.3 4.85 4.76  0 8.3 4.865
## 5           5  8.25 1.7 4.86 4.95  0 8.4 3.776
## 6           6 10.67 2.9 5.16 4.45  0 7.4 4.397
pairs(asphalt,pch=20)

asphalt_allmodels_r2 <- leaps(x=asphalt[,3:8],y=asphalt[,2], method="r2")
asphalt_allmodels_r2
## $which
##       1     2     3     4     5     6
## 1 FALSE FALSE FALSE  TRUE FALSE FALSE
## 1  TRUE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE  TRUE
## 1 FALSE FALSE FALSE FALSE  TRUE FALSE
## 1 FALSE  TRUE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE  TRUE FALSE FALSE FALSE
## 2 FALSE  TRUE FALSE  TRUE FALSE FALSE
## 2 FALSE FALSE  TRUE  TRUE FALSE FALSE
## 2 FALSE FALSE FALSE  TRUE FALSE  TRUE
## 2 FALSE FALSE FALSE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE FALSE FALSE
## 2  TRUE  TRUE FALSE FALSE FALSE FALSE
## 2  TRUE FALSE FALSE FALSE FALSE  TRUE
## 2  TRUE FALSE FALSE FALSE  TRUE FALSE
## 2  TRUE FALSE  TRUE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE  TRUE  TRUE
## 3 FALSE  TRUE  TRUE  TRUE FALSE FALSE
## 3 FALSE  TRUE FALSE  TRUE  TRUE FALSE
## 3 FALSE  TRUE FALSE  TRUE FALSE  TRUE
## 3 FALSE FALSE  TRUE  TRUE FALSE  TRUE
## 3  TRUE  TRUE FALSE  TRUE FALSE FALSE
## 3 FALSE FALSE FALSE  TRUE  TRUE  TRUE
## 3 FALSE FALSE  TRUE  TRUE  TRUE FALSE
## 3  TRUE FALSE  TRUE  TRUE FALSE FALSE
## 3  TRUE FALSE FALSE  TRUE FALSE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE FALSE
## 4 FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## 4 FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## 4  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## 4 FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## 4 FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## 4  TRUE FALSE  TRUE  TRUE FALSE  TRUE
## 4  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 4  TRUE FALSE  TRUE  TRUE  TRUE FALSE
## 5 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## 5  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## 5  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## 5  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
## [6] "5"           "6"          
## 
## $size
##  [1] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6
## [39] 6 6 6 6 7
## 
## $r2
##  [1] 0.65755827 0.31095511 0.17970996 0.07244760 0.05871311 0.04418471
##  [7] 0.67988500 0.67054203 0.66837794 0.66422324 0.65895778 0.34851619
## [13] 0.34222182 0.33205132 0.32205656 0.28241387 0.70161258 0.69572645
## [19] 0.68288526 0.68193516 0.68110172 0.68035604 0.67189893 0.67096105
## [25] 0.66840245 0.66507011 0.70709701 0.70382967 0.70175221 0.70146515
## [31] 0.69616264 0.68625513 0.68320263 0.68215000 0.68062906 0.67225423
## [37] 0.71103759 0.70715638 0.70384365 0.70150575 0.68682460 0.41732348
## [43] 0.71125649
cbind(asphalt_allmodels_r2$which,Rsq=asphalt_allmodels_r2$r2)
##   1 2 3 4 5 6        Rsq
## 1 0 0 0 1 0 0 0.65755827
## 1 1 0 0 0 0 0 0.31095511
## 1 0 0 0 0 0 1 0.17970996
## 1 0 0 0 0 1 0 0.07244760
## 1 0 1 0 0 0 0 0.05871311
## 1 0 0 1 0 0 0 0.04418471
## 2 0 1 0 1 0 0 0.67988500
## 2 0 0 1 1 0 0 0.67054203
## 2 0 0 0 1 0 1 0.66837794
## 2 0 0 0 1 1 0 0.66422324
## 2 1 0 0 1 0 0 0.65895778
## 2 1 1 0 0 0 0 0.34851619
## 2 1 0 0 0 0 1 0.34222182
## 2 1 0 0 0 1 0 0.33205132
## 2 1 0 1 0 0 0 0.32205656
## 2 0 0 0 0 1 1 0.28241387
## 3 0 1 1 1 0 0 0.70161258
## 3 0 1 0 1 1 0 0.69572645
## 3 0 1 0 1 0 1 0.68288526
## 3 0 0 1 1 0 1 0.68193516
## 3 1 1 0 1 0 0 0.68110172
## 3 0 0 0 1 1 1 0.68035604
## 3 0 0 1 1 1 0 0.67189893
## 3 1 0 1 1 0 0 0.67096105
## 3 1 0 0 1 0 1 0.66840245
## 3 1 0 0 1 1 0 0.66507011
## 4 0 1 1 1 1 0 0.70709701
## 4 0 1 1 1 0 1 0.70382967
## 4 1 1 1 1 0 0 0.70175221
## 4 0 1 0 1 1 1 0.70146515
## 4 1 1 0 1 1 0 0.69616264
## 4 0 0 1 1 1 1 0.68625513
## 4 1 1 0 1 0 1 0.68320263
## 4 1 0 1 1 0 1 0.68215000
## 4 1 0 0 1 1 1 0.68062906
## 4 1 0 1 1 1 0 0.67225423
## 5 0 1 1 1 1 1 0.71103759
## 5 1 1 1 1 1 0 0.70715638
## 5 1 1 1 1 0 1 0.70384365
## 5 1 1 0 1 1 1 0.70150575
## 5 1 0 1 1 1 1 0.68682460
## 5 1 1 1 0 1 1 0.41732348
## 6 1 1 1 1 1 1 0.71125649
asphalt_allmodels_r2adj <- leaps(x=asphalt[,3:8],y=asphalt[,2], method="adjr2")
asphalt_allmodels_r2adj
## $which
##       1     2     3     4     5     6
## 1 FALSE FALSE FALSE  TRUE FALSE FALSE
## 1  TRUE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE  TRUE
## 1 FALSE FALSE FALSE FALSE  TRUE FALSE
## 1 FALSE  TRUE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE  TRUE FALSE FALSE FALSE
## 2 FALSE  TRUE FALSE  TRUE FALSE FALSE
## 2 FALSE FALSE  TRUE  TRUE FALSE FALSE
## 2 FALSE FALSE FALSE  TRUE FALSE  TRUE
## 2 FALSE FALSE FALSE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE FALSE FALSE
## 2  TRUE  TRUE FALSE FALSE FALSE FALSE
## 2  TRUE FALSE FALSE FALSE FALSE  TRUE
## 2  TRUE FALSE FALSE FALSE  TRUE FALSE
## 2  TRUE FALSE  TRUE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE  TRUE  TRUE
## 3 FALSE  TRUE  TRUE  TRUE FALSE FALSE
## 3 FALSE  TRUE FALSE  TRUE  TRUE FALSE
## 3 FALSE  TRUE FALSE  TRUE FALSE  TRUE
## 3 FALSE FALSE  TRUE  TRUE FALSE  TRUE
## 3  TRUE  TRUE FALSE  TRUE FALSE FALSE
## 3 FALSE FALSE FALSE  TRUE  TRUE  TRUE
## 3 FALSE FALSE  TRUE  TRUE  TRUE FALSE
## 3  TRUE FALSE  TRUE  TRUE FALSE FALSE
## 3  TRUE FALSE FALSE  TRUE FALSE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE FALSE
## 4 FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## 4 FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## 4  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## 4 FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## 4 FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## 4  TRUE FALSE  TRUE  TRUE FALSE  TRUE
## 4  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 4  TRUE FALSE  TRUE  TRUE  TRUE FALSE
## 5 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## 5  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## 5  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## 5  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
## [6] "5"           "6"          
## 
## $size
##  [1] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6
## [39] 6 6 6 6 7
## 
## $adjr2
##  [1] 0.64574994 0.28719494 0.15142410 0.04046303 0.02625494 0.01122557
##  [7] 0.65701964 0.64700932 0.64469065 0.64023918 0.63459762 0.30198163
## [13] 0.29523766 0.28434070 0.27363203 0.23115772 0.66845842 0.66191828
## [19] 0.64765029 0.64659462 0.64566858 0.64484005 0.63544326 0.63440116
## [25] 0.63155828 0.62785567 0.66203501 0.65826501 0.65586793 0.65553671
## [31] 0.64941843 0.63798669 0.63446457 0.63325000 0.63149507 0.62183181
## [37] 0.65324511 0.64858766 0.64461237 0.64180690 0.62418952 0.30078818
## [43] 0.63907062
cbind(asphalt_allmodels_r2adj$which,adjRsq=asphalt_allmodels_r2adj$adjr2)
##   1 2 3 4 5 6     adjRsq
## 1 0 0 0 1 0 0 0.64574994
## 1 1 0 0 0 0 0 0.28719494
## 1 0 0 0 0 0 1 0.15142410
## 1 0 0 0 0 1 0 0.04046303
## 1 0 1 0 0 0 0 0.02625494
## 1 0 0 1 0 0 0 0.01122557
## 2 0 1 0 1 0 0 0.65701964
## 2 0 0 1 1 0 0 0.64700932
## 2 0 0 0 1 0 1 0.64469065
## 2 0 0 0 1 1 0 0.64023918
## 2 1 0 0 1 0 0 0.63459762
## 2 1 1 0 0 0 0 0.30198163
## 2 1 0 0 0 0 1 0.29523766
## 2 1 0 0 0 1 0 0.28434070
## 2 1 0 1 0 0 0 0.27363203
## 2 0 0 0 0 1 1 0.23115772
## 3 0 1 1 1 0 0 0.66845842
## 3 0 1 0 1 1 0 0.66191828
## 3 0 1 0 1 0 1 0.64765029
## 3 0 0 1 1 0 1 0.64659462
## 3 1 1 0 1 0 0 0.64566858
## 3 0 0 0 1 1 1 0.64484005
## 3 0 0 1 1 1 0 0.63544326
## 3 1 0 1 1 0 0 0.63440116
## 3 1 0 0 1 0 1 0.63155828
## 3 1 0 0 1 1 0 0.62785567
## 4 0 1 1 1 1 0 0.66203501
## 4 0 1 1 1 0 1 0.65826501
## 4 1 1 1 1 0 0 0.65586793
## 4 0 1 0 1 1 1 0.65553671
## 4 1 1 0 1 1 0 0.64941843
## 4 0 0 1 1 1 1 0.63798669
## 4 1 1 0 1 0 1 0.63446457
## 4 1 0 1 1 0 1 0.63325000
## 4 1 0 0 1 1 1 0.63149507
## 4 1 0 1 1 1 0 0.62183181
## 5 0 1 1 1 1 1 0.65324511
## 5 1 1 1 1 1 0 0.64858766
## 5 1 1 1 1 0 1 0.64461237
## 5 1 1 0 1 1 1 0.64180690
## 5 1 0 1 1 1 1 0.62418952
## 5 1 1 1 0 1 1 0.30078818
## 6 1 1 1 1 1 1 0.63907062

A third example:

n=1000
x1=rnorm(n)
x2=rnorm(n)
x3=rnorm(n)
x4=rnorm(n)
y=1+2*x1+3*x2+rnorm(n)
results=leaps( x=cbind(x1,x2,x3,x4), y=y, method="r2")
cbind(results$which, Rsq=results$r2)
##   1 2 3 4          Rsq
## 1 0 1 0 0 0.6097956074
## 1 1 0 0 0 0.2954778659
## 1 0 0 0 1 0.0012368128
## 1 0 0 1 0 0.0003920018
## 2 1 1 0 0 0.9261756045
## 2 0 1 1 0 0.6103199511
## 2 0 1 0 1 0.6098169039
## 2 1 0 0 1 0.2962985076
## 2 1 0 1 0 0.2955210672
## 2 0 0 1 1 0.0016020574
## 3 1 1 1 0 0.9261944546
## 3 1 1 0 1 0.9261827416
## 3 0 1 1 1 0.6103372699
## 3 1 0 1 1 0.2963492196
## 4 1 1 1 1 0.9262011552
results2=leaps( x=cbind(x1,x2,x3,x4), y=y, method="adjr2")
cbind(results2$which, adjRsq=results2$adjr2)
##   1 2 3 4        adjRsq
## 1 0 1 0 0  0.6094046210
## 1 1 0 0 0  0.2947719319
## 1 0 0 0 1  0.0002360481
## 1 0 0 1 0 -0.0006096094
## 2 1 1 0 0  0.9260275114
## 2 0 1 1 0  0.6095382459
## 2 0 1 0 1  0.6090341895
## 2 1 0 0 1  0.2948868697
## 2 1 0 1 0  0.2941078697
## 2 0 0 1 1 -0.0004007469
## 3 1 1 1 0  0.9259721487
## 3 1 1 0 1  0.9259604005
## 3 0 1 1 1  0.6091635869
## 3 1 0 1 1  0.2942297895
## 4 1 1 1 1  0.9259044764

3 select by AIC and BIC

AIC: Akaike’s Information Criterion

\[ AIC = n \log\bigg(\frac{SSRes}{n}\bigg) + 2p\]

BIC: Bayesian Information Criterion

\[ BIC = n \log\bigg(\frac{SSRes}{n}\bigg) + p \log(n)\]

Example

hald_allmodels=regsubsets(y~x1+x2+x3+x4, data=hald, nbest=4)
summary(hald_allmodels)
## Subset selection object
## Call: regsubsets.formula(y ~ x1 + x2 + x3 + x4, data = hald, nbest = 4)
## 4 Variables  (and intercept)
##    Forced in Forced out
## x1     FALSE      FALSE
## x2     FALSE      FALSE
## x3     FALSE      FALSE
## x4     FALSE      FALSE
## 4 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          x1  x2  x3  x4 
## 1  ( 1 ) " " " " " " "*"
## 1  ( 2 ) " " "*" " " " "
## 1  ( 3 ) "*" " " " " " "
## 1  ( 4 ) " " " " "*" " "
## 2  ( 1 ) "*" "*" " " " "
## 2  ( 2 ) "*" " " " " "*"
## 2  ( 3 ) " " " " "*" "*"
## 2  ( 4 ) " " "*" "*" " "
## 3  ( 1 ) "*" "*" " " "*"
## 3  ( 2 ) "*" "*" "*" " "
## 3  ( 3 ) "*" " " "*" "*"
## 3  ( 4 ) " " "*" "*" "*"
## 4  ( 1 ) "*" "*" "*" "*"
plot(hald_allmodels, scale="bic")

plot(hald_allmodels, scale="adjr2")

Another example

asphalt_allmodels <- regsubsets(y~x1+x2+x3+x4+x5+x6,data=asphalt,nbest=3)
summary(asphalt_allmodels)
## Subset selection object
## Call: regsubsets.formula(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = asphalt, 
##     nbest = 3)
## 6 Variables  (and intercept)
##    Forced in Forced out
## x1     FALSE      FALSE
## x2     FALSE      FALSE
## x3     FALSE      FALSE
## x4     FALSE      FALSE
## x5     FALSE      FALSE
## x6     FALSE      FALSE
## 3 subsets of each size up to 6
## Selection Algorithm: exhaustive
##          x1  x2  x3  x4  x5  x6 
## 1  ( 1 ) " " " " " " "*" " " " "
## 1  ( 2 ) "*" " " " " " " " " " "
## 1  ( 3 ) " " " " " " " " " " "*"
## 2  ( 1 ) " " "*" " " "*" " " " "
## 2  ( 2 ) " " " " "*" "*" " " " "
## 2  ( 3 ) " " " " " " "*" " " "*"
## 3  ( 1 ) " " "*" "*" "*" " " " "
## 3  ( 2 ) " " "*" " " "*" "*" " "
## 3  ( 3 ) " " "*" " " "*" " " "*"
## 4  ( 1 ) " " "*" "*" "*" "*" " "
## 4  ( 2 ) " " "*" "*" "*" " " "*"
## 4  ( 3 ) "*" "*" "*" "*" " " " "
## 5  ( 1 ) " " "*" "*" "*" "*" "*"
## 5  ( 2 ) "*" "*" "*" "*" "*" " "
## 5  ( 3 ) "*" "*" "*" "*" " " "*"
## 6  ( 1 ) "*" "*" "*" "*" "*" "*"
plot(asphalt_allmodels)

A third example

n=1000
x1=rnorm(n)
x2=rnorm(n)
x3=rnorm(n)
x4=rnorm(n)
y=1+2*x1+3*x2+rnorm(n)
m1=lm(y~x1)
m2=lm(y~x1+x2)
m3=lm(y~x1+x2+x3)
m4=lm(y~x1+x2+x3+x4)
BIC(m1)
## [1] 5190.934
BIC(m2)
## [1] 2842.267
BIC(m3)
## [1] 2847.547
BIC(m4)
## [1] 2854.439
AIC(m1)
## [1] 5176.211
AIC(m2)
## [1] 2822.636
AIC(m3)
## [1] 2823.008
AIC(m4)
## [1] 2824.992

4 Stepwise regression

Stepwise regression methods include:

Forward Selection

Example Hald data set

# forward selection using F
add1(lm(y~1,data=hald), y~x1+x2+x3+x4, test="F")
## Single term additions
## 
## Model:
## y ~ 1
##        Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>              2715.76 71.444                      
## x1      1   1450.08 1265.69 63.519 12.6025 0.0045520 ** 
## x2      1   1809.43  906.34 59.178 21.9606 0.0006648 ***
## x3      1    776.36 1939.40 69.067  4.4034 0.0597623 .  
## x4      1   1831.90  883.87 58.852 22.7985 0.0005762 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(lm(y~x4,data=hald), y~x1+x2+x3+x4, test="F")
## Single term additions
## 
## Model:
## y ~ x4
##        Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## <none>              883.87 58.852                       
## x1      1    809.10  74.76 28.742 108.2239 1.105e-06 ***
## x2      1     14.99 868.88 60.629   0.1725    0.6867    
## x3      1    708.13 175.74 39.853  40.2946 8.375e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(lm(y~x1+x4,data=hald), y~x1+x2+x3+x4, test="F")
## Single term additions
## 
## Model:
## y ~ x1 + x4
##        Df Sum of Sq    RSS    AIC F value  Pr(>F)  
## <none>              74.762 28.742                  
## x2      1    26.789 47.973 24.974  5.0259 0.05169 .
## x3      1    23.926 50.836 25.728  4.2358 0.06969 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

End of Example

Backward Elimination

Example Hald data set

# backward selection using F
drop1(lm(y~x1+x2+x3+x4,data=hald), test="F")
## Single term deletions
## 
## Model:
## y ~ x1 + x2 + x3 + x4
##        Df Sum of Sq    RSS    AIC F value  Pr(>F)  
## <none>              47.864 26.944                  
## x1      1   25.9509 73.815 30.576  4.3375 0.07082 .
## x2      1    2.9725 50.836 25.728  0.4968 0.50090  
## x3      1    0.1091 47.973 24.974  0.0182 0.89592  
## x4      1    0.2470 48.111 25.011  0.0413 0.84407  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(lm(y~x1+x2+x4,data=hald), test="F")
## Single term deletions
## 
## Model:
## y ~ x1 + x2 + x4
##        Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## <none>               47.97 24.974                       
## x1      1    820.91 868.88 60.629 154.0076 5.781e-07 ***
## x2      1     26.79  74.76 28.742   5.0259   0.05169 .  
## x4      1      9.93  57.90 25.420   1.8633   0.20540    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(lm(y~x1+x2,data=hald), test="F")
## Single term deletions
## 
## Model:
## y ~ x1 + x2
##        Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>                57.90 25.420                      
## x1      1    848.43  906.34 59.178  146.52 2.692e-07 ***
## x2      1   1207.78 1265.69 63.519  208.58 5.029e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

End of Example

Stepwise Regression

Stepwise Regression Methods

Example Hald data set

# stepwise selection using F
add1(lm(y~1,data=hald), y~x1+x2+x3+x4, test="F")
## Single term additions
## 
## Model:
## y ~ 1
##        Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>              2715.76 71.444                      
## x1      1   1450.08 1265.69 63.519 12.6025 0.0045520 ** 
## x2      1   1809.43  906.34 59.178 21.9606 0.0006648 ***
## x3      1    776.36 1939.40 69.067  4.4034 0.0597623 .  
## x4      1   1831.90  883.87 58.852 22.7985 0.0005762 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(lm(y~x4,data=hald), y~x1+x2+x3+x4, test="F")
## Single term additions
## 
## Model:
## y ~ x4
##        Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## <none>              883.87 58.852                       
## x1      1    809.10  74.76 28.742 108.2239 1.105e-06 ***
## x2      1     14.99 868.88 60.629   0.1725    0.6867    
## x3      1    708.13 175.74 39.853  40.2946 8.375e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(lm(y~x1+x4,data=hald), y~x1+x2+x3+x4, test="F")
## Single term additions
## 
## Model:
## y ~ x1 + x4
##        Df Sum of Sq    RSS    AIC F value  Pr(>F)  
## <none>              74.762 28.742                  
## x2      1    26.789 47.973 24.974  5.0259 0.05169 .
## x3      1    23.926 50.836 25.728  4.2358 0.06969 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(lm(y~x1+x2+x4,data=hald), y~x1+x2+x3+x4, test="F")
## Single term additions
## 
## Model:
## y ~ x1 + x2 + x4
##        Df Sum of Sq    RSS    AIC F value Pr(>F)
## <none>              47.973 24.974               
## x3      1   0.10909 47.864 26.944  0.0182 0.8959
drop1(lm(y~x1+x2+x4,data=hald), test="F")
## Single term deletions
## 
## Model:
## y ~ x1 + x2 + x4
##        Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## <none>               47.97 24.974                       
## x1      1    820.91 868.88 60.629 154.0076 5.781e-07 ***
## x2      1     26.79  74.76 28.742   5.0259   0.05169 .  
## x4      1      9.93  57.90 25.420   1.8633   0.20540    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
drop1(lm(y~x1+x2,data=hald), test="F")
## Single term deletions
## 
## Model:
## y ~ x1 + x2
##        Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>                57.90 25.420                      
## x1      1    848.43  906.34 59.178  146.52 2.692e-07 ***
## x2      1   1207.78 1265.69 63.519  208.58 5.029e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Similary, we could use AIC for stepwise regression.

# stepwise selection using AIC
null=lm(y~1, data=hald)
full=lm(y~., data=hald)
step(null, scope=list(lower=null, upper=full), direction="forward")
## Start:  AIC=71.44
## y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + x4    1   1831.90  883.87 58.852
## + x2    1   1809.43  906.34 59.178
## + x1    1   1450.08 1265.69 63.519
## + x3    1    776.36 1939.40 69.067
## <none>              2715.76 71.444
## 
## Step:  AIC=58.85
## y ~ x4
## 
##        Df Sum of Sq    RSS    AIC
## + x1    1    809.10  74.76 28.742
## + x3    1    708.13 175.74 39.853
## <none>              883.87 58.852
## + x2    1     14.99 868.88 60.629
## 
## Step:  AIC=28.74
## y ~ x4 + x1
## 
##        Df Sum of Sq    RSS    AIC
## + x2    1    26.789 47.973 24.974
## + x3    1    23.926 50.836 25.728
## <none>              74.762 28.742
## 
## Step:  AIC=24.97
## y ~ x4 + x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## <none>              47.973 24.974
## + x3    1   0.10909 47.864 26.944
## 
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x4           x1           x2  
##     71.6483      -0.2365       1.4519       0.4161
step(full, scope=list(lower=null, upper=full), direction="backward")
## Start:  AIC=26.94
## y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1    0.1091 47.973 24.974
## - x4    1    0.2470 48.111 25.011
## - x2    1    2.9725 50.836 25.728
## <none>              47.864 26.944
## - x1    1   25.9509 73.815 30.576
## 
## Step:  AIC=24.97
## y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## <none>               47.97 24.974
## - x4    1      9.93  57.90 25.420
## - x2    1     26.79  74.76 28.742
## - x1    1    820.91 868.88 60.629
## 
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = hald)
## 
## Coefficients:
## (Intercept)           x1           x2           x4  
##     71.6483       1.4519       0.4161      -0.2365
step(full, scope=list(lower=null, upper=full), direction="both")
## Start:  AIC=26.94
## y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1    0.1091 47.973 24.974
## - x4    1    0.2470 48.111 25.011
## - x2    1    2.9725 50.836 25.728
## <none>              47.864 26.944
## - x1    1   25.9509 73.815 30.576
## 
## Step:  AIC=24.97
## y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## <none>               47.97 24.974
## - x4    1      9.93  57.90 25.420
## + x3    1      0.11  47.86 26.944
## - x2    1     26.79  74.76 28.742
## - x1    1    820.91 868.88 60.629
## 
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = hald)
## 
## Coefficients:
## (Intercept)           x1           x2           x4  
##     71.6483       1.4519       0.4161      -0.2365
step(null, scope=list(lower=null, upper=full), direction="both")
## Start:  AIC=71.44
## y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + x4    1   1831.90  883.87 58.852
## + x2    1   1809.43  906.34 59.178
## + x1    1   1450.08 1265.69 63.519
## + x3    1    776.36 1939.40 69.067
## <none>              2715.76 71.444
## 
## Step:  AIC=58.85
## y ~ x4
## 
##        Df Sum of Sq     RSS    AIC
## + x1    1    809.10   74.76 28.742
## + x3    1    708.13  175.74 39.853
## <none>               883.87 58.852
## + x2    1     14.99  868.88 60.629
## - x4    1   1831.90 2715.76 71.444
## 
## Step:  AIC=28.74
## y ~ x4 + x1
## 
##        Df Sum of Sq     RSS    AIC
## + x2    1     26.79   47.97 24.974
## + x3    1     23.93   50.84 25.728
## <none>                74.76 28.742
## - x1    1    809.10  883.87 58.852
## - x4    1   1190.92 1265.69 63.519
## 
## Step:  AIC=24.97
## y ~ x4 + x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## <none>               47.97 24.974
## - x4    1      9.93  57.90 25.420
## + x3    1      0.11  47.86 26.944
## - x2    1     26.79  74.76 28.742
## - x1    1    820.91 868.88 60.629
## 
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x4           x1           x2  
##     71.6483      -0.2365       1.4519       0.4161

Similary, we could use BIC for stepwise regression.

# stepwise selection using BIC
n=dim(hald)[1]
step(null, scope=list(lower=null, upper=full), k = log(n), direction="forward")
## Start:  AIC=72.01
## y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + x4    1   1831.90  883.87 59.982
## + x2    1   1809.43  906.34 60.308
## + x1    1   1450.08 1265.69 64.649
## + x3    1    776.36 1939.40 70.197
## <none>              2715.76 72.009
## 
## Step:  AIC=59.98
## y ~ x4
## 
##        Df Sum of Sq    RSS    AIC
## + x1    1    809.10  74.76 30.437
## + x3    1    708.13 175.74 41.547
## <none>              883.87 59.982
## + x2    1     14.99 868.88 62.324
## 
## Step:  AIC=30.44
## y ~ x4 + x1
## 
##        Df Sum of Sq    RSS    AIC
## + x2    1    26.789 47.973 27.234
## + x3    1    23.926 50.836 27.987
## <none>              74.762 30.437
## 
## Step:  AIC=27.23
## y ~ x4 + x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## <none>              47.973 27.234
## + x3    1   0.10909 47.864 29.769
## 
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x4           x1           x2  
##     71.6483      -0.2365       1.4519       0.4161
step(full, scope=list(lower=null, upper=full), k = log(n), direction="backward")
## Start:  AIC=29.77
## y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1    0.1091 47.973 27.234
## - x4    1    0.2470 48.111 27.271
## - x2    1    2.9725 50.836 27.987
## <none>              47.864 29.769
## - x1    1   25.9509 73.815 32.836
## 
## Step:  AIC=27.23
## y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x4    1      9.93  57.90 27.115
## <none>               47.97 27.234
## - x2    1     26.79  74.76 30.437
## - x1    1    820.91 868.88 62.324
## 
## Step:  AIC=27.11
## y ~ x1 + x2
## 
##        Df Sum of Sq     RSS    AIC
## <none>                57.90 27.115
## - x1    1    848.43  906.34 60.308
## - x2    1   1207.78 1265.69 64.649
## 
## Call:
## lm(formula = y ~ x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     52.5773       1.4683       0.6623
step(full, scope=list(lower=null, upper=full), k = log(n), direction="both")
## Start:  AIC=29.77
## y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1    0.1091 47.973 27.234
## - x4    1    0.2470 48.111 27.271
## - x2    1    2.9725 50.836 27.987
## <none>              47.864 29.769
## - x1    1   25.9509 73.815 32.836
## 
## Step:  AIC=27.23
## y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x4    1      9.93  57.90 27.115
## <none>               47.97 27.234
## + x3    1      0.11  47.86 29.769
## - x2    1     26.79  74.76 30.437
## - x1    1    820.91 868.88 62.324
## 
## Step:  AIC=27.11
## y ~ x1 + x2
## 
##        Df Sum of Sq     RSS    AIC
## <none>                57.90 27.115
## + x4    1      9.93   47.97 27.234
## + x3    1      9.79   48.11 27.271
## - x1    1    848.43  906.34 60.308
## - x2    1   1207.78 1265.69 64.649
## 
## Call:
## lm(formula = y ~ x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     52.5773       1.4683       0.6623
step(null, scope=list(lower=null, upper=full), k = log(n), direction="both")
## Start:  AIC=72.01
## y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + x4    1   1831.90  883.87 59.982
## + x2    1   1809.43  906.34 60.308
## + x1    1   1450.08 1265.69 64.649
## + x3    1    776.36 1939.40 70.197
## <none>              2715.76 72.009
## 
## Step:  AIC=59.98
## y ~ x4
## 
##        Df Sum of Sq     RSS    AIC
## + x1    1    809.10   74.76 30.437
## + x3    1    708.13  175.74 41.547
## <none>               883.87 59.982
## + x2    1     14.99  868.88 62.324
## - x4    1   1831.90 2715.76 72.009
## 
## Step:  AIC=30.44
## y ~ x4 + x1
## 
##        Df Sum of Sq     RSS    AIC
## + x2    1     26.79   47.97 27.234
## + x3    1     23.93   50.84 27.987
## <none>                74.76 30.437
## - x1    1    809.10  883.87 59.982
## - x4    1   1190.92 1265.69 64.649
## 
## Step:  AIC=27.23
## y ~ x4 + x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## - x4    1      9.93  57.90 27.115
## <none>               47.97 27.234
## + x3    1      0.11  47.86 29.769
## - x2    1     26.79  74.76 30.437
## - x1    1    820.91 868.88 62.324
## 
## Step:  AIC=27.11
## y ~ x1 + x2
## 
##        Df Sum of Sq     RSS    AIC
## <none>                57.90 27.115
## + x4    1      9.93   47.97 27.234
## + x3    1      9.79   48.11 27.271
## - x1    1    848.43  906.34 60.308
## - x2    1   1207.78 1265.69 64.649
## 
## Call:
## lm(formula = y ~ x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     52.5773       1.4683       0.6623

5 Lasso regression

LASSO stands for Least Absolute Shrinkage and Selection Operator. The algorithm is another variation of linear regression like ridge regression. We use lasso regression when we have large number of predictor variables.

Lasso regression selects variables and estimate coefficient simoutaneously, via apply L1 regularization.The L1 regularization adds a penality equivalent to the absolute of the maginitude of regression coefficients and tries to minimize them. The equation of lasso is similar to ridge regression and looks like as given below.

Traditional least square estimation: \[ \hat{\beta} = \arg\min \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{k} \beta_j x_{ij})^2 \]

Ridge Regression estimation: \[ \hat{\beta}_\lambda = \arg\min \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{k} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \]

LASSO Regression estimation: \[ \hat{\beta}_\lambda = \arg\min \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{k} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j | \]

The following diagram is the visual interpretation comparing OLS and lasso regression:

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-2
lreg <- glmnet(x=as.matrix(hald[,-1]), y=hald[,1],alpha=1)
#rbind(lreg$lambda,lreg$beta)
matplot(lreg$lambda,t(lreg$beta),type="l",ylab="coefficient",xlab="lambda")
abline(h=0)

plot(lreg)

We can apply cross-validation to find the best cv

cv_fit <- cv.glmnet(x=as.matrix(hald[,-1]), y=hald[,1],alpha=1,nfolds = 4)
plot(cv_fit)

opt_lambda <- cv_fit$lambda.min
opt_lambda
## [1] 0.07810171

We can also predict with Lasso regression:

#only matrix form is acceptable in ridge regression
old_x <- as.matrix(hald[,-1])
y_predicted <- predict(lreg, s = opt_lambda, newx = old_x)

newx <- matrix(c(x1=15,x2=50,x3=6,x4=70),ncol=4)
predict(lreg, s = opt_lambda, newx = newx)
##             1
## [1,] 97.64065

And the in-sample mse:

mse <- mean((hald[,1] - y_predicted)^2)
mse
## [1] 3.700099
#your turn, what's the in-sample mse from the best model selected by R-square?

Another example Asphalt, try it!

add1(lm(y~1,data=asphalt[,-1]), y~x1+x2+x3+x4+x5+x6, test="F")
## Single term additions
## 
## Model:
## y ~ 1
##        Df Sum of Sq     RSS     AIC F value    Pr(>F)    
## <none>              1366.52 119.367                      
## x1      1    424.93  941.59 109.821 13.0872  0.001117 ** 
## x2      1     80.23 1286.29 119.491  1.8089  0.189067    
## x3      1     60.38 1306.14 119.966  1.3406  0.256377    
## x4      1    898.57  467.95  88.146 55.6859 3.181e-08 ***
## x5      1     99.00 1267.52 119.036  2.2651  0.143134    
## x6      1    245.58 1120.94 115.226  6.3533  0.017473 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
add1(lm(y~x4,data=asphalt[,-1]), y~x1+x2+x3+x4+x5+x6, test="F")
## Single term additions
## 
## Model:
## y ~ x4
##        Df Sum of Sq    RSS    AIC F value Pr(>F)
## <none>              467.95 88.146               
## x1      1    1.9125 466.04 90.019  0.1149 0.7372
## x2      1   30.5099 437.44 88.056  1.9529 0.1733
## x3      1   17.7425 450.21 88.948  1.1035 0.3025
## x5      1    9.1078 458.85 89.536  0.5558 0.4622
## x6      1   14.7853 453.17 89.151  0.9135 0.3474
drop1(lm(y~x4,data=asphalt[,-1]), test="F")
## Single term deletions
## 
## Model:
## y ~ x4
##        Df Sum of Sq     RSS     AIC F value    Pr(>F)    
## <none>               467.95  88.146                      
## x4      1    898.57 1366.52 119.367  55.686 3.181e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# stepwise with AIC
null=lm(y~1, data=asphalt[,-1])
full=lm(y~., data=asphalt[,-1])
step(null, scope=list(lower=null, upper=full), direction="forward")
## Start:  AIC=119.37
## y ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + x4    1    898.57  467.95  88.146
## + x1    1    424.93  941.59 109.821
## + x6    1    245.58 1120.94 115.226
## + x5    1     99.00 1267.52 119.036
## <none>              1366.52 119.367
## + x2    1     80.23 1286.29 119.491
## + x3    1     60.38 1306.14 119.966
## 
## Step:  AIC=88.15
## y ~ x4
## 
##        Df Sum of Sq    RSS    AIC
## + x2    1   30.5099 437.44 88.056
## <none>              467.95 88.146
## + x3    1   17.7425 450.21 88.948
## + x6    1   14.7853 453.17 89.151
## + x5    1    9.1078 458.85 89.536
## + x1    1    1.9125 466.04 90.019
## 
## Step:  AIC=88.06
## y ~ x4 + x2
## 
##        Df Sum of Sq    RSS    AIC
## + x3    1   29.6911 407.75 87.877
## <none>              437.44 88.056
## + x5    1   21.6476 415.80 88.482
## + x6    1    4.0999 433.34 89.764
## + x1    1    1.6627 435.78 89.938
## 
## Step:  AIC=87.88
## y ~ x4 + x2 + x3
## 
##        Df Sum of Sq    RSS    AIC
## <none>              407.75 87.877
## + x5    1    7.4946 400.26 89.302
## + x6    1    3.0297 404.72 89.646
## + x1    1    0.1908 407.56 89.862
## 
## Call:
## lm(formula = y ~ x4 + x2 + x3, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4           x2           x3  
##     -38.229      -10.253        4.534        5.800
step(full, scope=list(lower=null, upper=full), direction="backward")
## Start:  AIC=92.86
## y ~ x1 + x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x1    1      0.30 394.87  90.882
## - x6    1      5.60 400.18  91.295
## - x5    1     10.13 404.70  91.644
## - x3    1     13.32 407.90  91.888
## <none>              394.57  92.858
## - x2    1     33.39 427.96  93.376
## - x4    1    401.66 796.24 112.623
## 
## Step:  AIC=90.88
## y ~ x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x6    1      5.38 400.26  89.302
## - x5    1      9.85 404.72  89.646
## - x3    1     13.08 407.95  89.892
## <none>              394.87  90.882
## - x2    1     33.87 428.74  91.433
## - x4    1    540.02 934.89 115.600
## 
## Step:  AIC=89.3
## y ~ x2 + x3 + x4 + x5
## 
##        Df Sum of Sq     RSS     AIC
## - x5    1      7.49  407.75  87.877
## - x3    1     15.54  415.80  88.482
## <none>               400.26  89.302
## - x2    1     48.10  448.36  90.820
## - x4    1    709.20 1109.46 118.907
## 
## Step:  AIC=87.88
## y ~ x2 + x3 + x4
## 
##        Df Sum of Sq     RSS     AIC
## <none>               407.75  87.877
## - x3    1     29.69  437.44  88.056
## - x2    1     42.46  450.21  88.948
## - x4    1    786.16 1193.92 119.181
## 
## Call:
## lm(formula = y ~ x2 + x3 + x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x2           x3           x4  
##     -38.229        4.534        5.800      -10.253
step(full, scope=list(lower=null, upper=full), direction="both")
## Start:  AIC=92.86
## y ~ x1 + x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x1    1      0.30 394.87  90.882
## - x6    1      5.60 400.18  91.295
## - x5    1     10.13 404.70  91.644
## - x3    1     13.32 407.90  91.888
## <none>              394.57  92.858
## - x2    1     33.39 427.96  93.376
## - x4    1    401.66 796.24 112.623
## 
## Step:  AIC=90.88
## y ~ x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x6    1      5.38 400.26  89.302
## - x5    1      9.85 404.72  89.646
## - x3    1     13.08 407.95  89.892
## <none>              394.87  90.882
## - x2    1     33.87 428.74  91.433
## + x1    1      0.30 394.57  92.858
## - x4    1    540.02 934.89 115.600
## 
## Step:  AIC=89.3
## y ~ x2 + x3 + x4 + x5
## 
##        Df Sum of Sq     RSS     AIC
## - x5    1      7.49  407.75  87.877
## - x3    1     15.54  415.80  88.482
## <none>               400.26  89.302
## - x2    1     48.10  448.36  90.820
## + x6    1      5.38  394.87  90.882
## + x1    1      0.08  400.18  91.295
## - x4    1    709.20 1109.46 118.907
## 
## Step:  AIC=87.88
## y ~ x2 + x3 + x4
## 
##        Df Sum of Sq     RSS     AIC
## <none>               407.75  87.877
## - x3    1     29.69  437.44  88.056
## - x2    1     42.46  450.21  88.948
## + x5    1      7.49  400.26  89.302
## + x6    1      3.03  404.72  89.646
## + x1    1      0.19  407.56  89.862
## - x4    1    786.16 1193.92 119.181
## 
## Call:
## lm(formula = y ~ x2 + x3 + x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x2           x3           x4  
##     -38.229        4.534        5.800      -10.253
step(null, scope=list(lower=null, upper=full), direction="both")
## Start:  AIC=119.37
## y ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + x4    1    898.57  467.95  88.146
## + x1    1    424.93  941.59 109.821
## + x6    1    245.58 1120.94 115.226
## + x5    1     99.00 1267.52 119.036
## <none>              1366.52 119.367
## + x2    1     80.23 1286.29 119.491
## + x3    1     60.38 1306.14 119.966
## 
## Step:  AIC=88.15
## y ~ x4
## 
##        Df Sum of Sq     RSS     AIC
## + x2    1     30.51  437.44  88.056
## <none>               467.95  88.146
## + x3    1     17.74  450.21  88.948
## + x6    1     14.79  453.17  89.151
## + x5    1      9.11  458.85  89.536
## + x1    1      1.91  466.04  90.019
## - x4    1    898.57 1366.52 119.367
## 
## Step:  AIC=88.06
## y ~ x4 + x2
## 
##        Df Sum of Sq     RSS     AIC
## + x3    1     29.69  407.75  87.877
## <none>               437.44  88.056
## - x2    1     30.51  467.95  88.146
## + x5    1     21.65  415.80  88.482
## + x6    1      4.10  433.34  89.764
## + x1    1      1.66  435.78  89.938
## - x4    1    848.84 1286.29 119.491
## 
## Step:  AIC=87.88
## y ~ x4 + x2 + x3
## 
##        Df Sum of Sq     RSS     AIC
## <none>               407.75  87.877
## - x3    1     29.69  437.44  88.056
## - x2    1     42.46  450.21  88.948
## + x5    1      7.49  400.26  89.302
## + x6    1      3.03  404.72  89.646
## + x1    1      0.19  407.56  89.862
## - x4    1    786.16 1193.92 119.181
## 
## Call:
## lm(formula = y ~ x4 + x2 + x3, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4           x2           x3  
##     -38.229      -10.253        4.534        5.800
# stepwise with BIC
n=dim(asphalt)[1]
step(null, scope=list(lower=null, upper=full), k = log(n), direction="forward")
## Start:  AIC=120.8
## y ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + x4    1    898.57  467.95  91.014
## + x1    1    424.93  941.59 112.689
## + x6    1    245.58 1120.94 118.094
## <none>              1366.52 120.801
## + x5    1     99.00 1267.52 121.904
## + x2    1     80.23 1286.29 122.359
## + x3    1     60.38 1306.14 122.834
## 
## Step:  AIC=91.01
## y ~ x4
## 
##        Df Sum of Sq    RSS    AIC
## <none>              467.95 91.014
## + x2    1   30.5099 437.44 92.358
## + x3    1   17.7425 450.21 93.250
## + x6    1   14.7853 453.17 93.452
## + x5    1    9.1078 458.85 93.838
## + x1    1    1.9125 466.04 94.321
## 
## Call:
## lm(formula = y ~ x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4  
##       11.72       -10.77
step(full, scope=list(lower=null, upper=full), k = log(n), direction="backward")
## Start:  AIC=102.9
## y ~ x1 + x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x1    1      0.30 394.87  99.486
## - x6    1      5.60 400.18  99.899
## - x5    1     10.13 404.70 100.248
## - x3    1     13.32 407.90 100.492
## - x2    1     33.39 427.96 101.980
## <none>              394.57 102.896
## - x4    1    401.66 796.24 121.227
## 
## Step:  AIC=99.49
## y ~ x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x6    1      5.38 400.26  96.472
## - x5    1      9.85 404.72  96.816
## - x3    1     13.08 407.95  97.062
## - x2    1     33.87 428.74  98.603
## <none>              394.87  99.486
## - x4    1    540.02 934.89 122.770
## 
## Step:  AIC=96.47
## y ~ x2 + x3 + x4 + x5
## 
##        Df Sum of Sq     RSS     AIC
## - x5    1      7.49  407.75  93.613
## - x3    1     15.54  415.80  94.218
## <none>               400.26  96.472
## - x2    1     48.10  448.36  96.556
## - x4    1    709.20 1109.46 124.643
## 
## Step:  AIC=93.61
## y ~ x2 + x3 + x4
## 
##        Df Sum of Sq     RSS     AIC
## - x3    1     29.69  437.44  92.358
## - x2    1     42.46  450.21  93.250
## <none>               407.75  93.613
## - x4    1    786.16 1193.92 123.483
## 
## Step:  AIC=92.36
## y ~ x2 + x4
## 
##        Df Sum of Sq     RSS     AIC
## - x2    1     30.51  467.95  91.014
## <none>               437.44  92.358
## - x4    1    848.84 1286.29 122.359
## 
## Step:  AIC=91.01
## y ~ x4
## 
##        Df Sum of Sq     RSS     AIC
## <none>               467.95  91.014
## - x4    1    898.57 1366.52 120.801
## 
## Call:
## lm(formula = y ~ x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4  
##       11.72       -10.77
step(full, scope=list(lower=null, upper=full), k = log(n), direction="both")
## Start:  AIC=102.9
## y ~ x1 + x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x1    1      0.30 394.87  99.486
## - x6    1      5.60 400.18  99.899
## - x5    1     10.13 404.70 100.248
## - x3    1     13.32 407.90 100.492
## - x2    1     33.39 427.96 101.980
## <none>              394.57 102.896
## - x4    1    401.66 796.24 121.227
## 
## Step:  AIC=99.49
## y ~ x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x6    1      5.38 400.26  96.472
## - x5    1      9.85 404.72  96.816
## - x3    1     13.08 407.95  97.062
## - x2    1     33.87 428.74  98.603
## <none>              394.87  99.486
## + x1    1      0.30 394.57 102.896
## - x4    1    540.02 934.89 122.770
## 
## Step:  AIC=96.47
## y ~ x2 + x3 + x4 + x5
## 
##        Df Sum of Sq     RSS     AIC
## - x5    1      7.49  407.75  93.613
## - x3    1     15.54  415.80  94.218
## <none>               400.26  96.472
## - x2    1     48.10  448.36  96.556
## + x6    1      5.38  394.87  99.486
## + x1    1      0.08  400.18  99.899
## - x4    1    709.20 1109.46 124.643
## 
## Step:  AIC=93.61
## y ~ x2 + x3 + x4
## 
##        Df Sum of Sq     RSS     AIC
## - x3    1     29.69  437.44  92.358
## - x2    1     42.46  450.21  93.250
## <none>               407.75  93.613
## + x5    1      7.49  400.26  96.472
## + x6    1      3.03  404.72  96.816
## + x1    1      0.19  407.56  97.032
## - x4    1    786.16 1193.92 123.483
## 
## Step:  AIC=92.36
## y ~ x2 + x4
## 
##        Df Sum of Sq     RSS     AIC
## - x2    1     30.51  467.95  91.014
## <none>               437.44  92.358
## + x3    1     29.69  407.75  93.613
## + x5    1     21.65  415.80  94.218
## + x6    1      4.10  433.34  95.500
## + x1    1      1.66  435.78  95.674
## - x4    1    848.84 1286.29 122.359
## 
## Step:  AIC=91.01
## y ~ x4
## 
##        Df Sum of Sq     RSS     AIC
## <none>               467.95  91.014
## + x2    1     30.51  437.44  92.358
## + x3    1     17.74  450.21  93.250
## + x6    1     14.79  453.17  93.452
## + x5    1      9.11  458.85  93.838
## + x1    1      1.91  466.04  94.321
## - x4    1    898.57 1366.52 120.801
## 
## Call:
## lm(formula = y ~ x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4  
##       11.72       -10.77
step(null, scope=list(lower=null, upper=full), k = log(n), direction="both")
## Start:  AIC=120.8
## y ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + x4    1    898.57  467.95  91.014
## + x1    1    424.93  941.59 112.689
## + x6    1    245.58 1120.94 118.094
## <none>              1366.52 120.801
## + x5    1     99.00 1267.52 121.904
## + x2    1     80.23 1286.29 122.359
## + x3    1     60.38 1306.14 122.834
## 
## Step:  AIC=91.01
## y ~ x4
## 
##        Df Sum of Sq     RSS     AIC
## <none>               467.95  91.014
## + x2    1     30.51  437.44  92.358
## + x3    1     17.74  450.21  93.250
## + x6    1     14.79  453.17  93.452
## + x5    1      9.11  458.85  93.838
## + x1    1      1.91  466.04  94.321
## - x4    1    898.57 1366.52 120.801
## 
## Call:
## lm(formula = y ~ x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4  
##       11.72       -10.77
# then lasso?

# plot lasso?

# mse?

End of Example