Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

1 Variable selection in general

Let us start with a data set that concerns the hardening of cement. In particular, the researchers were interested in learning how the composition of the cement affected the heat evolved during the hardening of the cement. Therefore, they measured and recorded the following data on 13 batches of cement:

Response y: heat evolved in calories during hardening of cement on a per gram basis
Predictor x1: % of tricalcium aluminate
Predictor x2: % of tricalcium silicate
Predictor x3: % of tetracalcium alumino ferrite
Predictor x4: % of dicalcium silicate

rm(list=ls())
hald <- read.csv("data_HaldCement.csv",h=TRUE)
head(hald)

##       y x1 x2 x3 x4
## 1  78.5  7 26  6 60
## 2  74.3  1 29 15 52
## 3 104.3 11 56  8 20
## 4  87.6 11 31  8 47
## 5  95.9  7 52  6 33
## 6 109.2 11 55  9 22

pairs(hald,pch=20)

For this data set, if you want to build a model to predict y, which variables do you want to include in the regression?

summary(lm(y~x1,data=hald))

## 
## Call:
## lm(formula = y ~ x1, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.061  -9.048   1.339   7.883  15.614 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  81.4793     4.9273   16.54 4.07e-09 ***
## x1            1.8687     0.5264    3.55  0.00455 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.73 on 11 degrees of freedom
## Multiple R-squared:  0.5339, Adjusted R-squared:  0.4916 
## F-statistic:  12.6 on 1 and 11 DF,  p-value: 0.004552

summary(lm(y~x2,data=hald))

## 
## Call:
## lm(formula = y ~ x2, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.752  -6.008  -1.684   3.794  21.387 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  57.4237     8.4906   6.763  3.1e-05 ***
## x2            0.7891     0.1684   4.686 0.000665 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.077 on 11 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6359 
## F-statistic: 21.96 on 1 and 11 DF,  p-value: 0.0006648

summary(lm(y~x3,data=hald))

## 
## Call:
## lm(formula = y ~ x3, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.168 -10.075   4.144  10.299  14.399 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 110.2027     7.9478  13.866  2.6e-08 ***
## x3           -1.2558     0.5984  -2.098   0.0598 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.28 on 11 degrees of freedom
## Multiple R-squared:  0.2859, Adjusted R-squared:  0.221 
## F-statistic: 4.403 on 1 and 11 DF,  p-value: 0.05976

summary(lm(y~x4,data=hald))

## 
## Call:
## lm(formula = y ~ x4, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.589  -8.228   1.495   4.726  17.524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 117.5679     5.2622  22.342 1.62e-10 ***
## x4           -0.7382     0.1546  -4.775 0.000576 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.964 on 11 degrees of freedom
## Multiple R-squared:  0.6745, Adjusted R-squared:  0.645 
## F-statistic:  22.8 on 1 and 11 DF,  p-value: 0.0005762

summary(lm(y~x1+x2+x3+x4,data=hald))

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1750 -1.6709  0.2508  1.3783  3.9254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  62.4054    70.0710   0.891   0.3991  
## x1            1.5511     0.7448   2.083   0.0708 .
## x2            0.5102     0.7238   0.705   0.5009  
## x3            0.1019     0.7547   0.135   0.8959  
## x4           -0.1441     0.7091  -0.203   0.8441  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9736 
## F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07

Obviously, models with only one covariate is too weak for prediction, but the model with all covariates included have all insignificant p-values. So what is the best model?

Model-Building Problem

“Conflicting” goals in regression model building:
Want as many regressors as possible so that the “information content” in the variables will influence the fitted values
Want as few regressors as necessary because the variance of parameter estimates and fitted values will increase as the number of regressors increases and more regressors can cost more money in data collection/model maintenance
Need to find a compromise that leads to the best regression equation

Notes on Variable selection techniques.

None of the variable selection techniques can guarantee the best regression equation for the dataset of interest
The techniques may very well give different results
Complete reliance on the algorithm for results is to be avoided
Other valuable information such as experience with and knowledge of the data and problem should be utilized

Consequences of Deleting Variables

Improves the precision of the parameter estimates of retained variables
Improves the precision of the variance of the predicted response
Can induce bias into the estimates of coefficients and variance of predicted response

What if transformation is needed for some of the covariates? Here is the strategy for regression model building.

All Possible Regressions

Assume the intercept term is in all equations
If there are \(k\) regressors, we would investigate \(2^k\) possible regression equations
Use the criteria discussed to determine some candidate models
Complete regression analysis on candidate models

Criterion for variable selection

Adjusted R squared (or Prediction R squared or PRESS)
AIC or BIC
Stepwise regression

2 Select by Adjusted R squared

The coefficient of determination \(R^2\) is defined as

\[R^2= 1 - \frac{SSRes}{SST}\]

Large values of \(R^2\) are preferred, however, adding covariates will only increase this value.
Not a good measurement for variable selection.

For a model with k covariates, the adjusted R squared is

\[R^2_{adj}= 1- \frac{n-1}{n-k-1}\bigg(\frac{SSRes}{SST}\bigg)=1 - \frac{n-1}{n-k-1}(1-R^2)\]

This value will not necessarily increase as additional terms are introduced into the model
Large values of adjusted R2 are preferred
Adjusted R squared is a good measurement for variable selection.

Example

library(leaps)
hald_allmodels_r2 <- leaps( x=hald[,-1], y=hald[,1], method="r2")
hald_allmodels_r2

## $which
##       1     2     3     4
## 1 FALSE FALSE FALSE  TRUE
## 1 FALSE  TRUE FALSE FALSE
## 1  TRUE FALSE FALSE FALSE
## 1 FALSE FALSE  TRUE FALSE
## 2  TRUE  TRUE FALSE FALSE
## 2  TRUE FALSE FALSE  TRUE
## 2 FALSE FALSE  TRUE  TRUE
## 2 FALSE  TRUE  TRUE FALSE
## 2 FALSE  TRUE FALSE  TRUE
## 2  TRUE FALSE  TRUE FALSE
## 3  TRUE  TRUE FALSE  TRUE
## 3  TRUE  TRUE  TRUE FALSE
## 3  TRUE FALSE  TRUE  TRUE
## 3 FALSE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
## 
## $size
##  [1] 2 2 2 2 3 3 3 3 3 3 4 4 4 4 5
## 
## $r2
##  [1] 0.6745420 0.6662683 0.5339480 0.2858727 0.9786784 0.9724710 0.9352896
##  [8] 0.8470254 0.6800604 0.5481667 0.9823355 0.9822847 0.9812811 0.9728200
## [15] 0.9823756

cbind(hald_allmodels_r2$which, Rsq = hald_allmodels_r2$r2)

##   1 2 3 4       Rsq
## 1 0 0 0 1 0.6745420
## 1 0 1 0 0 0.6662683
## 1 1 0 0 0 0.5339480
## 1 0 0 1 0 0.2858727
## 2 1 1 0 0 0.9786784
## 2 1 0 0 1 0.9724710
## 2 0 0 1 1 0.9352896
## 2 0 1 1 0 0.8470254
## 2 0 1 0 1 0.6800604
## 2 1 0 1 0 0.5481667
## 3 1 1 0 1 0.9823355
## 3 1 1 1 0 0.9822847
## 3 1 0 1 1 0.9812811
## 3 0 1 1 1 0.9728200
## 4 1 1 1 1 0.9823756

hald_allmodels_r2adj <- leaps( x=hald[,-1], y=hald[,1], method="adjr2")
hald_allmodels_r2adj

## $which
##       1     2     3     4
## 1 FALSE FALSE FALSE  TRUE
## 1 FALSE  TRUE FALSE FALSE
## 1  TRUE FALSE FALSE FALSE
## 1 FALSE FALSE  TRUE FALSE
## 2  TRUE  TRUE FALSE FALSE
## 2  TRUE FALSE FALSE  TRUE
## 2 FALSE FALSE  TRUE  TRUE
## 2 FALSE  TRUE  TRUE FALSE
## 2 FALSE  TRUE FALSE  TRUE
## 2  TRUE FALSE  TRUE FALSE
## 3  TRUE  TRUE FALSE  TRUE
## 3  TRUE  TRUE  TRUE FALSE
## 3  TRUE FALSE  TRUE  TRUE
## 3 FALSE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
## 
## $size
##  [1] 2 2 2 2 3 3 3 3 3 3 4 4 4 4 5
## 
## $adjr2
##  [1] 0.6449549 0.6359290 0.4915797 0.2209521 0.9744140 0.9669653 0.9223476
##  [8] 0.8164305 0.6160725 0.4578001 0.9764473 0.9763796 0.9750415 0.9637599
## [15] 0.9735634

cbind(hald_allmodels_r2adj$which, adjRsq = hald_allmodels_r2adj$adjr2)

##   1 2 3 4    adjRsq
## 1 0 0 0 1 0.6449549
## 1 0 1 0 0 0.6359290
## 1 1 0 0 0 0.4915797
## 1 0 0 1 0 0.2209521
## 2 1 1 0 0 0.9744140
## 2 1 0 0 1 0.9669653
## 2 0 0 1 1 0.9223476
## 2 0 1 1 0 0.8164305
## 2 0 1 0 1 0.6160725
## 2 1 0 1 0 0.4578001
## 3 1 1 0 1 0.9764473
## 3 1 1 1 0 0.9763796
## 3 1 0 1 1 0.9750415
## 3 0 1 1 1 0.9637599
## 4 1 1 1 1 0.9735634

According adjusted R squared, ## 3 1 1 0 1 0.9764473 corresponds to the best model. That is, x1, x2, and x4 should be included in the regression.

Another example:

asphalt <- read.csv("data_Asphalt.csv",header=TRUE)
head(asphalt)

##   Observation     y  x1   x2   x3 x4  x5    x6
## 1           1  6.75 2.8 4.68 4.87  0 8.4 4.916
## 2           2 13.00 1.4 5.19 4.50  0 6.5 4.563
## 3           3 14.75 1.4 4.82 4.73  0 7.9 5.321
## 4           4 12.60 3.3 4.85 4.76  0 8.3 4.865
## 5           5  8.25 1.7 4.86 4.95  0 8.4 3.776
## 6           6 10.67 2.9 5.16 4.45  0 7.4 4.397

pairs(asphalt,pch=20)

asphalt_allmodels_r2 <- leaps(x=asphalt[,3:8],y=asphalt[,2], method="r2")
asphalt_allmodels_r2

## $which
##       1     2     3     4     5     6
## 1 FALSE FALSE FALSE  TRUE FALSE FALSE
## 1  TRUE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE  TRUE
## 1 FALSE FALSE FALSE FALSE  TRUE FALSE
## 1 FALSE  TRUE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE  TRUE FALSE FALSE FALSE
## 2 FALSE  TRUE FALSE  TRUE FALSE FALSE
## 2 FALSE FALSE  TRUE  TRUE FALSE FALSE
## 2 FALSE FALSE FALSE  TRUE FALSE  TRUE
## 2 FALSE FALSE FALSE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE FALSE FALSE
## 2  TRUE  TRUE FALSE FALSE FALSE FALSE
## 2  TRUE FALSE FALSE FALSE FALSE  TRUE
## 2  TRUE FALSE FALSE FALSE  TRUE FALSE
## 2  TRUE FALSE  TRUE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE  TRUE  TRUE
## 3 FALSE  TRUE  TRUE  TRUE FALSE FALSE
## 3 FALSE  TRUE FALSE  TRUE  TRUE FALSE
## 3 FALSE  TRUE FALSE  TRUE FALSE  TRUE
## 3 FALSE FALSE  TRUE  TRUE FALSE  TRUE
## 3  TRUE  TRUE FALSE  TRUE FALSE FALSE
## 3 FALSE FALSE FALSE  TRUE  TRUE  TRUE
## 3 FALSE FALSE  TRUE  TRUE  TRUE FALSE
## 3  TRUE FALSE  TRUE  TRUE FALSE FALSE
## 3  TRUE FALSE FALSE  TRUE FALSE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE FALSE
## 4 FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## 4 FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## 4  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## 4 FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## 4 FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## 4  TRUE FALSE  TRUE  TRUE FALSE  TRUE
## 4  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 4  TRUE FALSE  TRUE  TRUE  TRUE FALSE
## 5 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## 5  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## 5  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## 5  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
## [6] "5"           "6"          
## 
## $size
##  [1] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6
## [39] 6 6 6 6 7
## 
## $r2
##  [1] 0.65755827 0.31095511 0.17970996 0.07244760 0.05871311 0.04418471
##  [7] 0.67988500 0.67054203 0.66837794 0.66422324 0.65895778 0.34851619
## [13] 0.34222182 0.33205132 0.32205656 0.28241387 0.70161258 0.69572645
## [19] 0.68288526 0.68193516 0.68110172 0.68035604 0.67189893 0.67096105
## [25] 0.66840245 0.66507011 0.70709701 0.70382967 0.70175221 0.70146515
## [31] 0.69616264 0.68625513 0.68320263 0.68215000 0.68062906 0.67225423
## [37] 0.71103759 0.70715638 0.70384365 0.70150575 0.68682460 0.41732348
## [43] 0.71125649

cbind(asphalt_allmodels_r2$which,Rsq=asphalt_allmodels_r2$r2)

##   1 2 3 4 5 6        Rsq
## 1 0 0 0 1 0 0 0.65755827
## 1 1 0 0 0 0 0 0.31095511
## 1 0 0 0 0 0 1 0.17970996
## 1 0 0 0 0 1 0 0.07244760
## 1 0 1 0 0 0 0 0.05871311
## 1 0 0 1 0 0 0 0.04418471
## 2 0 1 0 1 0 0 0.67988500
## 2 0 0 1 1 0 0 0.67054203
## 2 0 0 0 1 0 1 0.66837794
## 2 0 0 0 1 1 0 0.66422324
## 2 1 0 0 1 0 0 0.65895778
## 2 1 1 0 0 0 0 0.34851619
## 2 1 0 0 0 0 1 0.34222182
## 2 1 0 0 0 1 0 0.33205132
## 2 1 0 1 0 0 0 0.32205656
## 2 0 0 0 0 1 1 0.28241387
## 3 0 1 1 1 0 0 0.70161258
## 3 0 1 0 1 1 0 0.69572645
## 3 0 1 0 1 0 1 0.68288526
## 3 0 0 1 1 0 1 0.68193516
## 3 1 1 0 1 0 0 0.68110172
## 3 0 0 0 1 1 1 0.68035604
## 3 0 0 1 1 1 0 0.67189893
## 3 1 0 1 1 0 0 0.67096105
## 3 1 0 0 1 0 1 0.66840245
## 3 1 0 0 1 1 0 0.66507011
## 4 0 1 1 1 1 0 0.70709701
## 4 0 1 1 1 0 1 0.70382967
## 4 1 1 1 1 0 0 0.70175221
## 4 0 1 0 1 1 1 0.70146515
## 4 1 1 0 1 1 0 0.69616264
## 4 0 0 1 1 1 1 0.68625513
## 4 1 1 0 1 0 1 0.68320263
## 4 1 0 1 1 0 1 0.68215000
## 4 1 0 0 1 1 1 0.68062906
## 4 1 0 1 1 1 0 0.67225423
## 5 0 1 1 1 1 1 0.71103759
## 5 1 1 1 1 1 0 0.70715638
## 5 1 1 1 1 0 1 0.70384365
## 5 1 1 0 1 1 1 0.70150575
## 5 1 0 1 1 1 1 0.68682460
## 5 1 1 1 0 1 1 0.41732348
## 6 1 1 1 1 1 1 0.71125649

asphalt_allmodels_r2adj <- leaps(x=asphalt[,3:8],y=asphalt[,2], method="adjr2")
asphalt_allmodels_r2adj

## $which
##       1     2     3     4     5     6
## 1 FALSE FALSE FALSE  TRUE FALSE FALSE
## 1  TRUE FALSE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE FALSE FALSE FALSE  TRUE
## 1 FALSE FALSE FALSE FALSE  TRUE FALSE
## 1 FALSE  TRUE FALSE FALSE FALSE FALSE
## 1 FALSE FALSE  TRUE FALSE FALSE FALSE
## 2 FALSE  TRUE FALSE  TRUE FALSE FALSE
## 2 FALSE FALSE  TRUE  TRUE FALSE FALSE
## 2 FALSE FALSE FALSE  TRUE FALSE  TRUE
## 2 FALSE FALSE FALSE  TRUE  TRUE FALSE
## 2  TRUE FALSE FALSE  TRUE FALSE FALSE
## 2  TRUE  TRUE FALSE FALSE FALSE FALSE
## 2  TRUE FALSE FALSE FALSE FALSE  TRUE
## 2  TRUE FALSE FALSE FALSE  TRUE FALSE
## 2  TRUE FALSE  TRUE FALSE FALSE FALSE
## 2 FALSE FALSE FALSE FALSE  TRUE  TRUE
## 3 FALSE  TRUE  TRUE  TRUE FALSE FALSE
## 3 FALSE  TRUE FALSE  TRUE  TRUE FALSE
## 3 FALSE  TRUE FALSE  TRUE FALSE  TRUE
## 3 FALSE FALSE  TRUE  TRUE FALSE  TRUE
## 3  TRUE  TRUE FALSE  TRUE FALSE FALSE
## 3 FALSE FALSE FALSE  TRUE  TRUE  TRUE
## 3 FALSE FALSE  TRUE  TRUE  TRUE FALSE
## 3  TRUE FALSE  TRUE  TRUE FALSE FALSE
## 3  TRUE FALSE FALSE  TRUE FALSE  TRUE
## 3  TRUE FALSE FALSE  TRUE  TRUE FALSE
## 4 FALSE  TRUE  TRUE  TRUE  TRUE FALSE
## 4 FALSE  TRUE  TRUE  TRUE FALSE  TRUE
## 4  TRUE  TRUE  TRUE  TRUE FALSE FALSE
## 4 FALSE  TRUE FALSE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## 4 FALSE FALSE  TRUE  TRUE  TRUE  TRUE
## 4  TRUE  TRUE FALSE  TRUE FALSE  TRUE
## 4  TRUE FALSE  TRUE  TRUE FALSE  TRUE
## 4  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## 4  TRUE FALSE  TRUE  TRUE  TRUE FALSE
## 5 FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## 5  TRUE  TRUE  TRUE  TRUE FALSE  TRUE
## 5  TRUE  TRUE FALSE  TRUE  TRUE  TRUE
## 5  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
## 5  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
## 6  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## 
## $label
## [1] "(Intercept)" "1"           "2"           "3"           "4"          
## [6] "5"           "6"          
## 
## $size
##  [1] 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 6 6
## [39] 6 6 6 6 7
## 
## $adjr2
##  [1] 0.64574994 0.28719494 0.15142410 0.04046303 0.02625494 0.01122557
##  [7] 0.65701964 0.64700932 0.64469065 0.64023918 0.63459762 0.30198163
## [13] 0.29523766 0.28434070 0.27363203 0.23115772 0.66845842 0.66191828
## [19] 0.64765029 0.64659462 0.64566858 0.64484005 0.63544326 0.63440116
## [25] 0.63155828 0.62785567 0.66203501 0.65826501 0.65586793 0.65553671
## [31] 0.64941843 0.63798669 0.63446457 0.63325000 0.63149507 0.62183181
## [37] 0.65324511 0.64858766 0.64461237 0.64180690 0.62418952 0.30078818
## [43] 0.63907062

cbind(asphalt_allmodels_r2adj$which,adjRsq=asphalt_allmodels_r2adj$adjr2)

##   1 2 3 4 5 6     adjRsq
## 1 0 0 0 1 0 0 0.64574994
## 1 1 0 0 0 0 0 0.28719494
## 1 0 0 0 0 0 1 0.15142410
## 1 0 0 0 0 1 0 0.04046303
## 1 0 1 0 0 0 0 0.02625494
## 1 0 0 1 0 0 0 0.01122557
## 2 0 1 0 1 0 0 0.65701964
## 2 0 0 1 1 0 0 0.64700932
## 2 0 0 0 1 0 1 0.64469065
## 2 0 0 0 1 1 0 0.64023918
## 2 1 0 0 1 0 0 0.63459762
## 2 1 1 0 0 0 0 0.30198163
## 2 1 0 0 0 0 1 0.29523766
## 2 1 0 0 0 1 0 0.28434070
## 2 1 0 1 0 0 0 0.27363203
## 2 0 0 0 0 1 1 0.23115772
## 3 0 1 1 1 0 0 0.66845842
## 3 0 1 0 1 1 0 0.66191828
## 3 0 1 0 1 0 1 0.64765029
## 3 0 0 1 1 0 1 0.64659462
## 3 1 1 0 1 0 0 0.64566858
## 3 0 0 0 1 1 1 0.64484005
## 3 0 0 1 1 1 0 0.63544326
## 3 1 0 1 1 0 0 0.63440116
## 3 1 0 0 1 0 1 0.63155828
## 3 1 0 0 1 1 0 0.62785567
## 4 0 1 1 1 1 0 0.66203501
## 4 0 1 1 1 0 1 0.65826501
## 4 1 1 1 1 0 0 0.65586793
## 4 0 1 0 1 1 1 0.65553671
## 4 1 1 0 1 1 0 0.64941843
## 4 0 0 1 1 1 1 0.63798669
## 4 1 1 0 1 0 1 0.63446457
## 4 1 0 1 1 0 1 0.63325000
## 4 1 0 0 1 1 1 0.63149507
## 4 1 0 1 1 1 0 0.62183181
## 5 0 1 1 1 1 1 0.65324511
## 5 1 1 1 1 1 0 0.64858766
## 5 1 1 1 1 0 1 0.64461237
## 5 1 1 0 1 1 1 0.64180690
## 5 1 0 1 1 1 1 0.62418952
## 5 1 1 1 0 1 1 0.30078818
## 6 1 1 1 1 1 1 0.63907062

A third example:

n=1000
x1=rnorm(n)
x2=rnorm(n)
x3=rnorm(n)
x4=rnorm(n)
y=1+2*x1+3*x2+rnorm(n)
results=leaps( x=cbind(x1,x2,x3,x4), y=y, method="r2")
cbind(results$which, Rsq=results$r2)

##   1 2 3 4          Rsq
## 1 0 1 0 0 0.6097956074
## 1 1 0 0 0 0.2954778659
## 1 0 0 0 1 0.0012368128
## 1 0 0 1 0 0.0003920018
## 2 1 1 0 0 0.9261756045
## 2 0 1 1 0 0.6103199511
## 2 0 1 0 1 0.6098169039
## 2 1 0 0 1 0.2962985076
## 2 1 0 1 0 0.2955210672
## 2 0 0 1 1 0.0016020574
## 3 1 1 1 0 0.9261944546
## 3 1 1 0 1 0.9261827416
## 3 0 1 1 1 0.6103372699
## 3 1 0 1 1 0.2963492196
## 4 1 1 1 1 0.9262011552

results2=leaps( x=cbind(x1,x2,x3,x4), y=y, method="adjr2")
cbind(results2$which, adjRsq=results2$adjr2)

##   1 2 3 4        adjRsq
## 1 0 1 0 0  0.6094046210
## 1 1 0 0 0  0.2947719319
## 1 0 0 0 1  0.0002360481
## 1 0 0 1 0 -0.0006096094
## 2 1 1 0 0  0.9260275114
## 2 0 1 1 0  0.6095382459
## 2 0 1 0 1  0.6090341895
## 2 1 0 0 1  0.2948868697
## 2 1 0 1 0  0.2941078697
## 2 0 0 1 1 -0.0004007469
## 3 1 1 1 0  0.9259721487
## 3 1 1 0 1  0.9259604005
## 3 0 1 1 1  0.6091635869
## 3 1 0 1 1  0.2942297895
## 4 1 1 1 1  0.9259044764

3 select by AIC and BIC

AIC: Akaike’s Information Criterion

\[ AIC = n \log\bigg(\frac{SSRes}{n}\bigg) + 2p\]

BIC: Bayesian Information Criterion

\[ BIC = n \log\bigg(\frac{SSRes}{n}\bigg) + p \log(n)\]

\(p\) is the number of parameters in the model. In linear regression, \(p=k+1\), i.e., intercept plus slopes.
The models with small values of AIC or BIC are preferred.
Commonly used for more complicated modeling situations

Example

hald_allmodels=regsubsets(y~x1+x2+x3+x4, data=hald, nbest=4)
summary(hald_allmodels)

## Subset selection object
## Call: regsubsets.formula(y ~ x1 + x2 + x3 + x4, data = hald, nbest = 4)
## 4 Variables  (and intercept)
##    Forced in Forced out
## x1     FALSE      FALSE
## x2     FALSE      FALSE
## x3     FALSE      FALSE
## x4     FALSE      FALSE
## 4 subsets of each size up to 4
## Selection Algorithm: exhaustive
##          x1  x2  x3  x4 
## 1  ( 1 ) " " " " " " "*"
## 1  ( 2 ) " " "*" " " " "
## 1  ( 3 ) "*" " " " " " "
## 1  ( 4 ) " " " " "*" " "
## 2  ( 1 ) "*" "*" " " " "
## 2  ( 2 ) "*" " " " " "*"
## 2  ( 3 ) " " " " "*" "*"
## 2  ( 4 ) " " "*" "*" " "
## 3  ( 1 ) "*" "*" " " "*"
## 3  ( 2 ) "*" "*" "*" " "
## 3  ( 3 ) "*" " " "*" "*"
## 3  ( 4 ) " " "*" "*" "*"
## 4  ( 1 ) "*" "*" "*" "*"

plot(hald_allmodels, scale="bic")

plot(hald_allmodels, scale="adjr2")

Another example

asphalt_allmodels <- regsubsets(y~x1+x2+x3+x4+x5+x6,data=asphalt,nbest=3)
summary(asphalt_allmodels)

## Subset selection object
## Call: regsubsets.formula(y ~ x1 + x2 + x3 + x4 + x5 + x6, data = asphalt, 
##     nbest = 3)
## 6 Variables  (and intercept)
##    Forced in Forced out
## x1     FALSE      FALSE
## x2     FALSE      FALSE
## x3     FALSE      FALSE
## x4     FALSE      FALSE
## x5     FALSE      FALSE
## x6     FALSE      FALSE
## 3 subsets of each size up to 6
## Selection Algorithm: exhaustive
##          x1  x2  x3  x4  x5  x6 
## 1  ( 1 ) " " " " " " "*" " " " "
## 1  ( 2 ) "*" " " " " " " " " " "
## 1  ( 3 ) " " " " " " " " " " "*"
## 2  ( 1 ) " " "*" " " "*" " " " "
## 2  ( 2 ) " " " " "*" "*" " " " "
## 2  ( 3 ) " " " " " " "*" " " "*"
## 3  ( 1 ) " " "*" "*" "*" " " " "
## 3  ( 2 ) " " "*" " " "*" "*" " "
## 3  ( 3 ) " " "*" " " "*" " " "*"
## 4  ( 1 ) " " "*" "*" "*" "*" " "
## 4  ( 2 ) " " "*" "*" "*" " " "*"
## 4  ( 3 ) "*" "*" "*" "*" " " " "
## 5  ( 1 ) " " "*" "*" "*" "*" "*"
## 5  ( 2 ) "*" "*" "*" "*" "*" " "
## 5  ( 3 ) "*" "*" "*" "*" " " "*"
## 6  ( 1 ) "*" "*" "*" "*" "*" "*"

plot(asphalt_allmodels)

A third example

n=1000
x1=rnorm(n)
x2=rnorm(n)
x3=rnorm(n)
x4=rnorm(n)
y=1+2*x1+3*x2+rnorm(n)
m1=lm(y~x1)
m2=lm(y~x1+x2)
m3=lm(y~x1+x2+x3)
m4=lm(y~x1+x2+x3+x4)
BIC(m1)

## [1] 5190.934

BIC(m2)

## [1] 2842.267

BIC(m3)

## [1] 2847.547

BIC(m4)

## [1] 2854.439

AIC(m1)

## [1] 5176.211

AIC(m2)

## [1] 2822.636

AIC(m3)

## [1] 2823.008

AIC(m4)

## [1] 2824.992

4 Stepwise regression

Stepwise regression methods include:

Forward Selection
Backward Elimination
Stepwise Regression (combination of forward and backward)

Forward Selection

Procedure is based on the idea that no variables are in the model originally, but are added one at a time
The first regressor selected to be entered into the model is the one with the highest correlation with the response
If the F test corresponding to the model containing this variable is significant (larger than some predetermined value), then that regressor is left in the model
The second regressor examined is the one with the largest partial correlation with the response. If the F-statistic corresponding to the addition of this variable is significant, the regressor is retained
This process continues until all regressors are examined

Example Hald data set

# forward selection using F
add1(lm(y~1,data=hald), y~x1+x2+x3+x4, test="F")

## Single term additions
## 
## Model:
## y ~ 1
##        Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>              2715.76 71.444                      
## x1      1   1450.08 1265.69 63.519 12.6025 0.0045520 ** 
## x2      1   1809.43  906.34 59.178 21.9606 0.0006648 ***
## x3      1    776.36 1939.40 69.067  4.4034 0.0597623 .  
## x4      1   1831.90  883.87 58.852 22.7985 0.0005762 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

add1(lm(y~x4,data=hald), y~x1+x2+x3+x4, test="F")

## Single term additions
## 
## Model:
## y ~ x4
##        Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## <none>              883.87 58.852                       
## x1      1    809.10  74.76 28.742 108.2239 1.105e-06 ***
## x2      1     14.99 868.88 60.629   0.1725    0.6867    
## x3      1    708.13 175.74 39.853  40.2946 8.375e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

add1(lm(y~x1+x4,data=hald), y~x1+x2+x3+x4, test="F")

## Single term additions
## 
## Model:
## y ~ x1 + x4
##        Df Sum of Sq    RSS    AIC F value  Pr(>F)  
## <none>              74.762 28.742                  
## x2      1    26.789 47.973 24.974  5.0259 0.05169 .
## x3      1    23.926 50.836 25.728  4.2358 0.06969 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

End of Example

Backward Elimination

Procedure is based on the idea that all variables are in the model originally and examined one at a time and removed if not significant
The partial F statistic is calculated for each variable as if it were the last one added to the model
The regressor with the smallest F statistic is examined first and will be removed if this value is less than some predetermined value Fout
If this regressor is removed, then the model is refit with the remaining regressor variables and the partial F statistics calculated again
The regressor with the smallest partial F statistic will be removed if that value is less than Fout.
The process continues until all regressors are examined

Example Hald data set

# backward selection using F
drop1(lm(y~x1+x2+x3+x4,data=hald), test="F")

## Single term deletions
## 
## Model:
## y ~ x1 + x2 + x3 + x4
##        Df Sum of Sq    RSS    AIC F value  Pr(>F)  
## <none>              47.864 26.944                  
## x1      1   25.9509 73.815 30.576  4.3375 0.07082 .
## x2      1    2.9725 50.836 25.728  0.4968 0.50090  
## x3      1    0.1091 47.973 24.974  0.0182 0.89592  
## x4      1    0.2470 48.111 25.011  0.0413 0.84407  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop1(lm(y~x1+x2+x4,data=hald), test="F")

## Single term deletions
## 
## Model:
## y ~ x1 + x2 + x4
##        Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## <none>               47.97 24.974                       
## x1      1    820.91 868.88 60.629 154.0076 5.781e-07 ***
## x2      1     26.79  74.76 28.742   5.0259   0.05169 .  
## x4      1      9.93  57.90 25.420   1.8633   0.20540    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop1(lm(y~x1+x2,data=hald), test="F")

## Single term deletions
## 
## Model:
## y ~ x1 + x2
##        Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>                57.90 25.420                      
## x1      1    848.43  906.34 59.178  146.52 2.692e-07 ***
## x2      1   1207.78 1265.69 63.519  208.58 5.029e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

End of Example

Stepwise Regression

This procedure is a modification of forward selection
The contribution of each regressor variable that is put into the model is reassessed by way of its partial F statistic
A regressor that makes it into the model, may also be removed it if is found to be insignificant with the addition of other variables to the model
If the partial F-statistic is less than Fout, the variable will be removed.
Stepwise requires both an Fin value and Fout value

Stepwise Regression Methods

No one model may be the “best”
The three stepwise techniques could result in different models
Inexperienced analysts may use the final model simply because the procedure spit it out.

Example Hald data set

# stepwise selection using F
add1(lm(y~1,data=hald), y~x1+x2+x3+x4, test="F")

## Single term additions
## 
## Model:
## y ~ 1
##        Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>              2715.76 71.444                      
## x1      1   1450.08 1265.69 63.519 12.6025 0.0045520 ** 
## x2      1   1809.43  906.34 59.178 21.9606 0.0006648 ***
## x3      1    776.36 1939.40 69.067  4.4034 0.0597623 .  
## x4      1   1831.90  883.87 58.852 22.7985 0.0005762 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

add1(lm(y~x4,data=hald), y~x1+x2+x3+x4, test="F")

## Single term additions
## 
## Model:
## y ~ x4
##        Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## <none>              883.87 58.852                       
## x1      1    809.10  74.76 28.742 108.2239 1.105e-06 ***
## x2      1     14.99 868.88 60.629   0.1725    0.6867    
## x3      1    708.13 175.74 39.853  40.2946 8.375e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

add1(lm(y~x1+x4,data=hald), y~x1+x2+x3+x4, test="F")

## Single term additions
## 
## Model:
## y ~ x1 + x4
##        Df Sum of Sq    RSS    AIC F value  Pr(>F)  
## <none>              74.762 28.742                  
## x2      1    26.789 47.973 24.974  5.0259 0.05169 .
## x3      1    23.926 50.836 25.728  4.2358 0.06969 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

add1(lm(y~x1+x2+x4,data=hald), y~x1+x2+x3+x4, test="F")

## Single term additions
## 
## Model:
## y ~ x1 + x2 + x4
##        Df Sum of Sq    RSS    AIC F value Pr(>F)
## <none>              47.973 24.974               
## x3      1   0.10909 47.864 26.944  0.0182 0.8959

drop1(lm(y~x1+x2+x4,data=hald), test="F")

## Single term deletions
## 
## Model:
## y ~ x1 + x2 + x4
##        Df Sum of Sq    RSS    AIC  F value    Pr(>F)    
## <none>               47.97 24.974                       
## x1      1    820.91 868.88 60.629 154.0076 5.781e-07 ***
## x2      1     26.79  74.76 28.742   5.0259   0.05169 .  
## x4      1      9.93  57.90 25.420   1.8633   0.20540    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

drop1(lm(y~x1+x2,data=hald), test="F")

## Single term deletions
## 
## Model:
## y ~ x1 + x2
##        Df Sum of Sq     RSS    AIC F value    Pr(>F)    
## <none>                57.90 25.420                      
## x1      1    848.43  906.34 59.178  146.52 2.692e-07 ***
## x2      1   1207.78 1265.69 63.519  208.58 5.029e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Similary, we could use AIC for stepwise regression.

# stepwise selection using AIC
null=lm(y~1, data=hald)
full=lm(y~., data=hald)
step(null, scope=list(lower=null, upper=full), direction="forward")

## Start:  AIC=71.44
## y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + x4    1   1831.90  883.87 58.852
## + x2    1   1809.43  906.34 59.178
## + x1    1   1450.08 1265.69 63.519
## + x3    1    776.36 1939.40 69.067
## <none>              2715.76 71.444
## 
## Step:  AIC=58.85
## y ~ x4
## 
##        Df Sum of Sq    RSS    AIC
## + x1    1    809.10  74.76 28.742
## + x3    1    708.13 175.74 39.853
## <none>              883.87 58.852
## + x2    1     14.99 868.88 60.629
## 
## Step:  AIC=28.74
## y ~ x4 + x1
## 
##        Df Sum of Sq    RSS    AIC
## + x2    1    26.789 47.973 24.974
## + x3    1    23.926 50.836 25.728
## <none>              74.762 28.742
## 
## Step:  AIC=24.97
## y ~ x4 + x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## <none>              47.973 24.974
## + x3    1   0.10909 47.864 26.944

## 
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x4           x1           x2  
##     71.6483      -0.2365       1.4519       0.4161

step(full, scope=list(lower=null, upper=full), direction="backward")

## Start:  AIC=26.94
## y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1    0.1091 47.973 24.974
## - x4    1    0.2470 48.111 25.011
## - x2    1    2.9725 50.836 25.728
## <none>              47.864 26.944
## - x1    1   25.9509 73.815 30.576
## 
## Step:  AIC=24.97
## y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## <none>               47.97 24.974
## - x4    1      9.93  57.90 25.420
## - x2    1     26.79  74.76 28.742
## - x1    1    820.91 868.88 60.629

## 
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = hald)
## 
## Coefficients:
## (Intercept)           x1           x2           x4  
##     71.6483       1.4519       0.4161      -0.2365

step(full, scope=list(lower=null, upper=full), direction="both")

## Start:  AIC=26.94
## y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1    0.1091 47.973 24.974
## - x4    1    0.2470 48.111 25.011
## - x2    1    2.9725 50.836 25.728
## <none>              47.864 26.944
## - x1    1   25.9509 73.815 30.576
## 
## Step:  AIC=24.97
## y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## <none>               47.97 24.974
## - x4    1      9.93  57.90 25.420
## + x3    1      0.11  47.86 26.944
## - x2    1     26.79  74.76 28.742
## - x1    1    820.91 868.88 60.629

## 
## Call:
## lm(formula = y ~ x1 + x2 + x4, data = hald)
## 
## Coefficients:
## (Intercept)           x1           x2           x4  
##     71.6483       1.4519       0.4161      -0.2365

step(null, scope=list(lower=null, upper=full), direction="both")

## Start:  AIC=71.44
## y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + x4    1   1831.90  883.87 58.852
## + x2    1   1809.43  906.34 59.178
## + x1    1   1450.08 1265.69 63.519
## + x3    1    776.36 1939.40 69.067
## <none>              2715.76 71.444
## 
## Step:  AIC=58.85
## y ~ x4
## 
##        Df Sum of Sq     RSS    AIC
## + x1    1    809.10   74.76 28.742
## + x3    1    708.13  175.74 39.853
## <none>               883.87 58.852
## + x2    1     14.99  868.88 60.629
## - x4    1   1831.90 2715.76 71.444
## 
## Step:  AIC=28.74
## y ~ x4 + x1
## 
##        Df Sum of Sq     RSS    AIC
## + x2    1     26.79   47.97 24.974
## + x3    1     23.93   50.84 25.728
## <none>                74.76 28.742
## - x1    1    809.10  883.87 58.852
## - x4    1   1190.92 1265.69 63.519
## 
## Step:  AIC=24.97
## y ~ x4 + x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## <none>               47.97 24.974
## - x4    1      9.93  57.90 25.420
## + x3    1      0.11  47.86 26.944
## - x2    1     26.79  74.76 28.742
## - x1    1    820.91 868.88 60.629

## 
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x4           x1           x2  
##     71.6483      -0.2365       1.4519       0.4161

Similary, we could use BIC for stepwise regression.

# stepwise selection using BIC
n=dim(hald)[1]
step(null, scope=list(lower=null, upper=full), k = log(n), direction="forward")

## Start:  AIC=72.01
## y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + x4    1   1831.90  883.87 59.982
## + x2    1   1809.43  906.34 60.308
## + x1    1   1450.08 1265.69 64.649
## + x3    1    776.36 1939.40 70.197
## <none>              2715.76 72.009
## 
## Step:  AIC=59.98
## y ~ x4
## 
##        Df Sum of Sq    RSS    AIC
## + x1    1    809.10  74.76 30.437
## + x3    1    708.13 175.74 41.547
## <none>              883.87 59.982
## + x2    1     14.99 868.88 62.324
## 
## Step:  AIC=30.44
## y ~ x4 + x1
## 
##        Df Sum of Sq    RSS    AIC
## + x2    1    26.789 47.973 27.234
## + x3    1    23.926 50.836 27.987
## <none>              74.762 30.437
## 
## Step:  AIC=27.23
## y ~ x4 + x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## <none>              47.973 27.234
## + x3    1   0.10909 47.864 29.769

## 
## Call:
## lm(formula = y ~ x4 + x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x4           x1           x2  
##     71.6483      -0.2365       1.4519       0.4161

step(full, scope=list(lower=null, upper=full), k = log(n), direction="backward")

## Start:  AIC=29.77
## y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1    0.1091 47.973 27.234
## - x4    1    0.2470 48.111 27.271
## - x2    1    2.9725 50.836 27.987
## <none>              47.864 29.769
## - x1    1   25.9509 73.815 32.836
## 
## Step:  AIC=27.23
## y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x4    1      9.93  57.90 27.115
## <none>               47.97 27.234
## - x2    1     26.79  74.76 30.437
## - x1    1    820.91 868.88 62.324
## 
## Step:  AIC=27.11
## y ~ x1 + x2
## 
##        Df Sum of Sq     RSS    AIC
## <none>                57.90 27.115
## - x1    1    848.43  906.34 60.308
## - x2    1   1207.78 1265.69 64.649

## 
## Call:
## lm(formula = y ~ x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     52.5773       1.4683       0.6623

step(full, scope=list(lower=null, upper=full), k = log(n), direction="both")

## Start:  AIC=29.77
## y ~ x1 + x2 + x3 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x3    1    0.1091 47.973 27.234
## - x4    1    0.2470 48.111 27.271
## - x2    1    2.9725 50.836 27.987
## <none>              47.864 29.769
## - x1    1   25.9509 73.815 32.836
## 
## Step:  AIC=27.23
## y ~ x1 + x2 + x4
## 
##        Df Sum of Sq    RSS    AIC
## - x4    1      9.93  57.90 27.115
## <none>               47.97 27.234
## + x3    1      0.11  47.86 29.769
## - x2    1     26.79  74.76 30.437
## - x1    1    820.91 868.88 62.324
## 
## Step:  AIC=27.11
## y ~ x1 + x2
## 
##        Df Sum of Sq     RSS    AIC
## <none>                57.90 27.115
## + x4    1      9.93   47.97 27.234
## + x3    1      9.79   48.11 27.271
## - x1    1    848.43  906.34 60.308
## - x2    1   1207.78 1265.69 64.649

## 
## Call:
## lm(formula = y ~ x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     52.5773       1.4683       0.6623

step(null, scope=list(lower=null, upper=full), k = log(n), direction="both")

## Start:  AIC=72.01
## y ~ 1
## 
##        Df Sum of Sq     RSS    AIC
## + x4    1   1831.90  883.87 59.982
## + x2    1   1809.43  906.34 60.308
## + x1    1   1450.08 1265.69 64.649
## + x3    1    776.36 1939.40 70.197
## <none>              2715.76 72.009
## 
## Step:  AIC=59.98
## y ~ x4
## 
##        Df Sum of Sq     RSS    AIC
## + x1    1    809.10   74.76 30.437
## + x3    1    708.13  175.74 41.547
## <none>               883.87 59.982
## + x2    1     14.99  868.88 62.324
## - x4    1   1831.90 2715.76 72.009
## 
## Step:  AIC=30.44
## y ~ x4 + x1
## 
##        Df Sum of Sq     RSS    AIC
## + x2    1     26.79   47.97 27.234
## + x3    1     23.93   50.84 27.987
## <none>                74.76 30.437
## - x1    1    809.10  883.87 59.982
## - x4    1   1190.92 1265.69 64.649
## 
## Step:  AIC=27.23
## y ~ x4 + x1 + x2
## 
##        Df Sum of Sq    RSS    AIC
## - x4    1      9.93  57.90 27.115
## <none>               47.97 27.234
## + x3    1      0.11  47.86 29.769
## - x2    1     26.79  74.76 30.437
## - x1    1    820.91 868.88 62.324
## 
## Step:  AIC=27.11
## y ~ x1 + x2
## 
##        Df Sum of Sq     RSS    AIC
## <none>                57.90 27.115
## + x4    1      9.93   47.97 27.234
## + x3    1      9.79   48.11 27.271
## - x1    1    848.43  906.34 60.308
## - x2    1   1207.78 1265.69 64.649

## 
## Call:
## lm(formula = y ~ x1 + x2, data = hald)
## 
## Coefficients:
## (Intercept)           x1           x2  
##     52.5773       1.4683       0.6623

5 Lasso regression

LASSO stands for Least Absolute Shrinkage and Selection Operator. The algorithm is another variation of linear regression like ridge regression. We use lasso regression when we have large number of predictor variables.

Lasso regression selects variables and estimate coefficient simoutaneously, via apply L1 regularization.The L1 regularization adds a penality equivalent to the absolute of the maginitude of regression coefficients and tries to minimize them. The equation of lasso is similar to ridge regression and looks like as given below.

Traditional least square estimation: \[ \hat{\beta} = \arg\min \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{k} \beta_j x_{ij})^2 \]

Ridge Regression estimation: \[ \hat{\beta}_\lambda = \arg\min \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{k} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} \beta_j^2 \]

LASSO Regression estimation: \[ \hat{\beta}_\lambda = \arg\min \sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{k} \beta_j x_{ij})^2 + \lambda \sum_{j=1}^{p} |\beta_j | \]

The following diagram is the visual interpretation comparing OLS and lasso regression:

think about \(\beta = [5,5] and [10,0]\) in ridge regression and Lasso regression.
If \(\lambda\) = 0, We get same coefficients as linear regression If \(\lambda\) = vary large, All coefficients are shriked towards zero. \(\lambda\) controls the level of sparsity.
The LASSO is not very good at handling variables which show correlation between them and thus can sometimes show very wild behaviors.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 3.0-2

lreg <- glmnet(x=as.matrix(hald[,-1]), y=hald[,1],alpha=1)
#rbind(lreg$lambda,lreg$beta)
matplot(lreg$lambda,t(lreg$beta),type="l",ylab="coefficient",xlab="lambda")
abline(h=0)

plot(lreg)

We can apply cross-validation to find the best cv

cv_fit <- cv.glmnet(x=as.matrix(hald[,-1]), y=hald[,1],alpha=1,nfolds = 4)
plot(cv_fit)

opt_lambda <- cv_fit$lambda.min
opt_lambda

## [1] 0.07810171

We can also predict with Lasso regression:

#only matrix form is acceptable in ridge regression
old_x <- as.matrix(hald[,-1])
y_predicted <- predict(lreg, s = opt_lambda, newx = old_x)

newx <- matrix(c(x1=15,x2=50,x3=6,x4=70),ncol=4)
predict(lreg, s = opt_lambda, newx = newx)

##             1
## [1,] 97.64065

And the in-sample mse:

mse <- mean((hald[,1] - y_predicted)^2)
mse

## [1] 3.700099

#your turn, what's the in-sample mse from the best model selected by R-square?

Another example Asphalt, try it!

add1(lm(y~1,data=asphalt[,-1]), y~x1+x2+x3+x4+x5+x6, test="F")

## Single term additions
## 
## Model:
## y ~ 1
##        Df Sum of Sq     RSS     AIC F value    Pr(>F)    
## <none>              1366.52 119.367                      
## x1      1    424.93  941.59 109.821 13.0872  0.001117 ** 
## x2      1     80.23 1286.29 119.491  1.8089  0.189067    
## x3      1     60.38 1306.14 119.966  1.3406  0.256377    
## x4      1    898.57  467.95  88.146 55.6859 3.181e-08 ***
## x5      1     99.00 1267.52 119.036  2.2651  0.143134    
## x6      1    245.58 1120.94 115.226  6.3533  0.017473 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

add1(lm(y~x4,data=asphalt[,-1]), y~x1+x2+x3+x4+x5+x6, test="F")

## Single term additions
## 
## Model:
## y ~ x4
##        Df Sum of Sq    RSS    AIC F value Pr(>F)
## <none>              467.95 88.146               
## x1      1    1.9125 466.04 90.019  0.1149 0.7372
## x2      1   30.5099 437.44 88.056  1.9529 0.1733
## x3      1   17.7425 450.21 88.948  1.1035 0.3025
## x5      1    9.1078 458.85 89.536  0.5558 0.4622
## x6      1   14.7853 453.17 89.151  0.9135 0.3474

drop1(lm(y~x4,data=asphalt[,-1]), test="F")

## Single term deletions
## 
## Model:
## y ~ x4
##        Df Sum of Sq     RSS     AIC F value    Pr(>F)    
## <none>               467.95  88.146                      
## x4      1    898.57 1366.52 119.367  55.686 3.181e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# stepwise with AIC
null=lm(y~1, data=asphalt[,-1])
full=lm(y~., data=asphalt[,-1])
step(null, scope=list(lower=null, upper=full), direction="forward")

## Start:  AIC=119.37
## y ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + x4    1    898.57  467.95  88.146
## + x1    1    424.93  941.59 109.821
## + x6    1    245.58 1120.94 115.226
## + x5    1     99.00 1267.52 119.036
## <none>              1366.52 119.367
## + x2    1     80.23 1286.29 119.491
## + x3    1     60.38 1306.14 119.966
## 
## Step:  AIC=88.15
## y ~ x4
## 
##        Df Sum of Sq    RSS    AIC
## + x2    1   30.5099 437.44 88.056
## <none>              467.95 88.146
## + x3    1   17.7425 450.21 88.948
## + x6    1   14.7853 453.17 89.151
## + x5    1    9.1078 458.85 89.536
## + x1    1    1.9125 466.04 90.019
## 
## Step:  AIC=88.06
## y ~ x4 + x2
## 
##        Df Sum of Sq    RSS    AIC
## + x3    1   29.6911 407.75 87.877
## <none>              437.44 88.056
## + x5    1   21.6476 415.80 88.482
## + x6    1    4.0999 433.34 89.764
## + x1    1    1.6627 435.78 89.938
## 
## Step:  AIC=87.88
## y ~ x4 + x2 + x3
## 
##        Df Sum of Sq    RSS    AIC
## <none>              407.75 87.877
## + x5    1    7.4946 400.26 89.302
## + x6    1    3.0297 404.72 89.646
## + x1    1    0.1908 407.56 89.862

## 
## Call:
## lm(formula = y ~ x4 + x2 + x3, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4           x2           x3  
##     -38.229      -10.253        4.534        5.800

step(full, scope=list(lower=null, upper=full), direction="backward")

## Start:  AIC=92.86
## y ~ x1 + x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x1    1      0.30 394.87  90.882
## - x6    1      5.60 400.18  91.295
## - x5    1     10.13 404.70  91.644
## - x3    1     13.32 407.90  91.888
## <none>              394.57  92.858
## - x2    1     33.39 427.96  93.376
## - x4    1    401.66 796.24 112.623
## 
## Step:  AIC=90.88
## y ~ x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x6    1      5.38 400.26  89.302
## - x5    1      9.85 404.72  89.646
## - x3    1     13.08 407.95  89.892
## <none>              394.87  90.882
## - x2    1     33.87 428.74  91.433
## - x4    1    540.02 934.89 115.600
## 
## Step:  AIC=89.3
## y ~ x2 + x3 + x4 + x5
## 
##        Df Sum of Sq     RSS     AIC
## - x5    1      7.49  407.75  87.877
## - x3    1     15.54  415.80  88.482
## <none>               400.26  89.302
## - x2    1     48.10  448.36  90.820
## - x4    1    709.20 1109.46 118.907
## 
## Step:  AIC=87.88
## y ~ x2 + x3 + x4
## 
##        Df Sum of Sq     RSS     AIC
## <none>               407.75  87.877
## - x3    1     29.69  437.44  88.056
## - x2    1     42.46  450.21  88.948
## - x4    1    786.16 1193.92 119.181

## 
## Call:
## lm(formula = y ~ x2 + x3 + x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x2           x3           x4  
##     -38.229        4.534        5.800      -10.253

step(full, scope=list(lower=null, upper=full), direction="both")

## Start:  AIC=92.86
## y ~ x1 + x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x1    1      0.30 394.87  90.882
## - x6    1      5.60 400.18  91.295
## - x5    1     10.13 404.70  91.644
## - x3    1     13.32 407.90  91.888
## <none>              394.57  92.858
## - x2    1     33.39 427.96  93.376
## - x4    1    401.66 796.24 112.623
## 
## Step:  AIC=90.88
## y ~ x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x6    1      5.38 400.26  89.302
## - x5    1      9.85 404.72  89.646
## - x3    1     13.08 407.95  89.892
## <none>              394.87  90.882
## - x2    1     33.87 428.74  91.433
## + x1    1      0.30 394.57  92.858
## - x4    1    540.02 934.89 115.600
## 
## Step:  AIC=89.3
## y ~ x2 + x3 + x4 + x5
## 
##        Df Sum of Sq     RSS     AIC
## - x5    1      7.49  407.75  87.877
## - x3    1     15.54  415.80  88.482
## <none>               400.26  89.302
## - x2    1     48.10  448.36  90.820
## + x6    1      5.38  394.87  90.882
## + x1    1      0.08  400.18  91.295
## - x4    1    709.20 1109.46 118.907
## 
## Step:  AIC=87.88
## y ~ x2 + x3 + x4
## 
##        Df Sum of Sq     RSS     AIC
## <none>               407.75  87.877
## - x3    1     29.69  437.44  88.056
## - x2    1     42.46  450.21  88.948
## + x5    1      7.49  400.26  89.302
## + x6    1      3.03  404.72  89.646
## + x1    1      0.19  407.56  89.862
## - x4    1    786.16 1193.92 119.181

## 
## Call:
## lm(formula = y ~ x2 + x3 + x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x2           x3           x4  
##     -38.229        4.534        5.800      -10.253

step(null, scope=list(lower=null, upper=full), direction="both")

## Start:  AIC=119.37
## y ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + x4    1    898.57  467.95  88.146
## + x1    1    424.93  941.59 109.821
## + x6    1    245.58 1120.94 115.226
## + x5    1     99.00 1267.52 119.036
## <none>              1366.52 119.367
## + x2    1     80.23 1286.29 119.491
## + x3    1     60.38 1306.14 119.966
## 
## Step:  AIC=88.15
## y ~ x4
## 
##        Df Sum of Sq     RSS     AIC
## + x2    1     30.51  437.44  88.056
## <none>               467.95  88.146
## + x3    1     17.74  450.21  88.948
## + x6    1     14.79  453.17  89.151
## + x5    1      9.11  458.85  89.536
## + x1    1      1.91  466.04  90.019
## - x4    1    898.57 1366.52 119.367
## 
## Step:  AIC=88.06
## y ~ x4 + x2
## 
##        Df Sum of Sq     RSS     AIC
## + x3    1     29.69  407.75  87.877
## <none>               437.44  88.056
## - x2    1     30.51  467.95  88.146
## + x5    1     21.65  415.80  88.482
## + x6    1      4.10  433.34  89.764
## + x1    1      1.66  435.78  89.938
## - x4    1    848.84 1286.29 119.491
## 
## Step:  AIC=87.88
## y ~ x4 + x2 + x3
## 
##        Df Sum of Sq     RSS     AIC
## <none>               407.75  87.877
## - x3    1     29.69  437.44  88.056
## - x2    1     42.46  450.21  88.948
## + x5    1      7.49  400.26  89.302
## + x6    1      3.03  404.72  89.646
## + x1    1      0.19  407.56  89.862
## - x4    1    786.16 1193.92 119.181

## 
## Call:
## lm(formula = y ~ x4 + x2 + x3, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4           x2           x3  
##     -38.229      -10.253        4.534        5.800

# stepwise with BIC
n=dim(asphalt)[1]
step(null, scope=list(lower=null, upper=full), k = log(n), direction="forward")

## Start:  AIC=120.8
## y ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + x4    1    898.57  467.95  91.014
## + x1    1    424.93  941.59 112.689
## + x6    1    245.58 1120.94 118.094
## <none>              1366.52 120.801
## + x5    1     99.00 1267.52 121.904
## + x2    1     80.23 1286.29 122.359
## + x3    1     60.38 1306.14 122.834
## 
## Step:  AIC=91.01
## y ~ x4
## 
##        Df Sum of Sq    RSS    AIC
## <none>              467.95 91.014
## + x2    1   30.5099 437.44 92.358
## + x3    1   17.7425 450.21 93.250
## + x6    1   14.7853 453.17 93.452
## + x5    1    9.1078 458.85 93.838
## + x1    1    1.9125 466.04 94.321

## 
## Call:
## lm(formula = y ~ x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4  
##       11.72       -10.77

step(full, scope=list(lower=null, upper=full), k = log(n), direction="backward")

## Start:  AIC=102.9
## y ~ x1 + x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x1    1      0.30 394.87  99.486
## - x6    1      5.60 400.18  99.899
## - x5    1     10.13 404.70 100.248
## - x3    1     13.32 407.90 100.492
## - x2    1     33.39 427.96 101.980
## <none>              394.57 102.896
## - x4    1    401.66 796.24 121.227
## 
## Step:  AIC=99.49
## y ~ x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x6    1      5.38 400.26  96.472
## - x5    1      9.85 404.72  96.816
## - x3    1     13.08 407.95  97.062
## - x2    1     33.87 428.74  98.603
## <none>              394.87  99.486
## - x4    1    540.02 934.89 122.770
## 
## Step:  AIC=96.47
## y ~ x2 + x3 + x4 + x5
## 
##        Df Sum of Sq     RSS     AIC
## - x5    1      7.49  407.75  93.613
## - x3    1     15.54  415.80  94.218
## <none>               400.26  96.472
## - x2    1     48.10  448.36  96.556
## - x4    1    709.20 1109.46 124.643
## 
## Step:  AIC=93.61
## y ~ x2 + x3 + x4
## 
##        Df Sum of Sq     RSS     AIC
## - x3    1     29.69  437.44  92.358
## - x2    1     42.46  450.21  93.250
## <none>               407.75  93.613
## - x4    1    786.16 1193.92 123.483
## 
## Step:  AIC=92.36
## y ~ x2 + x4
## 
##        Df Sum of Sq     RSS     AIC
## - x2    1     30.51  467.95  91.014
## <none>               437.44  92.358
## - x4    1    848.84 1286.29 122.359
## 
## Step:  AIC=91.01
## y ~ x4
## 
##        Df Sum of Sq     RSS     AIC
## <none>               467.95  91.014
## - x4    1    898.57 1366.52 120.801

## 
## Call:
## lm(formula = y ~ x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4  
##       11.72       -10.77

step(full, scope=list(lower=null, upper=full), k = log(n), direction="both")

## Start:  AIC=102.9
## y ~ x1 + x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x1    1      0.30 394.87  99.486
## - x6    1      5.60 400.18  99.899
## - x5    1     10.13 404.70 100.248
## - x3    1     13.32 407.90 100.492
## - x2    1     33.39 427.96 101.980
## <none>              394.57 102.896
## - x4    1    401.66 796.24 121.227
## 
## Step:  AIC=99.49
## y ~ x2 + x3 + x4 + x5 + x6
## 
##        Df Sum of Sq    RSS     AIC
## - x6    1      5.38 400.26  96.472
## - x5    1      9.85 404.72  96.816
## - x3    1     13.08 407.95  97.062
## - x2    1     33.87 428.74  98.603
## <none>              394.87  99.486
## + x1    1      0.30 394.57 102.896
## - x4    1    540.02 934.89 122.770
## 
## Step:  AIC=96.47
## y ~ x2 + x3 + x4 + x5
## 
##        Df Sum of Sq     RSS     AIC
## - x5    1      7.49  407.75  93.613
## - x3    1     15.54  415.80  94.218
## <none>               400.26  96.472
## - x2    1     48.10  448.36  96.556
## + x6    1      5.38  394.87  99.486
## + x1    1      0.08  400.18  99.899
## - x4    1    709.20 1109.46 124.643
## 
## Step:  AIC=93.61
## y ~ x2 + x3 + x4
## 
##        Df Sum of Sq     RSS     AIC
## - x3    1     29.69  437.44  92.358
## - x2    1     42.46  450.21  93.250
## <none>               407.75  93.613
## + x5    1      7.49  400.26  96.472
## + x6    1      3.03  404.72  96.816
## + x1    1      0.19  407.56  97.032
## - x4    1    786.16 1193.92 123.483
## 
## Step:  AIC=92.36
## y ~ x2 + x4
## 
##        Df Sum of Sq     RSS     AIC
## - x2    1     30.51  467.95  91.014
## <none>               437.44  92.358
## + x3    1     29.69  407.75  93.613
## + x5    1     21.65  415.80  94.218
## + x6    1      4.10  433.34  95.500
## + x1    1      1.66  435.78  95.674
## - x4    1    848.84 1286.29 122.359
## 
## Step:  AIC=91.01
## y ~ x4
## 
##        Df Sum of Sq     RSS     AIC
## <none>               467.95  91.014
## + x2    1     30.51  437.44  92.358
## + x3    1     17.74  450.21  93.250
## + x6    1     14.79  453.17  93.452
## + x5    1      9.11  458.85  93.838
## + x1    1      1.91  466.04  94.321
## - x4    1    898.57 1366.52 120.801

## 
## Call:
## lm(formula = y ~ x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4  
##       11.72       -10.77

step(null, scope=list(lower=null, upper=full), k = log(n), direction="both")

## Start:  AIC=120.8
## y ~ 1
## 
##        Df Sum of Sq     RSS     AIC
## + x4    1    898.57  467.95  91.014
## + x1    1    424.93  941.59 112.689
## + x6    1    245.58 1120.94 118.094
## <none>              1366.52 120.801
## + x5    1     99.00 1267.52 121.904
## + x2    1     80.23 1286.29 122.359
## + x3    1     60.38 1306.14 122.834
## 
## Step:  AIC=91.01
## y ~ x4
## 
##        Df Sum of Sq     RSS     AIC
## <none>               467.95  91.014
## + x2    1     30.51  437.44  92.358
## + x3    1     17.74  450.21  93.250
## + x6    1     14.79  453.17  93.452
## + x5    1      9.11  458.85  93.838
## + x1    1      1.91  466.04  94.321
## - x4    1    898.57 1366.52 120.801

## 
## Call:
## lm(formula = y ~ x4, data = asphalt[, -1])
## 
## Coefficients:
## (Intercept)           x4  
##       11.72       -10.77

# then lasso?

# plot lasso?

# mse?

End of Example

Chapter 13 Variable Selection

Tianhai Zu (zuti@mail.uc.edu)

1 Variable selection in general

2 Select by Adjusted R squared

3 select by AIC and BIC

4 Stepwise regression

5 Lasso regression

Another example Asphalt, try it!