Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

Model validation in general

Let us start with the general model building process.

As a quick review, we have discussed what model adequacy checking means.

Model Adequacy Checking:

On the other hand, let us define what model validation means.

Model Validation:

Some elementary tools for model validation:

To formally validate models, we could

Validation technique: cross validation

However, in practice,

In this section, we will focus on the following cross validation.

K-fold cross validation

The idea of K-fold CV is summarized in the figure below.

K-fold cross-validation

K-fold cross-validation

Leave one out cross validation

To split data into training and testing sets

Criterion to compare for cross validation

Example of delivery time data set Now we perform K-fold cross validation.

#install DAAG from archived source
if(!is.element("DAAG", installed.packages()[,1])){
  packageurl <- "https://cran.r-project.org/src/contrib/Archive/DAAG/DAAG_1.22.tar.gz"
  install.packages("latticeExtra")
  install.packages(packageurl, repos=NULL, type="source")
}

#loda DAAG package
library(DAAG)
## Loading required package: lattice
d <- read.csv("data_delivery.csv",header=TRUE)
head(d)
##   Index DeliveryTime NumberofCases Distance
## 1     1        16.68             7      560
## 2     2        11.50             3      220
## 3     3        12.03             3      340
## 4     4        14.88             4       80
## 5     5        13.75             6      150
## 6     6        18.11             7      330
n = dim(d)[1]
n
## [1] 25
model1 = lm(DeliveryTime ~ NumberofCases + Distance, data=d)
summary(model1)
## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7880 -0.6629  0.4364  1.1566  7.4197 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.341231   1.096730   2.135 0.044170 *  
## NumberofCases 1.615907   0.170735   9.464 3.25e-09 ***
## Distance      0.014385   0.003613   3.981 0.000631 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 22 degrees of freedom
## Multiple R-squared:  0.9596, Adjusted R-squared:  0.9559 
## F-statistic: 261.2 on 2 and 22 DF,  p-value: 4.687e-16
KCV=cv.lm(data=d, model1, m=3, seed=123)
## Analysis of Variance Table
## 
## Response: DeliveryTime
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## NumberofCases  1   5382    5382   506.6 < 2e-16 ***
## Distance       1    168     168    15.8 0.00063 ***
## Residuals     22    234      11                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in cv.lm(data = d, model1, m = 3, seed = 123): 
## 
##  As there is >1 explanatory variable, cross-validation
##  predicted values for a fold are not a linear function
##  of corresponding overall predicted values.  Lines that
##  are shown for the different folds are approximate

## 
## fold 1 
## Observations in test set: 8 
##                  1     4      6    10    11     12   13   19
## Predicted    21.71  9.96 18.400 19.12 38.09 21.593 12.5 7.71
## cvpred       21.69  9.15 17.933 19.29 37.89 20.759 12.0 6.86
## DeliveryTime 16.68 14.88 18.110 21.50 40.33 21.000 13.5 9.50
## CV residual  -5.01  5.73  0.177  2.21  2.44  0.241  1.5 2.64
## 
## Sum of squares = 78.1    Mean square = 9.76    n = 8 
## 
## fold 2 
## Observations in test set: 9 
##                 2      5     15    16    18    20    22    23     25
## Predicted    10.4 14.194 23.329 29.66 15.55 40.89 56.01 23.36 10.963
## cvpred       10.0 14.624 24.141 30.35 16.24 43.24 60.45 24.17 10.918
## DeliveryTime 11.5 13.750 24.000 29.00 19.00 35.10 52.32 18.75 10.750
## CV residual   1.5 -0.874 -0.141 -1.35  2.76 -8.14 -8.13 -5.42 -0.168
## 
## Sum of squares = 174    Mean square = 19.4    n = 9 
## 
## fold 3 
## Observations in test set: 8 
##                   3     7      8    9    14     17    21    24
## Predicted    12.080  7.16 16.673 71.8 18.68 14.914 20.51 24.40
## cvpred       12.685  8.89 17.442 64.6 18.43 15.831 21.28 23.17
## DeliveryTime 12.030  8.00 17.830 79.2 19.75 15.350 17.90 19.83
## CV residual  -0.655 -0.89  0.388 14.6  1.32 -0.481 -3.38 -3.34
## 
## Sum of squares = 240    Mean square = 29.9    n = 8 
## 
## Overall (Sum over all 8 folds) 
##   ms 
## 19.7
# MSPE can be alternatively calculated as
sum((d$DeliveryTime-KCV$cvpred)^2)/n
## [1] 19.7
# PRESS can be calculated as 
sum((d$DeliveryTime-KCV$cvpred)^2)
## [1] 492
# Prediction R squared can be calculated as
1-sum((d$DeliveryTime-KCV$cvpred)^2)/sum((d$DeliveryTime-mean(d$DeliveryTime))^2)
## [1] 0.915
summary(model1)$r.squared
## [1] 0.96

Let us take the 9th observation as an example. The acutal observation of delivery time is d$DeliveryTime[9]=79.24. The fitted value is model1$fitted.values[9]=71.82. The cross validated prediction is KCV$cvpred[9]=64.624.

We further perform leave one out cross validation.

LOOCV=cv.lm(data=d, model1, m=n,seed=123)
## Analysis of Variance Table
## 
## Response: DeliveryTime
##               Df Sum Sq Mean Sq F value  Pr(>F)    
## NumberofCases  1   5382    5382   506.6 < 2e-16 ***
## Distance       1    168     168    15.8 0.00063 ***
## Residuals     22    234      11                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Warning in cv.lm(data = d, model1, m = n, seed = 123): 
## 
##  As there is >1 explanatory variable, cross-validation
##  predicted values for a fold are not a linear function
##  of corresponding overall predicted values.  Lines that
##  are shown for the different folds are approximate
## Warning in plot.xy(xy, type, ...): unimplemented pch value '26'
## 
## fold 1 
## Observations in test set: 1 
##                  16
## Predicted    29.663
## cvpred       29.795
## DeliveryTime 29.000
## CV residual  -0.795
## 
## Sum of squares = 0.63    Mean square = 0.63    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 2 
## Observations in test set: 1 
##                 18
## Predicted    15.55
## cvpred       15.18
## DeliveryTime 19.00
## CV residual   3.82
## 
## Sum of squares = 14.6    Mean square = 14.6    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 3 
## Observations in test set: 1 
##                 21
## Predicted    20.51
## cvpred       21.03
## DeliveryTime 17.90
## CV residual  -3.13
## 
## Sum of squares = 9.81    Mean square = 9.81    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 4 
## Observations in test set: 1 
##                  4
## Predicted     9.96
## cvpred        9.50
## DeliveryTime 14.88
## CV residual   5.38
## 
## Sum of squares = 29    Mean square = 29    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 5 
## Observations in test set: 1 
##                  25
## Predicted    10.963
## cvpred       10.978
## DeliveryTime 10.750
## CV residual  -0.228
## 
## Sum of squares = 0.05    Mean square = 0.05    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 6 
## Observations in test set: 1 
##                  8
## Predicted    16.67
## cvpred       16.59
## DeliveryTime 17.83
## CV residual   1.24
## 
## Sum of squares = 1.53    Mean square = 1.53    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 7 
## Observations in test set: 1 
##                 10
## Predicted    19.12
## cvpred       18.54
## DeliveryTime 21.50
## CV residual   2.96
## 
## Sum of squares = 8.74    Mean square = 8.74    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 8 
## Observations in test set: 1 
##                 23
## Predicted    23.36
## cvpred       23.56
## DeliveryTime 18.75
## CV residual  -4.81
## 
## Sum of squares = 23.1    Mean square = 23.1    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 9 
## Observations in test set: 1 
##                24
## Predicted    24.4
## cvpred       25.0
## DeliveryTime 19.8
## CV residual  -5.2
## 
## Sum of squares = 27    Mean square = 27    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 10 
## Observations in test set: 1 
##                 11
## Predicted    38.09
## cvpred       37.88
## DeliveryTime 40.33
## CV residual   2.45
## 
## Sum of squares = 5.99    Mean square = 5.99    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 11 
## Observations in test set: 1 
##                  5
## Predicted    14.19
## cvpred       14.23
## DeliveryTime 13.75
## CV residual  -0.48
## 
## Sum of squares = 0.23    Mean square = 0.23    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 12 
## Observations in test set: 1 
##                 7
## Predicted    7.16
## cvpred       7.08
## DeliveryTime 8.00
## CV residual  0.92
## 
## Sum of squares = 0.85    Mean square = 0.85    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 13 
## Observations in test set: 1 
##                19
## Predicted    7.71
## cvpred       7.52
## DeliveryTime 9.50
## CV residual  1.98
## 
## Sum of squares = 3.94    Mean square = 3.94    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 14 
## Observations in test set: 1 
##                 20
## Predicted    40.89
## cvpred       41.54
## DeliveryTime 35.10
## CV residual  -6.44
## 
## Sum of squares = 41.5    Mean square = 41.5    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 15 
## Observations in test set: 1 
##                    3
## Predicted    12.0798
## cvpred       12.0852
## DeliveryTime 12.0300
## CV residual  -0.0552
## 
## Sum of squares = 0    Mean square = 0    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 16 
## Observations in test set: 1 
##                 1
## Predicted    21.7
## cvpred       22.3
## DeliveryTime 16.7
## CV residual  -5.6
## 
## Sum of squares = 31.3    Mean square = 31.3    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 17 
## Observations in test set: 1 
##                 22
## Predicted    56.01
## cvpred       58.38
## DeliveryTime 52.32
## CV residual  -6.06
## 
## Sum of squares = 36.7    Mean square = 36.7    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 18 
## Observations in test set: 1 
##                  17
## Predicted    14.914
## cvpred       14.886
## DeliveryTime 15.350
## CV residual   0.464
## 
## Sum of squares = 0.22    Mean square = 0.22    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 19 
## Observations in test set: 1 
##                   6
## Predicted    18.400
## cvpred       18.413
## DeliveryTime 18.110
## CV residual  -0.303
## 
## Sum of squares = 0.09    Mean square = 0.09    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 20 
## Observations in test set: 1 
##                  2
## Predicted    10.35
## cvpred       10.27
## DeliveryTime 11.50
## CV residual   1.23
## 
## Sum of squares = 1.52    Mean square = 1.52    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 21 
## Observations in test set: 1 
##                 14
## Predicted    18.68
## cvpred       18.59
## DeliveryTime 19.75
## CV residual   1.16
## 
## Sum of squares = 1.34    Mean square = 1.34    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 22 
## Observations in test set: 1 
##                  12
## Predicted    21.593
## cvpred       21.669
## DeliveryTime 21.000
## CV residual  -0.669
## 
## Sum of squares = 0.45    Mean square = 0.45    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 23 
## Observations in test set: 1 
##                15
## Predicted    23.3
## cvpred       23.3
## DeliveryTime 24.0
## CV residual   0.7
## 
## Sum of squares = 0.49    Mean square = 0.49    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 24 
## Observations in test set: 1 
##                 9
## Predicted    71.8
## cvpred       64.5
## DeliveryTime 79.2
## CV residual  14.8
## 
## Sum of squares = 219    Mean square = 219    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## 
## fold 25 
## Observations in test set: 1 
##                 13
## Predicted    12.47
## cvpred       12.41
## DeliveryTime 13.50
## CV residual   1.09
## 
## Sum of squares = 1.2    Mean square = 1.2    n = 1
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'
## Warning in plot.xy(xy.coords(x, y), type = type, ...): unimplemented pch value
## '26'

## 
## Overall (Sum over all 1 folds) 
##   ms 
## 18.4
# MSPE can be alternatively calculated as
sum((d$DeliveryTime-LOOCV$cvpred)^2)/n
## [1] 18.4
# PRESS can be calculated as 
sum((d$DeliveryTime-LOOCV$cvpred)^2)
## [1] 459
# Prediction R squared can be calculated as
1-sum((d$DeliveryTime-LOOCV$cvpred)^2)/sum((d$DeliveryTime-mean(d$DeliveryTime))^2)
## [1] 0.921
summary(model1)$r.squared
## [1] 0.96

Finally, I have include three functions that can quickly compute the PRESS and prediction R squared.

### This calculate the PRESS (predictive residual sum of squares), the lower, the better
#' @title PRESS
#' @author Thomas Hopper
#' @description Returns the PRESS statistic (predictive residual sum of squares).
#'              Useful for evaluating predictive power of regression models.
#' @param linear.model A linear regression model (class 'lm'). Required.
PRESS <- function(linear.model) {
  #' calculate the predictive residuals
  pr <- residuals(linear.model)/(1-lm.influence(linear.model)$hat)
  #' calculate the PRESS
  PRESS <- sum(pr^2)
  return(PRESS)
}
### This calculate the MSPE (mean square prediction error), the lower, the better
#' @title MSPE
#' @author Yichen Qin
#' @description Returns the MSPE statistic (mean square prediction error).
#' @param linear.model A linear regression model (class 'lm'). Required.
MSPE <- function(linear.model) {
  #' calculate the MSPE =PRESS/sample size
  return(PRESS(linear.model)/length(residuals(linear.model)))
}
### This calculate the Prediction r-squared
#' @title Predictive R-squared
#' @author Thomas Hopper
#' @description returns the prediction r-squared. Requires the function PRESS(), which returns
#'              the PRESS statistic.
#' @param linear.model A linear regression model (class 'lm'). Required.
pred_r_squared <- function(linear.model) {
  #' Use anova() to get the sum of squares for the linear model
  lm.anova <- anova(linear.model)
  #' Calculate the total sum of squares
  tss <- sum(lm.anova$'Sum Sq')
  # Calculate the predictive R^2
  pred.r.squared <- 1-PRESS(linear.model)/(tss)
  
  return(pred.r.squared)
}
MSPE(model1)
## [1] 18.4
PRESS(model1)
## [1] 459
pred_r_squared(model1)
## [1] 0.921
summary(model1)$r.squared
## [1] 0.96

End of example

Optional: Validation by collecting of new observations

If we could obtain new data, we could use new data to validate the model.

Example of the delivery time data Suppose we use the original data set to build a model.

d <- read.csv("data_delivery.csv",header=TRUE)
head(d)
##   Index DeliveryTime NumberofCases Distance
## 1     1         16.7             7      560
## 2     2         11.5             3      220
## 3     3         12.0             3      340
## 4     4         14.9             4       80
## 5     5         13.8             6      150
## 6     6         18.1             7      330
n = dim(d)[1]
n
## [1] 25

In this example, we have in total 25 observations in the original data set data_delivery.csv, we build a multiple linear regression as follows.

model1 = lm(DeliveryTime ~ NumberofCases + Distance, data=d)
summary(model1)
## 
## Call:
## lm(formula = DeliveryTime ~ NumberofCases + Distance, data = d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.788 -0.663  0.436  1.157  7.420 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.34123    1.09673    2.13  0.04417 *  
## NumberofCases  1.61591    0.17073    9.46  3.3e-09 ***
## Distance       0.01438    0.00361    3.98  0.00063 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.26 on 22 degrees of freedom
## Multiple R-squared:  0.96,   Adjusted R-squared:  0.956 
## F-statistic:  261 on 2 and 22 DF,  p-value: 4.69e-16

It seems model1 works well for the observed data set data_delivery.csv. However, how do we validate model1? In other words, how will model1 perform for a newly collected data set? Or how well does model1 generalize beyond the observed data?

To validate the performance of model1, we can collect more observations. Suppose the newly collected observations are stored in data_delivery_additional.csv. Then we could validate model1 as follows.

dnew <- read.csv("data_delivery_additional.csv",header=TRUE)

We can first make prediction for these new observations using the fitted model model1 and calculate the predication error.

prediction = predict(model1, dnew, interval = c("confidence"), level = 0.95, type="response")
head(prediction)
##    fit  lwr  upr
## 1 50.9 47.9 53.9
## 2 21.1 19.2 23.1
## 3 30.8 27.5 34.0
## 4 17.6 15.3 20.0
## 5 26.4 21.0 31.9
## 6 15.3 13.7 16.8
prediction_error = dnew$DeliveryTime-prediction[,1]
head(prediction_error)
##       1       2       3       4       5       6 
##  0.0905 -4.3327 -4.5914  2.2868 -2.4215  3.2767
# putting new observations, prediction and prediction error side by side.
head(cbind(dnew,prediction,prediction_error))
##   Index DeliveryTime NumberofCases Distance  fit  lwr  upr prediction_error
## 1    26         51.0            22      905 50.9 47.9 53.9           0.0905
## 2    27         16.8             7      520 21.1 19.2 23.1          -4.3327
## 3    28         26.2            15      290 30.8 27.5 34.0          -4.5914
## 4    29         19.9             5      500 17.6 15.3 20.0           2.2868
## 5    30         24.0             6     1000 26.4 21.0 31.9          -2.4215
## 6    31         18.6             6      225 15.3 13.7 16.8           3.2767

The whole idea is summarized in the figure below End of example

Once the prediction for new observations are obtained, we can use the following criterion to validate the model.

Example: For the delivery time data set, we can calculate the mean squared prediction error (MSPE) as follows.

MSPE = sum( (dnew$DeliveryTime - prediction[,1])^2 ) / dim(dnew)[1]
MSPE
## [1] 22.1

We can calculate the predicted residual error sum of squares (PRESS) statistic as follows.

PRESS = sum((dnew$DeliveryTime-prediction[,1])^2)
PRESS
## [1] 332

We can calculate prediction R squared as follows.

pred_Rsq = 1-PRESS/sum((dnew$DeliveryTime-mean(dnew$DeliveryTime))^2)
pred_Rsq
## [1] 0.897

Once we have the prediction R squared pred_Rsq = 0.897, we can compare it with the traditional R square summary(model1)$r.squared=0.96. Once again, we see that the least-sqaure model model1 does not predict the new observations dnew as well as it fits the original data d. However the lost in R squared for prediction is slight. Collecting new data has indicated that the least-squares fit for the delivery time data results in a reasonably good prediction equation.

In addition, we can compare PRESS with the Residual Sum of Squares (SSRes) from model1. To obtain the SSRes of model1, we use the following code

sum(model1$residuals^2)
## [1] 234

Clearly, SSRes=233.732 is lower than PRESS=331.832, which means model1 performs worse for the new observations dnew than the original observations d, but this is often expected.

End of example