Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

Model Adequacy Checking

What is model adequacy checking? Remember we made a few assumptions in Chapter 1 and 2? Model adequacy checking is essentially to make sure these assumptions are satisfied, therefore, our subsequent analysis is based on a correct model. More importantly, model adequacy checking will check to see if the model fits the data well.

Recall all the assumptions we have made for linear regression — LINE assumptions:

To check these assumptions, we mainly use residual analysis.

Residual and Residual Analysis

What is residual? We have \[e_i = y_i - \hat{y}_i\] Essentially, residual means: residual = data - fit. So the larger residual is, the poorer the fit is. Most of the LINE assumptions are checked by examing the residual through graphical tools (i.e., residual plots), hence residual analysis.

delivery <- read.csv("data_delivery.csv",h=T)
# build linear regression
model1 <- lm(DeliveryTime ~ NumberofCases+Distance, data=delivery)
# obtain residual
model1$residuals
##          1          2          3          4          5          6          7 
## -5.0280843  1.1463854 -0.0497937  4.9243539 -0.4443983 -0.2895743  0.8446235 
##          8          9         10         11         12         13         14 
##  1.1566049  7.4197062  2.3764129  2.2374930 -0.5930409  1.0270093  1.0675359 
##         15         16         17         18         19         20         21 
##  0.6712018 -0.6629284  0.4363603  3.4486213  1.7931935 -5.7879699 -2.6141789 
##         22         23         24         25 
## -3.6865279 -4.6075679 -4.5728535 -0.2125839
# obtain MSRes
MSRes=summary(model1)$sigma^2
MSRes
## [1] 10.62417
# obtain standardized residuals
standardized_res=model1$residuals/summary(model1)$sigma
standardized_res
##           1           2           3           4           5           6 
## -1.54260631  0.35170879 -0.01527661  1.51078203 -0.13634053 -0.08884082 
##           7           8           9          10          11          12 
##  0.25912883  0.35484408  2.27635117  0.72907878  0.68645843 -0.18194377 
##          13          14          15          16          17          18 
##  0.31508443  0.32751789  0.20592338 -0.20338513  0.13387449  1.05803019 
##          19          20          21          22          23          24 
##  0.55014821 -1.77573772 -0.80202492 -1.13101946 -1.41359270 -1.40294240 
##          25 
## -0.06522033
rocket <- read.csv("data_RocketProp.csv",h=T)
# build linear regression
model2 <- lm(y ~ x, data=rocket)

Summary of Residual Plots

QQ-Plot

QQ-plot is used to check for normality assumption. We first take all the residuals and calculate their percentiles (1%,…,99%) and plot these percentiles against these percentiles of standard normal distribution. The resulting plot is called QQ-plot. If the residuals are normally distributed, the QQ-plot should be around the 45 degree line.

par(mfrow=c(1,2))
# generate QQ plot
qqnorm(model1$residuals,main="model1")
qqline(model1$residuals)
qqnorm(model2$residuals,main="model2")
qqline(model2$residuals)

Scatter Plot of Residual Against Fitted Values or Regressors

This residual plot is used to check for linearity and equal variance assumption. We plot all the residuals against the fitted values or one of the regressors. If the linearity and equal variance assumptions are satisfied, we would expect the dots are evenly distributed around zero with constant variance across the x-axis.

# residual plot for delivery time data set
par(mfrow=c(1,3))
# generate residual plot, NumberofCases vs residuals
plot(delivery$NumberofCases,model1$residuals,pch=20)
abline(h=0,col="grey")
# generate residual plot, Distance vs residuals
plot(delivery$Distance,model1$residuals,pch=20)
abline(h=0,col="grey")
# generate residual plot, fitted values vs residual
plot(model1$fitted.values,model1$residuals,pch=20)
abline(h=0,col="grey")

# residual plot for rocket propellant data set
par(mfrow=c(1,2))
# generate residual plot, x vs residuals
plot(rocket$x,model2$residuals,pch=20)
abline(h=0,col="grey")
# generate residual plot, fitted values vs residual
plot(model2$fitted.values,model2$residuals,pch=20)
abline(h=0,col="grey")

Scatter Plot of Residual Against Time Order

In the case of time series regression, we can plot residuals against time order to check for independence assumption. If the residuals are independent, we would expect to see the dots are randomly distributed around zero with equal variance.

Other Types of Residuals

In addition to the traditional residuals we introduced above, there are many other types of residuals. For example

Standardized Residuals

For residuals, we can have its approximate variance as:

\[ \frac{ \sum_{i=1}^{n} (e_i -\bar e)^2} {n - k-1} = \frac {\sum_{i=1}^{n} e_i^2} {n - k-1}= \frac {SSRse} {n-k-1} = MSRes\]

This is called mean residual sum of squares (MSRes). So we can obtain standardized residuals as:

\[ d_i = \frac {e_i}{\sqrt{MSRes}}= \frac {e_i}{SD(e_i)}\]

To obtain standardized residuals, we do the following.

MSRes=summary(model1)$sigma^2
MSRes
## [1] 10.62417
# obtain standardized residuals
standardized_res=model1$residuals/summary(model1)$sigma
standardized_res
##           1           2           3           4           5           6 
## -1.54260631  0.35170879 -0.01527661  1.51078203 -0.13634053 -0.08884082 
##           7           8           9          10          11          12 
##  0.25912883  0.35484408  2.27635117  0.72907878  0.68645843 -0.18194377 
##          13          14          15          16          17          18 
##  0.31508443  0.32751789  0.20592338 -0.20338513  0.13387449  1.05803019 
##          19          20          21          22          23          24 
##  0.55014821 -1.77573772 -0.80202492 -1.13101946 -1.41359270 -1.40294240 
##          25 
## -0.06522033

PRESS Residual

Other than the traditional residual and standardized residuals, there are many other types. One of the important types is PRESS residual.

PRESS residuals logic

Example

set.seed(123)
n=5
x=runif(n)
y=1+2*x+rnorm(n)
plot(x,y,pch=c(15,20,20,20,20),cex=1.5)
m_full=lm(y~x)
m_rm1=lm(y[-1]~x[-1])
lines(c(x[1],x[1]),c(y[1],coef(m_rm1)[1]+coef(m_rm1)[2]*x[1]),lwd=4,col="grey")
lines(c(x[1],x[1]),c(y[1],m_full$fitted.values[1]))
abline(m_full,col="blue")
abline(m_rm1,col="red")

The blue line is the fitted linear regression using all the data points (both solid dots and square). The red line is the fitted linear regression using only solid dots, i.e., removing the square. The vertical distance between the circle and the blue line is the residual, \(e_i\), indicated by the black line. The vertical distance between the circle and the red line is the PRESS residual, \(e_{(i)}\), indicated by the grey line. Both of them are negative.

Calculate PRESS Residual

\[ e_{(i)} = \frac{e_i}{1-h_{ii}} \]

What is \(h_{ii}\)?

\[ h_{ii} = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum_{i=1}^{n} (x_i-\bar{x})^2} \]

PRESS Statstics

\[ e_{(i)} = y_i - \hat{y}_{(i)}\]

\[ e_{i} = y_i - \hat{y}_{i}\]

\[ PRESS = \sum_{i=1}^{n} e_{(i)}^2 = \sum_{i=1}^{n} \Big( \frac{e_i}{1-h_{ii}} \Big)^2\]

\[ SSRes = \sum_{i=1}^{n} e_{i}^2 \]

\[R^2_{prediction}= 1- \frac {PRESS}{SS_T}\]

Examples

Rocket propellant example

# example
rocket <- read.csv("data_RocketProp.csv",h=T)
plot(rocket$x,rocket$y,pch=20)
model2 <- lm(y ~ x, data=rocket)
abline(model2,col="blue")

# residual
model2$residuals
##           1           2           3           4           5           6 
##  106.758301  -67.274574  -14.593631   65.088687 -215.977609 -213.604131 
##           7           8           9          10          11          12 
##   48.563824   40.061618    8.729573   37.567141   20.374323  -88.946393 
##          13          14          15          16          17          18 
##   80.817415   71.175153  -45.143358   94.442278    9.499187   37.097528 
##          19          20 
##  100.684823  -75.320154
# standardized residuals
standardized_res=model2$residuals/summary(model2)$sigma
# PRESS residuals
PRESS_res=model2$residuals/(1 - hatvalues(model2))
# plot all residual and leverage
# partition the canvas into 6 columns.
par(mfrow=c(1,4))
plot(rocket$x,model2$residuals,pch=20,ylab="residual",xlab="fitted value")
abline(h=0,col="grey")
plot(rocket$x,standardized_res,pch=20,ylab="standardized residual",xlab="fitted value")
abline(h=0,col="grey")
plot(rocket$x,PRESS_res,pch=20,ylab="PRESS residual",xlab="fitted value")
abline(h=0,col="grey")
plot(rocket$x,hatvalues(model2),pch=20,ylab="leverage",xlab="age")