Chapter 4 Model Adequacy Checking

Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

Model Adequacy Checking

What is model adequacy checking? Remember we made a few assumptions in Chapter 1 and 2? Model adequacy checking is essentially to make sure these assumptions are satisfied, therefore, our subsequent analysis is based on a correct model. More importantly, model adequacy checking will check to see if the model fits the data well.

Recall all the assumptions we have made for linear regression — LINE assumptions:

Relationship between response and regressors is Linear (at least approximately).
Errors are Independent (i.e., uncorrelated)
Errors are Normally distributed.
Error term,has Equal variance (i.e., constant variance).

To check these assumptions, we mainly use residual analysis.

Residual and Residual Analysis

What is residual? We have \[e_i = y_i - \hat{y}_i\] Essentially, residual means: residual = data - fit. So the larger residual is, the poorer the fit is. Most of the LINE assumptions are checked by examing the residual through graphical tools (i.e., residual plots), hence residual analysis.

delivery <- read.csv("data_delivery.csv",h=T)
# build linear regression
model1 <- lm(DeliveryTime ~ NumberofCases+Distance, data=delivery)
# obtain residual
model1$residuals

##          1          2          3          4          5          6          7 
## -5.0280843  1.1463854 -0.0497937  4.9243539 -0.4443983 -0.2895743  0.8446235 
##          8          9         10         11         12         13         14 
##  1.1566049  7.4197062  2.3764129  2.2374930 -0.5930409  1.0270093  1.0675359 
##         15         16         17         18         19         20         21 
##  0.6712018 -0.6629284  0.4363603  3.4486213  1.7931935 -5.7879699 -2.6141789 
##         22         23         24         25 
## -3.6865279 -4.6075679 -4.5728535 -0.2125839

# obtain MSRes
MSRes=summary(model1)$sigma^2
MSRes

## [1] 10.62417

# obtain standardized residuals
standardized_res=model1$residuals/summary(model1)$sigma
standardized_res

##           1           2           3           4           5           6 
## -1.54260631  0.35170879 -0.01527661  1.51078203 -0.13634053 -0.08884082 
##           7           8           9          10          11          12 
##  0.25912883  0.35484408  2.27635117  0.72907878  0.68645843 -0.18194377 
##          13          14          15          16          17          18 
##  0.31508443  0.32751789  0.20592338 -0.20338513  0.13387449  1.05803019 
##          19          20          21          22          23          24 
##  0.55014821 -1.77573772 -0.80202492 -1.13101946 -1.41359270 -1.40294240 
##          25 
## -0.06522033

rocket <- read.csv("data_RocketProp.csv",h=T)
# build linear regression
model2 <- lm(y ~ x, data=rocket)

Summary of Residual Plots

QQ-plot of residuals:
- Checks the Normality assumption
- Look for potential outliers (part of normality assumption)
Scatter plot of residuals against fitted values or against regressors:
- Checks for Equal variance
- Checks for Linearity
- Look for potential outliers (part of normality assumption)
Scatter plot of residuals against time order
- Check for Independent errors

QQ-Plot

QQ-plot is used to check for normality assumption. We first take all the residuals and calculate their percentiles (1%,…,99%) and plot these percentiles against these percentiles of standard normal distribution. The resulting plot is called QQ-plot. If the residuals are normally distributed, the QQ-plot should be around the 45 degree line.

par(mfrow=c(1,2))
# generate QQ plot
qqnorm(model1$residuals,main="model1")
qqline(model1$residuals)
qqnorm(model2$residuals,main="model2")
qqline(model2$residuals)

Scatter Plot of Residual Against Fitted Values or Regressors

This residual plot is used to check for linearity and equal variance assumption. We plot all the residuals against the fitted values or one of the regressors. If the linearity and equal variance assumptions are satisfied, we would expect the dots are evenly distributed around zero with constant variance across the x-axis.

# residual plot for delivery time data set
par(mfrow=c(1,3))
# generate residual plot, NumberofCases vs residuals
plot(delivery$NumberofCases,model1$residuals,pch=20)
abline(h=0,col="grey")
# generate residual plot, Distance vs residuals
plot(delivery$Distance,model1$residuals,pch=20)
abline(h=0,col="grey")
# generate residual plot, fitted values vs residual
plot(model1$fitted.values,model1$residuals,pch=20)
abline(h=0,col="grey")

# residual plot for rocket propellant data set
par(mfrow=c(1,2))
# generate residual plot, x vs residuals
plot(rocket$x,model2$residuals,pch=20)
abline(h=0,col="grey")
# generate residual plot, fitted values vs residual
plot(model2$fitted.values,model2$residuals,pch=20)
abline(h=0,col="grey")

Scatter Plot of Residual Against Time Order

In the case of time series regression, we can plot residuals against time order to check for independence assumption. If the residuals are independent, we would expect to see the dots are randomly distributed around zero with equal variance.

Other Types of Residuals

In addition to the traditional residuals we introduced above, there are many other types of residuals. For example

Standardized residuals
PRESS residuals

Standardized Residuals

For residuals, we can have its approximate variance as:

\[ \frac{ \sum_{i=1}^{n} (e_i -\bar e)^2} {n - k-1} = \frac {\sum_{i=1}^{n} e_i^2} {n - k-1}= \frac {SSRse} {n-k-1} = MSRes\]

This is called mean residual sum of squares (MSRes). So we can obtain standardized residuals as:

\[ d_i = \frac {e_i}{\sqrt{MSRes}}= \frac {e_i}{SD(e_i)}\]

Standardized residuals have mean zero and variance approximately equal to 1.
Large values of standardized residuals ($|d_i|>3$) may indicate outliers.
In R, to obtain $MSRes$, use summary(model1)$sigma^2. To obtain $SD(e_i)$, use summary(model1)$sigma.

To obtain standardized residuals, we do the following.

MSRes=summary(model1)$sigma^2
MSRes

## [1] 10.62417

# obtain standardized residuals
standardized_res=model1$residuals/summary(model1)$sigma
standardized_res

##           1           2           3           4           5           6 
## -1.54260631  0.35170879 -0.01527661  1.51078203 -0.13634053 -0.08884082 
##           7           8           9          10          11          12 
##  0.25912883  0.35484408  2.27635117  0.72907878  0.68645843 -0.18194377 
##          13          14          15          16          17          18 
##  0.31508443  0.32751789  0.20592338 -0.20338513  0.13387449  1.05803019 
##          19          20          21          22          23          24 
##  0.55014821 -1.77573772 -0.80202492 -1.13101946 -1.41359270 -1.40294240 
##          25 
## -0.06522033

PRESS Residual

Other than the traditional residual and standardized residuals, there are many other types. One of the important types is PRESS residual.

PRESS residuals (predicted residual sum of squares) is the differences between $y_i$ and $\hat{y}_{(i)}$, i.e., \[ e_{(i)} = y_i - \hat{y}_{(i)} \]
These are the differences between the actual response for the $i$-th data point and the fitted value of the response for the $i$-th data point, using all observations except the $i$-th one.

PRESS residuals logic

If the ith point is unusual, then it can “overly” influence the regression model.
If the ith point is used in fitting the model, then the residual for the ith point will be impacted by the ith point.
If the ith point is not used in fitting the model, then the residual will better reflect how unusual that point is.

Example

set.seed(123)
n=5
x=runif(n)
y=1+2*x+rnorm(n)
plot(x,y,pch=c(15,20,20,20,20),cex=1.5)
m_full=lm(y~x)
m_rm1=lm(y[-1]~x[-1])
lines(c(x[1],x[1]),c(y[1],coef(m_rm1)[1]+coef(m_rm1)[2]*x[1]),lwd=4,col="grey")
lines(c(x[1],x[1]),c(y[1],m_full$fitted.values[1]))
abline(m_full,col="blue")
abline(m_rm1,col="red")

The blue line is the fitted linear regression using all the data points (both solid dots and square). The red line is the fitted linear regression using only solid dots, i.e., removing the square. The vertical distance between the circle and the blue line is the residual, $e_i$, indicated by the black line. The vertical distance between the circle and the red line is the PRESS residual, $e_{(i)}$, indicated by the grey line. Both of them are negative.

Calculate PRESS Residual

\[ e_{(i)} = \frac{e_i}{1-h_{ii}} \]

What is $h_{ii}$?

$h_{ii}$ is called leverage.
In simple linear regression case:

\[ h_{ii} = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum_{i=1}^{n} (x_i-\bar{x})^2} \]

The second part of right hand side of the above equation represents how far the data point is away from the center of the data set.
If $h_{ii}$ is large, a high leverage point (potential to affect to linear regression).
If $h_{ii}$ is small, a low leverage point.
In R, use hatvalues(model).

PRESS Statstics

PRESS Residual: Prediction Error Sum of Squares (PRESS) Statistic:

\[ e_{(i)} = y_i - \hat{y}_{(i)}\]

Compares to residual:

\[ e_{i} = y_i - \hat{y}_{i}\]

\[ PRESS = \sum_{i=1}^{n} e_{(i)}^2 = \sum_{i=1}^{n} \Big( \frac{e_i}{1-h_{ii}} \Big)^2\]

\[ SSRes = \sum_{i=1}^{n} e_{i}^2 \]

A small value of the PRESS Statistic is desired.
R-squared for Prediction Based on PRESS

\[R^2_{prediction}= 1- \frac {PRESS}{SS_T}\]

Interpretation: We expect the model to explain about R2% of the variability in prediction of a new observation.
PRESS is a valuable statistic for comparison of models.
Recall the traditional \[ R^2 = 1- \frac{SSRes}{SST}\]

Examples

Rocket propellant example

# example
rocket <- read.csv("data_RocketProp.csv",h=T)
plot(rocket$x,rocket$y,pch=20)
model2 <- lm(y ~ x, data=rocket)
abline(model2,col="blue")

# residual
model2$residuals

##           1           2           3           4           5           6 
##  106.758301  -67.274574  -14.593631   65.088687 -215.977609 -213.604131 
##           7           8           9          10          11          12 
##   48.563824   40.061618    8.729573   37.567141   20.374323  -88.946393 
##          13          14          15          16          17          18 
##   80.817415   71.175153  -45.143358   94.442278    9.499187   37.097528 
##          19          20 
##  100.684823  -75.320154

# standardized residuals
standardized_res=model2$residuals/summary(model2)$sigma
# PRESS residuals
PRESS_res=model2$residuals/(1 - hatvalues(model2))
# plot all residual and leverage
# partition the canvas into 6 columns.
par(mfrow=c(1,4))
plot(rocket$x,model2$residuals,pch=20,ylab="residual",xlab="fitted value")
abline(h=0,col="grey")
plot(rocket$x,standardized_res,pch=20,ylab="standardized residual",xlab="fitted value")
abline(h=0,col="grey")
plot(rocket$x,PRESS_res,pch=20,ylab="PRESS residual",xlab="fitted value")
abline(h=0,col="grey")
plot(rocket$x,hatvalues(model2),pch=20,ylab="leverage",xlab="age")