Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.
What is model adequacy checking? Remember we made a few assumptions in Chapter 1 and 2? Model adequacy checking is essentially to make sure these assumptions are satisfied, therefore, our subsequent analysis is based on a correct model. More importantly, model adequacy checking will check to see if the model fits the data well.
Recall all the assumptions we have made for linear regression — LINE assumptions:
Relationship between response and regressors is Linear (at least approximately).
Errors are Independent (i.e., uncorrelated)
Errors are Normally distributed.
Error term,has Equal variance (i.e., constant variance).
To check these assumptions, we mainly use residual analysis.
What is residual? We have \[e_i = y_i - \hat{y}_i\] Essentially, residual means: residual = data - fit. So the larger residual is, the poorer the fit is. Most of the LINE assumptions are checked by examing the residual through graphical tools (i.e., residual plots), hence residual analysis.
delivery <- read.csv("data_delivery.csv",h=T)
# build linear regression
model1 <- lm(DeliveryTime ~ NumberofCases+Distance, data=delivery)
# obtain residual
model1$residuals
## 1 2 3 4 5 6 7
## -5.0280843 1.1463854 -0.0497937 4.9243539 -0.4443983 -0.2895743 0.8446235
## 8 9 10 11 12 13 14
## 1.1566049 7.4197062 2.3764129 2.2374930 -0.5930409 1.0270093 1.0675359
## 15 16 17 18 19 20 21
## 0.6712018 -0.6629284 0.4363603 3.4486213 1.7931935 -5.7879699 -2.6141789
## 22 23 24 25
## -3.6865279 -4.6075679 -4.5728535 -0.2125839
# obtain MSRes
MSRes=summary(model1)$sigma^2
MSRes
## [1] 10.62417
# obtain standardized residuals
standardized_res=model1$residuals/summary(model1)$sigma
standardized_res
## 1 2 3 4 5 6
## -1.54260631 0.35170879 -0.01527661 1.51078203 -0.13634053 -0.08884082
## 7 8 9 10 11 12
## 0.25912883 0.35484408 2.27635117 0.72907878 0.68645843 -0.18194377
## 13 14 15 16 17 18
## 0.31508443 0.32751789 0.20592338 -0.20338513 0.13387449 1.05803019
## 19 20 21 22 23 24
## 0.55014821 -1.77573772 -0.80202492 -1.13101946 -1.41359270 -1.40294240
## 25
## -0.06522033
rocket <- read.csv("data_RocketProp.csv",h=T)
# build linear regression
model2 <- lm(y ~ x, data=rocket)
QQ-plot is used to check for normality assumption. We first take all the residuals and calculate their percentiles (1%,…,99%) and plot these percentiles against these percentiles of standard normal distribution. The resulting plot is called QQ-plot. If the residuals are normally distributed, the QQ-plot should be around the 45 degree line.
par(mfrow=c(1,2))
# generate QQ plot
qqnorm(model1$residuals,main="model1")
qqline(model1$residuals)
qqnorm(model2$residuals,main="model2")
qqline(model2$residuals)
This residual plot is used to check for linearity and equal variance assumption. We plot all the residuals against the fitted values or one of the regressors. If the linearity and equal variance assumptions are satisfied, we would expect the dots are evenly distributed around zero with constant variance across the x-axis.
# residual plot for delivery time data set
par(mfrow=c(1,3))
# generate residual plot, NumberofCases vs residuals
plot(delivery$NumberofCases,model1$residuals,pch=20)
abline(h=0,col="grey")
# generate residual plot, Distance vs residuals
plot(delivery$Distance,model1$residuals,pch=20)
abline(h=0,col="grey")
# generate residual plot, fitted values vs residual
plot(model1$fitted.values,model1$residuals,pch=20)
abline(h=0,col="grey")
# residual plot for rocket propellant data set
par(mfrow=c(1,2))
# generate residual plot, x vs residuals
plot(rocket$x,model2$residuals,pch=20)
abline(h=0,col="grey")
# generate residual plot, fitted values vs residual
plot(model2$fitted.values,model2$residuals,pch=20)
abline(h=0,col="grey")
In the case of time series regression, we can plot residuals against time order to check for independence assumption. If the residuals are independent, we would expect to see the dots are randomly distributed around zero with equal variance.
In addition to the traditional residuals we introduced above, there are many other types of residuals. For example
Standardized residuals
PRESS residuals
For residuals, we can have its approximate variance as:
\[ \frac{ \sum_{i=1}^{n} (e_i -\bar e)^2} {n - k-1} = \frac {\sum_{i=1}^{n} e_i^2} {n - k-1}= \frac {SSRse} {n-k-1} = MSRes\]
This is called mean residual sum of squares (MSRes). So we can obtain standardized residuals as:
\[ d_i = \frac {e_i}{\sqrt{MSRes}}= \frac {e_i}{SD(e_i)}\]
summary(model1)$sigma^2
. To obtain \(SD(e_i)\), use summary(model1)$sigma
.To obtain standardized residuals, we do the following.
MSRes=summary(model1)$sigma^2
MSRes
## [1] 10.62417
# obtain standardized residuals
standardized_res=model1$residuals/summary(model1)$sigma
standardized_res
## 1 2 3 4 5 6
## -1.54260631 0.35170879 -0.01527661 1.51078203 -0.13634053 -0.08884082
## 7 8 9 10 11 12
## 0.25912883 0.35484408 2.27635117 0.72907878 0.68645843 -0.18194377
## 13 14 15 16 17 18
## 0.31508443 0.32751789 0.20592338 -0.20338513 0.13387449 1.05803019
## 19 20 21 22 23 24
## 0.55014821 -1.77573772 -0.80202492 -1.13101946 -1.41359270 -1.40294240
## 25
## -0.06522033
Other than the traditional residual and standardized residuals, there are many other types. One of the important types is PRESS residual.
If the ith point is unusual, then it can “overly” influence the regression model.
If the ith point is used in fitting the model, then the residual for the ith point will be impacted by the ith point.
If the ith point is not used in fitting the model, then the residual will better reflect how unusual that point is.
set.seed(123)
n=5
x=runif(n)
y=1+2*x+rnorm(n)
plot(x,y,pch=c(15,20,20,20,20),cex=1.5)
m_full=lm(y~x)
m_rm1=lm(y[-1]~x[-1])
lines(c(x[1],x[1]),c(y[1],coef(m_rm1)[1]+coef(m_rm1)[2]*x[1]),lwd=4,col="grey")
lines(c(x[1],x[1]),c(y[1],m_full$fitted.values[1]))
abline(m_full,col="blue")
abline(m_rm1,col="red")
The blue line is the fitted linear regression using all the data points (both solid dots and square). The red line is the fitted linear regression using only solid dots, i.e., removing the square. The vertical distance between the circle and the blue line is the residual, \(e_i\), indicated by the black line. The vertical distance between the circle and the red line is the PRESS residual, \(e_{(i)}\), indicated by the grey line. Both of them are negative.
\[ e_{(i)} = \frac{e_i}{1-h_{ii}} \]
\(h_{ii}\) is called leverage.
In simple linear regression case:
\[ h_{ii} = \frac{1}{n} + \frac{(x_i-\bar{x})^2}{\sum_{i=1}^{n} (x_i-\bar{x})^2} \]
The second part of right hand side of the above equation represents how far the data point is away from the center of the data set.
In R, use hatvalues(model)
.
\[ e_{(i)} = y_i - \hat{y}_{(i)}\]
\[ e_{i} = y_i - \hat{y}_{i}\]
\[ PRESS = \sum_{i=1}^{n} e_{(i)}^2 = \sum_{i=1}^{n} \Big( \frac{e_i}{1-h_{ii}} \Big)^2\]
\[ SSRes = \sum_{i=1}^{n} e_{i}^2 \]
A small value of the PRESS Statistic is desired.
R-squared for Prediction Based on PRESS
\[R^2_{prediction}= 1- \frac {PRESS}{SS_T}\]
Rocket propellant example
# example
rocket <- read.csv("data_RocketProp.csv",h=T)
plot(rocket$x,rocket$y,pch=20)
model2 <- lm(y ~ x, data=rocket)
abline(model2,col="blue")
# residual
model2$residuals
## 1 2 3 4 5 6
## 106.758301 -67.274574 -14.593631 65.088687 -215.977609 -213.604131
## 7 8 9 10 11 12
## 48.563824 40.061618 8.729573 37.567141 20.374323 -88.946393
## 13 14 15 16 17 18
## 80.817415 71.175153 -45.143358 94.442278 9.499187 37.097528
## 19 20
## 100.684823 -75.320154
# standardized residuals
standardized_res=model2$residuals/summary(model2)$sigma
# PRESS residuals
PRESS_res=model2$residuals/(1 - hatvalues(model2))
# plot all residual and leverage
# partition the canvas into 6 columns.
par(mfrow=c(1,4))
plot(rocket$x,model2$residuals,pch=20,ylab="residual",xlab="fitted value")
abline(h=0,col="grey")
plot(rocket$x,standardized_res,pch=20,ylab="standardized residual",xlab="fitted value")
abline(h=0,col="grey")
plot(rocket$x,PRESS_res,pch=20,ylab="PRESS residual",xlab="fitted value")
abline(h=0,col="grey")
plot(rocket$x,hatvalues(model2),pch=20,ylab="leverage",xlab="age")