Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.
In the previous chapters, we are assuming that \(y\) and \(x\) are in a linear relationship. What if \(y\) and \(x\) are in a nonlinear relationship? To address this issue, we need polynomial regressions.
Suppose we want to study the relationship between the hardwood concentration in pulp and tensile strength of Kraft Paper.
\(y\): tensile strength
\(x\): hardwood concentration
rm(list=ls())
Paper <- read.csv("data_Hardwood.csv",h=T)
plot(Paper$HwdCon,Paper$TsStr,pch=20)
What do we do in this case? Polynomial regressions.
polynomial regression models the response variable \(y\) as a function of \(k\)th degree polynomial of the covariate \(x\).
For example:
A \(k\)th-order polynomial in one variable: \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 ... + \beta_k x_i^k + \epsilon_i \]
A second-order polynomial in one variable: \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i \]
Keep the order of the regression as low as possible. In other words, try to fit the data using the simplest model possible.
One approach is fitting the lowest order polynomial possible and build up (forward selection). To decide whether the polynomial regression fits the data sufficiently, examine the residual plot.
Extrapolation can be dangerous when the model is a higher-order polynomial. The nature of the true underlying relationship may change or be completely different than the system that produced the data used to fit the model.
We build a 2nd-order polynomial regression for this hardwood concentration example.
# fit polynomial regression with order 2, note the function I()
model1 <- lm(TsStr ~ HwdCon+ I(HwdCon^2), data=Paper )
summary(model1)
##
## Call:
## lm(formula = TsStr ~ HwdCon + I(HwdCon^2), data = Paper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8503 -3.2482 -0.7267 4.1350 6.5506
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.67419 3.39971 -1.963 0.0673 .
## HwdCon 11.76401 1.00278 11.731 2.85e-09 ***
## I(HwdCon^2) -0.63455 0.06179 -10.270 1.89e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.42 on 16 degrees of freedom
## Multiple R-squared: 0.9085, Adjusted R-squared: 0.8971
## F-statistic: 79.43 on 2 and 16 DF, p-value: 4.912e-09
par(mfrow=c(1,2))
# plot data
plot(Paper$HwdCon,Paper$TsStr,pch=20)
# plot the fitted values using dots
points(Paper$HwdCon,model1$fitted.values,pch=20,col="blue")
# plot the fitted values using a line
points(Paper$HwdCon,model1$fitted.values,type="l",col="blue")
# plot the fitted values using a line
plot(Paper$HwdCon,model1$residuals,pch=20)
abline(h=0,col="grey")
# fit polynomial regression with order 2, note the function I()
model1b <- lm(TsStr ~ HwdCon + I(HwdCon^2) + I(HwdCon^3), data=Paper )
summary(model1b)
##
## Call:
## lm(formula = TsStr ~ HwdCon + I(HwdCon^2) + I(HwdCon^3), data = Paper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.6250 -1.6109 0.0413 1.5892 5.0216
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.648395 2.954663 1.912 0.0752 .
## HwdCon 3.578489 1.565854 2.285 0.0373 *
## I(HwdCon^2) 0.653635 0.231330 2.826 0.0128 *
## I(HwdCon^3) -0.055188 0.009789 -5.638 4.72e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.585 on 15 degrees of freedom
## Multiple R-squared: 0.9707, Adjusted R-squared: 0.9648
## F-statistic: 165.4 on 3 and 15 DF, p-value: 1.025e-11
par(mfrow=c(1,2))
# plot data
plot(Paper$HwdCon,Paper$TsStr,pch=20)
# plot the fitted values using dots
points(Paper$HwdCon,model1b$fitted.values,pch=20,col="blue")
# plot the fitted values using a line
points(Paper$HwdCon,model1b$fitted.values,type="l",col="blue")
# plot the fitted values using a line
plot(Paper$HwdCon,model1b$residuals,pch=20)
abline(h=0,col="grey")
# fit polynomial regression with order 2, note the function I()
model1c <- lm(TsStr ~ HwdCon + I(HwdCon^2) + I(HwdCon^3) + I(HwdCon^4), data=Paper )
summary(model1c)
##
## Call:
## lm(formula = TsStr ~ HwdCon + I(HwdCon^2) + I(HwdCon^3) + I(HwdCon^4),
## data = Paper)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1384 -1.0550 -0.3203 1.0779 4.5030
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.070958 4.679866 0.229 0.8223
## HwdCon 8.049481 3.902170 2.063 0.0582 .
## I(HwdCon^2) -0.517352 0.966390 -0.535 0.6008
## I(HwdCon^3) 0.056575 0.090164 0.627 0.5405
## I(HwdCon^4) -0.003505 0.002812 -1.247 0.2330
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.539 on 14 degrees of freedom
## Multiple R-squared: 0.9736, Adjusted R-squared: 0.9661
## F-statistic: 129.1 on 4 and 14 DF, p-value: 6.994e-11
par(mfrow=c(1,2))
# plot data
plot(Paper$HwdCon,Paper$TsStr,pch=20)
# plot the fitted values using dots
points(Paper$HwdCon,model1c$fitted.values,pch=20,col="blue")
# plot the fitted values using a line
points(Paper$HwdCon,model1c$fitted.values,type="l",col="blue")
# plot the fitted values using a line
plot(Paper$HwdCon,model1c$residuals,pch=20)
abline(h=0,col="grey")
Suppose we want to study the relationship between the voltage drop and and time.
\(y\): voltage drop
\(x\): time
Voltage <- read.csv("data_VoltageDrop.csv",h=T)
# visualize data
plot(Voltage$Time,Voltage$VoltageDrop,pch=20)
# build polynomial regression
model2=lm(VoltageDrop~Time+I(Time^2),data=Voltage)
summary(model2)
##
## Call:
## lm(formula = VoltageDrop ~ Time + I(Time^2), data = Voltage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92769 -0.83047 -0.05291 0.63444 3.06431
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.265695 0.480701 10.95 2.55e-13 ***
## Time 1.487169 0.111192 13.38 5.97e-16 ***
## I(Time^2) -0.065198 0.005375 -12.13 1.23e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.076 on 38 degrees of freedom
## Multiple R-squared: 0.8324, Adjusted R-squared: 0.8236
## F-statistic: 94.35 on 2 and 38 DF, p-value: 1.829e-15
# plot fitted values
par(mfrow=c(1,2))
plot(Voltage$Time,Voltage$VoltageDrop,pch=20)
points(Voltage$Time,model2$fitted.values,pch=20,col="blue")
points(Voltage$Time,model2$fitted.values,type="l",col="blue")
plot(Voltage$Time,model2$residuals)
abline(h=0,col="grey")
# build polynomial regression
model2b=lm(VoltageDrop~Time+I(Time^2)+I(Time^3),data=Voltage)
summary(model2b)
##
## Call:
## lm(formula = VoltageDrop ~ Time + I(Time^2) + I(Time^3), data = Voltage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.3503 -0.7340 -0.1859 0.6440 1.8390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.4910163 0.5336473 12.163 1.71e-14 ***
## Time 0.7031952 0.2339552 3.006 0.004738 **
## I(Time^2) 0.0340179 0.0273762 1.243 0.221829
## I(Time^3) -0.0033072 0.0008992 -3.678 0.000743 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9335 on 37 degrees of freedom
## Multiple R-squared: 0.8773, Adjusted R-squared: 0.8673
## F-statistic: 88.14 on 3 and 37 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(Voltage$Time,Voltage$VoltageDrop,pch=20)
points(Voltage$Time,model2b$fitted.values,pch=20,col="blue")
points(Voltage$Time,model2b$fitted.values,type="l",col="blue")
plot(Voltage$Time,model2b$residuals)
abline(h=0,col="grey")
# build polynomial regression
model2c=lm(VoltageDrop~Time+I(Time^2)+I(Time^3)+I(Time^4),data=Voltage)
summary(model2c)
##
## Call:
## lm(formula = VoltageDrop ~ Time + I(Time^2) + I(Time^3) + I(Time^4),
## data = Voltage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.45087 -0.17792 0.00618 0.16770 0.55380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.496e+00 1.750e-01 48.55 < 2e-16 ***
## Time -1.553e+00 1.244e-01 -12.48 1.23e-14 ***
## I(Time^2) 5.563e-01 2.576e-02 21.59 < 2e-16 ***
## I(Time^3) -4.426e-02 1.947e-03 -22.73 < 2e-16 ***
## I(Time^4) 1.024e-03 4.827e-05 21.21 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2576 on 36 degrees of freedom
## Multiple R-squared: 0.9909, Adjusted R-squared: 0.9899
## F-statistic: 980.3 on 4 and 36 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(Voltage$Time,Voltage$VoltageDrop,pch=20)
points(Voltage$Time,model2c$fitted.values,pch=20,col="blue")
points(Voltage$Time,model2c$fitted.values,type="l",col="blue")
plot(Voltage$Time,model2c$residuals)
abline(h=0,col="grey")
# build polynomial regression
model2d=lm(VoltageDrop~Time+I(Time^2)+I(Time^3)+I(Time^4)+I(Time^5),data=Voltage)
summary(model2d)
##
## Call:
## lm(formula = VoltageDrop ~ Time + I(Time^2) + I(Time^3) + I(Time^4) +
## I(Time^5), data = Voltage)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4508 -0.1892 -0.0095 0.1718 0.5617
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.465e+00 2.004e-01 42.234 < 2e-16 ***
## Time -1.496e+00 2.112e-01 -7.084 2.98e-08 ***
## I(Time^2) 5.356e-01 6.751e-02 7.933 2.49e-09 ***
## I(Time^3) -4.144e-02 8.677e-03 -4.776 3.15e-05 ***
## I(Time^4) 8.644e-04 4.806e-04 1.799 0.0807 .
## I(Time^5) 3.188e-06 9.561e-06 0.333 0.7408
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2609 on 35 degrees of freedom
## Multiple R-squared: 0.9909, Adjusted R-squared: 0.9896
## F-statistic: 764.9 on 5 and 35 DF, p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(Voltage$Time,Voltage$VoltageDrop,pch=20)
points(Voltage$Time,model2d$fitted.values,pch=20,col="blue")
points(Voltage$Time,model2d$fitted.values,type="l",col="blue")
plot(Voltage$Time,model2d$residuals)
abline(h=0,col="grey")