Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

Nonlinear Relationship

In the previous chapters, we are assuming that \(y\) and \(x\) are in a linear relationship. What if \(y\) and \(x\) are in a nonlinear relationship? To address this issue, we need polynomial regressions.

Example

Suppose we want to study the relationship between the hardwood concentration in pulp and tensile strength of Kraft Paper.

\(y\): tensile strength

\(x\): hardwood concentration

rm(list=ls())
Paper <- read.csv("data_Hardwood.csv",h=T)
plot(Paper$HwdCon,Paper$TsStr,pch=20)

What do we do in this case? Polynomial regressions.

Polynomial Regression

polynomial regression models the response variable \(y\) as a function of \(k\)th degree polynomial of the covariate \(x\).

For example:

A \(k\)th-order polynomial in one variable: \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 ... + \beta_k x_i^k + \epsilon_i \]

A second-order polynomial in one variable: \[ y_i = \beta_0 + \beta_1 x_i + \beta_2 x_i^2 + \epsilon_i \]

Keep the Order Low

Keep the order of the regression as low as possible. In other words, try to fit the data using the simplest model possible.

One approach is fitting the lowest order polynomial possible and build up (forward selection). To decide whether the polynomial regression fits the data sufficiently, examine the residual plot.

Extrapolation

Extrapolation can be dangerous when the model is a higher-order polynomial. The nature of the true underlying relationship may change or be completely different than the system that produced the data used to fit the model.

Example

We build a 2nd-order polynomial regression for this hardwood concentration example.

# fit polynomial regression with order 2, note the function I()
model1 <- lm(TsStr ~ HwdCon+ I(HwdCon^2), data=Paper ) 
summary(model1)
## 
## Call:
## lm(formula = TsStr ~ HwdCon + I(HwdCon^2), data = Paper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8503 -3.2482 -0.7267  4.1350  6.5506 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -6.67419    3.39971  -1.963   0.0673 .  
## HwdCon      11.76401    1.00278  11.731 2.85e-09 ***
## I(HwdCon^2) -0.63455    0.06179 -10.270 1.89e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.42 on 16 degrees of freedom
## Multiple R-squared:  0.9085, Adjusted R-squared:  0.8971 
## F-statistic: 79.43 on 2 and 16 DF,  p-value: 4.912e-09
par(mfrow=c(1,2))
# plot data
plot(Paper$HwdCon,Paper$TsStr,pch=20)
# plot the fitted values using dots
points(Paper$HwdCon,model1$fitted.values,pch=20,col="blue")
# plot the fitted values using a line
points(Paper$HwdCon,model1$fitted.values,type="l",col="blue")
# plot the fitted values using a line
plot(Paper$HwdCon,model1$residuals,pch=20)
abline(h=0,col="grey")

# fit polynomial regression with order 2, note the function I()
model1b <- lm(TsStr ~ HwdCon + I(HwdCon^2) + I(HwdCon^3), data=Paper ) 
summary(model1b)
## 
## Call:
## lm(formula = TsStr ~ HwdCon + I(HwdCon^2) + I(HwdCon^3), data = Paper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.6250 -1.6109  0.0413  1.5892  5.0216 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.648395   2.954663   1.912   0.0752 .  
## HwdCon       3.578489   1.565854   2.285   0.0373 *  
## I(HwdCon^2)  0.653635   0.231330   2.826   0.0128 *  
## I(HwdCon^3) -0.055188   0.009789  -5.638 4.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.585 on 15 degrees of freedom
## Multiple R-squared:  0.9707, Adjusted R-squared:  0.9648 
## F-statistic: 165.4 on 3 and 15 DF,  p-value: 1.025e-11
par(mfrow=c(1,2))
# plot data
plot(Paper$HwdCon,Paper$TsStr,pch=20)
# plot the fitted values using dots
points(Paper$HwdCon,model1b$fitted.values,pch=20,col="blue")
# plot the fitted values using a line
points(Paper$HwdCon,model1b$fitted.values,type="l",col="blue")
# plot the fitted values using a line
plot(Paper$HwdCon,model1b$residuals,pch=20)
abline(h=0,col="grey")

# fit polynomial regression with order 2, note the function I()
model1c <- lm(TsStr ~ HwdCon + I(HwdCon^2) + I(HwdCon^3) + I(HwdCon^4), data=Paper ) 
summary(model1c)
## 
## Call:
## lm(formula = TsStr ~ HwdCon + I(HwdCon^2) + I(HwdCon^3) + I(HwdCon^4), 
##     data = Paper)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1384 -1.0550 -0.3203  1.0779  4.5030 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  1.070958   4.679866   0.229   0.8223  
## HwdCon       8.049481   3.902170   2.063   0.0582 .
## I(HwdCon^2) -0.517352   0.966390  -0.535   0.6008  
## I(HwdCon^3)  0.056575   0.090164   0.627   0.5405  
## I(HwdCon^4) -0.003505   0.002812  -1.247   0.2330  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.539 on 14 degrees of freedom
## Multiple R-squared:  0.9736, Adjusted R-squared:  0.9661 
## F-statistic: 129.1 on 4 and 14 DF,  p-value: 6.994e-11
par(mfrow=c(1,2))
# plot data
plot(Paper$HwdCon,Paper$TsStr,pch=20)
# plot the fitted values using dots
points(Paper$HwdCon,model1c$fitted.values,pch=20,col="blue")
# plot the fitted values using a line
points(Paper$HwdCon,model1c$fitted.values,type="l",col="blue")
# plot the fitted values using a line
plot(Paper$HwdCon,model1c$residuals,pch=20)
abline(h=0,col="grey")

Another Example

Suppose we want to study the relationship between the voltage drop and and time.

\(y\): voltage drop

\(x\): time

Voltage <- read.csv("data_VoltageDrop.csv",h=T)
# visualize data
plot(Voltage$Time,Voltage$VoltageDrop,pch=20)

# build polynomial regression
model2=lm(VoltageDrop~Time+I(Time^2),data=Voltage)
summary(model2)
## 
## Call:
## lm(formula = VoltageDrop ~ Time + I(Time^2), data = Voltage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.92769 -0.83047 -0.05291  0.63444  3.06431 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.265695   0.480701   10.95 2.55e-13 ***
## Time         1.487169   0.111192   13.38 5.97e-16 ***
## I(Time^2)   -0.065198   0.005375  -12.13 1.23e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.076 on 38 degrees of freedom
## Multiple R-squared:  0.8324, Adjusted R-squared:  0.8236 
## F-statistic: 94.35 on 2 and 38 DF,  p-value: 1.829e-15
# plot fitted values
par(mfrow=c(1,2))
plot(Voltage$Time,Voltage$VoltageDrop,pch=20)
points(Voltage$Time,model2$fitted.values,pch=20,col="blue")
points(Voltage$Time,model2$fitted.values,type="l",col="blue")
plot(Voltage$Time,model2$residuals)
abline(h=0,col="grey")

# build polynomial regression
model2b=lm(VoltageDrop~Time+I(Time^2)+I(Time^3),data=Voltage)
summary(model2b)
## 
## Call:
## lm(formula = VoltageDrop ~ Time + I(Time^2) + I(Time^3), data = Voltage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3503 -0.7340 -0.1859  0.6440  1.8390 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.4910163  0.5336473  12.163 1.71e-14 ***
## Time         0.7031952  0.2339552   3.006 0.004738 ** 
## I(Time^2)    0.0340179  0.0273762   1.243 0.221829    
## I(Time^3)   -0.0033072  0.0008992  -3.678 0.000743 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9335 on 37 degrees of freedom
## Multiple R-squared:  0.8773, Adjusted R-squared:  0.8673 
## F-statistic: 88.14 on 3 and 37 DF,  p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(Voltage$Time,Voltage$VoltageDrop,pch=20)
points(Voltage$Time,model2b$fitted.values,pch=20,col="blue")
points(Voltage$Time,model2b$fitted.values,type="l",col="blue")
plot(Voltage$Time,model2b$residuals)
abline(h=0,col="grey")

# build polynomial regression
model2c=lm(VoltageDrop~Time+I(Time^2)+I(Time^3)+I(Time^4),data=Voltage)
summary(model2c)
## 
## Call:
## lm(formula = VoltageDrop ~ Time + I(Time^2) + I(Time^3) + I(Time^4), 
##     data = Voltage)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.45087 -0.17792  0.00618  0.16770  0.55380 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.496e+00  1.750e-01   48.55  < 2e-16 ***
## Time        -1.553e+00  1.244e-01  -12.48 1.23e-14 ***
## I(Time^2)    5.563e-01  2.576e-02   21.59  < 2e-16 ***
## I(Time^3)   -4.426e-02  1.947e-03  -22.73  < 2e-16 ***
## I(Time^4)    1.024e-03  4.827e-05   21.21  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2576 on 36 degrees of freedom
## Multiple R-squared:  0.9909, Adjusted R-squared:  0.9899 
## F-statistic: 980.3 on 4 and 36 DF,  p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(Voltage$Time,Voltage$VoltageDrop,pch=20)
points(Voltage$Time,model2c$fitted.values,pch=20,col="blue")
points(Voltage$Time,model2c$fitted.values,type="l",col="blue")
plot(Voltage$Time,model2c$residuals)
abline(h=0,col="grey")

# build polynomial regression
model2d=lm(VoltageDrop~Time+I(Time^2)+I(Time^3)+I(Time^4)+I(Time^5),data=Voltage)
summary(model2d)
## 
## Call:
## lm(formula = VoltageDrop ~ Time + I(Time^2) + I(Time^3) + I(Time^4) + 
##     I(Time^5), data = Voltage)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4508 -0.1892 -0.0095  0.1718  0.5617 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.465e+00  2.004e-01  42.234  < 2e-16 ***
## Time        -1.496e+00  2.112e-01  -7.084 2.98e-08 ***
## I(Time^2)    5.356e-01  6.751e-02   7.933 2.49e-09 ***
## I(Time^3)   -4.144e-02  8.677e-03  -4.776 3.15e-05 ***
## I(Time^4)    8.644e-04  4.806e-04   1.799   0.0807 .  
## I(Time^5)    3.188e-06  9.561e-06   0.333   0.7408    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2609 on 35 degrees of freedom
## Multiple R-squared:  0.9909, Adjusted R-squared:  0.9896 
## F-statistic: 764.9 on 5 and 35 DF,  p-value: < 2.2e-16
par(mfrow=c(1,2))
plot(Voltage$Time,Voltage$VoltageDrop,pch=20)
points(Voltage$Time,model2d$fitted.values,pch=20,col="blue")
points(Voltage$Time,model2d$fitted.values,type="l",col="blue")
plot(Voltage$Time,model2d$residuals)
abline(h=0,col="grey")