Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.
In the previous chapters, we mostly deal with continuous variables. What if the covariates are categorical variables, e.g., gender and race? In this case, we would need indicator variables.
Suppose we want to study the relationship between the effective life of a cutting tool used on a lathe (\(y\)) and covariates such as the speed in revolutions per minute (\(x_1\)), type of cutting tool used (categorical), and type of cutting oil used (categorical).
response:
covariates:
rm(list=ls())
Life <- read.csv("data_Life.csv",h=T)
Life
## Hour rpm ToolType Oil
## 1 18.73 610 A Low
## 2 14.52 950 A Low
## 3 17.43 720 A Low
## 4 14.54 840 A Low
## 5 13.44 980 A Low
## 6 24.39 530 A Medium
## 7 13.34 680 A Medium
## 8 22.71 540 A Medium
## 9 12.68 890 A Medium
## 10 19.32 730 A Medium
## 11 30.16 670 B Low
## 12 27.09 770 B Low
## 13 25.40 880 B Low
## 14 26.05 1000 B Low
## 15 33.49 760 B Low
## 16 35.62 590 B Medium
## 17 26.07 910 B Medium
## 18 36.78 650 B Medium
## 19 34.95 810 B Medium
## 20 43.67 500 B Medium
Categorical variables - A categorical variable do not have a scale of measurement.
Indicator variables (also called dummy variables) - A variable that assigns levels to the categorical variable.
Let us first focus on effective life of a cutting tool used on a lathe \(y\), the speed in revolutions per minute (rpm) \(x_1\), and type of cutting tool used (type A or B) \(x_2\).
We could first define the indicator variable for the categorical variable cutting tool type (A or B) as follows.
\[ x_2 = \begin{cases} 0,\text{ Type A}\\ 1,\text{ Type B} \end{cases} \]
Then we could consider the following model:
\[ y= \beta_0+\beta_1x_1+\beta_2x_2+\epsilon \]
For Tool type A , this model becomes:
\[ y= \beta_0+\beta_1x_1+\epsilon \]
For Tool type B , this model becomes:
\[ y= \beta_0+\beta_1x_1+\beta_2 * 1 +\epsilon \]
\[ y= (\beta_0+\beta_2)+\beta_1x_1+\epsilon \]
Changing from A to B induces a change in the intercept (slope is unchanged and identical)
pairs(Life,pch=20)
# creat color vector for plotting
col_vec=ifelse(Life$ToolType=="B","blue","red")
pch_vec=ifelse(Life$ToolType=="B",15,18)
plot(Life$rpm,Life$Hour,col=col_vec,pch=pch_vec)
# fit regression
model1 <- lm(Hour ~ rpm + ToolType, data=Life)
summary(model1)
##
## Call:
## lm(formula = Hour ~ rpm + ToolType, data = Life)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5527 -1.7868 -0.0016 1.8395 4.9838
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.98560 3.51038 10.536 7.16e-09 ***
## rpm -0.02661 0.00452 -5.887 1.79e-05 ***
## ToolTypeB 15.00425 1.35967 11.035 3.59e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.039 on 17 degrees of freedom
## Multiple R-squared: 0.9003, Adjusted R-squared: 0.8886
## F-statistic: 76.75 on 2 and 17 DF, p-value: 3.086e-09
Note that covariate name changes from ToolType
to ToolTypeB
with B
in it, meaning this is the indicator variable which equals to 1 when type is B and 0 otherwise.
We can also visualize the fitted linear regressions.
plot(Life$rpm,Life$Hour,col=col_vec,pch=pch_vec)
abline(a=model1$coef[1],b=model1$coef[2],col="red")
abline(a=model1$coef[1]+model1$coef[3],b=model1$coef[2],col="blue")
Note that:
Two separate models could have been fit to the data.
The single-model approach is preferred because the analyst has only one final equation
Since the slope is assumed to be the same, it makes sense to use the data from both tool types to produce a single estimate of this common parameter
If the slopes are expected differ, an interaction term can included in the model.
Considering the tool life example, the model to account for the change in slope is:
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \]
If tool type A is used:
\[ y=\beta_0+ \beta_1 x_1 + \epsilon \]
If tool Type B is used:
\[ y = (\beta_0 + \beta_2 )+ (\beta_1 +\beta_3) x_1 +\epsilon \]
\(\beta_2\) represents change in the intercept caused by changing from type A to type B.
\(\beta_3\) represents change in the slope caused by changing from type A to type B.
# fit regression
model2 <- lm(Hour ~ rpm+ToolType+rpm:ToolType, data=Life)
summary(model2)
##
## Call:
## lm(formula = Hour ~ rpm + ToolType + rpm:ToolType, data = Life)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1750 -1.4999 0.4849 1.7830 4.8652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32.774760 4.633472 7.073 2.63e-06 ***
## rpm -0.020970 0.006074 -3.452 0.00328 **
## ToolTypeB 23.970593 6.768973 3.541 0.00272 **
## rpm:ToolTypeB -0.011944 0.008842 -1.351 0.19553
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.968 on 16 degrees of freedom
## Multiple R-squared: 0.9105, Adjusted R-squared: 0.8937
## F-statistic: 54.25 on 3 and 16 DF, p-value: 1.319e-08
# fit two separate regressions for type A and B
model2A = lm(Hour ~ rpm, data=Life[Life$ToolType=="A",])
model2B = lm(Hour ~ rpm, data=Life[Life$ToolType=="B",])
model2A
##
## Call:
## lm(formula = Hour ~ rpm, data = Life[Life$ToolType == "A", ])
##
## Coefficients:
## (Intercept) rpm
## 32.77476 -0.02097
model2B
##
## Call:
## lm(formula = Hour ~ rpm, data = Life[Life$ToolType == "B", ])
##
## Coefficients:
## (Intercept) rpm
## 56.74535 -0.03291
For categorical variables with \(d\) levels, we would need \(d-1\) indicator variables
Suppose there were three tool types : A, B, and C. Then two indicator variables (called \(x_2\) and \(x_3\)) will be needed:
\(x_2\) | \(x_3\) | represent |
---|---|---|
0 | 0 | observation is from type A |
1 | 0 | observation is from type B |
0 | 1 | observation is from type C |
The regression model is
\[ y=\beta_0 + \beta_1 x_{1} +\beta_2 x_{2} +\beta_3 x_{3} + \epsilon\]
City <- read.csv("data_City.csv",h=T)
City
## y x1 x2
## 1 16.68 7 San Diego
## 2 11.50 3 San Diego
## 3 12.03 3 San Diego
## 4 14.88 4 San Diego
## 5 13.75 6 San Diego
## 6 18.11 7 San Diego
## 7 8.00 2 San Diego
## 8 17.83 7 Boston
## 9 79.24 30 Boston
## 10 21.50 5 Boston
## 11 40.33 16 Boston
## 12 21.00 10 Boston
## 13 13.50 4 Boston
## 14 19.75 6 Boston
## 15 24.00 9 Boston
## 16 29.00 10 Boston
## 17 15.35 6 Boston
## 18 19.00 7 Austin
## 19 9.50 3 Austin
## 20 35.10 17 Austin
## 21 17.90 10 Austin
## 22 52.32 26 Austin
## 23 18.75 9 Austin
## 24 19.83 8 Minneapolis
## 25 10.75 4 Minneapolis
model3=lm(y~x1+x2, data=City)
summary(model3)
##
## Call:
## lm(formula = y ~ x1 + x2, data = City)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.4829 -2.2847 -0.0926 1.5171 7.2829
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2562 1.9838 -0.633 0.53375
## x1 2.2237 0.1153 19.285 2.16e-14 ***
## x2Boston 6.5020 1.8087 3.595 0.00181 **
## x2Minneapolis 3.2039 2.9260 1.095 0.28653
## x2San Diego 4.6550 2.1181 2.198 0.03992 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.482 on 20 degrees of freedom
## Multiple R-squared: 0.9581, Adjusted R-squared: 0.9497
## F-statistic: 114.3 on 4 and 20 DF, p-value: 1.772e-13
Note that we only include one categorical variable \(x_2\) which has four levels (Austin, Boston, Minneapolis, and San Diego), r
automatically creates three indicator variables x2Boston
,x2Minneapolis
,x2San Diego
for you.
The whole linear regression is: \[y=\beta_0 + \beta_1 x_1 + \beta_2 x_{Bos} + \beta_3 x_{Minn} + \beta_4 x_{SD} + \epsilon\]
Based on the output, the equation for Boston is
\[y=-1.2562 + 2.2237 x_1 + 6.5020 * 1 + \epsilon=(-1.2562 + 6.5020 ) + 2.2237 x_1 + \epsilon\]
Appears unrealistic to assume that the slope of the regression function relating \(y\) to \(x_1\) does not depend on the city.
Would expect the mean electricity consumption to increase with the size of the house, but the rate of increase would be different depending on type
There should be an interaction between the size of the house and the type of air conditioning system.
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_1 x_2 + \beta_6 x_1 x_3 + \beta_7 x_1 x_4 + \epsilon\]
The four regression models corresponding to the four types of air conditioning systems are as follows:
If the city is Austin, the model is \[ y= \beta_0 + \beta_1 x_1 + \epsilon\]
If the city is Boston, the model is \[ y= (\beta_0 + \beta_2) + (\beta_1 + \beta_5)x_1 + \epsilon\]
If the city is Minneapolis, the model is
\[ y= (\beta_0 + \beta_3) + (\beta_1 + \beta_6)x_1 + \epsilon \]
If the city is San Diego, the model is
\[ y= (\beta_0 + \beta_4) + (\beta_1 + \beta_7)x_1 + \epsilon \]
model4=lm(y~x1+x2+x1:x2, data=City)
summary(model4)
##
## Call:
## lm(formula = y ~ x1 + x2 + x1:x2, data = City)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.405 -1.756 0.263 0.991 6.508
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.0014 2.2773 1.318 0.20501
## x1 1.8689 0.1608 11.625 1.63e-09 ***
## x2Boston -0.4232 2.7924 -0.152 0.88133
## x2Minneapolis -1.3314 7.0087 -0.190 0.85159
## x2San Diego 3.7926 3.6855 1.029 0.31787
## x1:x2Boston 0.6138 0.2054 2.989 0.00825 **
## x1:x2Minneapolis 0.4011 1.0603 0.378 0.70991
## x1:x2San Diego -0.3879 0.6063 -0.640 0.53081
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.964 on 17 degrees of freedom
## Multiple R-squared: 0.9742, Adjusted R-squared: 0.9635
## F-statistic: 91.61 on 7 and 17 DF, p-value: 3.019e-12
For Austin, the regression is \[ y= 3.0014 + 1.8689 x_1 +\epsilon\]
For Boston, the regression is \[ y= (3.0014 + -0.4232) + (1.8689 +0.6138)x_1 +\epsilon\]
Suppose that in cutting tool example, a second qualitative factor, the type of cutting oil used, must be considered.
Assuming that this factor has two levels, we may define a second indicator variable, \(x_3\), as follows:
\[ x_3 = \begin{cases} 0,\text{ if low-viscosity oil used}\\ 1,\text{ if medium-viscosity oil used} \end{cases} \]
A regression model relating tool life (\(y\)) to cutting speed (\(x_1\)), tool type (\(x_2\)), and type of oil (\(x_3\)) is
\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \]
The intercept of the regression depends on the two indicators in a additive fashion.
The slope of the regression does not depend on the two indicators.
Various types of interaction effects may be added to the model. For example,
\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \epsilon \] This implies
Tool Type | Oil Type | Regression Model |
---|---|---|
A | low | \(y=\beta_0 + \beta_1 x_1 +\epsilon\) |
B | low | \(y=(\beta_0 + \beta_2) + (\beta_1 + \beta_4) x_1 +\epsilon\) |
A | medium | \(y=(\beta_0 + \beta_3 ) + (\beta_1 + \beta_5) x_1 +\epsilon\) |
B | medium | \(y=(\beta_0 + \beta_2 + \beta_3 ) + (\beta_1 + \beta_4 + \beta_5) x_1 +\epsilon\) |
\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \epsilon \] This implies
Tool Type | Oil Type | Regression Model |
---|---|---|
A | low | \(y=\beta_0 + \beta_1 x_1 +\epsilon\) |
B | low | \(y=(\beta_0 + \beta_2) + (\beta_1 + \beta_4) x_1 +\epsilon\) |
A | medium | \(y=(\beta_0 + \beta_3 ) + (\beta_1 + \beta_5) x_1 +\epsilon\) |
B | medium | \(y=(\beta_0 + \beta_2 + \beta_3 + \beta_6) + (\beta_1 + \beta_4 + \beta_5) x_1 +\epsilon\) |
\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \beta_7 x_1 x_2 x_3 + \epsilon \] This implies
Tool Type | Oil Type | Regression Model |
---|---|---|
A | low | \(y=\beta_0 + \beta_1 x_1 +\epsilon\) |
B | low | \(y=(\beta_0 + \beta_2) + (\beta_1 + \beta_4) x_1 +\epsilon\) |
A | medium | \(y=(\beta_0 + \beta_3 ) + (\beta_1 + \beta_5) x_1 +\epsilon\) |
B | medium | \(y=(\beta_0 + \beta_2 + \beta_3 + \beta_6) + (\beta_1 + \beta_4 + \beta_5 + \beta_7) x_1 +\epsilon\) |
model5=lm(Hour ~ rpm+ToolType+Oil+rpm:ToolType+rpm:Oil, data=Life)
summary(model5)
##
## Call:
## lm(formula = Hour ~ rpm + ToolType + Oil + rpm:ToolType + rpm:Oil,
## data = Life)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8554 -1.3415 0.4952 1.4381 4.6931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.851332 6.317961 3.775 0.00205 **
## rpm -0.010955 0.007741 -1.415 0.17886
## ToolTypeB 22.824228 6.449096 3.539 0.00327 **
## OilMedium 13.078954 7.316050 1.788 0.09548 .
## rpm:ToolTypeB -0.010336 0.008435 -1.225 0.24063
## rpm:OilMedium -0.015126 0.009480 -1.595 0.13293
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.808 on 14 degrees of freedom
## Multiple R-squared: 0.9299, Adjusted R-squared: 0.9049
## F-statistic: 37.16 on 5 and 14 DF, p-value: 1.344e-07
model6=lm(Hour ~ rpm+ToolType+Oil+rpm:ToolType+rpm:Oil+ToolType:Oil, data=Life)
summary(model6)
##
## Call:
## lm(formula = Hour ~ rpm + ToolType + Oil + rpm:ToolType + rpm:Oil +
## ToolType:Oil, data = Life)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9670 -1.1693 0.1475 0.9990 4.0778
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.985677 6.269634 4.304 0.000857 ***
## rpm -0.013724 0.007508 -1.828 0.090584 .
## ToolTypeB 15.647742 7.507264 2.084 0.057419 .
## OilMedium 11.837542 6.954661 1.702 0.112513
## rpm:ToolTypeB -0.003672 0.008950 -0.410 0.688245
## rpm:OilMedium -0.016447 0.008995 -1.829 0.090499 .
## ToolTypeB:OilMedium 4.366602 2.667702 1.637 0.125631
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.653 on 13 degrees of freedom
## Multiple R-squared: 0.9419, Adjusted R-squared: 0.9151
## F-statistic: 35.12 on 6 and 13 DF, p-value: 2.666e-07
model7=lm(Hour ~ rpm+ToolType+Oil+rpm:ToolType+rpm:Oil+rpm:ToolType:Oil, data=Life)
summary(model7)
##
## Call:
## lm(formula = Hour ~ rpm + ToolType + Oil + rpm:ToolType + rpm:Oil +
## rpm:ToolType:Oil, data = Life)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9958 -1.3189 0.2251 1.1989 3.9290
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25.844028 6.127839 4.217 0.00101 **
## rpm -0.012375 0.007403 -1.672 0.11849
## ToolTypeB 18.470517 6.706680 2.754 0.01641 *
## OilMedium 13.811638 6.961331 1.984 0.06877 .
## rpm:ToolTypeB -0.007037 0.008272 -0.851 0.41034
## rpm:OilMedium -0.018977 0.009321 -2.036 0.06266 .
## rpm:ToolTypeB:OilMedium 0.005523 0.003471 1.591 0.13564
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.666 on 13 degrees of freedom
## Multiple R-squared: 0.9413, Adjusted R-squared: 0.9143
## F-statistic: 34.77 on 6 and 13 DF, p-value: 2.833e-07