Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

Deal with Categorial Variable?

In the previous chapters, we mostly deal with continuous variables. What if the covariates are categorical variables, e.g., gender and race? In this case, we would need indicator variables.

Example

Suppose we want to study the relationship between the effective life of a cutting tool used on a lathe (\(y\)) and covariates such as the speed in revolutions per minute (\(x_1\)), type of cutting tool used (categorical), and type of cutting oil used (categorical).

response:

covariates:

rm(list=ls())
Life <- read.csv("data_Life.csv",h=T)
Life
##     Hour  rpm ToolType    Oil
## 1  18.73  610        A    Low
## 2  14.52  950        A    Low
## 3  17.43  720        A    Low
## 4  14.54  840        A    Low
## 5  13.44  980        A    Low
## 6  24.39  530        A Medium
## 7  13.34  680        A Medium
## 8  22.71  540        A Medium
## 9  12.68  890        A Medium
## 10 19.32  730        A Medium
## 11 30.16  670        B    Low
## 12 27.09  770        B    Low
## 13 25.40  880        B    Low
## 14 26.05 1000        B    Low
## 15 33.49  760        B    Low
## 16 35.62  590        B Medium
## 17 26.07  910        B Medium
## 18 36.78  650        B Medium
## 19 34.95  810        B Medium
## 20 43.67  500        B Medium

General Concept of Indicator Variables

Categorical variables - A categorical variable do not have a scale of measurement.

Indicator variables (also called dummy variables) - A variable that assigns levels to the categorical variable.

Let us first focus on effective life of a cutting tool used on a lathe \(y\), the speed in revolutions per minute (rpm) \(x_1\), and type of cutting tool used (type A or B) \(x_2\).

We could first define the indicator variable for the categorical variable cutting tool type (A or B) as follows.

\[ x_2 = \begin{cases} 0,\text{ Type A}\\ 1,\text{ Type B} \end{cases} \]

Then we could consider the following model:

\[ y= \beta_0+\beta_1x_1+\beta_2x_2+\epsilon \]

For Tool type A , this model becomes:

\[ y= \beta_0+\beta_1x_1+\epsilon \]

For Tool type B , this model becomes:

\[ y= \beta_0+\beta_1x_1+\beta_2 * 1 +\epsilon \]

\[ y= (\beta_0+\beta_2)+\beta_1x_1+\epsilon \]

Changing from A to B induces a change in the intercept (slope is unchanged and identical)

Example of Cutting Tool Life

pairs(Life,pch=20)

# creat color vector for plotting
col_vec=ifelse(Life$ToolType=="B","blue","red")
pch_vec=ifelse(Life$ToolType=="B",15,18)
plot(Life$rpm,Life$Hour,col=col_vec,pch=pch_vec)

# fit regression
model1 <- lm(Hour ~ rpm + ToolType, data=Life)
summary(model1)
## 
## Call:
## lm(formula = Hour ~ rpm + ToolType, data = Life)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5527 -1.7868 -0.0016  1.8395  4.9838 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.98560    3.51038  10.536 7.16e-09 ***
## rpm         -0.02661    0.00452  -5.887 1.79e-05 ***
## ToolTypeB   15.00425    1.35967  11.035 3.59e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.039 on 17 degrees of freedom
## Multiple R-squared:  0.9003, Adjusted R-squared:  0.8886 
## F-statistic: 76.75 on 2 and 17 DF,  p-value: 3.086e-09

Note that covariate name changes from ToolType to ToolTypeB with B in it, meaning this is the indicator variable which equals to 1 when type is B and 0 otherwise.

We can also visualize the fitted linear regressions.

plot(Life$rpm,Life$Hour,col=col_vec,pch=pch_vec)
abline(a=model1$coef[1],b=model1$coef[2],col="red")
abline(a=model1$coef[1]+model1$coef[3],b=model1$coef[2],col="blue")

Note that:

Interaction Between Continuous and Indicator Variable

Considering the tool life example, the model to account for the change in slope is:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \]

If tool type A is used:

\[ y=\beta_0+ \beta_1 x_1 + \epsilon \]

If tool Type B is used:

\[ y = (\beta_0 + \beta_2 )+ (\beta_1 +\beta_3) x_1 +\epsilon \]

\(\beta_2\) represents change in the intercept caused by changing from type A to type B.

\(\beta_3\) represents change in the slope caused by changing from type A to type B.

# fit regression
model2 <- lm(Hour ~ rpm+ToolType+rpm:ToolType, data=Life)
summary(model2)
## 
## Call:
## lm(formula = Hour ~ rpm + ToolType + rpm:ToolType, data = Life)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1750 -1.4999  0.4849  1.7830  4.8652 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   32.774760   4.633472   7.073 2.63e-06 ***
## rpm           -0.020970   0.006074  -3.452  0.00328 ** 
## ToolTypeB     23.970593   6.768973   3.541  0.00272 ** 
## rpm:ToolTypeB -0.011944   0.008842  -1.351  0.19553    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.968 on 16 degrees of freedom
## Multiple R-squared:  0.9105, Adjusted R-squared:  0.8937 
## F-statistic: 54.25 on 3 and 16 DF,  p-value: 1.319e-08
# fit two separate regressions for type A and B
model2A = lm(Hour ~ rpm, data=Life[Life$ToolType=="A",])
model2B = lm(Hour ~ rpm, data=Life[Life$ToolType=="B",])
model2A
## 
## Call:
## lm(formula = Hour ~ rpm, data = Life[Life$ToolType == "A", ])
## 
## Coefficients:
## (Intercept)          rpm  
##    32.77476     -0.02097
model2B
## 
## Call:
## lm(formula = Hour ~ rpm, data = Life[Life$ToolType == "B", ])
## 
## Coefficients:
## (Intercept)          rpm  
##    56.74535     -0.03291

One Categorical Variable with More Than Two levels

For categorical variables with \(d\) levels, we would need \(d-1\) indicator variables

Suppose there were three tool types : A, B, and C. Then two indicator variables (called \(x_2\) and \(x_3\)) will be needed:

\(x_2\) \(x_3\) represent
0 0 observation is from type A
1 0 observation is from type B
0 1 observation is from type C

The regression model is

\[ y=\beta_0 + \beta_1 x_{1} +\beta_2 x_{2} +\beta_3 x_{3} + \epsilon\]

Example of City Expense

City <- read.csv("data_City.csv",h=T)
City
##        y x1          x2
## 1  16.68  7   San Diego
## 2  11.50  3   San Diego
## 3  12.03  3   San Diego
## 4  14.88  4   San Diego
## 5  13.75  6   San Diego
## 6  18.11  7   San Diego
## 7   8.00  2   San Diego
## 8  17.83  7      Boston
## 9  79.24 30      Boston
## 10 21.50  5      Boston
## 11 40.33 16      Boston
## 12 21.00 10      Boston
## 13 13.50  4      Boston
## 14 19.75  6      Boston
## 15 24.00  9      Boston
## 16 29.00 10      Boston
## 17 15.35  6      Boston
## 18 19.00  7      Austin
## 19  9.50  3      Austin
## 20 35.10 17      Austin
## 21 17.90 10      Austin
## 22 52.32 26      Austin
## 23 18.75  9      Austin
## 24 19.83  8 Minneapolis
## 25 10.75  4 Minneapolis
model3=lm(y~x1+x2, data=City)
summary(model3)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = City)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4829 -2.2847 -0.0926  1.5171  7.2829 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.2562     1.9838  -0.633  0.53375    
## x1              2.2237     0.1153  19.285 2.16e-14 ***
## x2Boston        6.5020     1.8087   3.595  0.00181 ** 
## x2Minneapolis   3.2039     2.9260   1.095  0.28653    
## x2San Diego     4.6550     2.1181   2.198  0.03992 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.482 on 20 degrees of freedom
## Multiple R-squared:  0.9581, Adjusted R-squared:  0.9497 
## F-statistic: 114.3 on 4 and 20 DF,  p-value: 1.772e-13

Note that we only include one categorical variable \(x_2\) which has four levels (Austin, Boston, Minneapolis, and San Diego), r automatically creates three indicator variables x2Boston,x2Minneapolis,x2San Diego for you.

The whole linear regression is: \[y=\beta_0 + \beta_1 x_1 + \beta_2 x_{Bos} + \beta_3 x_{Minn} + \beta_4 x_{SD} + \epsilon\]

Based on the output, the equation for Boston is

\[y=-1.2562 + 2.2237 x_1 + 6.5020 * 1 + \epsilon=(-1.2562 + 6.5020 ) + 2.2237 x_1 + \epsilon\]

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_1 x_2 + \beta_6 x_1 x_3 + \beta_7 x_1 x_4 + \epsilon\]

The four regression models corresponding to the four types of air conditioning systems are as follows:

If the city is Austin, the model is \[ y= \beta_0 + \beta_1 x_1 + \epsilon\]

If the city is Boston, the model is \[ y= (\beta_0 + \beta_2) + (\beta_1 + \beta_5)x_1 + \epsilon\]

If the city is Minneapolis, the model is

\[ y= (\beta_0 + \beta_3) + (\beta_1 + \beta_6)x_1 + \epsilon \]

If the city is San Diego, the model is

\[ y= (\beta_0 + \beta_4) + (\beta_1 + \beta_7)x_1 + \epsilon \]

model4=lm(y~x1+x2+x1:x2, data=City)
summary(model4)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x1:x2, data = City)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.405 -1.756  0.263  0.991  6.508 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.0014     2.2773   1.318  0.20501    
## x1                 1.8689     0.1608  11.625 1.63e-09 ***
## x2Boston          -0.4232     2.7924  -0.152  0.88133    
## x2Minneapolis     -1.3314     7.0087  -0.190  0.85159    
## x2San Diego        3.7926     3.6855   1.029  0.31787    
## x1:x2Boston        0.6138     0.2054   2.989  0.00825 ** 
## x1:x2Minneapolis   0.4011     1.0603   0.378  0.70991    
## x1:x2San Diego    -0.3879     0.6063  -0.640  0.53081    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.964 on 17 degrees of freedom
## Multiple R-squared:  0.9742, Adjusted R-squared:  0.9635 
## F-statistic: 91.61 on 7 and 17 DF,  p-value: 3.019e-12

For Austin, the regression is \[ y= 3.0014 + 1.8689 x_1 +\epsilon\]

For Boston, the regression is \[ y= (3.0014 + -0.4232) + (1.8689 +0.6138)x_1 +\epsilon\]

Some Comments on Indicator Variables

Optional: More Than Two Categorical Variables and Interaction Between Indicator Variables

Suppose that in cutting tool example, a second qualitative factor, the type of cutting oil used, must be considered.

Assuming that this factor has two levels, we may define a second indicator variable, \(x_3\), as follows:

\[ x_3 = \begin{cases} 0,\text{ if low-viscosity oil used}\\ 1,\text{ if medium-viscosity oil used} \end{cases} \]

A regression model relating tool life (\(y\)) to cutting speed (\(x_1\)), tool type (\(x_2\)), and type of oil (\(x_3\)) is

\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \]

The intercept of the regression depends on the two indicators in a additive fashion.

The slope of the regression does not depend on the two indicators.

Various types of interaction effects may be added to the model. For example,

\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \epsilon \] This implies

Tool Type Oil Type Regression Model
A low \(y=\beta_0 + \beta_1 x_1 +\epsilon\)
B low \(y=(\beta_0 + \beta_2) + (\beta_1 + \beta_4) x_1 +\epsilon\)
A medium \(y=(\beta_0 + \beta_3 ) + (\beta_1 + \beta_5) x_1 +\epsilon\)
B medium \(y=(\beta_0 + \beta_2 + \beta_3 ) + (\beta_1 + \beta_4 + \beta_5) x_1 +\epsilon\)

\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \epsilon \] This implies

Tool Type Oil Type Regression Model
A low \(y=\beta_0 + \beta_1 x_1 +\epsilon\)
B low \(y=(\beta_0 + \beta_2) + (\beta_1 + \beta_4) x_1 +\epsilon\)
A medium \(y=(\beta_0 + \beta_3 ) + (\beta_1 + \beta_5) x_1 +\epsilon\)
B medium \(y=(\beta_0 + \beta_2 + \beta_3 + \beta_6) + (\beta_1 + \beta_4 + \beta_5) x_1 +\epsilon\)

\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \beta_7 x_1 x_2 x_3 + \epsilon \] This implies

Tool Type Oil Type Regression Model
A low \(y=\beta_0 + \beta_1 x_1 +\epsilon\)
B low \(y=(\beta_0 + \beta_2) + (\beta_1 + \beta_4) x_1 +\epsilon\)
A medium \(y=(\beta_0 + \beta_3 ) + (\beta_1 + \beta_5) x_1 +\epsilon\)
B medium \(y=(\beta_0 + \beta_2 + \beta_3 + \beta_6) + (\beta_1 + \beta_4 + \beta_5 + \beta_7) x_1 +\epsilon\)

Example

model5=lm(Hour ~ rpm+ToolType+Oil+rpm:ToolType+rpm:Oil, data=Life)
summary(model5)
## 
## Call:
## lm(formula = Hour ~ rpm + ToolType + Oil + rpm:ToolType + rpm:Oil, 
##     data = Life)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8554 -1.3415  0.4952  1.4381  4.6931 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   23.851332   6.317961   3.775  0.00205 **
## rpm           -0.010955   0.007741  -1.415  0.17886   
## ToolTypeB     22.824228   6.449096   3.539  0.00327 **
## OilMedium     13.078954   7.316050   1.788  0.09548 . 
## rpm:ToolTypeB -0.010336   0.008435  -1.225  0.24063   
## rpm:OilMedium -0.015126   0.009480  -1.595  0.13293   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.808 on 14 degrees of freedom
## Multiple R-squared:  0.9299, Adjusted R-squared:  0.9049 
## F-statistic: 37.16 on 5 and 14 DF,  p-value: 1.344e-07
model6=lm(Hour ~ rpm+ToolType+Oil+rpm:ToolType+rpm:Oil+ToolType:Oil, data=Life)
summary(model6)
## 
## Call:
## lm(formula = Hour ~ rpm + ToolType + Oil + rpm:ToolType + rpm:Oil + 
##     ToolType:Oil, data = Life)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9670 -1.1693  0.1475  0.9990  4.0778 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         26.985677   6.269634   4.304 0.000857 ***
## rpm                 -0.013724   0.007508  -1.828 0.090584 .  
## ToolTypeB           15.647742   7.507264   2.084 0.057419 .  
## OilMedium           11.837542   6.954661   1.702 0.112513    
## rpm:ToolTypeB       -0.003672   0.008950  -0.410 0.688245    
## rpm:OilMedium       -0.016447   0.008995  -1.829 0.090499 .  
## ToolTypeB:OilMedium  4.366602   2.667702   1.637 0.125631    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.653 on 13 degrees of freedom
## Multiple R-squared:  0.9419, Adjusted R-squared:  0.9151 
## F-statistic: 35.12 on 6 and 13 DF,  p-value: 2.666e-07
model7=lm(Hour ~ rpm+ToolType+Oil+rpm:ToolType+rpm:Oil+rpm:ToolType:Oil, data=Life)
summary(model7)
## 
## Call:
## lm(formula = Hour ~ rpm + ToolType + Oil + rpm:ToolType + rpm:Oil + 
##     rpm:ToolType:Oil, data = Life)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9958 -1.3189  0.2251  1.1989  3.9290 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             25.844028   6.127839   4.217  0.00101 **
## rpm                     -0.012375   0.007403  -1.672  0.11849   
## ToolTypeB               18.470517   6.706680   2.754  0.01641 * 
## OilMedium               13.811638   6.961331   1.984  0.06877 . 
## rpm:ToolTypeB           -0.007037   0.008272  -0.851  0.41034   
## rpm:OilMedium           -0.018977   0.009321  -2.036  0.06266 . 
## rpm:ToolTypeB:OilMedium  0.005523   0.003471   1.591  0.13564   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.666 on 13 degrees of freedom
## Multiple R-squared:  0.9413, Adjusted R-squared:  0.9143 
## F-statistic: 34.77 on 6 and 13 DF,  p-value: 2.833e-07