Chapter 10 Indicator Variables

Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

Deal with Categorial Variable?

In the previous chapters, we mostly deal with continuous variables. What if the covariates are categorical variables, e.g., gender and race? In this case, we would need indicator variables.

Example

Suppose we want to study the relationship between the effective life of a cutting tool used on a lathe (\(y\)) and covariates such as the speed in revolutions per minute (\(x_1\)), type of cutting tool used (categorical), and type of cutting oil used (categorical).

response:

effective life of a cutting tool used on a lathe

covariates:

the speed in revolutions per minute (rpm)
type of cutting tool used (type A or B)
type of cutting tool used (low or medium)

rm(list=ls())
Life <- read.csv("data_Life.csv",h=T)
Life

##     Hour  rpm ToolType    Oil
## 1  18.73  610        A    Low
## 2  14.52  950        A    Low
## 3  17.43  720        A    Low
## 4  14.54  840        A    Low
## 5  13.44  980        A    Low
## 6  24.39  530        A Medium
## 7  13.34  680        A Medium
## 8  22.71  540        A Medium
## 9  12.68  890        A Medium
## 10 19.32  730        A Medium
## 11 30.16  670        B    Low
## 12 27.09  770        B    Low
## 13 25.40  880        B    Low
## 14 26.05 1000        B    Low
## 15 33.49  760        B    Low
## 16 35.62  590        B Medium
## 17 26.07  910        B Medium
## 18 36.78  650        B Medium
## 19 34.95  810        B Medium
## 20 43.67  500        B Medium

General Concept of Indicator Variables

Categorical variables - A categorical variable do not have a scale of measurement.

Indicator variables (also called dummy variables) - A variable that assigns levels to the categorical variable.

Let us first focus on effective life of a cutting tool used on a lathe \(y\), the speed in revolutions per minute (rpm) \(x_1\), and type of cutting tool used (type A or B) \(x_2\).

We could first define the indicator variable for the categorical variable cutting tool type (A or B) as follows.

\[ x_2 = \begin{cases} 0,\text{ Type A}\\ 1,\text{ Type B} \end{cases} \]

Then we could consider the following model:

\[ y= \beta_0+\beta_1x_1+\beta_2x_2+\epsilon \]

For Tool type A , this model becomes:

\[ y= \beta_0+\beta_1x_1+\epsilon \]

For Tool type B , this model becomes:

\[ y= \beta_0+\beta_1x_1+\beta_2 * 1 +\epsilon \]

\[ y= (\beta_0+\beta_2)+\beta_1x_1+\epsilon \]

Changing from A to B induces a change in the intercept (slope is unchanged and identical)

Example of Cutting Tool Life

pairs(Life,pch=20)

# creat color vector for plotting
col_vec=ifelse(Life$ToolType=="B","blue","red")
pch_vec=ifelse(Life$ToolType=="B",15,18)
plot(Life$rpm,Life$Hour,col=col_vec,pch=pch_vec)

# fit regression
model1 <- lm(Hour ~ rpm + ToolType, data=Life)
summary(model1)

## 
## Call:
## lm(formula = Hour ~ rpm + ToolType, data = Life)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5527 -1.7868 -0.0016  1.8395  4.9838 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.98560    3.51038  10.536 7.16e-09 ***
## rpm         -0.02661    0.00452  -5.887 1.79e-05 ***
## ToolTypeB   15.00425    1.35967  11.035 3.59e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.039 on 17 degrees of freedom
## Multiple R-squared:  0.9003, Adjusted R-squared:  0.8886 
## F-statistic: 76.75 on 2 and 17 DF,  p-value: 3.086e-09

Note that covariate name changes from ToolType to ToolTypeB with B in it, meaning this is the indicator variable which equals to 1 when type is B and 0 otherwise.

We can also visualize the fitted linear regressions.

plot(Life$rpm,Life$Hour,col=col_vec,pch=pch_vec)
abline(a=model1$coef[1],b=model1$coef[2],col="red")
abline(a=model1$coef[1]+model1$coef[3],b=model1$coef[2],col="blue")

Note that:

Two separate models could have been fit to the data.
The single-model approach is preferred because the analyst has only one final equation
Since the slope is assumed to be the same, it makes sense to use the data from both tool types to produce a single estimate of this common parameter
If the slopes are expected differ, an interaction term can included in the model.

Interaction Between Continuous and Indicator Variable

Considering the tool life example, the model to account for the change in slope is:

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1 x_2 + \epsilon \]

If tool type A is used:

\[ y=\beta_0+ \beta_1 x_1 + \epsilon \]

If tool Type B is used:

\[ y = (\beta_0 + \beta_2 )+ (\beta_1 +\beta_3) x_1 +\epsilon \]

\(\beta_2\) represents change in the intercept caused by changing from type A to type B.

\(\beta_3\) represents change in the slope caused by changing from type A to type B.

# fit regression
model2 <- lm(Hour ~ rpm+ToolType+rpm:ToolType, data=Life)
summary(model2)

## 
## Call:
## lm(formula = Hour ~ rpm + ToolType + rpm:ToolType, data = Life)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1750 -1.4999  0.4849  1.7830  4.8652 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   32.774760   4.633472   7.073 2.63e-06 ***
## rpm           -0.020970   0.006074  -3.452  0.00328 ** 
## ToolTypeB     23.970593   6.768973   3.541  0.00272 ** 
## rpm:ToolTypeB -0.011944   0.008842  -1.351  0.19553    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.968 on 16 degrees of freedom
## Multiple R-squared:  0.9105, Adjusted R-squared:  0.8937 
## F-statistic: 54.25 on 3 and 16 DF,  p-value: 1.319e-08

# fit two separate regressions for type A and B
model2A = lm(Hour ~ rpm, data=Life[Life$ToolType=="A",])
model2B = lm(Hour ~ rpm, data=Life[Life$ToolType=="B",])
model2A

## 
## Call:
## lm(formula = Hour ~ rpm, data = Life[Life$ToolType == "A", ])
## 
## Coefficients:
## (Intercept)          rpm  
##    32.77476     -0.02097

model2B

## 
## Call:
## lm(formula = Hour ~ rpm, data = Life[Life$ToolType == "B", ])
## 
## Coefficients:
## (Intercept)          rpm  
##    56.74535     -0.03291

One Categorical Variable with More Than Two levels

For categorical variables with \(d\) levels, we would need \(d-1\) indicator variables

Suppose there were three tool types : A, B, and C. Then two indicator variables (called \(x_2\) and \(x_3\)) will be needed:

\(x_2\)	\(x_3\)	represent
0	0	observation is from type A
1	0	observation is from type B
0	1	observation is from type C

The regression model is

\[ y=\beta_0 + \beta_1 x_{1} +\beta_2 x_{2} +\beta_3 x_{3} + \epsilon\]

Example of City Expense

City <- read.csv("data_City.csv",h=T)
City

##        y x1          x2
## 1  16.68  7   San Diego
## 2  11.50  3   San Diego
## 3  12.03  3   San Diego
## 4  14.88  4   San Diego
## 5  13.75  6   San Diego
## 6  18.11  7   San Diego
## 7   8.00  2   San Diego
## 8  17.83  7      Boston
## 9  79.24 30      Boston
## 10 21.50  5      Boston
## 11 40.33 16      Boston
## 12 21.00 10      Boston
## 13 13.50  4      Boston
## 14 19.75  6      Boston
## 15 24.00  9      Boston
## 16 29.00 10      Boston
## 17 15.35  6      Boston
## 18 19.00  7      Austin
## 19  9.50  3      Austin
## 20 35.10 17      Austin
## 21 17.90 10      Austin
## 22 52.32 26      Austin
## 23 18.75  9      Austin
## 24 19.83  8 Minneapolis
## 25 10.75  4 Minneapolis

model3=lm(y~x1+x2, data=City)
summary(model3)

## 
## Call:
## lm(formula = y ~ x1 + x2, data = City)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4829 -2.2847 -0.0926  1.5171  7.2829 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.2562     1.9838  -0.633  0.53375    
## x1              2.2237     0.1153  19.285 2.16e-14 ***
## x2Boston        6.5020     1.8087   3.595  0.00181 ** 
## x2Minneapolis   3.2039     2.9260   1.095  0.28653    
## x2San Diego     4.6550     2.1181   2.198  0.03992 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.482 on 20 degrees of freedom
## Multiple R-squared:  0.9581, Adjusted R-squared:  0.9497 
## F-statistic: 114.3 on 4 and 20 DF,  p-value: 1.772e-13

Note that we only include one categorical variable \(x_2\) which has four levels (Austin, Boston, Minneapolis, and San Diego), r automatically creates three indicator variables x2Boston,x2Minneapolis,x2San Diego for you.

The whole linear regression is: \[y=\beta_0 + \beta_1 x_1 + \beta_2 x_{Bos} + \beta_3 x_{Minn} + \beta_4 x_{SD} + \epsilon\]

Based on the output, the equation for Boston is

\[y=-1.2562 + 2.2237 x_1 + 6.5020 * 1 + \epsilon=(-1.2562 + 6.5020 ) + 2.2237 x_1 + \epsilon\]

Appears unrealistic to assume that the slope of the regression function relating \(y\) to \(x_1\) does not depend on the city.
Would expect the mean electricity consumption to increase with the size of the house, but the rate of increase would be different depending on type
There should be an interaction between the size of the house and the type of air conditioning system.

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4 + \beta_5 x_1 x_2 + \beta_6 x_1 x_3 + \beta_7 x_1 x_4 + \epsilon\]

The four regression models corresponding to the four types of air conditioning systems are as follows:

If the city is Austin, the model is \[ y= \beta_0 + \beta_1 x_1 + \epsilon\]

If the city is Boston, the model is \[ y= (\beta_0 + \beta_2) + (\beta_1 + \beta_5)x_1 + \epsilon\]

If the city is Minneapolis, the model is

\[ y= (\beta_0 + \beta_3) + (\beta_1 + \beta_6)x_1 + \epsilon \]

If the city is San Diego, the model is

\[ y= (\beta_0 + \beta_4) + (\beta_1 + \beta_7)x_1 + \epsilon \]

model4=lm(y~x1+x2+x1:x2, data=City)
summary(model4)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x1:x2, data = City)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.405 -1.756  0.263  0.991  6.508 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.0014     2.2773   1.318  0.20501    
## x1                 1.8689     0.1608  11.625 1.63e-09 ***
## x2Boston          -0.4232     2.7924  -0.152  0.88133    
## x2Minneapolis     -1.3314     7.0087  -0.190  0.85159    
## x2San Diego        3.7926     3.6855   1.029  0.31787    
## x1:x2Boston        0.6138     0.2054   2.989  0.00825 ** 
## x1:x2Minneapolis   0.4011     1.0603   0.378  0.70991    
## x1:x2San Diego    -0.3879     0.6063  -0.640  0.53081    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.964 on 17 degrees of freedom
## Multiple R-squared:  0.9742, Adjusted R-squared:  0.9635 
## F-statistic: 91.61 on 7 and 17 DF,  p-value: 3.019e-12

For Austin, the regression is \[ y= 3.0014 + 1.8689 x_1 +\epsilon\]

For Boston, the regression is \[ y= (3.0014 + -0.4232) + (1.8689 +0.6138)x_1 +\epsilon\]

Some Comments on Indicator Variables

Can substitute indicator variables for quantitative regressors, e.g., salaries to salary bins. This is useful if accurate data cannot be readily attained.
Group the data into classes or intervals and assign indicator variables

Optional: More Than Two Categorical Variables and Interaction Between Indicator Variables

Suppose that in cutting tool example, a second qualitative factor, the type of cutting oil used, must be considered.

Assuming that this factor has two levels, we may define a second indicator variable, \(x_3\), as follows:

\[ x_3 = \begin{cases} 0,\text{ if low-viscosity oil used}\\ 1,\text{ if medium-viscosity oil used} \end{cases} \]

A regression model relating tool life (\(y\)) to cutting speed (\(x_1\)), tool type (\(x_2\)), and type of oil (\(x_3\)) is

\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon \]

The intercept of the regression depends on the two indicators in a additive fashion.

The slope of the regression does not depend on the two indicators.

Various types of interaction effects may be added to the model. For example,

\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \epsilon \] This implies

Tool Type	Oil Type	Regression Model
A	low	\(y=\beta_0 + \beta_1 x_1 +\epsilon\)
B	low	\(y=(\beta_0 + \beta_2) + (\beta_1 + \beta_4) x_1 +\epsilon\)
A	medium	\(y=(\beta_0 + \beta_3 ) + (\beta_1 + \beta_5) x_1 +\epsilon\)
B	medium	\(y=(\beta_0 + \beta_2 + \beta_3 ) + (\beta_1 + \beta_4 + \beta_5) x_1 +\epsilon\)

\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \epsilon \] This implies

Tool Type	Oil Type	Regression Model
A	low	\(y=\beta_0 + \beta_1 x_1 +\epsilon\)
B	low	\(y=(\beta_0 + \beta_2) + (\beta_1 + \beta_4) x_1 +\epsilon\)
A	medium	\(y=(\beta_0 + \beta_3 ) + (\beta_1 + \beta_5) x_1 +\epsilon\)
B	medium	\(y=(\beta_0 + \beta_2 + \beta_3 + \beta_6) + (\beta_1 + \beta_4 + \beta_5) x_1 +\epsilon\)

\[ y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \beta_7 x_1 x_2 x_3 + \epsilon \] This implies

Tool Type	Oil Type	Regression Model
A	low	\(y=\beta_0 + \beta_1 x_1 +\epsilon\)
B	low	\(y=(\beta_0 + \beta_2) + (\beta_1 + \beta_4) x_1 +\epsilon\)
A	medium	\(y=(\beta_0 + \beta_3 ) + (\beta_1 + \beta_5) x_1 +\epsilon\)
B	medium	\(y=(\beta_0 + \beta_2 + \beta_3 + \beta_6) + (\beta_1 + \beta_4 + \beta_5 + \beta_7) x_1 +\epsilon\)

Example

model5=lm(Hour ~ rpm+ToolType+Oil+rpm:ToolType+rpm:Oil, data=Life)
summary(model5)

## 
## Call:
## lm(formula = Hour ~ rpm + ToolType + Oil + rpm:ToolType + rpm:Oil, 
##     data = Life)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8554 -1.3415  0.4952  1.4381  4.6931 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   23.851332   6.317961   3.775  0.00205 **
## rpm           -0.010955   0.007741  -1.415  0.17886   
## ToolTypeB     22.824228   6.449096   3.539  0.00327 **
## OilMedium     13.078954   7.316050   1.788  0.09548 . 
## rpm:ToolTypeB -0.010336   0.008435  -1.225  0.24063   
## rpm:OilMedium -0.015126   0.009480  -1.595  0.13293   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.808 on 14 degrees of freedom
## Multiple R-squared:  0.9299, Adjusted R-squared:  0.9049 
## F-statistic: 37.16 on 5 and 14 DF,  p-value: 1.344e-07

model6=lm(Hour ~ rpm+ToolType+Oil+rpm:ToolType+rpm:Oil+ToolType:Oil, data=Life)
summary(model6)

## 
## Call:
## lm(formula = Hour ~ rpm + ToolType + Oil + rpm:ToolType + rpm:Oil + 
##     ToolType:Oil, data = Life)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9670 -1.1693  0.1475  0.9990  4.0778 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         26.985677   6.269634   4.304 0.000857 ***
## rpm                 -0.013724   0.007508  -1.828 0.090584 .  
## ToolTypeB           15.647742   7.507264   2.084 0.057419 .  
## OilMedium           11.837542   6.954661   1.702 0.112513    
## rpm:ToolTypeB       -0.003672   0.008950  -0.410 0.688245    
## rpm:OilMedium       -0.016447   0.008995  -1.829 0.090499 .  
## ToolTypeB:OilMedium  4.366602   2.667702   1.637 0.125631    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.653 on 13 degrees of freedom
## Multiple R-squared:  0.9419, Adjusted R-squared:  0.9151 
## F-statistic: 35.12 on 6 and 13 DF,  p-value: 2.666e-07

model7=lm(Hour ~ rpm+ToolType+Oil+rpm:ToolType+rpm:Oil+rpm:ToolType:Oil, data=Life)
summary(model7)

## 
## Call:
## lm(formula = Hour ~ rpm + ToolType + Oil + rpm:ToolType + rpm:Oil + 
##     rpm:ToolType:Oil, data = Life)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9958 -1.3189  0.2251  1.1989  3.9290 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)   
## (Intercept)             25.844028   6.127839   4.217  0.00101 **
## rpm                     -0.012375   0.007403  -1.672  0.11849   
## ToolTypeB               18.470517   6.706680   2.754  0.01641 * 
## OilMedium               13.811638   6.961331   1.984  0.06877 . 
## rpm:ToolTypeB           -0.007037   0.008272  -0.851  0.41034   
## rpm:OilMedium           -0.018977   0.009321  -2.036  0.06266 . 
## rpm:ToolTypeB:OilMedium  0.005523   0.003471   1.591  0.13564   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.666 on 13 degrees of freedom
## Multiple R-squared:  0.9413, Adjusted R-squared:  0.9143 
## F-statistic: 34.77 on 6 and 13 DF,  p-value: 2.833e-07

Chapter 10 Indicator Variables

Tianhai Zu (zuti@mail.uc.edu)

Deal with Categorial Variable?

Example

General Concept of Indicator Variables

Example of Cutting Tool Life

Interaction Between Continuous and Indicator Variable

One Categorical Variable with More Than Two levels

Example of City Expense

Some Comments on Indicator Variables

Optional: More Than Two Categorical Variables and Interaction Between Indicator Variables

Example