Chapter 6 Transformation

Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

Why Transformation?

Chapter 4 presents several methods for checking the model adequacy for linear regressions, i.e., LINE assumptions. What if these assumptions are violated? Then consider transformation to fix them.

Example

Suppose we want to study the relationship between demand \(y\) and energy usage \(x\) for 53 residential customers.

\(y\) demand

\(x\) energy usage

rm(list=ls())
par(mfrow=c(1,2))
Energy=read.csv("data_ElectricUtility.csv")
head(Energy)

##   Customer Usage Demand
## 1        1   679   0.79
## 2        2   292   0.44
## 3        3  1012   0.56
## 4        4   493   0.79
## 5        5   582   2.70
## 6        6  1156   3.64

plot(Energy$Usage,Energy$Demand,pch=20)
model1=lm(Demand~Usage, data=Energy)
plot(Energy$Usage,model1$residuals,pch=20,col="blue")
abline(h=0)

Transformation

We can transform the predictor \(x\) values only: Transforming the x values is appropriate when non-linearity is the only problem, while the independence, normality and equal variance conditions are met. i.e., L

We can transform the response \(y\) values only: Transforming the y values should be considered when non-normality and/or unequal variances are the problems with the model. As an added bonus, the transformation on y may also help to “straighten out” a curved relationship. i.e., INE and also a little L.

We can transform both the predictor \(x\) values and response \(y\) values: Do this when everything seems wrong, i.e., when the regression function is not linear and the error terms are not normal and have unequal variances. i.e., LINE

Trial and Error

Data transformation definitely requires a “trial and error” approach.
In building the model, we try a transformation and then check to see if the transformation eliminated the problems with the model. If it does not help, we try another transformation and so on.
We continue this cyclical process until we have built a model that is appropriate and we can use. That is, the process of model building includes model formulation, model estimation, and model adequacy check.
We do not leave the model building process until we have convinced ourselves that the model meets the four conditions (“LINE”) of the linear regression model.

Artful Science

One important thing to remember is that there is often more than one viable model (or transformation). The model you choose and the model a colleague chooses may be different and yet both equally appropriate. What’s important is that the model you choose:
- is not overly complicated
- meets the four conditions of the linear regression model, and
- allows you to answer your research question of interest.
Data analysis is an artful science! It involves making subjective decisions using very objective tools!

Transformations on \(y\)

We usually assume constant variance of the response variable. However it is often violated when the variance is functionally related to the mean of response. We can usually eliminate this problem by transformation on the response. Note that the strength of the transformation depends on the amount of curvature that is induced.

Here is a list of transformations on \(y\)

Relationship of \(\sigma^2\) to \(E[y]\)	Transformation
\(\sigma^2 \propto \text{constant}\)	\(y^* = y\), no transformation
\(\sigma^2 \propto E[y]\)	\(y^* = \sqrt{y}\), sqaure root transformation
\(\sigma^2 \propto E[y]^2\)	\(y^* = \log y\), log transformation
\(\sigma^2 \propto E[y]^3\)	\(y^* = y^{-1/2}\), reciprocal square root
\(\sigma^2 \propto E[y]^4\)	\(y^* = y^{-1}\), reciprocal
\(\sigma^2 \propto E[y](1-E[y])\)	\(y^* = \sin^{-1}(\sqrt{y})\), arcsin

Note that the last one is for \(0 \leq y \leq 1\).

Example

library(MASS)
# read data
Energy=read.csv("data_ElectricUtility.csv")
head(Energy)

##   Customer Usage Demand
## 1        1   679   0.79
## 2        2   292   0.44
## 3        3  1012   0.56
## 4        4   493   0.79
## 5        5   582   2.70
## 6        6  1156   3.64

# visualize data
par(mfrow=c(1,2))
plot(Energy$Usage,Energy$Demand,pch=20)
# fit regression
model1=lm(Demand~Usage, data=Energy)
# residual plots based on fitted values
plot(model1$fitted.values,model1$residuals,pch=20,col="blue")
abline(h=0)

Apparently, there is unequal variance issue.

Energy$sqrtDemand=sqrt(Energy$Demand)
# visualize data
model1b=lm(sqrtDemand~Usage, data=Energy)
par(mfrow=c(1,2))
plot(Energy$Usage,Energy$sqrtDemand,pch=20)
abline(model1b)
plot(model1b$fitted.values,model1b$residuals,pch=20,col="blue")
abline(h=0)

Energy$logDemand=log(Energy$Demand)
model1c=lm(logDemand~Usage, data=Energy)
plot(Energy$Usage,Energy$logDemand,pch=20)
abline(model1c)
plot(model1c$fitted.values,model1c$residuals,pch=20,col="blue")
abline(h=0)

Transformations on \(x\)

Transformations on \(x\) (and sometimes on \(x\) and \(y\)) can often fix the nonlinearity issues, they essentialy “linearize the model”. * If a transformation of a nonlinear function can result in a linear function - we say it is intrinsically or transformably linear.

For Example:

\[ y= \beta_0x^\beta_1\epsilon\] \[ \log y = \log \beta_0 + \beta_1 \log x + \log \epsilon\]

\[y^* = \beta_0^* + \beta_1^* x^* + \epsilon^*\]

Here is a list of transformations on \(x\) and \(y\) to linearize the model.

Example of Windmill

# example 5.2
Wind=read.csv("data_Windmill.csv")
head(Wind)

##   Observation.Number Velocity Output
## 1                  1      5.0  1.582
## 2                  2      6.0  1.822
## 3                  3      3.4  1.057
## 4                  4      2.7  0.500
## 5                  5     10.0  2.236
## 6                  6      9.7  2.386

# visualize data
par(mfrow=c(1,2))
plot(Wind$Velocity,Wind$Output,pch=20)
model2=lm(Wind$Output~Wind$Velocity)
summary(model2)

## 
## Call:
## lm(formula = Wind$Output ~ Wind$Velocity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59869 -0.14099  0.06059  0.17262  0.32184 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    0.13088    0.12599   1.039     0.31    
## Wind$Velocity  0.24115    0.01905  12.659 7.55e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2361 on 23 degrees of freedom
## Multiple R-squared:  0.8745, Adjusted R-squared:  0.869 
## F-statistic: 160.3 on 1 and 23 DF,  p-value: 7.546e-12

# residual plots based on fitted values
plot(model2$fitted.values,model2$residuals,pch=20,col="blue")
abline(h=0)

So we will need transformation to fix nonlinearity.

# transform velocity
Wind$InvVelo=1/Wind$Velocity
# visualize data
par(mfrow=c(1,2))
plot(Wind$InvVelo,Wind$Output,pch=20)
model2b=lm(Output~InvVelo, data=Wind)
summary(model2b)

## 
## Call:
## lm(formula = Output ~ InvVelo, data = Wind)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20547 -0.04940  0.01100  0.08352  0.12204 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.9789     0.0449   66.34   <2e-16 ***
## InvVelo      -6.9345     0.2064  -33.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09417 on 23 degrees of freedom
## Multiple R-squared:   0.98,  Adjusted R-squared:  0.9792 
## F-statistic:  1128 on 1 and 23 DF,  p-value: < 2.2e-16

# residual plots based on fitted values
plot(model2b$fitted.values,model2b$residuals,pch=20,col="blue")
abline(h=0)

Let’s try log transformation.

# transform velocity
Wind$logVelo=log(Wind$Velocity)
# visualize data
par(mfrow=c(1,2))
plot(Wind$logVelo,Wind$Output,pch=20)
model2c=lm(Output~logVelo, data=Wind)
summary(model2c)

## 
## Call:
## lm(formula = Output ~ logVelo, data = Wind)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.31619 -0.07685  0.02395  0.11139  0.23029 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.83036    0.11083  -7.493  1.3e-07 ***
## logVelo      1.41677    0.06234  22.728  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1376 on 23 degrees of freedom
## Multiple R-squared:  0.9574, Adjusted R-squared:  0.9555 
## F-statistic: 516.6 on 1 and 23 DF,  p-value: < 2.2e-16

# residual plots based on fitted values
plot(model2c$fitted.values,model2c$residuals,pch=20,col="blue")
abline(h=0)

Box-Cox Transformation

Suppose that we wish to transform \(y\) to correct nonnormality and/or nonconstant variance. A useful class of transformations is the Box-Cox transformation, \(y^{(\lambda)}\) where \(\lambda\) is a parameter to be determined.

The Box-Cox Transformation is defined as

\[ y^{(\lambda)}= \begin{cases} \frac{y^\lambda - 1}{\lambda} \text{, if }\lambda \neq 0\\ \log y \text{, if }\lambda = 0 \end{cases} \] where \(\lambda\) is a tuning parameter. To select \(\lambda\), we maximize the log-likelihood of the transformed data, which is equivalent to minimize the sum of square distances.

\(\lambda\)	Transformation
2	square transformation
1	no transformation
1/2	square root transformation
0	log transformation
-1/2	reciprocal square root transformation
-1	reciprocal transformation
-2	reciprocal square transformation

# Box-Cox
library(MASS)
boxcox(model1)

boxcox(model1b)

boxcox(model2)

boxcox(model2b)

Chapter 6 Transformation

Tianhai Zu (zuti@mail.uc.edu)

Why Transformation?

Example

Transformation

Trial and Error

Artful Science

Transformations on \(y\)

Example

Transformations on \(x\)

Example of Windmill

Box-Cox Transformation