Chapter 1 Introduction

Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

Definition

A statistical methodology that utilizes the relationship between two or more quantitative variables so that a response variable or outcome variable can be predicted from the other or others.

Sales Volume
Child Development
Hospital Stay
Sports Performance

Simple Linear Regression (SLR): A single predictor (independent) variable is used for predicting the response or outcome (dependent) variable.

A Running Example Throughout the Course

Let us start with an simple example that will run through the entire course. Suppose we own a delivery company, such as FedEx or UPS, and we want to study the relationship between the delivery time of each truck and the number of packages each truck carries. The data we observe consists:

y: delivery time

x: number of packages to be delivered

delivery <- read.csv("data_delivery.csv",h=T)
plot(delivery$NumberofCases,delivery$DeliveryTime,pch=20)
abline(a=5,b=2,lty=2)

Not all data points fall on a straight line. Therefore, x and y are not in a perfect linear relationship \(y = a x + b\).

Therefore, to analyze data, we need to relax the perfect linear relationship to something more forgiving.

Simple Linear Regression

A simple linear regression model is \[ y_i=\beta_0+\beta_1 x_i + \epsilon_i \] for \(i=1,...,n\).

\(y_i\) dependent variable (response/outcome).

\(x_i\) independent variable (covariate/regressor/predictor).

\(\beta_0\) - intercept

\(\beta_1\) - slope

\(\epsilon_i\) is normally distributed with a mean of zero and a variance of \(\sigma^2\), \(\epsilon_i \sim N(0,\sigma^2)\)

\(y_i\) are independent normal variables with a mean of \(\beta_0 + \beta_1 x_i\) and a variance of \(\sigma^2\).

Normality assumption for the error term is justifiable in many situations since the error represents the effects of many factors omitted from the model.

Historical Origins: Developed by Sir Francis Galton in late 1800s to study the relationship between the heights of parents and their children.

Basic Concepts in Linear Regression

Response variable, \(y_i\), varies with the predictor variable, \(x_i\), in a systematic fashion. The mean response at any value, x, of the regressor variable is

\(E[y]=\mu=E[\beta_0+\beta_1 x + \epsilon] = \beta_0+\beta_1 x\)

For a given value of \(x\), there is variance in the value of \(y\). The variance of y at any given x is

\(\text{Var}[y]=\text{Var}[\beta_0+\beta_1 x + \epsilon] = \sigma^2\)

library(asbio)
see.regression.tck()

True Relationship Between x and y

The true relationship between x and y may be different from what we assume in simple linear regression. For example, it could be nonlinear. See figure below.

However, using a simple linear regression only requires two parameter (slope and intercept) whereas the curve relationship requires much larger number of parameters. Therefore, as long the true relationship is close to the simple linear regression, we use simple linear regression anyway.

Model Building Process

Lastly, I would like to bring everyone’s attention to the model building process.

When complete the final project or any linear regression project, we should follow this follow chart to make sure every aspect of the model is safe and sound.

List of Concepts

scatter plot
histogram
least square estimate
fitted value
residual
unbiasedness
standard error
interval estimation
hypothesis testing for coefficients (t-test)
hypothesis testing for significance of the regression (F-test)
type I error and type II error
significance level
p-value
\(R^2\)
adjusted \(R^2\)
prediction
prediction interval
confidence interval for mean response
extrapolation
QQ plot
residual against fitted value plot
standardized residual
PRESS residual
PRESS
prediction \(R^2\)
leverage
Cook’s distance
transformation on \(x\) (e.g., \(1/x\), \(\log(x)\), \(\sqrt{x}\))
transformation on \(y\) (e.g., \(1/y\), \(\log(y)\), \(\sqrt{y}\))
Box-Cox transformation on \(y\)
polynomial regression
indicator variable
interaction between indicator variable and continuous variable
Multicollinearity
VIF
ridge regression
best subset regression
AIC
BIC
stepwise regression (forward, backward, and both)
cross validation (\(k\)-fold and leave one out)
logistic regression
confusion matrix