Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.
A statistical methodology that utilizes the relationship between two or more quantitative variables so that a response variable or outcome variable can be predicted from the other or others.
Sales Volume
Child Development
Hospital Stay
Sports Performance
Simple Linear Regression (SLR): A single predictor (independent) variable is used for predicting the response or outcome (dependent) variable.
Let us start with an simple example that will run through the entire course. Suppose we own a delivery company, such as FedEx or UPS, and we want to study the relationship between the delivery time of each truck and the number of packages each truck carries. The data we observe consists:
y: delivery time
x: number of packages to be delivered
delivery <- read.csv("data_delivery.csv",h=T)
plot(delivery$NumberofCases,delivery$DeliveryTime,pch=20)
abline(a=5,b=2,lty=2)
Not all data points fall on a straight line. Therefore, x and y are not in a perfect linear relationship \(y = a x + b\).
Therefore, to analyze data, we need to relax the perfect linear relationship to something more forgiving.
A simple linear regression model is \[ y_i=\beta_0+\beta_1 x_i + \epsilon_i \] for \(i=1,...,n\).
\(y_i\) dependent variable (response/outcome).
\(x_i\) independent variable (covariate/regressor/predictor).
\(\beta_0\) - intercept
\(\beta_1\) - slope
\(\epsilon_i\) is normally distributed with a mean of zero and a variance of \(\sigma^2\), \(\epsilon_i \sim N(0,\sigma^2)\)
\(y_i\) are independent normal variables with a mean of \(\beta_0 + \beta_1 x_i\) and a variance of \(\sigma^2\).
Normality assumption for the error term is justifiable in many situations since the error represents the effects of many factors omitted from the model.
Historical Origins: Developed by Sir Francis Galton in late 1800s to study the relationship between the heights of parents and their children.
Response variable, \(y_i\), varies with the predictor variable, \(x_i\), in a systematic fashion. The mean response at any value, x, of the regressor variable is
\(E[y]=\mu=E[\beta_0+\beta_1 x + \epsilon] = \beta_0+\beta_1 x\)
For a given value of \(x\), there is variance in the value of \(y\). The variance of y at any given x is
\(\text{Var}[y]=\text{Var}[\beta_0+\beta_1 x + \epsilon] = \sigma^2\)
library(asbio)
see.regression.tck()
The true relationship between x and y may be different from what we assume in simple linear regression. For example, it could be nonlinear. See figure below.
However, using a simple linear regression only requires two parameter (slope and intercept) whereas the curve relationship requires much larger number of parameters. Therefore, as long the true relationship is close to the simple linear regression, we use simple linear regression anyway.
Lastly, I would like to bring everyone’s attention to the model building process.
When complete the final project or any linear regression project, we should follow this follow chart to make sure every aspect of the model is safe and sound.