Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

Simple Linear Regression

A simple linear regression model assumes \[ y_i=\beta_0+\beta_1 x_i + \epsilon_i \] for \(i=1,...,n\).

\(y_i\): dependent variable (response/outcome).

\(x_i\): independent variable (covariate/regressor/predictor).

\(\beta_0\): intercept. It is the mean of the response \(y\) when \(x = 0\) (of course, if x = 0 is in the range of observed \(x_i\)’s); if x = 0 is not in the range, then \(\beta_0\) has no natural interpretation.

\(\beta_1\): slope. It is the change in the mean of the response \(y\) produced by a unit increase in \(x\).

\(\epsilon_i\) is normally distributed with a mean of zero and a variance of \(\sigma^2\), \(\epsilon_i \sim N(0,\sigma^2)\)

Response variable, \(y_i\), varies with the predictor variable, \(x_i\), in a systematic fashion. The mean response (i.e., mean of response), at any value x of the regressor variable, is

\(E[y]=\mu=E[\beta_0+\beta_1 x + \epsilon] = \beta_0+\beta_1 x\)

For a given value of \(x\), there is variance in the value of \(y\). The variance of y at any given x is

\(\text{Var}[y]=\text{Var}[\beta_0+\beta_1 x + \epsilon] = \sigma^2\)

There are infinitely many normal distributions in the previous figure. I only plot two of them. So it is really a “ridge” if you plot all the them.

library(asbio)
see.regression.tck()

Estimation of \(\beta_0\) and \(\beta_1\)

In this course, we assume that this simple linear regression, \(y_i = \beta_0 + \beta_1 x_i + \epsilon_i\), is the true model governing the relationship between \(x_i\) and \(y_i\). For example, suppose \(x_i\) is height and \(y_i\) is weight. For any person with a given height \(x_i\), his weight follows a normal distribution with mean of \(\beta_0+\beta_1 x_i\) and variance of \(\sigma^2\). However, \(\beta_0\) and \(\beta_1\) are unknown in practice and must be estimated from data.

Hypothetically, if we can afford the cost, we can collect the height and weight of everyone in the world, we will see that, for all the people with height \(x\), their average weights is exactly \(\beta_0+\beta_1 x\) and each person’s weight exactly follows a normal distribution with mean of \(\beta_0+\beta_1 x\) and variance of \(\sigma^2\).

However, we cannot afford to observe everyone in the world.

In practice, we are able to observe only some of the people in the world (subset of all people, i.e., a sample) in our study, and use this observed sample to guess (i.e., estimate) the true model parameter \(\beta_0\) and \(\beta_1\). So the question becomes, given \(n\) observed data points \((x_1, y_1)\),…,\((x_n,y_n)\), what is the estimate of \(\beta_0\) and \(\beta_1\) using these data points.

Suppose we observe the following data points. Which one of the following slight line seems to be the best estimate of \(\beta_0\) and \(\beta_1\) and why?

The last one seems to be the most reasonable one.

Least Square Estimates

Least squares estimation seeks to minimize the sum of squares of the differences between the observed response, \(y_i\), and the straight line.

\[ S(a,b)=\sum_{i=1}^{n} (y_i-a-b x_i)^2 \]

In other words, among all the possible straight lines, we first obtains the vertical distances between the data points and the straight line (with intercept \(a\) and slope \(b\)) and sums over all the squared distances to obtain \(S(a,b)\), and then finds the straight line that minimizes this sum of squared distances. This straight line corresponds to our estimates of the slope \(\beta_0\) and intercept \(\beta_1\). Let us refer back to the previous three lines.

Here is another demonstration: (http://www.dangoldstein.com/regression.html)

Let \(\hat{\beta}_0\) and \(\hat{\beta}_1\) represent the intercept and slope of the straight line that minimizes the sum of squared distances \(S\). We call \(\hat{\beta}_0\) and \(\hat{\beta}_1\) the least square estimates of \(\beta_0\) and \(\beta_1\). In other words,

\[ (\hat{\beta}_0,\hat{\beta}_1)=\arg\min_{a,b} S(a,b) \]

In fact, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) can be succinctly expressed as

\[ \hat{\beta}_1=\frac{\sum_{i=1}^{n} (y_i-\bar{y})(x_i-\bar{x})}{\sum_{i=1}^{n} (x_i-\bar{x})^2} \\ \hat{\beta}_0= \bar{y} - \hat{\beta}_1 \bar{x}, \]

Example of Rocket Propellant

Let us take a look at a real data example. Suppose we want to study, for each barrel of rocket propellant, the relationship between the rocket propellant strength (in psi) and age of the propellant (in weeks) and eventaully make predictions of strength for various ages. The response and covariate are

y: propellant strenghth

x: age of the propellant

rocket <- read.csv("data_RocketProp.csv",h=T)
rocket
##          y     x
## 1  2158.70 15.50
## 2  1678.15 23.75
## 3  2316.00  8.00
## 4  2061.30 17.00
## 5  2207.50  5.50
## 6  1708.30 19.00
## 7  1784.70 24.00
## 8  2575.00  2.50
## 9  2357.90  7.50
## 10 2256.70 11.00
## 11 2165.20 13.00
## 12 2399.55  3.75
## 13 1779.80 25.00
## 14 2336.75  9.75
## 15 1765.30 22.00
## 16 2053.50 18.00
## 17 2414.40  6.00
## 18 2200.50 12.50
## 19 2654.20  2.00
## 20 1753.70 21.50

We assume \(x\) and \(y\) are in a simple linear regression relationship, but we still do not know the slope and intercept of the simple linear regression. Hence, we need to estimate them.

model1 <- lm(y ~ x, data=rocket) # obtain least square estimate
plot(rocket$x,rocket$y,pch=20) 
abline(model1)

summary(model1)
## 
## Call:
## lm(formula = y ~ x, data = rocket)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -215.98  -50.68   28.74   66.61  106.76 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2627.822     44.184   59.48  < 2e-16 ***
## x            -37.154      2.889  -12.86 1.64e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 96.11 on 18 degrees of freedom
## Multiple R-squared:  0.9018, Adjusted R-squared:  0.8964 
## F-statistic: 165.4 on 1 and 18 DF,  p-value: 1.643e-10

Interpretation of Least Square Estimates

\(\hat{\beta}_0\), the estimated intercept, is the estimated mean of the response y, when \(x = 0\).

\(\hat{\beta}_1\), the estimated slope, is the estimated change in the mean of the response variable \(y\) produced by a unit change in \(x\).

\(\hat{\beta}_0\) is 2627.822. When age=0, the mean strength of the rocket propollent is estimated to be 2627.822.

\(\hat{\beta}_1\) is -37.154. The mean strength is estimated to decreases by 37.154 due to one unit increment in the propollent age.

Fitted Values (Or Predicted Values)

For each data point in the rocket propellant data set, at a given age \(x_i\), what is the mean propellant strength of \(y_i\)? To answer this questions, we need fitted values.

The straight line with estimated intercept \(\hat{\beta}_0\) and estimated slope \(\hat{\beta}_1\) is called fitted linear regression line. Using these estimates, we can further obtain the fitted value \(\hat{y}_i\) (or predicted value) which is defined as

\[ \hat{y}_i = \hat{\beta}_0 + \hat{\beta}_1 x_i \]

The fitted value represents the estimated mean of the response \(y_i\) at a given value of \(x_i\). It can be also considered as the prediction of response \(y_i\) at a given \(x_i\).

To obtain the fitted values in R, we use

model1$fitted.values
##        1        2        3        4        5        6        7        8 
## 2051.942 1745.425 2330.594 1996.211 2423.478 1921.904 1736.136 2534.938 
##        9       10       11       12       13       14       15       16 
## 2349.170 2219.133 2144.826 2488.496 1698.983 2265.575 1810.443 1959.058 
##       17       18       19       20 
## 2404.901 2163.402 2553.515 1829.020
plot(rocket$x,rocket$y,pch=20)
points(rocket$x,model1$fitted.values,pch=4)
abline(model1)

Interpretation of Fitted Values

For the first observation, \(y_1\) is 2158.70 and \(x_1\) is 15.50.

The fitted value of the first observation is 2051.942, which means that, given x=15.50, according to our fitted regression linear, the mean of the response variable y is estimated to be 2051.942. In other words, the estimated mean response of y at x=15.50 is 2051.942.

So the observed value \(y_1=2158.70\) is a little above the estimated mean response.

If we are asked to predict the strength for age=15.50, the prediction is 2051.942.

Residuals

For each data point in the rocket propellant data set, at a given age \(x_i\), how far is the observed propellant strength \(y_i\) away from the mean propellant strength of \(y_i\)? In other words, how unusal is this observed propellant strenth? To answer this questions, we need residuals.

Using the fitted values, we can further define residuals, \(e_i\), as

\[ e_i = y_i - \hat{y}_i \]

Residuals are used to determine the adequacy of the model. Residuals also represent prediction errors. Note that fitted values are essentially predictions.

model1$residuals
##           1           2           3           4           5           6 
##  106.758301  -67.274574  -14.593631   65.088687 -215.977609 -213.604131 
##           7           8           9          10          11          12 
##   48.563824   40.061618    8.729573   37.567141   20.374323  -88.946393 
##          13          14          15          16          17          18 
##   80.817415   71.175153  -45.143358   94.442278    9.499187   37.097528 
##          19          20 
##  100.684823  -75.320154
plot(rocket$x,rocket$y,pch=20)
points(rocket$x,model1$fitted.values,pch=4)
for (i in 1:dim(rocket)[1])
{
  lines(c(rocket$x[i],rocket$x[i]),
        c(model1$fitted.values[i],rocket$y[i]))
}

Interpretation of Residuals

The first observation’s residual is 106.758301. It means the observed value of \(y_1\) is 106.758301 above the estimated mean response.

Summary of Concepts

\(i\) is the index of data point, \(i=1,...,n\).

\(n\) is the sample size (i.e., how many data points in the sample)

\(y_i\) is the response/outcome/dependent variable (observed and known).

\(x_i\) is the predictor/regressor/covariate/independent variable (observed and known).

\(\epsilon_i\) is the random error (unknown).

\(\beta_0\) is the intercept (unknown).

\(\beta_1\) is the slope (unknown).

\(\beta_0 + \beta_1 x_i\) is the mean response, i.e., the mean value of response variable \(y_i\) at a given \(x_i\), (unknown).

\(\hat{\beta}_0\) is the estimate of \(\beta_0\) (calculated from data).

\(\hat{\beta}_1\) is the estimate of \(\beta_1\) (calculated from data).

\(\hat{y}_i=\hat{\beta}_0 + \hat{\beta}_1 x_i\) is the fitted value (calculated from data). It is also the estimate of the mean response of \(y_i\) at \(x_i\).

\(e_i\) is the residual (calculated from data). It is also the estimate of \(\epsilon_i\).

True linear regression model: \(y_i=\beta_0 +\beta_1 x_i +\epsilon_i\) (dash line in the next figure).

Fitted linear regression model: \(y_i=\hat{\beta}_0 +\hat{\beta}_1 x_i +e_i = \hat{y}_i + e_i\) (solid line in the next figure).

In practice, when we observe real data and plot the regression line, is that line the dashed one or solid one in the previous figure?

Ans: Solid. Solid one is the fitted linear regression line. Dashed line is the true model which we never know.

n = 20 # sample size
beta0 = 1 # true intercept
beta1 = 2 # true slope
x = runif(n, min=0,max=3) # simualte predictor x from a uniform distirbution
epsilon = rnorm(n) # simulate random error epsilon
y = beta0 + beta1 * x + epsilon # simulate y according the true model
plot(x, y, pch=20, xlim=c(0,3), ylim=c(0,9))
abline(a=beta0, b=beta1, lwd=3,col="gray",lty=2) # plot the true model
model1 = lm(y~x)
abline(model1, col="blue") # plot the estimated model