Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

Unbiasedness of Least Square Estimates

Apparently, there are many other possible estimates in addition to least square estimates, e.g., a naively simple estimate is to choose two points from the data set and connects them and call it an estimate. Why do we use the least square estimates? What justifies the least square estimates? To answer this question, we need unbiasedness.

The least-squares estimates are unbiased estimates of their respective parameters. In other words,

\[E[\hat{\beta}_1]=\beta_1, \quad \quad E[\hat{\beta}_0]=\beta_0 \]

Here, what do we mean by \(E[\hat{\beta}_1]\)? Since \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are calculated from data, they depend on data. Hypothetically, if we could collect a new data set, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) would be different. Therefore, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) can be considered as random variables. Unbiased estimates essentially says that the means of these random variables, \(\hat{\beta}_0\) and \(\hat{\beta}_1\), are exactly at the true values, \(\beta_0\) and \(\beta_1\). This property justifies the usage of the least square estimate.

Below we repeat the data collecting process for six times, obtaint the least square estimates (blue line), and compare them with the true parameters (grey line). Note that repeating the data collecting process is impossible in practice, but we are doing simulation, so it is fine for demonstration purposes only.

par(mfrow=c(2,3))
n=20
for (i in 1:6)
{
  x=runif(n)*3
  y=1+2*x+rnorm(n)
  plot(x,y,xlim=c(0,3),ylim=c(0,10),pch=20)
  abline(lm(y~x),col="blue")
  abline(a=1,b=2,col="gray",lwd=3, lty=3)
}

As we can see, in these six scenarios, the estimate blue lines are sometimes above and sometimes below the true parameter grey line. If we repeat this process 100 times, the results are summarized in the following figure.

n=20
plot(0,0,xlim=c(0,3),ylim=c(0,10),pch=20,type="n",xlab="x",ylab="y")
for (i in 1:100)
{
  x=runif(n)*3
  y=1+2*x+rnorm(n)
  abline(lm(y~x),col=adjustcolor("blue", alpha.f = 0.3))
}
abline(a=1,b=2,col="gray",lwd=3, lty=3)
legend(0,10,c("true model","estimated model"),lwd=c(3,1),col=c("gray","blue"),lty=c(3,1))

Therefore, we can see that in each scenario, we may overestimate or underestimate the true parameter, but on average, we are right on the target, which shows the unbiasness.

Standard Error of Least Square Estimates

Even though we know the least square estimates are unbiased, but what is its variability like? To answer this question, we need standard errors.

Similar to the previous section, since \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are calculated from data, they depend on data. Hypothetically, if we could collect a new data set, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) would be different. Therefore, \(\hat{\beta}_0\) and \(\hat{\beta}_1\) can be considered as random variables. The standard errors (SE) are just the standard deviations of these random variables.

The standard erros of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) are \[ \text{SE}(\hat{\beta}_1)=\text{SD}(\hat{\beta}_1)=\sqrt{\text{Var}(\hat{\beta}_1)}=\sqrt{\frac{\sigma^2}{S_{xx}}}, \quad \quad \text{SE}(\hat{\beta}_0)=\text{SD}(\hat{\beta}_0)=\sqrt{\text{Var}(\hat{\beta}_0)}=\sqrt{\sigma^2\Big(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}}\Big)} \]

In R, use summary(...) to obtain standard errors of estimates.

Here we presents two simulated data sets compare their standard errors.

par(mfrow=c(1,2))
n=100
x=runif(n)*3
y1=1+2*x+rnorm(n)
plot(x,y1,xlim=c(0,3),ylim=c(0,10),pch=20)
model_se1=lm(y1~x)
abline(model_se1,col="blue")
y2=1+2*x+rnorm(n)*0.3
plot(x,y2,xlim=c(0,3),ylim=c(0,10),pch=20)
model_se2=lm(y2~x)
abline(model_se2,col="blue")

summary(model_se1)$coef
##              Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 0.6883245  0.2058671  3.343538 1.172669e-03
## x           2.0424659  0.1190986 17.149364 2.882895e-31
summary(model_se2)$coef
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 0.9581051 0.05514746 17.37351 1.103027e-31
## x           2.0412824 0.03190402 63.98198 9.621453e-82

Interpretation of Standard Errors

SE is one of the most important concepts in linear regressions. SE measures the variation and the accuracy of the estimate. Small SE means less variable and more accurate. SE can be viewed as “wiggle room” or “leeway” for having approximately same quality estimate.

Let us get back to the rocket propellant example.

rocket <- read.csv("data_RocketProp.csv",h=T)
model1 <- lm(y ~ x, data=rocket)
summary(model1)$coef
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 2627.82236  44.183912  59.47464 4.063559e-22
## x            -37.15359   2.889107 -12.85989 1.643344e-10
summary(model1)$coef[,2]
## (Intercept)           x 
##   44.183912    2.889107

For intercept estimate \(\hat{\beta}_0\), the standard error is 44.184. For slope estimate \(\hat{\beta}_1\), the standard error is 2.889.

If we think of \(\hat{\beta}_0\) as a random variable, then its standard deviation is 44.184. Note that \(\hat{\beta}_0\) is 2627.822, more than twice of SE. If we think of \(\hat{\beta}_1\) as a random variable, then its standard deviation is 2.889. Note that \(\hat{\beta}_1\) is -37.154, more than twice of SE.

Hypothesis Testing for Slope (t-Test)

For the rocket propellant data example, suppose your friend is very skeptical about your results and tells you that the age of the propellant should not influence the propellant strength at all based on his theory. In other words, in his opinion, the slope of simple linear regression \(\beta_1\) should be 0. Do you believe it? Suppose you obtain the data, run the regression and get an estimate of slope of -37, now do you believe him? What if your estimate is -5 instead of -37, do you believe him? Obviously, the decision depends on the data. To answer this question, we need to do hypothesis testing.

The null hypothesis is \(H_0:\beta_1=0\) (i.e., what your friend claims) — versus — the alternative hypothesis is \(H_1:\beta_1 \neq 0\) (i.e., the opposite of what your friend claims).

Our decision can be only one of the following two:

Since the truth can be only one of the two hypothesis (i.e., your friend is either right or wrong), we have in total four scenarios when making a decision for this hypothesis testing problem:

All four scenarios are summarized in the following table. Two of these four scenarios lead to mistakes (i.e., type I and II errors).

Two decisions below When \(H_0\) is true When \(H_0\) is not true
Rejecting \(H_0\) leads to: Scenario 1: type I error Scenario 2: no error
Not rejecting \(H_0\) leads to: Scenario 3: no error Scenario 4: type II error

Obviously, we want to reduce both the type I and II errors. However, type I and II errors are like two ends of a seesaw — one goes down, the other goes up. Therefore, we can only limit one type of errors. We choose to limit the type I error, because it is more important with more serious consequence. To understand this point, think of \(H_0\) as a person is innocent and \(H_1\) as this person is guilty, and think of rejecting \(H_0\) as guilty verdict and not rejecting \(H_0\) as innocent verdict. Giving an innocent person a guilty verdict is more serious than giving a guilty person an innocent verdict.

We usually control the type I error to be no larger than 0.05=5%. To implement this, we use p-value approach to help us make decision. The procedure is summarized below.

When p-value is …, our decision is …, which implies …
p-value is smaller than or equal to 0.05 reject \(H_0\) we have nonzero slope
p-value is larger than 0.05 do not reject \(H_0\) we have zero slope

Rejecting \(H_0\) implies that the corresponding slope coefficient is significantly different from zero, which means the corresponding regressor is significant in explaining the response variable. Therefore, this test is essentially testing the significance of regression; that is, is there a linear relationship between the response and the covariate? If we do not reject \(H_0\), it implies that there is no linear relationship between \(y\) and \(x\).

To calculate the p-value, we need to rely on the sample size \(n\) and the t-value, \(t=\hat{\beta}_1/SE(\hat{\beta}_1)\). If t-value is very large in absolute value, then the p-value is very small. For example, one approximate rule is that, if \(|t|>2\), the p-value is smaller than 0.05.

Now let us go back to our rocke propellant example.

# summary of hypothesis testing results
summary(model1)$coef
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 2627.82236  44.183912  59.47464 4.063559e-22
## x            -37.15359   2.889107 -12.85989 1.643344e-10
# p value
summary(model1)$coef[2,4]
## [1] 1.643344e-10

Since the p-value= 1.643343810^{-10} is smaller than 0.05, we say the p-value is significant. Therefore, we conclude we have a significant nonzero slope and there is a significant linear relationship between \(y\) and \(x\).

Hypothesis Testing for Intercept (t-Test)

Everything is the same with the pervious section. Using the output above, since the p-value of the intercept is smaller than 0.05, we conclude we have a significant nonzero intercept. See below for more details.

# summary of hypothesis testing results
summary(model1)$coef
##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 2627.82236  44.183912  59.47464 4.063559e-22
## x            -37.15359   2.889107 -12.85989 1.643344e-10
# p value
summary(model1)$coef[1,4]
## [1] 4.063559e-22

Interval Estimation for Slope and Intercept

For the rocket propellant data set, we have the slope estimate \(\hat{\beta}_1\), which is just one value. However, what if we want to know a possible range of values for the slope? To answer this question, we need the interval estimation.

Instead of giving ONE estimate, i.e., point estimate, we can give a window estimate. This window estimate will contain the true parameter with a certain probability (i.e., a confidence level). Usually, we set this confidence level to 0.95.

Pro: can capture uncertainty of the estimate

Con: more complex.

Interval estimation has a similar flavor as standard error.

confint(model1,level=0.95)
##                  2.5 %    97.5 %
## (Intercept) 2534.99540 2720.6493
## x            -43.22338  -31.0838
confint(model1,level=0.90)
##                    5 %      95 %
## (Intercept) 2551.20465 2704.4401
## x            -42.16349  -32.1437
confint(model1,level=0.99)
##                  0.5 %     99.5 %
## (Intercept) 2500.64160 2755.00312
## x            -45.46971  -28.83747

Oftentimes, we set level=0.95. However, we should also be aware of the following

confint(model1,level=1)
##              0 % 100 %
## (Intercept) -Inf   Inf
## x           -Inf   Inf
confint(model1,level=0)
##                   50 %       50 %
## (Intercept) 2627.82236 2627.82236
## x            -37.15359  -37.15359

Interpretation of Interval Estimation

At level=0.95, Interval estimation for intercept [2534.99540, 2720.6493], which means [2534.99540, 2720.6493] contains the true intercept with 95% probability.

Interval estimation for slope [-43.22338, -31.0838], which means [-43.22338, -31.0838] contains the true slope with 95% probability.

Note that this interval does not contain zero, which is why the hypothesis testing result for \(\beta_1\) is significant (i.e., rejecting \(H_0: \beta_1=0\)). This is called the duality of hypothesis testing and interval estimation. If the interval estimation contains zero, the hypothesis testing will fail to reject the null hypothesis with p-value larger than 0.05. On the other hand, if the interval estimation does not contain zero, the hypothesis testing will indeed reject the null hypothesis with p-value smaller than 0.05.