Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.
For the rocket propellant data set, we have built a simple linear regression between the strength \(y\) and age \(x\). However, how strong is the relationship between \(x\) and \(y\)? To answer this question, we need coefficient of determination, \(R^2\).
Before we start, let us first go over a few notations:
\(y_i\) is observed response value.
\(\bar{y}=\frac{1}{n}\sum_{i=1}^{n} y_i\) is the mean of all \(y_i\)’s.
\(\hat{y}_i\) is the fitted value at \(x_i\), it also represents the estimated mean of \(y_i\) at values of \(x_i\).
With the help these notations, we can start by partitioning the total variability of \(y_i\) into two components.
\[ y_i-\bar{y} = (y_i - \hat{y}_i) + (\hat{y}_i -\bar{y}) \]
\[ \underbrace{\sum_{i=1}^{n} (y_i-\bar{y})^2}_{SST} = \underbrace{\sum_{i=1}^{n} (y_i - \hat{y}_i)^2}_{SSRes} + \underbrace{\sum_{i=1}^{n} (\hat{y}_i -\bar{y})^2}_{SSR} \]
Here is a figure illurstrating this decomposition.
\(R^2\) is defined as \[ R^2=\frac{SSR}{SST} = \frac{\sum_{i=1}^{n} (\hat{y}_i -\bar{y})^2}{\sum_{i=1}^{n} (y_i-\bar{y})^2} = 1-\frac{SSRes}{SST} \] \(R^2\) represents the proportion of variation in response \(y\) that can be explained by the regressor \(x\). It measures how strong the linear relationship between the response \(y\) and covariate \(x\) is. Here are two examples.
The black dots are the acutal data points, i.e., \((x_i,y_i)\). The black “x”s are the fitted values, i.e., \((x_i, \hat{y}_i)\). We can project all the data points and their fitted values onto the y-axis with blue dots and blue “x”s, i.e., \((0,y_i)\) and \((0,\hat{y}_i)\). The red circle indicates the center of the data points, i.e., \((\bar{x},\bar{y})\). We also project the center of the data to the y-axis by blue circle, i.e., \((0,\bar{y})\). Therefore, the variation in the vertical direction of blue dots is represented by \(SST=\sum_{i=1}^{n}(y_i-\bar{y})^2\). The variation in the vertical direction of the blue “x”s is represented by \(SSR=\sum_{i=1}^{n}(\hat{y}_i-\bar{y})^2\). \(R^2\) is essentially the ratio \(SSR/SST\), which, loosely speaking, represents that hom much of the variation of blue dots can be explained by the variation of blue “x”s.
To understand the magnitude of \(R^2\), if the data points are very close to the fitted linear regression line, the blue “x”s will be close to the blue dots, which means the \(R^2\) will be high, i.e., the right figure above. On the other hand, if the data points are very far from the fitted linear regression line, the blue “x”s will be far from the blue dots, which means the \(R^2\) will be low, i.e., the left figure above.
Since \(R^2\) is a proportion, it is always a number between 0 and 1.
If \(R^2\) = 1, all of the data points fall perfectly on the regression line. The predictor x accounts for all of the variation in y!
If \(R^2\) = 0, the estimated regression line is perfectly horizontal. The predictor x accounts for none of the variation in y!
To interpret \(R^2\), either one of the following two explanations are fine.
To obtain the \(R^2\), we use summary(...)
. For the rocket propellant data, we have
rocket <- read.csv("data_RocketProp.csv",h=T)
model1 <- lm(y ~ x, data=rocket)
summary(model1)$r.square
## [1] 0.9018414
This means the \(R^2\) for the fitted linear regression to the rocket propellant data set is 0.9018, which means 90% of the variation in \(y\) can be explained by the variation in \(x\).
Meanwhile, we can manually calculate \(R^2\) by manually calculating SST, SSR, and SSRes first and then obtain the ratio. Here is an example for the rocket propellant data set.
# manual way to calculate R^2, SST, SSR, and SSRes.
SST = sum((rocket$y-mean(rocket$y))^2)
SST
## [1] 1693738
SSR = sum((model1$fitted.values-mean(rocket$y))^2)
SSR
## [1] 1527483
SSRes = sum((rocket$y-model1$fitted.values)^2)
SSRes
## [1] 166254.9
SSR/SST # R^2 = SSR/SST.
## [1] 0.9018414
1-SSRes/SST # R^2 = 1- SSRes/SST, equivalent to the previous one.
## [1] 0.9018414
One of the most frequently asked questions is “Is my \(R^2\) large enough or good enough?”. The answer really depends on the research area you are in.
Social science:
Engineering:
Conclusion: read the literature to learn what typical \(R^2\) values are for your research area!
The \(R^2\) quantifies the strength of a linear relationship. It is possible that \(R^2 = 0\%\), suggesting there is no linear relation between \(x\) and \(y\), and yet a perfect curved (or “curvilinear” relationship) exists.
A large \(R^2\) value should not be interpreted as meaning that the estimated regression line fits the data well (e.g., a perfect slightly curved relationship). What to use to evaluate the goodness of fit? Residuals (Chapter 4).
The \(R^2\) can be greatly affected by just one data point (or a few data points).
High \(R^2\) or high correlation (or association) does not imply causation.
\(R^2\) obtained based on rates or averages tends to overstate the strength of an association, e.g., practical.txt.
A large \(R^2\) value does not necessarily mean that a useful prediction of the response, or estimation of the mean response, can be made. It is still possible to get prediction intervals or confidence intervals that are too wide to be useful.
\(R^2\) can be misleading:
For the rocket propellant data set, suppose we have a new barrel of propellant and we only know it’s been stored for 14 weeks, i.e., age=14, how can we know the strength of this barrel of rocket propellant? To answer this question, we need prediction.
The ultimate objective of the simple linear regression is make predictions. Let \(x^{new}\) be the level of the covariate at which we want to make prediction, our prediction for \(y\) is simply
\[ \hat{\beta}_0+\hat{\beta}_1 x^{new} \]
If we make prediction for one of the observed covariate \(x_i\), the prediction is the same as the fitted value for this data point, \(y_i\).
To make predictions, we use predict(...)
functions. Suppose we want to make prediction of the rocket propellant strength for age = 14.
newx <- c(14)
pred <- predict(model1, newdata=data.frame(x=newx),type="response")
pred
## 1
## 2107.672
plot(rocket$x,rocket$y,pch=20)
abline(model1,col="blue")
points(newx,pred,pch=1,col="red")
For the rocket propellant data set, suppose our prediction is 2107.6720858. How confidence are we about our prediction? What is the likelihood that this prediction is close to the actual strength? To answer these questions, we need prediction intervals.
In addition to the prediction, we can also provide intervals for our predictions by using predict(...)
function as follows. This interval will provide more information. The prediction interval means the, with a given confidence such as 95%, the prediction interval will contain the new response value. It can be obtained as follows.
pred <- predict(model1,newdata=data.frame(x=newx),interval = c("prediction"),level = 0.95, type="response")
pred
## fit lwr upr
## 1 2107.672 1900.738 2314.606
Therefore, the prediction interval is from 1900.738279 to 2314.6058925, meaning the new observation’s response variable is in this interval with 95% probability.
In addition, there is another type of interval, called confidence interval, which focus on the mean of the response. It can be obtained as follows.
conf <- predict(model1,newdata=data.frame(x=newx),interval = c("confidence"),level = 0.95, type="response")
conf
## fit lwr upr
## 1 2107.672 2062.358 2152.986
This interval from 2062.358 to 2152.986 is much narrower. This interval means that the mean of rocket propellant strength is within this interval with 95% probability. Note that difference between the two intervals. The first one is for a new observed strength (which is a random variable), the second one is for the mean of the strength (which is a constant). The first one is usually called prediction interval, whereas the second one is called confidence interval for mean response.
Finally, we make prediction for age = 0,1,2, … 30, and plot the prediction and two types of prediction intervals all in one figure.
newx <- seq(0,30)
pred <- predict(model1,newdata=data.frame(x=newx),interval = c("prediction"),level = 0.95, type="response")
head(pred)
## fit lwr upr
## 1 2627.822 2405.595 2850.050
## 2 2590.669 2370.584 2810.754
## 3 2553.515 2335.425 2771.605
## 4 2516.362 2300.114 2732.609
## 5 2479.208 2264.648 2693.768
## 6 2442.054 2229.021 2655.087
conf <- predict(model1,newdata=data.frame(x=newx),interval = c("confidence"),level = 0.95, type="response")
head(conf)
## fit lwr upr
## 1 2627.822 2534.995 2720.649
## 2 2590.669 2503.096 2678.242
## 3 2553.515 2471.083 2635.947
## 4 2516.362 2438.937 2593.786
## 5 2479.208 2406.628 2551.788
## 6 2442.054 2374.122 2509.987
plot(rocket$x,rocket$y,pch=20)
abline(model1,col="blue")
points(newx,conf[,1],pch=1,col="red")
lines(newx,conf[,2],col="red",lty=2)
lines(newx,conf[,3],col="red",lty=2)
lines(newx,pred[,2],col="red",lty=3)
lines(newx,pred[,3],col="red",lty=3)
We see that the interval is wider on two sides and narrower in the middle. We can calculate the interval widths.
pred[,3]-pred[,2]
## 1 2 3 4 5 6 7 8
## 444.4550 440.1695 436.1798 432.4942 429.1205 426.0660 423.3378 420.9421
## 9 10 11 12 13 14 15 16
## 418.8846 417.1704 415.8037 414.7879 414.1257 413.8186 413.8676 414.2725
## 17 18 19 20 21 22 23 24
## 415.0321 416.1447 417.6073 419.4164 421.5674 424.0551 426.8738 430.0167
## 25 26 27 28 29 30 31
## 433.4770 437.2470 441.3189 445.6843 450.3347 455.2614 460.4555
conf[,3]-conf[,2]
## 1 2 3 4 5 6 7 8
## 185.65391 175.14651 164.86347 154.84949 145.16027 135.86532 127.05118 118.82491
## 9 10 11 12 13 14 15 16
## 111.31691 104.68184 99.09521 94.74266 91.79989 90.40469 90.62856 92.45972
## 17 18 19 20 21 22 23 24
## 95.80607 100.51640 106.40971 113.30157 121.02150 129.42141 138.37751 147.78873
## 25 26 27 28 29 30 31
## 157.57355 167.66655 178.01534 188.57781 199.31999 210.21432 221.23835
The reason why the interval is widers on two sides is because you have much larger variability in estimation of the linear regression line and relatively less data points on two sides.
For the rocket propellant data set, our observed data points contains only propellant stored between 0 and 25 weeks, i.e., \(x_i\) is between 0 and 25. Suppose there is a new barrel of propellant that have been stored for 50 weeks, i.e. way beyond 25 weeks. Can we still make valid prediction for strength of this barrel of propellant? To answer this question, we need to know about extrapolation.
Lastly, when making predictions, be careful about extrapolation, which is usually unrealiable and unstable. It should be avoided.
If we make a prediction at age=50, then below are the results
newx=c(100)
pred <- predict(model1,newdata=data.frame(x=newx),interval = c("prediction"),level = 0.95, type="response")
pred
## fit lwr upr
## 1 -1087.537 -1652.645 -522.4286
newx=seq(0,100)
pred <- predict(model1,newdata=data.frame(x=newx),interval = c("prediction"),level = 0.95, type="response")
plot(rocket$x,rocket$y,pch=20,xlim=c(0,110),ylim=c(-1200,2700))
abline(model1,col="blue")
points(newx,pred[,1],pch=1,col="red")
lines(newx,pred[,2],col="red",lty=3)
lines(newx,pred[,3],col="red",lty=3)
Obviously, this prediction is less trustworthy because we do not know the relationship outside of the range of observed data. Therefore, plesae avoid extrapolation by only making predictions within the range of the observed covariates,i.e., from the minimum of the covariate to the maximum of the covariate.