Acknowledgment: the materials below are partially based on Montgomery, D. C., Peck, E. A., Vining, G. G., Introduction to Linear Regression Analysis (5th Edition), Wiley Series in Probability and Statistics, 2012. This materials was initilated by Yichen Qin and modified by Tianhai Zu for teaching purpose.

1. Logistic regression in general

Motivation

Why logistic regression?
Why not existing methods such as linear regression?
Linear regression assumes that the response variable Y is continuous, normally distributed and etc.
There are many situations where these assumptions are inappropriate
The response variable is binary (0,1).
E.g., win/lose, success/failure, approved/not approved.

Binary Response Variables

Outcome values 1 and 0 can represent “success” and “failure”.
Occurs often in the biopharmaceutical field, clinical trials.
Industrial applications include failure analysis, fatigue testing, reliability testing.
Other examples: buy a product, use a coupon, attend a university, apply a credit card, and etc.

Example Space Shuttle Challenger’s O-ring Wikipedia link

On January 28, 1986, the NASA Space Shuttle broke apart 73 seconds into its flight.
An O-ring seal in its right solid rocket booster failed at liftoff.
Due to a low temperature at launch (31 F).

We illustrate the usage of logistic regression through this data set on the experiments on the O-ring.

d = read.csv("data_challenger.csv", h=T)
d

##    Temperature Success
## 1         53.0       0
## 2         56.0       0
## 3         57.0       0
## 4         63.0       1
## 5         66.0       1
## 6         67.0       1
## 7         67.3       1
## 8         67.7       1
## 9         68.0       1
## 10        69.0       1
## 11        70.0       1
## 12        70.6       0
## 13        71.2       0
## 14        71.7       0
## 15        72.0       1
## 16        73.0       1
## 17        75.0       1
## 18        74.5       0
## 19        76.0       1
## 20        76.5       1
## 21        78.0       1
## 22        79.0       1
## 23        80.0       1
## 24        81.0       1

plot(d$Temperature,d$Success,pch=20,xlim=c(30,90),ylim=c(-1,2),xlab="Temperature",ylab="O-ring status (1 or 0)", main="Success 1; Failure 0")

If we fit a linear regression to this data set, we have

plot(d$Temperature,d$Success,pch=20,xlim=c(30,90),ylim=c(-1,2),xlab="Temperature",ylab="O-ring status (1 or 0)", main="Success 1; Failure 0")
abline(lm(Success~Temperature,data=d),col="red")

Obviously, linear regression does not make sense in this situation because some predictions are below 0 or above 1. If we could have something as follows, it would be very sensible.

That is the idea of logistic regression. Now let us discuss logistic regression formally.

\[ \color{blue}{P_x(Y=1)=\frac{1}{1+\exp{[-(\beta_0+\beta_1 x)]}}}=\frac{\exp{[\beta_0+\beta_1 x]}}{1+\exp{[\beta_0+\beta_1 x]}} \]

Or equivalently,

\[ \log\bigg(\frac{P(Y=1)}{P(Y=0)}\bigg)=\log\bigg(\frac{P(Y=1)}{1-P(Y=1)}\bigg)=\beta_0+\beta_1 x \]

Focus on the probability of Y=1, i.e., probability of success. It is also called “logistic response function”. There is empirical evidence that the response function should be nonlinear. “S” shape is quite logical.

2. Fit logistic regression to the data

We have observed \(y_i\) and \(x_i\) for \(i=1,...,n\), we need to figure out \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
We want the best values of \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that are consistent with observed values \(y_i\) and \(x_i\).
Done by software, e.g., glm() in R.
Find the best model that is consistent with the data.
The fitting process is done by maximum likelihood.

To fit the logistic regression to the data, we use glm as follows.

model2 = glm(Success ~ Temperature, data = d, family = "binomial")
summary(model2)

## 
## Call:
## glm(formula = Success ~ Temperature, family = "binomial", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9615  -0.6351   0.4941   0.8406   1.1977  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -9.98233    5.41893  -1.842   0.0655 .
## Temperature  0.15769    0.07892   1.998   0.0457 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 28.975  on 23  degrees of freedom
## Residual deviance: 23.640  on 22  degrees of freedom
## AIC: 27.64
## 
## Number of Fisher Scoring iterations: 4

3. Interpret the output

The output suggests that the fitted regression model is \[ \log\bigg(\frac{P(Y=1)}{P(Y=0)}\bigg)=\hat{\beta}_0+\hat{\beta}_1 * \text{Temperature} \]

where \(\hat{\beta}_0\)=-9.9823253 and \(\hat{\beta}_1\)=0.1576942. The slope coefficient for Temperature is 0.1576942, what does it mean? Since 0.1576942 is positive, we know as Temperature increases, it is more likely to succeed. But how much more likely?

Since we know, at temperature = 40, we have \[ \log\bigg(\frac{P_{40}(Y=1)}{P_{40}(Y=0)}\bigg)=\hat{\beta}_0+\hat{\beta}_1 * 40 \] Since we know, at temperature = 41, we have \[ \log\bigg(\frac{P_{41}(Y=1)}{P_{41}(Y=0)}\bigg)=\hat{\beta}_0+\hat{\beta}_1 * 41 \] Therefore, \[\log\bigg(\frac{P_{41}(Y=1)}{P_{41}(Y=0)}\bigg)-\log\bigg(\frac{P_{40}(Y=1)}{P_{40}(Y=0)}\bigg)=\hat{\beta}_1\]

Or equivalently, \[\frac{\bigg[\frac{P_{41}(Y=1)}{P_{41}(Y=0)} \bigg]}{ \bigg[\frac{P_{40}(Y=1)}{P_{40}(Y=0)} \bigg] }=\exp(\hat{\beta}_1)\] Here we call

\[\text{odds}=\frac{P_{41}(Y=1)}{P_{41}(Y=0)}\]

and we call \[\text{odds ratio} = \frac{\bigg[\frac{P_{41}(Y=1)}{P_{41}(Y=0)} \bigg]}{ \bigg[\frac{P_{40}(Y=1)}{P_{40}(Y=0)} \bigg] }\].

By increasing Temperature by 1 unit, the odds ratio is \(\exp(\hat{\beta}_1)\)=1.1708081. In other words, the odds increases by 17.0808126%.

4. Use logistic regression for prediction

To make a prediction of the new observation (i.e., a new temperature), we need to first obtain predicted probability and then predicted outcome.

The predicted probability of success for a new observation at Temperature=\(x\) is \[\hat{P}_x = \frac{1}{1+\exp{[-(\hat{\beta}_0+\hat{\beta}_1 x)]}}\].

The predicted outcome \(\hat{y}\) for the new observation at Temperature=\(x\) is \[\hat{y} = \begin{cases} 0, \quad \hat{P}_x \leq 0.5 \\ 1, \quad \hat{P}_x > 0.5 \end{cases}\]

For the space shuttle example, suppose we want to make a prediction at temperature=31 (i.e., the actual launching temperature), we obtain the following.

pred_prob_31=predict(model2, list(Temperature=31), type = "response")
pred_prob_31

##           1 
## 0.006097231

As we can see, the probability of success at temperature=31 is 0.0060972. We can further obtain the predicted outcome.

pred_value_31 = 1*(pred_prob_31>0.5)
pred_value_31

## 1 
## 0

To evaluate the prediction performance on all the observed data (\(n\)=24), we can similarly have

pred_prob = predict(model2, d, type = "response")
pred_value = 1*(pred_prob>0.5)
cbind(d, pred_prob, pred_value)

##    Temperature Success pred_prob pred_value
## 1         53.0       0 0.1645808          0
## 2         56.0       0 0.2402245          0
## 3         57.0       0 0.2701710          0
## 4         63.0       1 0.4881048          0
## 5         66.0       1 0.6047969          1
## 6         67.0       1 0.6418004          1
## 7         67.3       1 0.6526018          1
## 8         67.7       1 0.6667613          1
## 9         68.0       1 0.6771886          1
## 10        69.0       1 0.7106567          1
## 11        70.0       1 0.7419771          1
## 12        70.6       0 0.7596728          1
## 13        71.2       0 0.7765204          1
## 14        71.7       0 0.7899045          1
## 15        72.0       1 0.7976479          1
## 16        73.0       1 0.8219115          1
## 17        75.0       1 0.8635084          1
## 18        74.5       0 0.8539463          1
## 19        76.0       1 0.8810525          1
## 20        76.5       1 0.8890706          1
## 21        78.0       1 0.9103425          1
## 22        79.0       1 0.9224076          1
## 23        80.0       1 0.9329687          1
## 24        81.0       1 0.9421824          1

Now, we plot everything in one figure.

plot(d$Temperature,d$Success,xlim=c(50,85),ylim=c(-1,2),pch=20,xlab="Temperature",ylab="O-ring status (1 or 0)", main="Success 1; Failure 0",type="n")
points(d$Temperature,d$Success,pch=20)
points(d$Temperature,pred_prob,type="h",lty=3)
curve(1/(1+exp(-(model2$coefficients[1]+model2$coefficients[2]*x))),add=TRUE,col="blue")
abline(h=0.5,col="grey")
points(d$Temperature,pred_value,pch=1,cex=1.3)

To evaluate the prediction performance, we use the confusion matrix and misclassification error rate.

Confusion matrix	Predicted Negative (0)	Predicted Positive (1)
Actual Negative (0)	Number of True Negatives	Number of False Positives
Actual Positive (1)	Number of False Negatives	Number of True Positives

misclassification error rate = (Number of False Negatives+Number of False Positives)/Total.

We can obtain the confusion matrix as follows.

actual_value=d$Success
confusion_matrix=table(actual_value,pred_value)
confusion_matrix

##             pred_value
## actual_value  0  1
##            0  3  4
##            1  1 16

We can compare this table with the figure above. There are one hollow circles (false negative) on 0 and four hollow circles (false positives) on 1.

We can further obtain the misclassification error rate as follows.

misclassification_error_rate=1-sum(diag(confusion_matrix))/sum(confusion_matrix)
misclassification_error_rate

## [1] 0.2083333

5. Further tweak

In fact, 0.5 is not the unique choise of threshold in:

\[\hat{y} = \begin{cases} 0, \quad \hat{P}_x \leq 0.5 \\ 1, \quad \hat{P}_x > 0.5 \end{cases}\]

You can use

# define a cost function with input "obs" being observed response 
# and "pi" being predicted probability, and "pcut" being the threshold.
costfunc = function(obs, pred.p, pcut){
    weight1 = 1   # define the weight for "true=1 but pred=0" (FN)
    weight0 = 1    # define the weight for "true=0 but pred=1" (FP)
    c1 = (obs==1)&(pred.p<pcut)    # count for "true=1 but pred=0"   (FN)
    c0 = (obs==0)&(pred.p>=pcut)   # count for "true=0 but pred=1"   (FP)
    cost = mean(weight1*c1 + weight0*c0)  # misclassification with weight
    return(cost) # you have to return to a value when you write R functions
} # end of the function

# define a sequence from 0.01 to 1 by 0.01
p.seq = seq(0.01, 1, 0.01) 

mean_error = rep(0, length(p.seq))  
for(i in 1:length(p.seq)){ 
    mean_error[i] = costfunc(obs = d$Success, pred.p = pred_prob, pcut = p.seq[i])  
} # end of the loop

# draw a plot with X axis being all pcut and Y axis being associated cost
plot(p.seq, mean_error)

Let’s confirm when p-cut is 0.5, we get the previous mean error:

mean_error[which(p.seq==0.5)]

## [1] 0.2083333

What’s the best p-cut and lowest mean error we can get?

optimal.pcut = p.seq[which(mean_error==min(mean_error))]
min(mean_error)

## [1] 0.1666667

A in-class exercise

Use the data_delivery data, the first data we use in the first lecture. Cut the varibale “Distance” into two level with a threshold of 300. That is , any value higher than 300 is 1 and any value lower or equal to 300 is 0. Then we can perform logistic regression on this data. The predictor is still “DeliveryTime” and “NumberofCases”. The final goal is 1) to find out wichi variable is significant and 2) what is the best p-cut and lowest mean error we can get?

data <- read.csv("data_delivery.csv")
data$Distance <- as.numeric(data$Distance>300)
head(data)

##   Index DeliveryTime NumberofCases Distance
## 1     1        16.68             7        1
## 2     2        11.50             3        0
## 3     3        12.03             3        1
## 4     4        14.88             4        0
## 5     5        13.75             6        0
## 6     6        18.11             7        1

#Your turn!