In this lab we will go through the model building, validation, and interpretation of tree models. The focus will be on rpart package.

1 Regression Tree vs. Classification Tree

CART stands for classification and regression tree.

For the regression tree example, we will use the Boston Housing data. Recall the response variable is the housing price. For the classification tree example, we will use the credit scoring data. The response variable is whether the loan went to default.

Note that unlkie logistic regreesion, the response variable does not have to be binary in case of classification tree. We can use classification tree on classification problems with more than 2 outcomes.

go to top

1.1 Regression Tree (Boston Housing Data)

Let us load the data sets. Random sampled training and test datasets will lead to different results,

library(MASS) #this data is in MASS package
boston.data <- data(Boston)
sample_index <- sample(nrow(Boston),nrow(Boston)*0.90)
boston.train <- Boston[sample_index,]
boston.test <- Boston[-sample_index,]

We will use the ‘rpart’ library for model building and ‘rpart.plot’ for plotting.

install.packages('rpart')
install.packages('rpart.plot') 
library(rpart)
library(rpart.plot)

The simple form of the rpart function is similar to lm and glm. It takes a formula argument in which you specify the response and predictor variables, and a data argument in which you specify the data frame.

boston.rpart <- rpart(formula = medv ~ ., data = boston.train)

1.1.1 Printing and ploting the tree

boston.rpart
## n= 455 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 455 38961.9200 22.82703  
##    2) rm< 6.797 370 14198.5800 19.77459  
##      4) lstat>=14.4 151  2622.3790 14.99139  
##        8) crim>=5.7819 64   684.5444 12.03438 *
##        9) crim< 5.7819 87   966.5533 17.16667 *
##      5) lstat< 14.4 219  5739.4360 23.07260  
##       10) dis>=1.5511 212  2699.1230 22.57972  
##         20) rm< 6.543 176  1467.3120 21.66250 *
##         21) rm>=6.543 36   359.8631 27.06389 *
##       11) dis< 1.5511 7  1429.0200 38.00000 *
##    3) rm>=6.797 85  6309.4430 36.11412  
##      6) rm< 7.437 56  1816.2800 31.55000  
##       12) lstat>=9.65 9   380.6356 24.82222 *
##       13) lstat< 9.65 47   950.2711 32.83830 *
##      7) rm>=7.437 29  1073.9780 44.92759 *
prp(boston.rpart,digits = 4, extra = 1)

Make sure you know how to interpret this tree model!

Exercise: What is the predicted median housing price (in thousand) given following information:

crim zn indus chas nox rm age dis rad tax ptratio black lstat medv
2.73 0 19.58 0 0.87 5.6 94.9 1.53 5 403 14.7 351.85 21.45 15.4

1.1.2 Prediction using regression tree

The in-sample and out-of-sample prediction for regression tree is also similar to lm and glm models.

  • In-sample prediction
boston.train.pred.tree = predict(boston.rpart)
  • Out-of-sample prediction
boston.test.pred.tree = predict(boston.rpart,boston.test)

We often denote MSE as training error, and MSPE as testing error when sample size is large. Exercise: Calculate the mean squared error (MSE) for this tree model

MSE.tree<- 
MSPE.tree <- 

We can compare this model’s out-of-sample performance with the linear regression model with all variables in it.

boston.reg = lm(medv~., data = boston.train)
boston.test.pred.reg = predict(boston.reg, boston.test)
mean((boston.test.pred.reg - boston.test$medv)^2)
## [1] 22.27053

1.1.3 Comparing the performance of regression tree with linear regression model in terms of prediction error (Exercise)

Calculate the mean squared error (MSE) and mean squared prediction error (MSPE) for linear regression model using all variables. Then compare the results. What is your conclusion? Further, try to compare the regression tree with the best linear regression model using some variable selection procedures.

boston.lm<- 
boston.train.pred.lm<- 
boston.test.pred.lm<- 
MSE.lm<- 
MSPE.lm<-

1.2 Classification Tree (Credit Card Default data)

The classification tree is slightly more complicated to specify. What makes it more complicated is that we often have asymmetric cost function. In the credit scoring case it means that false negatives (predicting 0 when truth is 1, or giving out loans that end up in default) will cost more than false positives (predicting 1 when truth is 0, rejecting loans that you should not reject).

Here we make the assumption that false negative cost 10 times of false positive. In real life the cost structure should be carefully researched.

credit.data <- read.csv(file = "data/credit_default.csv", header=T)
# rename
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
credit.data<- rename(credit.data, default=default.payment.next.month)
# convert categorical data to factor
credit.data$SEX<- as.factor(credit.data$SEX)
credit.data$EDUCATION<- as.factor(credit.data$EDUCATION)
credit.data$MARRIAGE<- as.factor(credit.data$MARRIAGE)

index <- sample(nrow(credit.data),nrow(credit.data)*0.80)
credit.train = credit.data[index,]
credit.test = credit.data[-index,]

credit.rpart0 <- rpart(formula = default ~ ., data = credit.train, method = "class")

credit.rpart <- rpart(formula = default ~ . , data = credit.train, method = "class", parms = list(loss=matrix(c(0,10,1,0), nrow = 2)))

Note the following important differences from the regression tree:

  • The method = “class” is required if the response is not declared as factors.

  • The parms argument, which is a list. The most import element is the loss matrix. The diagonal elements are 0, and off-diagonal elements tells you the loss(cost) of classifying something wrong. For binary classification, the numbers in c() specify the cost in this sequence: c(0, False Negative, False Positive, 0). If you have symmetric cost, you can ignore the parms argument.

However, this tree with defaul cost minimizes the symmetric cost, which is misclassification rate. We can take a look at the confusion matrix.

pred0<- predict(credit.rpart0, type="class")
table(credit.train$default, pred0, dnn = c("True", "Pred"))
##     Pred
## True    0    1
##    0 7168  331
##    1 1415  686

Note that in the predict() function, we need type="class" in order to get binary prediction.

Look at the confusion matrix, is it what we expected? Think about why the confusion matrix is like this?

Therefore, for most applications (very unbalanced data), we often have asymmetric cost. Recall the example in logistic regression. In the credit scoring case it means that false negatives (predicting 0 when truth is 1, or giving out loans that end up in default) will cost more than false positives (predicting 1 when truth is 0, rejecting loans that you should not reject).

Here we make the assumption that false negative cost 5 times of false positive. In real life the cost structure should be carefully researched.

credit.rpart <- rpart(formula = default ~ . , data = credit.train, method = "class", parms = list(loss=matrix(c(0,5,1,0), nrow = 2)))

For more advanced controls, you should carefully read the help document for the rpart function.

1.2.1 Printing and ploting the tree

credit.rpart
## n= 9600 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 9600 7499 1 (0.78114583 0.21885417)  
##    2) PAY_0< 0.5 7436 5035 0 (0.86457773 0.13542227)  
##      4) PAY_AMT6>=1500.5 3879 1905 0 (0.90177881 0.09822119)  
##        8) PAY_5< 1 3755 1715 0 (0.90865513 0.09134487) *
##        9) PAY_5>=1 124   86 1 (0.69354839 0.30645161) *
##      5) PAY_AMT6< 1500.5 3557 2931 1 (0.82400900 0.17599100)  
##       10) PAY_4< 1 3349 2755 0 (0.83547328 0.16452672)  
##         20) PAY_AMT3>=810.5 1967 1320 0 (0.86578546 0.13421454)  
##           40) LIMIT_BAL>=35000 1583  910 0 (0.88502843 0.11497157) *
##           41) LIMIT_BAL< 35000 384  302 1 (0.78645833 0.21354167) *
##         21) PAY_AMT3< 810.5 1382 1095 1 (0.79232996 0.20767004) *
##       11) PAY_4>=1 208  133 1 (0.63942308 0.36057692) *
##    3) PAY_0>=0.5 2164 1070 1 (0.49445471 0.50554529) *
prp(credit.rpart, extra = 1)

1.2.2 Prediction using classification tree

For a binary classification problem, as you learned in logistic regression there are 2 types of predictions. One is the predicted class of response (0 or 1), and the second type is the probability of response being 1. We use an additional argument type=“class” or type=“prob” to get these:

In-sample prediction (skipped)

In-sample prediction

credit.train.pred.tree1<- predict(credit.rpart, credit.train, type="class")
table(credit.train$default, credit.train.pred.tree1, dnn=c("Truth","Predicted"))
##      Predicted
## Truth    0    1
##     0 4813 2686
##     1  525 1576

Exercise: Out-of-sample prediction

#Predicted Class
credit.test.pred.tree1<- 
table()

Usually if you want a hassle-free model, using type=“class” is enough given that you specified the loss matrix correctly in rpart.

We can get the expected loss for this tree model by defining a cost function that has the correct weights:

cost <- function(r, pi){
  weight1 = 10
  weight0 = 1
  c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
  c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}
cost(credit.train$Y,credit.train.pred.tree1)
## [1] NaN

Exercise: Out-of-sample prediction

#Predicted Class
credit.test.pred.tree1<- 
table()

Exercise: Try type="prob" in prediction, what can you say about these predicted probabilities?

1.2.3 Calculate the actual cost using a self-defined cost function.

We can get the expected loss for this tree model by defining a cost function that has the correct weights:

cost <- function(r, pi){
  weight1 = 5
  weight0 = 1
  c1 = (r==1)&(pi==0) #logical vector - true if actual 1 but predict 0
  c0 = (r==0)&(pi==1) #logical vector - true if actual 0 but predict 1
  return(mean(weight1*c1+weight0*c0))
}

Calculate the cost for training sample using above cost function

cost(credit.train$default,credit.train.pred.tree1)
## [1] 0.5532292

Exercise: Calculate the cost for testing sample.

1.2.4 Comparing this classification tree with logistic regression

We can compare this model’s out-of-sample performance with the logistic regression model with all variables in it. Recall that when we search for the optimal cut-off using the same cost function we get optimal cut-off at about 0.21.

#Fit logistic regression model
credit.glm<- glm(default~., data = credit.train, family=binomial)
#Get binary prediction
credit.test.pred.glm<- as.numeric(predict(credit.glm, credit.test, type="response")>0.21)
#Calculate cost using test set
cost(credit.test$default,credit.test.pred.glm)
## [1] 0.6245833
#Confusion matrix
table(credit.test$default, credit.test.pred.glm, dnn=c("Truth","Predicted"))
##      Predicted
## Truth    0    1
##     0 1230  639
##     1  172  359

Exercise: Comparison for in-sample performance.

Which model do you think is better?

go to top

1.2.5 ROC Curve and Cut-off Probability for Classification Tree

Recall that ROC Curve gives you the trade-off between hit rate (1 - false positive) and false negative, and area under the curve (AUC) can be used as a measure of how good the binary classification model performs when you do not know the cost function.

To get ROC curve, we get the predicted probability of Y being 1 from the fitted tree. The additional cp parameter controls the complexity of tree. Here we change it from its default 0.01 to a smaller value to grow a more complex tree than just the root node (if you use the default the tree you get will tell you to clasify everything as 0). More discussion on this in the next section.

credit.rpart <- rpart(formula = default ~ ., data = credit.train, method = "class", parms = list(loss=matrix(c(0,10,1,0), nrow = 2)))
#Probability of getting 1
credit.test.prob.rpart = predict(credit.rpart,credit.test, type="prob")

credit.test.prob.rpart has 2 columns, the first one is prob(Y) = 0 and the second prob(Y) = 1. We only need the second column because they add to 1 for binary classification.

To get ROC curve we use

install.packages('ROCR')
library(ROCR)
pred = prediction(credit.test.prob.rpart[,2], credit.test$default)
perf = performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

Area under the curve is given by (do not worry about the syntax here):

slot(performance(pred, "auc"), "y.values")[[1]]
## [1] 0.7470243

For a given cut-off probability, the 0/1 prediction result can be calculated similar to what you do in logistic regression

credit.test.pred.rpart = as.numeric(credit.test.prob.rpart[,2] > 1/11)
table(credit.test$default, credit.test.pred.rpart, dnn=c("Truth","Predicted"))
##      Predicted
## Truth    0    1
##     0  602 1267
##     1   48  483

If you know the cost structure of mis-classification, there is usually no need to search for an optimal cut-off probability as we did in the logistic regression. You can refer to the last section on specifying a loss matrix, rpart will automatically generate decision rules with your cost structure taken into consideration.

Exercise: Draw the ROC curve for training sample.

1.3 Pruning

In rpart(), the cp(complexity parameter) argument is one of the parameters that are used to control the compexity of the tree. The help document for rpart tells you “Any split that does not decrease the overall lack of fit by a factor of cp is not attempted”. For a regression tree, the overall R-square must increase by cp at each step. Basically, the smaller the cp value, the larger (complex) tree rpart will attempt to fit. The default value for cp is 0.01.

What happens when you have a large tree? The following tree has 27 splits.

boston.largetree <- rpart(formula = medv ~ ., data = boston.train, cp = 0.001)

Try plot it yourself to see its structure.

prp(boston.largetree)

The plotcp() function gives the relationship between 10-fold cross-validation error in the training set and size of tree.

plotcp(boston.largetree)

You can observe from the above graph that the cross-validation error (x-val) does not always go down when the tree becomes more complex. The analogy is when you add more variables in a regression model, its ability to predict future observations not necessarily increases. A good choice of cp for pruning is often the leftmost value for which the mean lies below the horizontal line. In the Boston housing example, you may conclude that having a tree mode with more than 10 splits is not helptul.

To look at the error vs size of tree more carefully, you can look at the following table:

printcp(boston.largetree)
## 
## Regression tree:
## rpart(formula = medv ~ ., data = boston.train, cp = 0.001)
## 
## Variables actually used in tree construction:
## [1] age     black   crim    dis     lstat   nox     ptratio rm      tax    
## 
## Root node error: 38962/455 = 85.631
## 
## n= 455 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.4736392      0   1.00000 1.00940 0.086734
## 2  0.1498070      1   0.52636 0.57523 0.058795
## 3  0.0877571      2   0.37655 0.42137 0.049388
## 4  0.0413556      3   0.28880 0.33304 0.045521
## 5  0.0249290      4   0.24744 0.32229 0.045704
## 6  0.0223795      5   0.22251 0.28082 0.037810
## 7  0.0124576      6   0.20013 0.25960 0.037869
## 8  0.0086452      7   0.18767 0.26733 0.040807
## 9  0.0069296      8   0.17903 0.26021 0.038443
## 10 0.0063204      9   0.17210 0.25767 0.038439
## 11 0.0055260     10   0.16578 0.24373 0.037292
## 12 0.0053722     11   0.16025 0.24438 0.037496
## 13 0.0044016     13   0.14951 0.23746 0.037348
## 14 0.0040514     14   0.14511 0.23481 0.037329
## 15 0.0038419     15   0.14106 0.23442 0.037422
## 16 0.0032478     16   0.13721 0.22903 0.036613
## 17 0.0031528     17   0.13397 0.23020 0.036630
## 18 0.0022961     18   0.13081 0.22756 0.036747
## 19 0.0015176     19   0.12852 0.22082 0.035255
## 20 0.0014949     20   0.12700 0.22009 0.035095
## 21 0.0014255     21   0.12551 0.22003 0.035094
## 22 0.0014195     23   0.12265 0.21907 0.035099
## 23 0.0012704     24   0.12123 0.21892 0.035099
## 24 0.0012370     25   0.11996 0.21810 0.035008
## 25 0.0011061     26   0.11873 0.21777 0.035003
## 26 0.0010000     27   0.11762 0.21581 0.035026

Root node error is the error when you do not do anything too smart in prediction, in regression case, it is the mean squared error(MSE) if you use the average of medv as the prediction. Note it is the same as

sum((boston.train$medv - mean(boston.train$medv))^2)/nrow(boston.train)
## [1] 85.63059

The first 2 columns CP and nsplit tells you how large the tree is. rel.error \(\times\) root node error gives you the in sample error. For example, The last row (rel error)*(root node error)= 0.13085*87.133 = 11.40135, which is the same as the in-sample MSE if you calculate using predict:

mean((predict(boston.largetree) - boston.train$medv)^2)
## [1] 10.07199

xerror gives you the cross-validation (default is 10-fold) error. You can see that the rel error (in-sample error) is always decreasing as model is more complex, while the cross-validation error (measure of performance on future observations) is not. That is why we prune the tree to avoid overfitting the training data.

The way rpart() does it is that it uses some default control parameters to avoid fitting a large tree. The main reason for this approach is to save computation time. For example by default rpart set a cp = 0.1 and the minimum number of observations that must exist in a node to be 20. Use ?rpart.control to view these parameters. Sometimes we wish to change these paramters to see how more complex trees will perform, as we did above. If we have a larger than necessary tree, we can use prune() function and specify a new cp:

prune(boston.largetree, cp = 0.008)
## n= 455 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 455 38961.9200 22.82703  
##    2) rm< 6.797 370 14198.5800 19.77459  
##      4) lstat>=14.4 151  2622.3790 14.99139  
##        8) crim>=5.7819 64   684.5444 12.03438 *
##        9) crim< 5.7819 87   966.5533 17.16667 *
##      5) lstat< 14.4 219  5739.4360 23.07260  
##       10) dis>=1.5511 212  2699.1230 22.57972  
##         20) rm< 6.543 176  1467.3120 21.66250 *
##         21) rm>=6.543 36   359.8631 27.06389 *
##       11) dis< 1.5511 7  1429.0200 38.00000 *
##    3) rm>=6.797 85  6309.4430 36.11412  
##      6) rm< 7.437 56  1816.2800 31.55000  
##       12) lstat>=9.65 9   380.6356 24.82222 *
##       13) lstat< 9.65 47   950.2711 32.83830 *
##      7) rm>=7.437 29  1073.9780 44.92759  
##       14) ptratio>=17.6 7   465.9686 38.88571 *
##       15) ptratio< 17.6 22   271.1750 46.85000 *

Exercise: Prune a classification tree. Start with “cp=0.001”, and find a reasonable cp value, then obtain the pruned tree.

Some software/packages can automatically prune the tree.

go to top

1.4 Things to remember

  • Use rpart() to fit regression and classification tree.

  • Know how to interpret a tree.

  • Use predict() for prediction, and how to assess the performance.

  • Know how to use Cp plot/table to prune a large tree.

go to top