Explanation of the Decision Tree Model

Procedure: How to Create a Decision Tree Model

The following example uses the credit scoring data set that was explained and used for the scoring application example in Creating a Scoring Application.

To execute the tree model:

Load the credit scoring data set into RStat.
For more information on loading data into RStat, see Getting Started With RStat.
Use the default sample percentage of 70%.
Select the data roles as shown in the following image:
1. ID as Ident.
2. CREDIT_APPROVAL as Target.
3. All other variables as Input.
Note: This list has been truncated for display purposes.
Click Execute.
On the Model tab, select Tree for the Type.
Note: Do not change any of the default parameters. For more information on the default values, see User-Defined Parameters.
Click Execute.
The Summary of the Tree model for Classification appears, as shown in the following image.

Reference: User-Defined Parameters

User-defined parameters include:

Priors. Sets the prior probabilities for each class. All probabilities must add up to 1.
The default priors are proportional to the data counts. The input box is empty by default.
Min Split. The minimum number of values in a node that must exist before a split is attempted. In other words, if the node has two members and the minimum split is set to 5, the node will become terminal, that is, no split will be attempted.
The default value is 20.

Consideration: As a rule, many programs and data miners will not attempt, or advise you, to split a node with less than 10 cases in it.
Max Depth. Controls the maximum depth of the tree that will be created. It can also be described as the length of the longest path from the tree root to a leaf. The root node is considered to have a depth of 0. The Max Depth value cannot exceed 30 on a 32-bit machine.
The default value is 30.
Loss Matrix. Weighs the outcome classes differently.
Min Bucket. The minimum number of entities allowed in any leaf of the tree. The default number is one-third of the value specified for Min Split.
Complexity. Complexity is used to establish a control level that determines whether a split contributes to a better model fit. Any split that increases the model fit by a factor greater than the defined complexity factor is attempted. For instance, with regression splitting, this means that the overall R-square must increase by the defined complexity factor at each step. The main role of this parameter is to save computing time by pruning off splits that are not worthwhile. By specifying the complexity factor, the user informs the program that any split that does not improve the fit will likely be pruned off by cross-validation, and that the program need not pursue it. If you set the complexity factor to 0, the program will create the largest tree.
The default value is 0.01.

Reference: Output From Decision Trees

The model output is described line by line. For illustration purposes, we have pruned the tree by lowering the Max Depth from the default to 3.

Model output window

This section describes the decision tree output.

Summary of the Tree model for Classification (built using rpart). This is the title of the output for the decision tree. Rpart is the library in R that is used to construct the decision tree. Classification indicates that the modeling technique was applied to a set with a categorical dependent variable.
```
Summary of the Tree model for Classification (built using rpart):
```

n=1348. Indicates the number of observations used in the model. The credit scoring data set contains 1926 observations. However, we sampled 70% of them to use as a training set.
```
n= 1348
```

Print Specifications. Indicates what information is printed in the output for each node:

node), split, n, loss, yval, (yprob)
* denotes terminal node
```
node), split, n, loss, yval, (yprob)
      * denotes terminal node
```

Nodes Information. This output prints the tree in an extended form, that is, it describes exactly each node in the tree accordingly to the print specifications described in the previous bullet. Notice the indentation. It is used to indicate the tree topology, that is, it indicates the parent and child relationships (also referred to as primary and secondary splits). See the chart on the next page for the graphical representation of the parent-child relationships in the tree.

1) root 1348 204 0 (0.84866469 0.15133531)  
   2) INCOME>=33270.53 1074  22 0 (0.97951583 0.02048417)  
     4) INCOME>=44861.79 879   0 0 (1.00000000 0.00000000) *
     5) INCOME< 44861.79 195  22 0 (0.88717949 0.11282051)  
      10) EDUCATION=Bachelor,College,HSgrad,Master,Professional 180 7 0 (0.96111111 0.03888889) *
      11) EDUCATION=Doctorate 15   0 1 (0.00000000 1.00000000) *
   3) INCOME< 33270.53 274  92 1 (0.33576642 0.66423358)  
     6) AGE>=28.5 185  86 1 (0.46486486 0.53513514)  
      12) INCOME>=28790.85 71  29 0 (0.59154930 0.40845070) *
      13) INCOME< 28790.85 114  44 1 (0.38596491 0.61403509) *
     7) AGE< 28.5 89   6 1 (0.06741573 0.93258427) *

Node Numbering. Nodes are labeled with unique numbers. Those numbers are generated by the following formula: the child nodes of node X are always numbered 2x (left child) and 2 x+1(right child). The root node is 1. The following tree diagram generated by clicking the Draw button shows in color the node numbers for the tree described previously. Only the terminal node numbers are displayed. For example, node 2 and 3 labels are not shown. Node 2 (left child) is derived by multiplying node 1*2, node 3 (right child) by (1*2)+2. Node 4 (left child of node 2 is derived by 2*2. Terminal node 10 is derived from node 5 (right child of node 2) by 5*2. Terminal node 11 is the right child of node 5 derived by (5*2)+1.

Decision Tree graph

Primary Split. Income is the predictor variable used for the primary split. The same predictor variable can be used to split many nodes. For example, node 2 is further split using Income. Age is the primary split for node 3.

Split Point. Nodes 2 and 3 were formed by splitting node 1 on the predictor variable Income. The split point is 33270.53. If the splitting variable is continuous (numeric), as in this split, the values going into the left and right child nodes will be shown as values less than or greater than some split point (33270.53 in this example). Node 2 consists of all rows with the value of Income greater than 33270.53, whereas node 3 consists of all rows with Income less than 33270.53.

Number of Node Cases. The number after the Split Point. For example, for node 2 the first number is 1074, which indicates the total number of rows in the data that belong to this node. For node 4 the number is 879, and for node 7 the number is 89.

Expected Loss. This is the total number of rows that will be misclassified if the predicted class for the node is applied to all rows. In the case of node 4, all cases are correctly classified and therefore the number is 0. In the case of node 7, out of the total 89 cases, 6 will be misclassified. This information can also be inferred from the Probability of the Winning Class (see the description that follows), for example, the probability of the winning class, which is indicated by the third number 1 or 0 is 93%. This number is a placeholder for the category in the target variable. Based on this probability, out of 89 cases, 83 will be classified correctly. Therefore, 6 cases will be misclassified.

Predicted Class for Node. This is the predicted class for the node. For example, for node 7, this will be 1. In the sample data, 1 indicated good credit risk, and 0 indicated bad credit risk. People were classified into either one of the two categories.

Probability of Winning Class. The numbers after the predicted class for the node, for example, for node 7, indicate the probabilities of each class and allow the user to see the probability of the winning class, that is, the factor that determines the final classification. In this particular case, the predicted class for node 7 is 1 and the probability is 0.89. For node 4, the winning class is 0 and the probability is 1.00.

Terminal Nodes. An asterisk (*) indicates a terminal node. As the preceding diagram shows, nodes 4, 7, 10, 11, 12, and 13 are terminal. Other input variables that were specified on the Data tab, for example, Gender, were omitted from the model. The algorithm has determined that they did not contribute to the predictive power of the model.

Variables actually used in tree construction: Age, Education, Income. Shows the variables that are actually used to construct the tree. If you look at the decision tree image and at the node descriptions, you will notice that splits have occurred on the variables Age, Education, Income.

Root node error: 204/1348 = 0.15134. This is the error rate for a single node tree, that is, if the tree was pruned to node 1. It is useful when comparing different decision tree models.

Complexity Table: The complexity table provides information about all of the trees considered for the final model. It lists their complexity parameter, the number of splits, the resubstitution error rate, the cross-validated error rate, and the associated standard error. See the following for an explanation of the items in the complexity table.

CP nsplit rel error  xerror     xstd
1 0.441176      0   1.00000 1.00000 0.064499
2 0.036765      1   0.55882 0.56373 0.050275
3 0.031863      3   0.48529 0.57353 0.050670
4 0.010000      5   0.42157 0.52451 0.048652

Complexity Parameter. The complexity parameter was explained in the section User-Defined Parameters.

Number of Splits. The number of splits for the tree. As the diagram shows for tree 4, we have 5 splits. You can count the number of splits shown on the diagram on the previous page.

Resubstitution Error Rate (xstand). The resubstitution rate is a measure of error. It is the proportion of original observations that were misclassified by various subsets of the original tree. The tree yielding the minimum resubstitution error rate in the present example is tree number 4. The resubstitution rate decreases as you go down the list of trees. The largest tree will always yield the lowest resubstitution error rate. However, choosing the tree with the lowest resubstitution rate is not the optimal choice, as this tree will have a bias. Large trees will put random variation in the predictions as they overfit outliers.

Cross-Validated Error Rate (xerror). Instead of selecting a tree based on the resubstitution error rate, X-fold cross-validation is used to obtain a cross-validated error rate, from which the optimal tree is selected. The X-fold cross-validation involves creating X-random subsets of the original data, setting one portion aside as a test set, constructing a tree for the remaining X-1 portions, and evaluating the tree using the test portion. This is repeated for all portions, and an estimate of the error is evaluated. Adding up the error across the X portions represents the cross-validated error rate. The tree yielding the lowest cross-validated error rate (xerror) is selected as the tree that best fits the data. In this case, this is tree number 4, which has 5 splits.

Default Cross Validation. RStat uses by default a 10-fold cross-validation.

Standard Error (xstd). This is the standard deviation of error across the cross-validation sets.

Procedure: How to Evaluate a Decision Tree Model

In this procedure, you will produce the error matrix to evaluate how many of the categories are correctly classified.

Go to the Data Tab and uncheck the Sample box.
Go to the Model tab and execute the model.
In the Evaluate tab, select Error Matrix for the Type, load the data set for the Data, then select Execute.
There are two matrices produced. The first one gives you the counts of correctly or incorrectly classified records. For example, out of 77 records with good credit, 70 were classified correctly and 7 were misclassified. Out of 501 records with bad credit, 473 were classified correctly and 28 were misclassified. The second table gives you the percent of correctly or incorrectly classified records. If we sum both the correct and incorrect classifications, we get 82+12=94 percent correctly classified cases. However, it is important to see whether the model better classifies the positive or negative cases, for example, whether it predicts more accurately the good credit versus the bad. Those assessments are made by the modeler. See the evaluation techniques and examples in Building a Logistic Model.
```
Error matrix for the Tree model on ab_credit_training.csv [test] (counts):
         Actual
Predicted   0   1
        0 473   7
        1  28  70
Error matrix for the Tree model on ab_credit_training.csv [test] (%):
         Actual
Predicted  0  1
        0 82  1
        1  5 12
Overall error: [1] 0.06055363
Generated by RStat 2009-03-04 19:55:49 
======================================================================
```