Explanation of the Decision Tree Model

How to:

Reference:

This chapter describes the decision tree model and when you should use it.

What is a Decision Tree?

A decision tree is a machine learning algorithm that partitions the data into subsets. The partitioning process starts with a binary split and continues until no further splits can be made. Various branches of variable length are formed.

The goal of a decision tree is to encapsulate the training data in the smallest possible tree. The rationale for minimizing the tree size is the logical rule that the simplest possible explanation for a set of phenomena is preferred over other explanations. Also, small trees produce decisions faster than large trees, and they are much easier to look at and understand. There are various methods and techniques to control the depth, or prune, of the tree.

How Do Decision Trees Work?

There are several steps involved in the building of a decision tree.

Decision trees:

Practical Applications of Decision Tree Analysis

Decision trees can be used either for classification, for example, to determine the category for an observation, or for prediction, for example, to estimate the numeric value. Using a decision tree for classification is an alternative methodology to logistic regression. Using a decision tree for prediction is an alternative method to linear regression. See those methods for additional industry examples.


Top of page

x
Procedure: How to Create a Decision Tree Model

The following example uses the credit scoring data set that was explained and used for the scoring application example in Creating a Scoring Application.

To execute the tree model:

  1. Load the credit scoring data set into RStat.

    For more information on loading data into RStat, see Getting Started With RStat.

  2. Use the default sample percentage of 70%.
  3. Select the data roles as shown in the following image:
    1. ID as Ident.
    2. CREDIT_APPROVAL as Target.
    3. All other variables as Input.

    Data roles selection window

    Note: This list has been truncated for display purposes.

  4. Click Execute.
  5. On the Model tab, select Tree for the Type.

    Note: Do not change any of the default parameters. For more information on the default values, see User-Defined Parameters.

  6. Click Execute.

    The Summary of the Tree model for Classification appears, as shown in the following image.

    Summary of Tree model for Classification


Top of page

x
Reference: User-Defined Parameters

User-defined parameters include:


Top of page

x
Reference: Output From Decision Trees

The model output is described line by line. For illustration purposes, we have pruned the tree by lowering the Max Depth from the default to 3.

Model output window

This section describes the decision tree output.

Node Numbering. Nodes are labeled with unique numbers. Those numbers are generated by the following formula: the child nodes of node X are always numbered 2x (left child) and 2 x+1(right child). The root node is 1. The following tree diagram generated by clicking the Draw button shows in color the node numbers for the tree described previously. Only the terminal node numbers are displayed. For example, node 2 and 3 labels are not shown. Node 2 (left child) is derived by multiplying node 1*2, node 3 (right child) by (1*2)+2. Node 4 (left child of node 2 is derived by 2*2. Terminal node 10 is derived from node 5 (right child of node 2) by 5*2. Terminal node 11 is the right child of node 5 derived by (5*2)+1.

Decision Tree graph

Primary Split. Income is the predictor variable used for the primary split. The same predictor variable can be used to split many nodes. For example, node 2 is further split using Income. Age is the primary split for node 3.

Split Point. Nodes 2 and 3 were formed by splitting node 1 on the predictor variable Income. The split point is 33270.53. If the splitting variable is continuous (numeric), as in this split, the values going into the left and right child nodes will be shown as values less than or greater than some split point (33270.53 in this example). Node 2 consists of all rows with the value of Income greater than 33270.53, whereas node 3 consists of all rows with Income less than 33270.53.

Number of Node Cases. The number after the Split Point. For example, for node 2 the first number is 1074, which indicates the total number of rows in the data that belong to this node. For node 4 the number is 879, and for node 7 the number is 89.

Expected Loss. This is the total number of rows that will be misclassified if the predicted class for the node is applied to all rows. In the case of node 4, all cases are correctly classified and therefore the number is 0. In the case of node 7, out of the total 89 cases, 6 will be misclassified. This information can also be inferred from the Probability of the Winning Class (see the description that follows), for example, the probability of the winning class, which is indicated by the third number 1 or 0 is 93%. This number is a placeholder for the category in the target variable. Based on this probability, out of 89 cases, 83 will be classified correctly. Therefore, 6 cases will be misclassified.

Predicted Class for Node. This is the predicted class for the node. For example, for node 7, this will be 1. In the sample data, 1 indicated good credit risk, and 0 indicated bad credit risk. People were classified into either one of the two categories.

Probability of Winning Class. The numbers after the predicted class for the node, for example, for node 7, indicate the probabilities of each class and allow the user to see the probability of the winning class, that is, the factor that determines the final classification. In this particular case, the predicted class for node 7 is 1 and the probability is 0.89. For node 4, the winning class is 0 and the probability is 1.00.

Terminal Nodes. An asterisk (*) indicates a terminal node. As the preceding diagram shows, nodes 4, 7, 10, 11, 12, and 13 are terminal. Other input variables that were specified on the Data tab, for example, Gender, were omitted from the model. The algorithm has determined that they did not contribute to the predictive power of the model.

Variables actually used in tree construction: Age, Education, Income. Shows the variables that are actually used to construct the tree. If you look at the decision tree image and at the node descriptions, you will notice that splits have occurred on the variables Age, Education, Income.

Root node error: 204/1348 = 0.15134. This is the error rate for a single node tree, that is, if the tree was pruned to node 1. It is useful when comparing different decision tree models.

Complexity Table: The complexity table provides information about all of the trees considered for the final model. It lists their complexity parameter, the number of splits, the resubstitution error rate, the cross-validated error rate, and the associated standard error. See the following for an explanation of the items in the complexity table.

CP nsplit rel error  xerror     xstd
1 0.441176      0   1.00000 1.00000 0.064499
2 0.036765      1   0.55882 0.56373 0.050275
3 0.031863      3   0.48529 0.57353 0.050670
4 0.010000      5   0.42157 0.52451 0.048652

Complexity Parameter. The complexity parameter was explained in the section User-Defined Parameters.

Number of Splits. The number of splits for the tree. As the diagram shows for tree 4, we have 5 splits. You can count the number of splits shown on the diagram on the previous page.

Resubstitution Error Rate (xstand). The resubstitution rate is a measure of error. It is the proportion of original observations that were misclassified by various subsets of the original tree. The tree yielding the minimum resubstitution error rate in the present example is tree number 4. The resubstitution rate decreases as you go down the list of trees. The largest tree will always yield the lowest resubstitution error rate. However, choosing the tree with the lowest resubstitution rate is not the optimal choice, as this tree will have a bias. Large trees will put random variation in the predictions as they overfit outliers.

Cross-Validated Error Rate (xerror). Instead of selecting a tree based on the resubstitution error rate, X-fold cross-validation is used to obtain a cross-validated error rate, from which the optimal tree is selected. The X-fold cross-validation involves creating X-random subsets of the original data, setting one portion aside as a test set, constructing a tree for the remaining X-1 portions, and evaluating the tree using the test portion. This is repeated for all portions, and an estimate of the error is evaluated. Adding up the error across the X portions represents the cross-validated error rate. The tree yielding the lowest cross-validated error rate (xerror) is selected as the tree that best fits the data. In this case, this is tree number 4, which has 5 splits.

Standard Error (xstd). This is the standard deviation of error across the cross-validation sets.


Top of page

x
Procedure: How to Evaluate a Decision Tree Model

In this procedure, you will produce the error matrix to evaluate how many of the categories are correctly classified.

  1. Go to the Data Tab and uncheck the Sample box.
  2. Go to the Model tab and execute the model.
  3. In the Evaluate tab, select Error Matrix for the Type, load the data set for the Data, then select Execute.

    There are two matrices produced. The first one gives you the counts of correctly or incorrectly classified records. For example, out of 77 records with good credit, 70 were classified correctly and 7 were misclassified. Out of 501 records with bad credit, 473 were classified correctly and 28 were misclassified. The second table gives you the percent of correctly or incorrectly classified records. If we sum both the correct and incorrect classifications, we get 82+12=94 percent correctly classified cases. However, it is important to see whether the model better classifies the positive or negative cases, for example, whether it predicts more accurately the good credit versus the bad. Those assessments are made by the modeler. See the evaluation techniques and examples in Building a Logistic Model.

    Error matrix for the Tree model on ab_credit_training.csv [test] (counts):
             Actual
    Predicted   0   1
            0 473   7
            1  28  70
    Error matrix for the Tree model on ab_credit_training.csv [test] (%):
             Actual
    Predicted  0  1
            0 82  1
            1  5 12
    Overall error: [1] 0.06055363
    Generated by RStat 2009-03-04 19:55:49 
    ======================================================================

WebFOCUS