Evaluating the Model

Exporting Rpart Rules

How to:

Execute the Rpart Rule Export

The Rpart Rule Export functionality allows you to create and store unique versions of scoring data for a selected database on a local drive or directory. The basic functionality involves choosing a database, creating a model, and then scoring the data based on the values generated and stored for that modeling scenario.

Note:

In each step of this process, click the Execute button in order to apply your changes. When you select the Tree model type, you must then execute it so that the model loads with the Rpart model builder. When you score the data, the model will be ready for the selected database and will produce the scoring results based on these prior functions.
The Tree model type must be selected during modeling. The Tree model type uses rpart as its model builder, as shown in the following image.

From the Model tab, you can model the scenario, clicking Execute after you make a selection. You can then export the rules. For more information, see How to Execute the Rpart Rule Export.

From the Evaluate tab, if the Score option is selected, RStat appends the rules after the data set within the resulting .csv file. The data specific to this functionality is identified with a column heading of Rpart, allowing you to locate the newly generated Rpart data easily. This is illustrated in the following image.

RStat has a unique method for naming the files associated with the Rpart Rule Export functionality. The file name for the output of the scoring routine is derived from the original database file name. RStat appends _train_score_all to the file name. For example, if the originating database is database.csv, the resulting filename is:

database_train_score_all.csv

Alternatively, you can provide your own file name or append information to the default file name assigned. For example, you may run the same scenario on different dates. You might use a date convention (for example, _mmddyyyy) to archive your files. In general, this file naming convention ensures that the original name of the database is preserved while marking the new output file as one that contains scoring data.

You can store these files locally using the default generated naming conventions (or that which you specify) for future analysis. For example, you can use WebFOCUS to create reports, graphs or other BI-related tasks. Specifically, you can use the following resources:

Creating Reports with WebFOCUS Language
Creating Reports with Graphical Tools

Note: After scoring and saving the resulting data, RStat displays the path and file name of the file that you saved when the application returns to the Evaluate tab, as shown in the following image.

Top of page

Procedure: How to Execute the Rpart Rule Export

Open RStat.
Click the folder adjacent to the Filename field and select a database file.
Note: The Rpart Rule Export does not support a sampling or testing data set. In order to execute Rpart rules, you must clear the Sample check box on the Data tab. This converts the data into training data from which rules can be extracted.
Click Open to confirm the database selection and return to the RStat interface.
Click Execute.
Click the Model Tab and for Type, select Tree.
Click Execute.
Click the Evaluate tab and for Type, click Score.
Note: Clear the check boxes for the Neural Net and SVM models, if necessary.
1. For Data, select Training.
2. For Include, select All.
3. Click Execute to score the Rpart data and save it to a local file, as shown in the following image.
Note:
- On the Evaluate tab, the Execute and Export buttons produce the same results.
- You can save the output file to the default directory or optionally, create a new folder for archiving purposes. You can also rename the file as required.

Evaluating the Decision Tree

In this section:

Evaluation techniques in RStat allow you to investigate how well your model will make predictions. The available evaluation techniques are determined by the type of model you have generated.

Select the Evaluate tab.

Notice that the Model Type is defined as Tree and a variety of evaluation techniques are presented.
Select the evaluation data.

You can use the following data sets to evaluate the current model.

From the current data:
- Training. The initial data used within the model. If you are using sampling, this will be a randomly selected % of your data set based on the definition identified in the Data tab.
- Testing. Only available when sampling has been used to build the model. This will contain the remaining data not used within the training data set. If the sample has been set to 70%, the training data set will contain the remaining 30%.
From new data sources:
- CSV
- R Dataset

Top of page

Error Matrix

An error matrix shows the relationship between the actual data and the predicted values.

With Error Matrix selected, click Execute.

Error Matrix window

Two error matrices are displayed. The first matrix shows the count of cases and the second shows the percentage of cases.

Looking at the second matrix, you can see that the model predicts the following:

In 83% of the cases {Cell (0,0)}, the actual value of bad credit was matched by the predicted value.
In 13% of the cases {Cell (1,1)}, people with good credit were correctly classified.
The remaining 4% were misclassified.
Summing across the correctly classified cases, 83% + 13% = 96% were correctly classified cases.

Top of page

Scoring New Data

In RStat, you can score new data to see how well your model predicts. The Score data option will create a new CSV file with the scored values.

Select Score as the Evaluation type.

New Score options appear at the bottom of the tab panel.

Report Options

Report options are available only for Binary Trees and Logistic Regressions, where your target is binary (two unique values). For other models, the Report options will be grayed out.

The Report options define the type of score to be returned.
- Class. A categorical value that is derived on a zero to 1 scale, where 0 through .5 = 0 and .5 through 1 = 1.
- Probability. A numeric value between 0 and 1 representing the likelihood that the result will be a higher value. For character-based targets, the higher value is determined alphabetically. For example, if your target is Gender with Male and Female as the values, the probability will return the likelihood that the outcome will be Male.
Include Options

Include options allow you to define which fields should be included in the scored file.
- Identifiers. Includes the identifier, the target, and the score value.
- All. Includes all variables in the data set plus the score value.
Once you have defined the Scoring options, click Execute in the RStat toolbar.
The Score Files dialog box opens.
Define the file name and location where the scored data will be saved.
Note: The file name that you define will be the exact name used, so be sure that the file name contains a .csv extension.

In the example below, the Scored option has ALL selected instead of IDENTIFIERS. The output file structure will have all fields (variables) plus the Scored value (Column name=rpart) and the Rules column (column name=RpartRules).

Note: The contents for each data line are the rule details. Check the column name and verify that none are missing rules for any data line, as shown in the following image.