In this section: |
In this section, we will review the different model types that are available in RStat and build a Decision Tree model. For more information, see Building the Decision Tree Model.
In this section: How to: |
Using RStat, you can define and execute various models against a selected database.
Note:
The supported model types and their exportability are listed in the following table.
Model Type |
Exportability |
---|---|
Ada Boost |
C Exportable(.c);PMML Exportable(.xml) |
Binomial Regression |
C Exportable(.c);PMML Exportable(.xml) |
Decision Tree |
C Exportable(.c);PMML Exportable(.xml) |
Gamma Regression |
C Exportable(.c);PMML Exportable(.xml) |
Gaussian Regression |
C Exportable(.c);PMML Exportable(.xml) |
Inverse Gaussian Regression |
C Exportable(.c);PMML Exportable(.xml) |
Linear Regression |
C Exportable(.c);PMML Exportable(.xml) |
Logistic Regression |
C Exportable(.c);PMML Exportable(.xml) |
Multinomial Regression |
C Exportable(.c);PMML Exportable(.xml) |
Negative Binomial |
C Exportable(.c);PMML Exportable(.xml) |
Neural Net |
C Exportable(.c);PMML Exportable(.xml) |
Poisson Regression |
C Exportable(.c);PMML Exportable(.xml) |
Random Forest |
C Exportable(.c);PMML Exportable(.xml) |
Survival |
C Exportable(.c);PMML Exportable(.xml) |
SVM |
PMML Exportable(.xml) |
The following procedure provides basic guidance for executing a model type option.
For example, the following image illustrates the results of modeling using SVM with the default Kernel value selected (Radial Basis (rbfdot)). The novelty detection: one-svc option was selected.
The following sections present key functionality for each of the different model types.
The Ada Boost model uses the ada, which is the underlying algorithm (model builder). Boosting builds multiple, but generally simple, models. The models might be decision trees that have just one split. These are commonly referred to as decision stumps.
Note:
The Boost model allows you to specify the number of trees, in addition to other criteria, as shown in the following image.
The following table lists and describes the fields that are used to adjust the Boost model.
Field Name |
Description |
---|---|
Number of Trees |
The number of trees to build. Note: In order to ensure that every input row is predicted at least a few times, this value should not be set to a number that is too low. The default value is 50. |
Max Depth |
Allows you to set the maximum depth of any node of the final tree. The root node is counted as depth 0. The default value is 30. Note: Values greater than 30 will generate invalid results on 32-bit machines. |
Stumps |
If the Stumps check box is selected, you can build stumps using the Boost model. If the Stumps check box is not selected, the results in the default values are deactivated. |
Min Split |
The minimum number of entities that must exist in a data set at any node for a split of that node to be attempted. The default value is 20. |
Complexity |
Also known as the complexity parameter (cp), this value allows you to control the size of the decision tree and select the optimal size tree. If the cost of adding another variable to the decision tree from the current node is above the value of the cp, then tree building does not continue. The default value is 0.0100. Note: The main role of this parameter is to save computing time by pruning unnecessary splits. |
X Val |
Refers to the number of cross-validation errors allowed. The default value is 10. |
Once you have defined your model criteria, you must click the Execute button to review the results, as shown in the following image.
The Decision Tree option is used to generate a decision tree, which is the prototypical data mining technique. It is widely used because of its ease of interpretation. The Decision Tree model uses an underlying algorithm (model builder) of rpart, as shown in the following image.
The following table lists and describes the fields that are used to adjust a Decision Tree model.
Field Name |
Description |
---|---|
Min Split |
The minimum number of entities that must exist in a data set at any node for a split of that node to be attempted. The default value is 20. |
Min Bucket |
The minimum number of entities allowed in any leaf node of the decision tree. The default value is one third of the min split. |
Max Depth |
Allows you to set the maximum depth of any node of the final tree. The root node is counted as depth 0. The default value is 30. Note: Values greater than 30 will generate invalid results on 32-bit machines. |
Complexity |
Also known as the complexity parameter (cp), this value allows you to control the size of the decision tree and select the optimal size tree. If the cost of adding another variable to the decision tree from the current node is above the value of the cp, then tree building does not continue. The default value is 0.0100. Note: The main role of this parameter is to save computing time by pruning unnecessary splits. |
Priors |
Allows you to set the prior probabilities for each class. |
Loss Matrix |
Allows you to weight the outcome classes differently. |
Once you have specified your model criteria, you must click the Execute button to view the result. Since Tree was selected, the Summary of the Decision Tree model displays, as shown in the following image.
Note: The default values for all fields were used to create this example.
You can view the summary or access other features of the application, including Rules, Draw, and FancyPlots. For more information, see:
Regression is a traditional approach to modeling. The model builder glm (logit) is used by the Regression model. Logistic regression (using the binomial family) is used to model binary outcomes. Linear regression is used to model a linear numeric outcome. For predicting where the outcome is a count, the Poisson family is used. Generalized Regression is generalization of standard linear regression, allowing for response variables that fall outside of a normal distribution. Multinomial regression generalizes logistic regression, in that it allows more than two discrete outcomes. For more information, see Building a Logistic Model and Building a Linear Regression Model.
RStat supports regression and advanced regression, as shown in the following image.
The following techniques related to regression can be performed:
Neural Network (Neural Net) is an older approach to modeling. The Neural Net model uses a structure that resembles the neural network of a human being. When applied to modeling, the concept is to build a network of neurons that are connected by synapses. Rather than generate electrical signals, however, the network propagates numbers.
When using the Neural Net model, you can include or exclude the interval of networks from your model. The interval of networks is used to define and describe the relationships within your model. Display of these intervals is accomplished using the Skip field. When set to True (default), the interval of networks displays. If the Skip field is set to False, the interval of networks does not display.
The default value of True is shown in the following image.
The following table lists the options that are available on the Neural Net model screen.
Field Name |
Description |
---|---|
Hidden Layer Nodes |
The number of hidden layer nodes to display. The default is 10. |
Skip |
Set to a value of TRUE by default, this field switches between the input and output of skip-layer connections, depending on your selection. |
In the following diagram, the relationships between the different layers of a neural record are illustrated. The bottom portion represents the input layer, the middle nodes form the hidden layer, and the top components constitute the output layer.
Note: The number of nodes varies in different applications. In addition, you must have a data set selected and loaded in order to use this functionality.
The following example shows the results of a Neural Net model with the Skip field set to True. Accordingly, RStat displays the interval of networks.
The Random Forest uses an underlying algorithm (randomForest), which builds multiple decision trees from different samples of the data set. While building each tree, random subsets of the available variables are considered for splitting the data at each node of the tree.
Note:
You can specify a number of trees (the default is 500) and a number of variables (the default is 4), as shown in the following image.
The following table lists and describes the fields that are used to adjust the Random Forest model.
Field Name |
Description |
---|---|
Number of Trees |
The number of trees to build. Note: In order to ensure that every input row gets predicted at least a few times, this value should not be set to a number that is too low. The default value is 500. |
Number of Variables |
This is the number of variables to be considered at any time in deciding how to partition the data set. Each split produces a number of variables which are randomly sampled as candidates. |
Sample Size |
Size(s) of a sample to draw. |
Impute |
This applies to missing variables. If this check box is selected, the variables are transferred by replacing any NA values with one of the following (depending on the current selection):
|
Once you have indicated your criteria, you must click the Execute button to see the results, as shown in the following image.
The Survival Model is used to model time-to-event data. When using this option, you can select Cox Proportional Hazards (coxph) or Parametric (survreg) to perform your Survival Analysis, as shown in the following image. For more information on the Survival model, see Building a Survival Model.
The following table lists and describes the fields that are used to adjust a Survival model.
Field Name |
Description |
---|---|
Time |
The variable that you selected (time) on the Data tab. |
Status |
The variable that you selected (status) on the Data tab. |
Model Builder |
The name of the model builder (coxph or survreg) related to the selection you made: Cox Proportional Hazards or Parametric. |
Cox Proportional Hazards |
A general regression model that predicts individual risk relative to the population. |
Parametric |
Also known as the accelerated failure time model, this regression model predicts the expected time to the event of interest. |
Survival |
This option enables you to view the results of the Cox Proportional Hazards model. For more information, see Building a Survival Model. |
Residuals |
This button enables the testing of the assumption of proportional hazards. |
The Support Vector Model (SVM) is a modern approach to modeling where the data is mapped to a higher dimensional space. This increases the possibility that vectors separating the classes will be found.
You can select the SVM option to specify a kernel and related options in support of the model.
Note:
The Options drop-down list defaults to C classification: C-svc, as shown in the following image. If you do not make a selection from the Options drop-down list, the default value is used.
The following table lists the options that are available on the SVM model screen.
Field Name |
Description |
---|---|
Kernel |
The kernel function is used in training and predicting. You can select one of the following kernels:
|
Options |
The ksvm model builder can be used for classification, regression, or novelty detection. When using the ksvm model builder, you can select one of the following options from the drop-down list:
|
This section reviews the basic procedure for building a Decision Tree model.
The model metadata or output appears in the model textview area, as shown in the following image.
In this section: How to: |
The Decision Tree generates rules that predict the score and divides the sample data into multiple segments (branches). Each branch terminates in a node that associates a subset of the customers with a predicted score. The rules describe the criteria that qualify for each node. The predicted score is a probability value between 0 and 1. Those with a probability of .5 or greater are predicted as a good risk, and those with less than .5 are predicted as a bad risk.
You can display the rules or diagram the nodes.
The Rpart Rules GUI displays, as shown in the following image.
The Export Rpart Rules dialog displays, as shown in the following image. This enables you to specify a file name and a folder location.
Note: You can save the file in the current (default) folder or specify a new folder in which to place the file by clicking Browse for other folders.
The diagram displays in the RGui.
The colored numbers at the end of each node correspond to the rules, as shown in the following image.
Accessed from the Model tab in the RStat application, the FancyPlot functionality enables you to plot data from the database with which you are working. You can also split the plot or collapse nodes in the tree. FancyPlots provide unique charting options that can be customized for each data set.
The FancyPlot functionality displays, as shown in the following image.
You can use varying combinations in the primary sections: SplitType and BranchType. Upon selecting different combinations of type and branch, click Execute in the toolbar to update the tree image in the drawing area.
The following table defines these options.
Group |
Field |
Description |
---|---|---|
SplitType |
Default |
The default. Draw a split label at each split and a node label at each leaf. |
List All Nodes |
Label all nodes, not just the leaves. Similar to text.rpart all=TRUE. | |
Labels Under Nodes |
Similar to List All Nodes, but draws the split labels below the node labels. Similar to the plots in the CART book. | |
Label Both Directions |
Draw separate split labels for the left and right directions. | |
List All Nodes and Both Directions |
Similar to Label Both Directions, but labels all nodes, not just leaves. | |
BranchType |
Default |
The default. The branch lines are drawn conventionally. |
Deviance |
Deviance | |
SQRT(Deviance) |
Square-root (deviance) | |
Deviance/Obs |
Deviance / nobs | |
SQRT(Deviance/Obs) |
The standard deviation when method=anova | |
weight |
Also known as frame$wt, this is the number of observations at the node, unless rpart weight argument was used. | |
Complexity | ||
Abs |
This is the predicted value. | |
Pred.Val.-min(Pred.Val.) | ||
Constant |
For checking visual perception of the relative width of branches. |
The GUI also has a toolbar with options that enable some of the basic functionality of charting. From left to right, these options are: Save current output to file, Execute current selections, Clear output area, Collapse nodes and re-plot model, and Export the current tree model.
You can use any combination of Split Type and Branch Type to build your chart. You can clear the output area before creating a new chart or work with the initial chart that displays by default. You can perform the following tasks when charting with FancyPlots:
Some additional concepts for charting with FancyPlots include:
When working with a decision tree in a FancyPlot, you can edit it using the scissors icon. This enables you to determine which branches of the tree will be included in your decision tree.
When snipping the decision tree, you must snip at the intersection of a branch, as shown in the red circle in the following image.
Once you have made a snip, you can click on the Quit icon to redraw the FancyPlot without the section that you just snipped. The Quit icon displays in the upper-left corner of the screen.
Within the decision tree, the lines of the branches are black until a snip is made. Snipping changes the color to grey, indicating that this portion of the decision tree will be removed based on your snip. This is shown in the following image.
Note: If you snip at the top of the branch (where subordinate or lower branches are included), fewer nodes display, reducing the size of the chart (when the intersection and lower branches are removed).
Once you click Quit, the updated FancyPlot displays (showing the area that was snipped).
Note: In the diagram above, under the Education = Mst section, notice the node with YES = 100% of 8 before snipping and after snipping. The value of the No node is 58% of 184 after snipping, as well as before, however, this is now the end node.
As you are editing the decision tree, different versions of the tree are saved for each change you make, as shown in the following image.
This process creates a history for the chart with which you are working. You can create historical snapshots of your files and routines (PDFs or C Routines), as described in the following sections.
Each original and snipped tree model can be saved as a unique .pdf.
Note: You can specify a different name for the report in the Name field.
Each original and snipped tree model can be saved as a c routine.
The Export C or PMML dialog box opens.
Note: You can specify a different name for the report in the Name field.
This section presents a series of screenshots that further illustrate the FancyPlot charting functionality. You can perform the following tasks with the FancyPlot GUI:
The first screen shows a basic plotted chart, through which the current tree model can be exported. Using the SplitType and BranchType categories on the left, you can vary the output of your plot.
Note: Some combinations are not allowed. In those cases, an error message will display.
When working with a plotted chart, you can clear the chart from the user interface, leaving a blank canvas with which to work.
Once you have an output, you can save it to a file.
Next, you can then open the plot and use the edit (scissor image) to collapse the nodes in the tree.
Note: You will see a Quit icon in the left corner of the diagram. In addition, all nodes are connected by a solid black line.
When you click one of the nodes, the sub-tree will be colored.
Click Quit and another tree displays (without the selected sub--trees).
After each snipping procedure, the new tree model name is added to the drop-down list.
Select any portion of the new model (other than the original tree model) and perform a snip.
Select any model in the drop-down list and click Execute. The selected model will be drawn in the drawing area.
Note: It is recommended that you try at least three different models before making a determination.
Save the snipped tree and check the saved image.
WebFOCUS |