Glossary

This is a glossary of key terms and concepts in this manual as they relate to RStat.

advanced regression

A functionality that enables the performance of more advanced regression techniques. Depending on the underlying data set, the types of advanced regression options available may include Normal (Linear), Gaussian, Poisson, Logistic, Multinomial, Binomial, Negative Binomial, Gamma, and Inverse Gaussian.

algorithm

A procedure for solving a mathematical problem (such as finding the greatest common divisor) in a finite number of steps, that frequently involves repetition of an operation. In RStat, algorithms form the basis of the model builder.

association rule mining

A popular technique for determining variable commonalities and associations when working with large databases. Association rule mining can also be used as a classification method.

Auto Target Data Type

The default data type.

binomial distribution

A discrete probability distribution of the number of successes in a sequence of a variable number (n) of dependent Yes or No experiments.

Boost Model

Uses the Ada (Adaptive) Boost Model, which generates and calls the classifiers a series of rounds to achieve better classification.

C

A general purpose programming language. C is used by WebFOCUS as a means for creating functions to be used in data analysis.

Categoric Data Type

Data organized into groups of categories. Categoric Data Type could be any character data or any numeric data with 10 or fewer unique values.

class

A component of multinomial regression.

clustering

A method of organizing objects into distinct groups or clusters based on their similarities.

coefficient

A constant that represents the rate of change in the dependent variable as a function of changes in the independent variable. It is the slope of the linear line, for example, it shows how prices go up with the increase in the number of vintage years, or how wines become more expensive the longer they mature. For other data sets, the trend can be the inverse, that is, the slope can be decreasing.

Comma-Separated Value (CSV)

A common file format that enables plain text storage of data (typically tabular data values separated by commas).

correlation
A measure of relation between two or more variables.
correlation analysis

Determines if there is a linear relationship between two variables. It also measures the strength and direction of the relationship. Correlation analysis does not test whether two samples are different.

correlation coefficient

A measure that determines the degree to which the movements of two variables are associated. A correlation coefficient closer to 0 indicates no relationship, while a correlation coefficient closer to 1 indicates a strong relationship.

Cox Proportional Hazards

A general regression model that predicts individual risk relative to the population.

cross tabulation

A method used to summarize data that is grouped into categories. This process creates contingency tables, which illustrate the summarized categorical data.

data extract

An extraction of data based on a set of parameters.

data frame

Collections of individual observations (rows of data) across many variables (fields).

data set

The underlying data used for modeling.

data type

Determines the type of modeling available and the specific algorithms that will be used within the modeling process. It is defined based on the type of data RStat identifies and the quantity of unique values found in the actual data.

decision stump

A decision tree with only one split.

decision tree

A predictive data mining tool that predicts a categorical or continuous response using a tree-like model.

Decision Tree Model

The prototypical data mining technique and default model in RStat. The Decision Tree Model is widely used because of its ease of interpretation.

dendogram

A tree diagram that illustrates the arrangement or categorization of clusters.

dependent (target) variable

In experimental conditions, the dependent variable is a value whose measure is driven by some independent, manipulated variable. This variable is typically the target variable in regression analysis.

DLL

Dynamic-Link Library.

end node

The final or end node on a FancyPlots diagram, from which no new branches derive.

estimation (predictive modeling)

One of two types of statistical inferences supported by RStat, estimation is the process of deriving expected and predicted values from observations. Decision trees, regression, and other models are used to generate estimates. For example, the user can estimate whether a prospect is a good target for a particular marketing campaign, or the expected sales revenues for different stores in order to determine whether store layout and product mix has an impact on sales.

FancyPlots

A technique within R to interactively and graphically represent decision trees, whereby users are able to prune the trees and output the results in C. FancyPlots is an advanced plot function for a decision tree.

FEX

A WebFOCUS executable report procedure (FOCUS executable, also known as FOCEXEC).

F-test

Used to determine if the standard deviations of two samples are the same. If the standard deviations are not the same, then the bell-shaped curves will be different for the two samples. If the samples have the same standard deviations, then a T-test can be conducted to test if the means are equal. The test is also referred to as a test on the variance of two samples and is used in analysis of variance (ANOVA).

Generalized Regression

Regression technique that generalizes standard linear regression, allowing for response variables that fall outside of a normal distribution.

GLM

Generalized Linear Models, specific to regression.

GUI

Graphical User Interface.

hidden layer nodes

The number of hidden layers to display in a Neural Network (NNet) model. In RStat, NNet is, by default, a single layer neural net model, with the option to display the hidden layer.

hypothesis testing

Provides the user with the ability to use samples to test whether or not the null hypotheses are likely to be true.

Kendall Correlation Coefficient

Non-parametric measure of the association of two variables. It is a method for computing the correlation coefficient, that is, it does not assume that the data is normally distributed. The Kendall Correlation Coefficient is used to measure correlation between rankings and cross-tabulations.

Kernel Value

Also known as a Kernel Function, the Kernel Value is a parameter under the Model Tab (specific to SVM) that is used in training and prediction. It can be set to any function of class kernel. The options are Radial Basis (rbfdot), Polynomial (polydot), Linear (vanilladot), Hyperbolic Tangent (tanhdot), Laplacian (laplacedot), Bessel (besseldot), ANOVA RBF (anovadot), and Spline (splinedot).

Kolmogorov-Smirnov

A non-parametric test for quantifying the distance of continuous, one-dimensional probability distributions. Kolmogorov-Smirnov can be a one sample test, which compares a sample with a reference probability distribution, or a two sample test, which compares the relationship between two empirical distribution functions.

leaf node

Also known as the terminal node, the leaf node contains a small set of observations. In Decision Tree modeling, branch nodes are typically pruned into leaf nodes.

linear regression

A regression technique used to model a linear numeric outcome.

logarithm

The power (exponent) to which a base number must be raised in order to get the original number.

logistic regression

A form of regression analysis used to model binary outcomes.

modeling

The ability to model different scenarios and develop models that can be deployed as scoring applications. Modeling is part of the process of developing predictive outcomes.

multinomial regression

A regression technique that generalizes logistic regression, in that it allows for more than two discrete outcomes.

negative binomial distribution

A discrete distribution of the number of successes in a sequence of Bernoulli trials before a specified number of failures occur.

Neural Network (NNet) model

Uses a structure that resembles the neural network of a human being. When applied to modeling, the concept is to build a network of connections that are connected by nodes. Once in place, the network propagates numbers.

node

Represents a branch or leaf in a Decision tree.

non-parametric test

A non-parametric test makes no assumption of the underlying distributions. For example, some of the data does not follow the normal distribution, such as ranked and cross-tabulated data. (Kolmogorov-Smirnov and Wilcoxon Rank-Sum fall into this category.)

Numeric Data Type

Any numeric data with more than 10 unique values.

outliers

Observations in a regression model that do not closely fit the line in any statistical plot. Outliers deviate from the majority of the data.

paired difference tests

Location test used to assess whether population means differ between two sets of measurements. (Specific types of T-tests and the Wilcoxon Signed Rank fall into this category.)

parametric tests

Hypothesis tests that make strong assumptions that the underlying data belongs to a certain type of distribution, which is defined by several parameters. (T-tests and F-tests fall into this category.)

Pearson Correlation Coefficient

The most commonly used method for the computation of the correlation coefficient. Measures the degree and strength of a linear relationship between two variables with bivariate normal distributions on a scale of -1 to 1.

plot
A graphical representation of data, such as a scatter plot.
Poisson

Regression technique used to model and predict in cases where count data and contingency tables are used.

Predictive Model Markup Language (PMML)

An XML-based markup language developed by the Data Mining Group to provide a way for applications to define models related to predictive analytics and data mining, and to share those models between the PMML-compliant applications.

pruning

The process of reducing the size of the decision tree by some branch nodes into leaf nodes, and removing the leaf nodes under the original branch.

quartiles

A variable value that divides the distribution of that variable into four groups with equal frequency.

R

A powerful scripting environment designed for technical users. It is known as the most powerful and flexible statistical programming language available.

Random Forest

Builds a series of un-pruned decision tree models for a data set. While building each tree, random subsets of the available variables are considered for splitting the data at each node of the tree.

regression

A traditional approach to modeling used to estimate relationships between variables. The common types of regression are linear, logistic, generalized regression, Poisson, and multinomial regression. Advanced Regression types include Gaussian, Binomial, Negative Binomial, Gamma, and Inverse Gaussian.

Report Painter

A powerful WebFOCUS reporting tool that enables inclusion of calculated values in reports, control of report formatting, and other robust features.

root node

The starting point of the Decision Tree.

Rpart

A base algorithm, or model builder, in RStat.

Rpart Rule Export Functionality

Enables the exporting of Rpart rules.

R-Script

Script that can run plots, charts, summaries, model techniques, or even be used to execute scoring functionality using R.

sampling

A common practice to test models on new data. Splits the single data set into two data sets: a training data set used for analysis and a test data set used to evaluate how well a model performs.

scoring application

Deploys analytic models for repeated use on new data sets by non-technical users to support decision-making. In simple terms, the scoring application labels a prospect as either good or bad.

scoring routine

Stores model information, including when the model was created, the parameters of the model, the model meta data, and the PMML.

seed

A numerical value used to initialize a random sampling algorithm or to establish a starting point in a table of random numbers.

snip

The process of trimming nodes in a tree.

Spearman's Rank Correlation Coefficient

A non-parametric indicator of the dependencies between two variables which are used to compute the correlation coefficient. That is, it does not assume that the data is normally distributed.

split

The process of splitting a node in a tree into multiple branches or leaves.

standard deviation

Functionality that illustrates the variance from the average (mean) or expected value. Depending on the underlying data set, the types of advanced regression options available may include Normal (Linear), Gaussian, Poisson, Logistic, Multinomial, Binomial, Negative Binomial, Gamma, and Inverse Gaussian.

Support Vector Model (SVM)

A modern approach to modeling where the data is mapped to a higher dimensional space, increasing the possibility that vectors separating the classes will be found.

Survival Data Type

Allows the user to run a Survival Model, that is, Cox Proportional Hazards or Parametric.

Survival Model

Allows the user to run Time-to-Event analysis. When using this model in RStat, the user can select Cox Proportional Hazards or Parametric tests to perform their analysis.

T-test
Used to determine if two sets of data are significantly different from each other, and commonly, it assumes the test statistic follows a normal distribution. There are four common usages:
  1. A one-sample location test, which determines if the mean of the normally distributed populations has the specified value in the null hypothesis.
  2. A two-sample location test, which determines, under the null hypothesis, if the means of two normally distributed populations are equal.
  3. A paired difference test of the null hypothesis, which determines if the difference of two responses measured on the same statistical unit has mean zero.
  4. A test of regression slope, which determines if the regression slope is 0.
variable grid

A grid that displays the available variables in the current data set. The variable grid appears once the data set has been loaded.

variance

The variance measures the dispersion between numbers in a set of numbers.

Wilcoxon Rank-Sum

Also known as the Mann-Whitney-Wilcoxon test, Wilcoxon Rank-Sum is analogous to the two-sample T-test, but is performed on the rankings of the combined data sets instead of on the actual measure. If the observation rankings are not different, then the samples are not different. Because it is performed on the rankings, it is more sensitive about the location of the distribution, that is, to the median (not the mean, as in the T-test).

Wilcoxon Signed Rank
A non-parametric hypothesis test used to identify the difference in the means amongst sample populations, which can be two related samples, matched samples, or repeated measurements. Analogous to the T-test, Wilcoxon Signed Rank is also a paired difference test.

WebFOCUS