The following options are available
on the Data tab in RStat. You may also access the Data tab through
the Tools menu.
Source
Different options may be available depending
on the data type.
-
FEX. Loads
the data from a FEX.
-
CSV File. Loads
data from a comma-separated value (CSV) file. When CSV File is selected,
the following options become available:
-
Filename:
Indicates the data file that is currently in use.
-
Separator. Specifies
the separator type. For example, ",", "|", and so on.
-
Decimal. Specifies
a decimal character. For example, “,” or “.”. The character that
you specify is used in the file for decimal points and supports international
currency differences. For example, 30.00 versus 30,00.
-
Header. Allows
you to indicate whether the first row contains column headings.
-
RData File. Loads data from an
RData file (usually binary). When RData File is selected, the following
options become available:
-
Filename. Enables
you to select a data file.
-
Data Name. Allows
you to load different data frames of the R data file. Data frames
are collections of individual observations (rows of data) across many
variables (fields). They are analogous to the SAS or SPSS data sets
that organize the data set for statistical analysis in a cases
by variables matrix. For example, rows across multiple columns.
You can have multiple data frames in one R data file.
-
Library. Enables
you to select a supplied data set from the R library. When Library
is selected, the following option becomes available:
-
Data Name Opens
a list, which you can use to select a library file.
-
RScript. Provides
the capability of running plots, charts, summaries, and model techniques.
It can also be used to execute scoring functionality using R.
-
Filename. Enables
you to select a data file.
Partition
Partitioning splits the single data set
into two data sets, a training data set used for analysis and modeling,
and a test data set used to evaluate how well a model performs.
It is a common practice to test models on new data, different from
the data used to create the model.
You can define the partition size either
as a percentage of the total records or as an exact number of records.
Changing the percentage will automatically change the count and vice
versa.
-
Percentage. The
default is 70%. The sample randomly chooses 70% of the data for
a selected data set.
-
Count. Displays
the number of records that will be included in the sample based
on the selected percentage. You can manually specify the number
of records, which in turn will change the percentage.
-
Seed. Numerical
value used to initialize a random sampling algorithm or to establish
a starting point in a table of random numbers. By using the same
seed number, you will generate exactly the same sample. You can
click Seeds to update the seed with a random
number.
-
View. Opens
the Data Viewer window, enabling you to view data. An alternate
way to view data is to run the WebFOCUS procedure (FEX) that you
use to load the data.
-
Input and Ignore Buttons
You
can set a group of variables to a single role using the Input and
Ignore buttons by:
- Selecting
the variables to be set from the Variable Grid. To select multiple variables,
hold down the Ctrl key while clicking each variable, or the Shift
key to define a range. You can select all variables within the grid
by clicking one of the variables and then pressing Ctrl+A.
- Clicking
the green Input button to define all selected variables as input,
or the red Ignore button to define all selected variables as ignored.
This defines a portion to be used as training data and opens with
the selected data set loaded. RStat presents nine tabs that reflect
the standard modeling workflow. The Data tab shows the variables
and the roles each will play in building the model.
-
Target Data Type
The data type of the target variable
determines the type of modeling available and the specific algorithms
that will be used within the modeling process. The data type is defined
based on the type of data RStat identifies and the quantity of unique
values found in the actual data. In RStat, data types are defined
as:
-
Auto. This option is selected
by default and uses the rules that were most recently defined.
-
Categoric. Any
character data or any numeric data with 10 or less unique values.
-
Numeric. Any numeric data with
more than 10 unique values.
-
Survival. Allows
the user to run a Survival Model (that is, Cox Proportional Hazards
or Parametric). A Time variable and a Status variable must be selected
from the list of variables.
-
The Target option
allows you to override these heuristic settings:
- Auto will
use the previously defined rules.
- Categoric
will handle character and numeric values as categories.
- Numeric will
assign a unique numeric value to each categoric value.
-
Variable Roles. Each
variable can have only one role.
-
Input. This
is the exploratory (independent) variable(s) where presence or degree
determines the change in the dependent variable.
-
Target. This
is the dependent variable. In modeling, it is assumed that the dependent
variable is influenced by the input variables. The model shows the
degree to which the dependent variable is influenced by the input
variables.
-
Risk. Special
variable in the data set that measures the amount of risk associated
with each record in the data set.
-
Ident. Identifies
the variable as containing the ID for each record in the data set.
-
Ignore. Ignores
the variable for any analysis or modeling.
-
Weight. Used
to identify some observations as more important than others. There
is no standard method for calculating this type of weight. This
should be NULL or a numeric vector.
-
Comment. Provides
information on the type of data and the values found within the
data, including the count of missing values and unique values.