Data Exploration

The following options are available on the Explore tab in RStat. You may also access the Explore tab through the Tools menu.

The screen that follows shows the summary statistics being executed and displayed.

Explore tab

Summary

Summarizes the data set and provides descriptive statistics on each variable in the data set.

Distributions

Displays various distribution plots for numeric and categoric variables. The chart options for numeric and categoric variables are displayed separately. The chart types for numeric and categoric variables are also different, as shown in the following image.

Distribution window

For numeric variables, the available charts are:

For categoric variables, the following chart types are available:

You have a few controls that apply to all charts within the Distributions section.

In the following image, we have chosen to display four plots per page, and we have selected all four plots for the Income variable. When multiple plots per variable are selected, the Target variable information is ignored and only the total information for the variable is plotted.

Box Plot, Histogram, Cumlative Histogram and Benfords Law graph

Latticist

Latticist is an interactive visualization package. We will not cover all the functions and capabilities of the package here but will give you a general overview. Select the Latticist radio button and click Execute. This loads all selected variables into a matrix plot, as displayed in the following image. The target variable will be used as a grouping variable. Both numeric and categoric variables can be used as grouping variables.

Latticist matrix plot

To remove a grouping variable in the Groups/Color drop-down list, select the top null row, as shown in the following image.

Latticist matrix plot

To remove any variables from the plot, select the plot type button, for example, marginals, and uncheck the variables to be removed.

Choose variables to plot check box

Click OK to display only the selected variables.

Latticist graph

To display a scatter matrix plot, select splom (pairs). Conditioning will display a separate scatter matrix for each value in the variable selected in the Condition drop-down list, as shown in the following image.

Latticist scatter matrix plot

GGobi

Runs the R package GGobi for interactive data visualization. This section will not cover all the functions of the GGobi package but will show you how to initialize and interact with the data.

Displaying a Matrix Scatter Plot. A matrix scatter plot is a display of all two-by-two scatter plots for all variables in a single panel. It allows you to quickly assess all relationships in a data set.

To generate the matrix scatter plot shown here, select Display from the toolbar menu on the GGobi floating panel and select New Scatterplot Matrix.

New GGobi Scatterplot Matrix

GGobi Parallel Coordinates Chart. A parallel coordinates chart, also known as a profile plot, is a useful way to compare several sets of observations as a combination of different factors. It is useful to detect patterns in the data.

To generate the parallel coordinates chart shown below, select Display from the toolbar menu on the GGobi floating panel and select New Parallel Coordinates Display.

GGobi Parallel Coordinates Chart

GGobi Interactions. From the toolbar menu, you can select Interactions and specify the type, for example, Brush. The brush allows you to select points on the graph and the selection will propagate to all other GGobi graphs. You can see the relationship of the selected items and all other items.

GGobi Interactions graph

For more information on GGobi, see http://www.ggobi.org/.

Correlation

Correlation indicates the strength and the direction of the relationship between two variables. Correlations should be interpreted carefully, as they depend on the context. In simple terms, a correlation coefficient closer to 0 indicates no relationship, and a coefficient closer to 1 indicates a strong relationship. Positive correlation indicates that as one variable increases, so does the other. Negative correlation indicates that as one variable increases, the other decreases. Correlations are displayed in a table and a graph, storing the pairwise correlations between all numeric variables.

The following image displays the correlations in a table within the RStat output window.

Correlations option button in RStat output window

Chart Displaying the Correlations. The color and the shape of the circles indicate the strength of the correlation. For example, Income and Age have a correlation near 0, which is represented by a full white circle. Credit_Approval and Income have a correlation of -0.38, which is represented by a pink ellipse. The dispersion of the data becomes more narrow and closer to a straight line, that is, the perfect linear relationship.

Chart Displaying Correlations

Explore Missing. Click to display the correlation between missing values. If the file does not contain missing values, those correlations will not be displayed and a dialog with a warning message will be displayed.

Ordered. Select the check box to display the variables ordered by the strength of the correlations.

Method. Select the method for the computation of the correlation coefficient.

Hierarchical

A correlation between the numeric data is calculated, and then a hierarchical cluster is generated based on the correlations. The hierarchical cluster is then visualized through a dendrogram to give an idea of the groupings of the numeric variables. The length of the lines in the dendrogram provide a visual indication of the degree of correlation. Shorter lines indicate more tightly correlated variables. Once you have identified the groups of variables that are correlated, you may want to reduce the number of variables that you are including in your modeling. For instance, you can compare the Credit_Approval and Income on the dendrogram with the correlations from the prior section, and see that shorter lines correspond to higher correlations.

You can use any of the three methods, Pearson, Kendall, or Spearman, to calculate the correlation coefficients.

Hierarchical cluster graph

Principal Components

Principal components analysis is used for variable reduction, that is, to analyze the numeric variables in the data set and indicate whether a smaller set of new uncorrelated variables can be generated and used for modeling. The potential new variables are called principal components, and usually the first two account for most of the variation in the target variable. Hence, only the components that account for most of the variation can be used for modeling. RStat does not actually generate the new PC variables. Instead, it is used to analyze which of the input variables contribute most to the components. This information helps decide which variables to include in or exclude from the analysis.

The two following images are displayed to exhibit the relationships between the principal components. The following bar chart presents the relative comparison of how much of the variation in the data is accounted for by each of the principal components. The first will account for the most, then the second, and so on.

Principal components bar chart

The plot chart plots the principal component 1 against the principal component 2 (for example, the two principal components that account for most of the variation), also displaying the strength of the component variables.

Principal components plot chart


WebFOCUS