Data Exploration

The following options are available on the Explore tab in RStat. You may also access the Explore tab through the Tools menu.

The screen that follows shows the summary statistics being executed and displayed.

Explore tab

Summary

Summarizes the data set and provides descriptive statistics on each variable in the data set.

Summary. Includes Min, Quartiles, Mean, and Max for numeric, and top level for factors.
Describe. Includes a concise description including missing, unique, sum, mean, and the lowest and highest values, frequencies, and percentages.
Basics. Includes various basic measures of numeric data, including missing, min, max, quartiles, mean, sum, skewness, and Kurtosis.
Kurtosis. Summarizes just the Kurtosis, useful for comparing all numeric variables at once. The Kurtosis measures the peakness of the distribution. The higher the peak, the more of the data variation is due to infrequent extreme observations. In the example below, Income has a Kurtosis of 1.9290161, while Age has a Kurtosis of -0.3990564. The difference in the peak is noticeable. The example is taken from the AB_Demo data set, which was used for the credit scoring application.
Note: To generate this chart, click Latticist and then, click Execute. Click Marginals in the Chart interactive panel and then deselect all other variables except for Age and Income. From the Groups/Color drop-down box, select the top row (Empty/Null) to remove any grouping. For more details, see the section in this chapter on Latticist.
Skewness. Measures the asymmetry of the distribution or how a particular distribution deviates from the normal bell-shaped distribution. A negative skew indicates that the left tail is longer, that is, most of the observations are located on that side. A positive skew indicates that the right side is longer. Both Age and Income from the preceding example have a positive skew. The skewness affects the relative position of the mean, median, and mode. If the distribution is normal, for example, bell shaped, those three values are equal.
Show Missing. Displays a summary of the missing values for each variable.

Distributions

Displays various distribution plots for numeric and categoric variables. The chart options for numeric and categoric variables are displayed separately. The chart types for numeric and categoric variables are also different, as shown in the following image.

Distribution window

For numeric variables, the available charts are:

Box plot. Also known as box-and-whisker diagram or plot, displays groups of numerical data through their five-number summaries, the smallest observation, lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation. If you have selected a target variable, it will be displayed on the x-axis, as shown in the following example, where Credit_Approval is a target variable. The x-axis can take only categoric or count data.
Histogram. This displays the frequencies of each observation. The frequencies are organized in non-overlapping categories (bars). The curve shows the shape of the distribution. The histogram describes the data by displaying five values, the center (location) of the data, the spread (scale) of the data, the skewness of the data, the presence of outliers, and the presence of multiple modes in the data (if any exist).
Cumulative Histogram. The cumulative histogram is a variation of the histogram in which the vertical axis gives not just the counts for each group, but rather gives the counts for that group plus all other groups prior to it. So, for income, it indicates what percent of total population is below or above any income level. If a categoric target variable has been selected, separate cumulative curves will be displayed for each category.
Benford Bars. According to Benford's Law, also referred to as the First Digit Law, if you draw a random number from a list of real life numbers, such as income, invoice amounts, and so on, the probability of drawing a number starting with 1 is almost one-third. The larger leading digits occur with lower and lower frequency, to the point where 9 as a first digit occurs less than one time in twenty. It is a pragmatic law.

The practical implication is that people do not know this law and thus cannot manipulate data convincingly. Hence, the law is used in fraud detection. For example, a tax authority uses Benford’s Law to see if cash disbursements on company returns follow its law. If the distribution of disbursements does not follow Benford’s Law, an investigation is triggered. Insurance claims are another typical use case.

Generate a bar plot as shown in the following image. Make sure that no dependent variable is selected on the Data tab. If a dependent variable is selected, its values will be shown on the Benford chart to compare against the other variable. The graph indicates that income follows the Benford distribution. That is, the distribution of incomes starting with 1, 2, 3, and so on, follows the distributions of 1, 2, and 3 in real life.
Benford Digit. The digit for which to plot the distribution. 1st through 9th are allowed.
abs. Plots the distribution of the absolute values.
+ve. Plots only the distribution of positive values.
-ve. Plots only the distribution of negative values.

For categoric variables, the following chart types are available:

Bar Plot. Shows the counts of observations within each category. If a target variable is selected, it will plot additional bars for each category in the target. In the following example, for each Occupational Value, there are three bars: the total number of people in this occupation, the number of people with good credit, and the number of people with bad credit.
Dot Plot. A dot plot is an alternative to a bar chart. According to some analysts, they are simpler and easier to interpret, as they clearly show the distribution of data. The frequencies are displayed on the x-axis and labels are displayed on the y-axis. If a categoric target is selected, separate dots will be plotted for each category.
Mosaic Chart. The mosaic chart gives a real representation of the distribution of observations within categories. Each category value is displayed as a 100 percent bar. The width of the bar indicates the frequency of observations. The wider the bar, the higher the count in this category. If a categoric target variable is selected, the height of the bars will be split proportionate to the counts of the observations that fall in each category of the target.

You have a few controls that apply to all charts within the Distributions section.

Clear. Clears all check boxes of selected rows in the Numeric or Categoric table.
Plots per Page. The number of plots to draw per page.
Annotate. Includes numeric values within the plots.

In the following image, we have chosen to display four plots per page, and we have selected all four plots for the Income variable. When multiple plots per variable are selected, the Target variable information is ignored and only the total information for the variable is plotted.

Box Plot, Histogram, Cumlative Histogram and Benfords Law graph

Latticist

Latticist is an interactive visualization package. We will not cover all the functions and capabilities of the package here but will give you a general overview. Select the Latticist radio button and click Execute. This loads all selected variables into a matrix plot, as displayed in the following image. The target variable will be used as a grouping variable. Both numeric and categoric variables can be used as grouping variables.

Latticist matrix plot

To remove a grouping variable in the Groups/Color drop-down list, select the top null row, as shown in the following image.

Latticist matrix plot

To remove any variables from the plot, select the plot type button, for example, marginals, and uncheck the variables to be removed.

Choose variables to plot check box

Click OK to display only the selected variables.

Latticist graph

To display a scatter matrix plot, select splom (pairs). Conditioning will display a separate scatter matrix for each value in the variable selected in the Condition drop-down list, as shown in the following image.

Latticist scatter matrix plot

GGobi

Runs the R package GGobi for interactive data visualization. This section will not cover all the functions of the GGobi package but will show you how to initialize and interact with the data.

With the GGobi radio button selected, click Execute. All variables that are selected in the Data tab will be loaded, and a scatter plot will be displayed, as shown in the following image.
All variables are displayed in the following GGobi panel. Users can interactively change the X and Y variables.

Displaying a Matrix Scatter Plot. A matrix scatter plot is a display of all two-by-two scatter plots for all variables in a single panel. It allows you to quickly assess all relationships in a data set.

To generate the matrix scatter plot shown here, select Display from the toolbar menu on the GGobi floating panel and select New Scatterplot Matrix.

New GGobi Scatterplot Matrix

GGobi Parallel Coordinates Chart. A parallel coordinates chart, also known as a profile plot, is a useful way to compare several sets of observations as a combination of different factors. It is useful to detect patterns in the data.

To generate the parallel coordinates chart shown below, select Display from the toolbar menu on the GGobi floating panel and select New Parallel Coordinates Display.

GGobi Parallel Coordinates Chart

GGobi Interactions. From the toolbar menu, you can select Interactions and specify the type, for example, Brush. The brush allows you to select points on the graph and the selection will propagate to all other GGobi graphs. You can see the relationship of the selected items and all other items.

GGobi Interactions graph

For more information on GGobi, see http://www.ggobi.org/.

Correlation

Correlation indicates the strength and the direction of the relationship between two variables. Correlations should be interpreted carefully, as they depend on the context. In simple terms, a correlation coefficient closer to 0 indicates no relationship, and a coefficient closer to 1 indicates a strong relationship. Positive correlation indicates that as one variable increases, so does the other. Negative correlation indicates that as one variable increases, the other decreases. Correlations are displayed in a table and a graph, storing the pairwise correlations between all numeric variables.

The following image displays the correlations in a table within the RStat output window.

Correlations option button in RStat output window

Chart Displaying the Correlations. The color and the shape of the circles indicate the strength of the correlation. For example, Income and Age have a correlation near 0, which is represented by a full white circle. Credit_Approval and Income have a correlation of -0.38, which is represented by a pink ellipse. The dispersion of the data becomes more narrow and closer to a straight line, that is, the perfect linear relationship.

Chart Displaying Correlations

Explore Missing. Click to display the correlation between missing values. If the file does not contain missing values, those correlations will not be displayed and a dialog with a warning message will be displayed.

Ordered. Select the check box to display the variables ordered by the strength of the correlations.

Method. Select the method for the computation of the correlation coefficient.

Pearson. The Pearson correlation coefficient measures the degree and strength of a linear relationship between two variables with bivariate normal distributions on a scale of -1 to 1. It is the most commonly used method and therefore, is set as the default.
Kendall. The Kendall correlation coefficient is a non-parametric measure of the association of two variables; that is, it does not assume that the data is normally distributed. It is used to measure correlation between rankings and cross tabulations. For example, an analyst may create ranks on income and on job qualifications for a set of individuals in a data set. The correlation test will determine whether people in the higher income ranks are likely to rank higher in qualifications.
Spearman. The Spearman correlation coefficient, like the Kendall correlation coefficient, is also a non-parametric (distribution-free) measure of the association of two variables. It is like the Pearson’s correlation coefficient, but is conducted on the rankings of the original variables. Compared to the Kendall test correlation coefficient, the Spearman correlation coefficient will be less accurate if there are dislocations from the perfect ranking order in the data set.

Hierarchical

A correlation between the numeric data is calculated, and then a hierarchical cluster is generated based on the correlations. The hierarchical cluster is then visualized through a dendrogram to give an idea of the groupings of the numeric variables. The length of the lines in the dendrogram provide a visual indication of the degree of correlation. Shorter lines indicate more tightly correlated variables. Once you have identified the groups of variables that are correlated, you may want to reduce the number of variables that you are including in your modeling. For instance, you can compare the Credit_Approval and Income on the dendrogram with the correlations from the prior section, and see that shorter lines correspond to higher correlations.

You can use any of the three methods, Pearson, Kendall, or Spearman, to calculate the correlation coefficients.

Hierarchical cluster graph

Principal Components

Principal components analysis is used for variable reduction, that is, to analyze the numeric variables in the data set and indicate whether a smaller set of new uncorrelated variables can be generated and used for modeling. The potential new variables are called principal components, and usually the first two account for most of the variation in the target variable. Hence, only the components that account for most of the variation can be used for modeling. RStat does not actually generate the new PC variables. Instead, it is used to analyze which of the input variables contribute most to the components. This information helps decide which variables to include in or exclude from the analysis.

The two following images are displayed to exhibit the relationships between the principal components. The following bar chart presents the relative comparison of how much of the variation in the data is accounted for by each of the principal components. The first will account for the most, then the second, and so on.

Principal components bar chart

The plot chart plots the principal component 1 against the principal component 2 (for example, the two principal components that account for most of the variation), also displaying the strength of the component variables.

Principal components plot chart

WebFOCUS