Explanation of the Regression Model

How to:

Reference:

What Is Regression Analysis?

Regression analysis is the method of using observations (data records) to quantify the relationship between a target variable (a field in the record set), also referred to as a dependent variable, and a set of independent variables, also referred to as a covariate. For example, regression analysis can be used to determine whether the dollar value of grocery shopping baskets (the target variable) is different for male and female shoppers (gender being the independent variable). The regression equation estimates a coefficient for each gender that corresponds to the difference in value.

The value of quantifying the relationship between a dependent variable and a set of independent variables is that the contribution of each independent variable to the value of the dependent variable becomes known. Once this is known, you need to know only the values of the independent variables in order to be able to make predictions about the value of the dependent variable. For example, when the coefficients for male and female shoppers are known, you can make precise revenue estimates for different distributions of shoppers. Specifically, you can predict the revenues for 80/20 females/males versus 60/40 females/males.

The goal of regression analysis is to generate the line that best fits the observations (the recorded data). The rationale for this is that the observations vary and thus will never fit precisely on a line. However, the best fitted line for the data leaves the least amount of unexplained variation, such as the dispersion of observed points around the line. Stated differently, a relationship that explains 90% of the variation in the observations is better than one that explains 75%. Conversely, a relationship with a better fit has a better predictive power.

How Does Linear Regression Work?

To illustrate how linear regression works, examine the relationship between the prices of vintage wines and the number of years since vintage. Each year, many vintage wine buyers gather in France and buy wines that will mature in 10 years. There are many stories and speculations on how the buyers determine the future prices of wine. Is the wine going to be good 10 years from now, and how much would it be worth? Imagine an application that could assist buyers in making those decisions by forecasting the expected future value of the wines. This is exactly what economists have done. They have collected data and created a regression model that estimates this future price. The current explanation of the regression is based on this model.

The provided sample data set contains 60 observations of prices for vintage wines that were sold at a wine auction. There are two columns, one for the price of each wine and another for the number of years since vintage. The data set, referred to as the training data set, is used to estimate the parameters in the equation.

The estimated equation can be used to predict wine prices when the year since vintage is known. In other words, it can be applied to another data set, referred to as the test data set, which contains only vintage years. Prices are derived by applying the equation to the vintage years in the training data set.

The following formula describes the linear relationship between price and years:

y = β0 + β1 x + error

The following image visually examines the nature of the relationship by plotting Y and X on a scatter plot with the trend line.

Linear relationship graph

The following regression equation shows estimated parameters for the trend line:

y = -533 + 55 * x

To predict the price for a 15-year-old vintage wine, substitute x=15:

y = -533 + 55 * 15
y = -533 + 825
y = 292

How Does Multiple Linear Regression Work?

A simple linear equation rarely explains much of the variation in the data and for that reason, can be a poor predictor. In the case of vintage wine, time since vintage provides very little explanation for the prices of wines. The regression equation described in the simple linear regression section will poorly predict the future prices of vintage wines. Multiple linear regression enables you to add additional variables to improve the predictive power of the regression equation. On a very intuitive level, the producer of the wine matters. Thus, including the chateau as another independent variable is likely to increase the predictive power of the equation.

The following equation shows a multiple linear regression equation. The equation is conceptually similar to the simple regression equation, except for parameters β2 through βn, which represent the additional independent variables.

y = β0 + β1x1 + β2x2 + ... + βnxn + error

The independent variables in the vintage wine data set are:

The regression equation estimates a single parameter for the numeric variables and separate parameters for each unique value in the categorical variable. For example, there are six chateaus in the data set, and five coefficients. One chateau is used as a base against which all other chateaus are compared, and thus, no coefficient will be estimated for it. In other words, if prices have to be predicted for the reference chateau, 0 is entered in the equation.

The equation with the estimated parameters is:

Price = - 4744 + 1.03 * WR + 295.84 * T - 2.09 * HR + 8.06 * TSV + Chateau

The coefficients for each chateau are given in the output, as shown in the following image.

                         Estimate Std. Error t value   Pr(>|t|)    
(Intercept)            -4743.7592  1027.7697  -4.616 0.00002762 ***
WRAIN                      1.0287     0.2513   4.093   0.000155 ***
DEGREES_IN_C             295.8356    58.8231   5.029 0.00000672 ***
HRAIN                     -2.0854     0.3686  -5.658 0.00000074 ***
TIME_SINCE_VINTAGE         8.0633    10.9301   0.738   0.464133    
CHATEAUCos d'Estournel  -280.2000    99.3542  -2.820   0.006863 ** 
CHATEAULafite              7.8000    99.3542   0.079   0.937738    
CHATEAULatour            177.9000    99.3542   1.791   0.079418 .  
CHATEAUMontrose         -320.8000    99.3542  -3.229   0.002198 ** 
CHATEAUPichon Lalande   -228.4000    99.3542  -2.299   0.025730 *  
---

Tip: The image below shows how the estimated parameter values are used as the coefficient values in the regression equation.

Estimated parameter values chart

Note: Cheval Blanc is the reference chateau with a coefficient of 0. Therefore, it is not listed in the output. However, when scoring data, the predicted prices for the Cheval Blanc will be in the scoring output. For more information about the output, see Output From Linear Regression.

Vintage wine prices can be predicted by substituting the values for the independent variables, as in the simple regression equation used earlier.

A scoring application automatically generates the predicted prices, either for a single record or for a batch of records.

Practical Applications of Regression Analysis

Regression analysis is used for:


Top of page

x
Procedure: How to Create a Multiple Linear Regression (lm)

Note: The following example creates a multiple linear regression, using the wine_train data source, and targets PRICE_1991.

  1. Define the Model Data.
    • Load the Wine data into RStat. For more information on loading data into RStat, see Getting Started With RStat.
    • Select Ident for the ID variable.
    • Select Ignore for the Vintage_Year variable.
    • Select Target for the Price_1991 variable.
    • Keep all the other variables as Input.
    • Click Execute to set up the Model Data.

    The status bar confirms your data settings, as shown in the following image.

    Status bar output window

  2. Select the Model tab to select the Model Data.
    • Select Regression as the Type of model.
    • Select Linear as the Model Builder. This is the default option when Regression is selected.

      Note: Other available regression models are Generalized (Gaussian Regression model) and Poisson Regression models. Both models are built using Regression Generalized (glm).

    • Click Execute to run the model.

    The model data output appears in the bottom window of the Model tab, as shown in the following image.

    Model tab

    For details about the regression output, see Output From Linear Regression.


Top of page

x
Reference: Output From Linear Regression

This section describes the linear regression output. Note that output may vary slightly due to sampling.

Signif. codes:

0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The significance codes indicate how certain we can be that the coefficient has an impact on the dependent variable. For example, a significance level of 0.01 indicates that there is less than a 0.1% chance that the coefficient might be equal to 0 and thus be insignificant. Stated differently, we can be 99.9% sure that it is significant. The significance codes (shown by asterisks) are intended for quickly ranking the significance of each variable.

Residual standard error: 237.4 on 37 degrees of freedom.

Multiple R-squared: 0.7958

Adjusted R-squared: 0.7384

F-statistic: 13.86 on 9 and 32 DF, p-value: 0.000000009629.


Top of page

x
Reference: Analysis of Variance (ANOVA) From Linear Regression

The following image shows the Model tab with the ANOVA table for the regression output. The ANOVA table provides statistics on each variable used in the regression equation.

Model tab with ANOVA table


WebFOCUS