Calculating Trends and Predicting Values With Multivariate REGRESS
The REGRESS method derives
a linear equation that best fits a set of numeric data points, and
uses this equation to create a new column in the report output. The
equation can be based on one to three independent variables.
This method estimates values by assuming
that the dependent variable (y, the new calculated values)
and the independent variables (x1, x2, x3) are
related by the following linear equation:
y = a1*x1 [+ a2*x2 [+ a3*x3]] + b
When there is one independent variable, the equation represents
a straight line. This produces the same values as FORECAST using
the REGRESS method. When there are two independent variables, the
equation represents a plane, and with three independent variables,
it represents a hyperplane. You should use this technique when you
have reason to believe that the dependent variable can be approximated
by a linear combination of the independent variables.
REGRESS uses a technique called Ordinary Least Squares to calculate
values for the coefficients (a1, a2, a3, and b)
that minimize the sum of the squared differences between the data
and the resulting line, plane, or hyperplane.
x
Syntax: How to Create a Multivariate Linear Regression Column
ON {sortfield} RECAP y[/fmt] = REGRESS(n, x1, [x2, [x3,]] z);
where:
- sortfield
- Is a field in the data source. It cannot be the same field as
any of the parameters to REGRESS. A new linear regression equation
is derived each time the sort field value changes.
- y
- Is the new numeric column calculated by applying the regression
equation. You cannot DEFINE or COMPUTE a field with this name.
- fmt
- Is the display format for y. If it is omitted, the default format
is D12.2.
- n
- Is a whole number from 1 to 3 indicating the number of independent
variables.
- x1, x2, x3
- Are the field names to be used as the independent variables.
All of these variables must be numeric and be independent of each
other.
- z
- Is an existing numeric field that is assumed to be approximately
linearly dependent on the independent variables and is used to derive
the regression equation.
x
Reference: Usage Notes for REGRESS
- The (By) sort field
used with REGRESS must be in a numeric or date format.
- REGRESS cannot operate
on an ACROSS field.
- If any of the independent
variables are also sort fields, they cannot be referenced in the
request prior to the REGRESS sort field.
- FORECAST and REGRESS
cannot be used in the same request, and only one REGRESS is supported
in a request. Non-REGRESS RECAP commands are supported.
- The RECAP command
used with REGRESS can contain only the REGRESS syntax. REGRESS does
not recognize any syntax after the closing semicolon (;).
- Although you pass
parameters to REGRESS using an argument list in parentheses, REGRESS
is not a function. It can coexist with a user-written subroutine
of the same name, as long as the user-written subroutine is not
specified in a RECAP command.
- BY TOTAL is not supported.
- MORE, MATCH, FOR,
and OVER are not supported.
- The process of generating
the REGRESS values creates extra columns that are not printed in
the report output. The number and placement of these additional
columns varies depending on the specific request. Therefore, use
of column notation is not supported in a request that includes REGRESS.
- SUMMARIZE and RECOMPUTE
are not supported for the same sort field used for REGRESS.
- REGRESS is not supported
for the FOCUS GRAPH facility.
- The left side of
a RECAP command used for REGRESS supports the CURR attribute for
creating a currency-denominated field.
- Fields with missing
values cannot be used in the regression.
- Larger amounts of
data produce more useful results.
Example: Creating a Multivariate Linear Regression Column
The following request uses the GGSALES
data source to calculate an estimated DOLLARS column. The BUDUNITS,
UNITS, and BUDDOLLARS fields are the independent variables. The
DOLLARS field provides the actual values to be estimated:
DEFINE FILE GGSALES
YEAR/Y = DATE;
MONTH/M = DATE;
PERIOD/I2 = MONTH;
END
TABLE FILE GGSALES
PRINT BUDUNITS UNITS BUDDOLLARS DOLLARS
BY PERIOD
ON PERIOD
RECAP EST_DOLLARS/F8 = REGRESS(3, BUDUNITS, UNITS, BUDDOLLARS, DOLLARS);
WHERE CATEGORY EQ 'Coffee'
WHERE REGION EQ 'West'
WHERE UNITS GT 1600 AND UNITS LT 1700
END
The output is: