Working With the New Data Model

As of PMF 5.3.2, the model where each individual measure needs to be loaded as a separate, schedulable entity is no longer applicable. Instead, PMF now allows you to group (and to schedule loads for) your loadable data by its source.

You can specify how that data should be harvested from individual physical data sources. You can:

You can also specify that some datapoints are acquired from end users. Typically, you do this if you do not already have a physical source for this data, but want PMF to become the system of record for the data and to track it as it is captured.

PMF provides user-input features that allow it to capture, update, validate, and store the data in datapoints for use downstream.

When setting up user-entered sources, specify the level of dimensionality common to all datapoints in a single source group. At capture time, end users can input all datapoints at the same time, increasing the speed and convenience for end users.

Once data has been harvested from physical sources or data has been entered by the user, PMF can regularly recalculate any derived datapoints. Recalculation is performed in lineage order. This means that PMF itself determines which derived datapoints have dependencies, and waits to perform recalculation on any datapoint until all of its precursor dependencies have been resolved.

After all datapoints are reloaded, and derived datapoints are fully recalculated, PMF checks for measure dependencies of these datapoints. It then copies the data, as you designed it, to the measures.

The Core Paradigm

The new metrics model in PMF allows you to think differently about measure loads than in previous versions of PMF:

You can now time a single source load to harvest all values needed from any particular external RDBMS system. You no longer have to think about timing multiple separate measure loads to events in external systems. PMF determines the timing for you, harvesting the data at the time you set, and then pushing the data downstream to derivations and then to each dependent measure in the system.
PMF now knows the lineage of each calculation and the order in which each dependency must be calculated, so you do not have to determine dependency order for calculated fields.
PMF can now mix user-input data with loadable data. You do not have to create special interim data tables and user interface schemas to accept user input that needs to be combined on measures or subcalculations (either as separate measure components or as dependent components in complex calculations).
PMF can now synthesize data at will, using rules that allow you to determine dimensional depth and method of generation. This means that creating demonstrations of new metric models does not require spending time generating data in Excel and figuring out how to load that data into PMF.
Determining dimensional dependencies is now more automatic. Since dimensional dependencies are attributes of sources and datapoints, just connecting datapoints to a measure automatically determines the dimensions and levels that will be available to that measure.

Data Lineage

Data lineage refers to the entire path of data through the PMF load architecture.

Step 1

For PMF, data lineage starts at the source. Datapoints that are linked to a source are left-side endpoints in the lineage. This means they are harvested directly from:

Physical database tables or views, or various flat file formats.
User entry of data, in situations where there is no physical representation of the data.
Generated data (in the exception where no data exists, and is being created for various demonstration purposes).

Example 1: In the case of a manufacturing company, there might be sources defined to harvest data from systems in Warehousing, Production Line, Quality Control, Shipping and Logistics, Supply Chain/Purchasing, Prospect Management, and Wholesale Sales. PMF harvests data as follows:

Warehousing. Average cost of storage per item, per warehouse.
Production Line. Average cost of labor per item, per plant.
Quality Control. Average cost of testing, per item, per plant.
Shipping and Logistics. Cost to ship, per item, per plant, per customer.
Supply Chain/Purchasing. Cost of supply, per item, per plant.
Prospect Management. Average Cost of Customer Acquisition, averaged per item, per customer.
Wholesale Sales. Sale Amount, per item, per customer.

Step 2

Lineage then proceeds through each generation of derived datapoints. There is no limit to the number of phases possible.

Example 2: Continuing from Example 1, you can derive the following datapoints from those you loaded:

Total Product Cost = Average cost of storage per item + Average cost of labor per item + Average cost of testing, per item
Total Sale Cost = Average Cost of Customer Acquisition + Shipping Cost
Net COGS = Total Product Cost + Total Sale Cost
Profit = Total Sale – Net COGS
Margin = Profit/Total Sale (as a %)

These datapoints need to be calculated in the following order:

Total Product Cost, Total Sale Cost
Net COGS
Profit
Margin

Step 3

Lineage then ends at measures.

Load, Recalculate, and Copy (LRC) Loads

Reference:

Checking the Administrative Log Reports

To load measures, the PMF load architecture puts data through three phases:

Load. All sources indicated for Load are loaded and data is fed into the datapoints for each source.
- Volitional load. The Load button is clicked on any source. In this case, the only source that will be loaded is the one you indicated for load.
- Scheduled direct load. A source load is scheduled to run at that time.
- Scheduled optional cascaded load. If any of the dimensions that are linked to the source are reloaded, a source load could be forced (cascaded) depending on how the dimension Cascade Load settings are configured.
- Scheduled forced cascaded load. If any of the dimensions that are linked to the source are reorganized, a source load will always be forced (cascaded), regardless of how the dimension Cascade Load settings are configured.
Note: During schedule processing, if more than one source has to be loaded during the scheduled run, all scheduled sources would be loaded before the next step runs. This prevents inefficiently repeating the recalculation.
Recalculate. PMF looks at all the sources that were reloaded, and analyzes all derived datapoints with dependencies on the sources that were loaded. PMF then analyzes the lineage of all derived datapoints to determine the correct order to recalculate them, respecting their dependencies. Finally, PMF performs the recalculation step, in phases, with the number of phases determined by the generations of lineage of the derived datapoints.
Copy. PMF analyzes all measures that use the datapoints that were recalculated in Step 2. It then copies the data for the linked datapoints into the measures cube, making the data ready for reporting and dashboard publication.

Top of page

Reference: Checking the Administrative Log Reports

PMF logs all activity that involves load, recalculation, or copy actions. Logged data is stored as peer data in the PMF Data Mart.

PMF captures the following data in its source load logs:

Earliest and latest effective dates for the source data.
What dimensions were processed on the load.
Status of the load:
- Running: The load is currently running.
- Success: The load completed successfully.
- Error: The load operation completed with an error thrown.
How many source records were retrieved to process the load.
How many datapoint records were inserted, updated, and deleted.
How many dimensional linkage mismatches were processed.
Gaps in dimensional intersections.
Datapoint sparsity count (how much of the potential total Cartesian cross product was not loaded).
LRC load type:
- Load a source.
- Load a datapoint.
- Recalculate.
- Copy to measure.
Owner ID under which the load ran.
Start and end time of the load.
Messages that were returned from the server during the load.
Compare mode that was flagged for the load:
- NO_DATA: There is no incoming/parent data ready to process.
- NEW: All ready parent data is new (outside the range of the target data), and all parent data is in the lineage for all incoming datapoints for at least one common time period.
- NEW_W_MISSING: All ready parent data is new (outside the range of the target data), but data is missing for at least one required parent data point in the lineage.
- POSS_UPD: All incoming ready data falls within the range of current data, but there is no ready data for at least one incoming data point. There is no possibility of new rows being generated.
  - For Copy to Measure, there are possible updates.
  - For Recalculation, there are also possible updates if there is available data for all other parents (this has not yet been coded).
  - It is possible that there is some ready data that is new for some parents, but not for all parents. Since we are lacking new data for at least one parent, there can be no new processing.
- POSS_UPD_NEW: Similar to POSS_UPD, except that there is ready data for all parents in the lineage, so it is possible that there is some new data.
  The most likely scenario for this would be that you source loaded EAST for 2012/01 and then separately source loaded WEST for 2012/01. In order to process the POSS_UPD_NEW mode, you need to split out those incoming or parent rows that match the dimensionality existing/child rows versus those that are new. For those that are new, you have to check for gaps.
- MIXED: Similar to NEW, except there is also some ready parent data which overlaps existing data. For incoming data that falls within the range of existing data, it is possible that:
  - The data will not be in the current child data, in which case, the data will be New provided that there is already data in all required parents, or
  - It may be that the child data already exists, in which case, we can do C2M update processing OR recalc updates (if there is available data for all other parents).