Data flow metrics
When you encounter a decision strategy performance issue, you can usually see it in the data flow run page. The strategy component in a data flow is expensive in terms of time percentage. You can interpret it through the following two metrics:
- Time percentage taken among the overall data flow
- In a typical decision management scenario, strategy is the most time consuming (CPU intensive) component. This metric could reach up to 90%-95% of a total data flow execution time. Such a high percentage means that the strategy execution is shown as a performance bottleneck (as it should be). At the same time, it does not signify a performance problem. However, if you see a relatively low percentage, it might be an indication that other parts of the system (for example, database or Decision Data Store) can be tuned better.
- Average time taken by Strategy shape for every record
- This metric records the time spent on the Strategy shape in Data Flow
per record. If the strategy execution time exceeds the threshold, PEGA0063
alert (decision strategy execution time above threshold) is triggered. For
more information, see Pega alerts for Cassandra. This metric
mainly consists of three sub parts (visible after enabling detailed metrics
for a data flow run):
- Preprocessing, such as loading interaction history or interaction history summary caches for a batch run.
- Strategy execution, where most of this metric should be covered in a typically good scenario.
- Postprocessing, which invokes the data flow synchronously, to save strategy results and monitoring Info to pxDecisionResults data set or other built-in destination, depending on your configuration.
Troubleshooting decision strategy performance
To troubleshoot decision strategy performance:
- Look at the data flow metrics and alerts in Pega Predictive Diagnostic Cloud (PDC) for guidance and direction on the performance challenge. In case of a sudden performance degradation in staging or production environment, monitor which metric is affected.
- Analyze the alerts for typical strategy-related issues:
- PEGA0063 (strategy execution time)
- PEGA0064 (maximum number of strategy results processed per strategy component)
- PEGA0075 (Decision Data Store interaction time)
- PEGA0058 and PEGA0059 (interaction history reading/writing time)
- Enable detailed data flow metrics, by setting the pyUseDetailedMetrics property for RunOptions page. This property is a part of Data-Decision-DDF-RunOptions. When set to True, the detailed metrics for the execution of each shape will be calculated and made available in the progress page.
Perform test runs and simulations
You can test strategies to find performance issues by performing strategy test runs. The statistics such as the processing speed of records or decisions, time spent in each component, throughput, and the number of processed decisions or records, can help you assess the health of a strategy. For example, by viewing the Time spent statistics, you can get insight into how much time is spent on processing data in each strategy component, whether the indicated amount of time is justified, or whether the component uses a complex processing logic that you can optimize, and so on.
To test a strategy, you provide input data to the strategy components and then run a single case or batch set of cases. Data transforms, data sets, and data flows support the generation of the data objects that contain input data for test runs. The data processing power that is provided by data sets and data flows is best suited for validating your design against sample data from one or more data sources.
To understand the impact of strategy changes on the overall strategy execution time, run a simulation or a batch run to check the differences between versions of strategies.
To run a simulation, use the Revision Management performance check tool. This simulation runs on the same audience and top level strategy, so you can collect the average processing speed for each record in each revision. You can compare the results with previous revision and report on the changes in performance. For more information on Revision Manager, see Simulating your revision changes. This approach can be replicated in a batch run or manually, using a simulation test from the landing page in Pega Customer Decision Hub portal. You can monitor the performance of your strategy through the data flow metrics. Running on the same strategy, with the same audience means you track the change in your strategy performance by comparing metrics from run to run.
- It shows the result for the current strategy only, so you would need to run a performance test on each sub-strategy to collect metrics.
- It only has detailed metrics for legacy components when running in an optimized mode.
Use the strategy execution profile
This is the traditional Pega Platform test run page which accepts a data transform to initialize the customer page for executing strategy. In Pega Customer Decision Hub, the data transform rules used for persona testing can be directly used here to generate a report you can download for offline analysis.
To use the strategy execution profile:
- Open the strategy rule under test (normally, this would be the top next best action strategy).
- Click .
- Initialize the primary page context with a data transform or copy it from another page, whichever is appropriate based on your setup.
- Click Run.
- Download the strategy execution profile report
The report includes total strategy execution time and a strategy/component breakdown. When this is executed with the new SSA engine, only the unoptimized components are measured directly with pages in, pages out, and execution time, as indicated with a proper component name. The row with component name All represents the total time spent within that particular strategy execution, including sub strategy executions, when applicable. The Optimized row includes all components included in All, minus the non-optimized components. In case of sub strategy component, the execution time is accumulative.
Apply filters early
Apply filters in a strategy as early as possible to eliminate data from the strategy flow that is not required to issue a decision. This solution reduces the amount of memory that is needed to process a strategy and decreases the processing time.
Avoid computing inactive paths in data joins
In complex decision strategies that contain multiple layers of substrategies, you can encounter Data Join components that are always triggered, regardless of their validity in the decision path. This type of design can needlessly extend the strategy processing time and is not recommended.
To illustrate this problem, see the following example strategy:
In the preceding strategy, the condition that is configured in the Data Join shape states that the data is matched only if the value of the SubjectID property of the input records is the same. However, even if the processing of the Filter shape results in no output records, the substrategy is still processed, which results in the unnecessary addition of 1.56 seconds to the total processing time.
To process the strategy only when required, use the Switch and Group By components. The Group By component counts the customer records that pass through the Filter component. If at least one customer record passes through the Filter component, the strategy is processed; otherwise, the strategy is not processed.
For more information, see Strategy rule form - Completing the Strategy tab.
Cache time-consuming expressions
You can cache the global parts of an expression that are not required for each decision. For example, the following Set Property component takes 525.76 milliseconds to compute, which is 12 percent of the total strategy processing time. To a strategy designer, this amount of time might indicate that this element requires optimization.
This Set Valid Set Property component sets properties as stated by the following expression:
.D_Date_Start <= DateTime.D_Today && .D_Date_End >= DateTime.D_Today && .D_Time_Start <= DateTime.D_TimeOfDay && .D_Time_End >= DateTime.D_TimeOfDay
Based on the preceding expression, the DateTime.D_Today and DateTime.D_TimeOfDay properties are retrieved from the clipboard page for each decision. This time-consuming process can be optimized by caching the two properties through an additional Set Property component.
The new DataCache Set Property component sets temporary D_Today and D_TimeOfDay properties. This component reduces the processing time of the Set Valid component from 12 percent of the total strategy processing time to 1 percent by using the following expression:
.D_Date_Start <= DataCache.D_Today && .D_Date_End >= DataCache.D_Today && .D_Time_Start <= DataCache.D_TimeOfDay && .D_Time_End >= DataCache.D_TimeOfDay
Frequently asked questions when troubleshooting strategy performance
- Which node type does the strategy runs through a data flow? Can it be controlled to run on specific node types?
- There is no dedicated node type for a strategy, because it normally gets executed under a data flow.
- What are the top components that when used in a strategy design could lead to performance degradation?
- Unoptimized components are typically the ones that need more attention
when debugging. Typical components that might run into performance
- Adaptive Model is by nature is a relatively expensive component. If using interaction history predictors, adaptive models can be extremely expensive when there is an issue with Decision Data Store (for example, PEGA0075 alert).
- Interaction history, when there are many records to be loaded.
- Data import or decision data, when importing a large list of pages.
- Embedded Strategy, when iterating over a large page list.
- Data Join, when the component is wrongly configured, it leads to an explosion in the number of result pages (for example, Cartesian product).
- MarkerContainer, which is a internal technical representation of all data that needs to be propagated along with the service request page, for example, adaptive decision manager Mmodel results and monitoring data. It is transparent to the strategy designer, but if there are too many service request pages or the strategy logic is incorrectly configured, it might cause a long garbage collection (GC) pause issue. In this case, select Exclude model results from Assisted Channel Eligibility from data join shape properties, when applicable.
- What are the important things to look at while tracing strategies?
- Tracing using the built-in tracer is not recommended with the optimized decision engine, as the optimizations means it cannot guarantee the order of execution.