Creating a batch run for data flows
Create batch runs for your data flows to make simultaneous decisions for large groups of customers. You can also create a batch run for data flows with a non-streamable primary input.
- In the header of Dev Studio, click .
- On the Batch processing tab, click New.
- On the New: Data Flow Work Item tab, associate a Data Flow rule
with the data flow run:
- In the Applies to field, press the Down arrow key, and then select the class to which the Data Flow rule applies.
- In the Access group field, press the Down arrow key, and then select an access group context for the data flow run.
- In the Number of threads field, enter the number of threads to use per data flow node.
- In the Data flow field, press the Down arrow key, and then
select the Data Flow rule that you want to run.The class that you select in the Applies to field limits the available rules.
- In the Service instance name field, select Batch.
- In the Priority list, select the importance level for the
run.For more information, see Data flow run priorities.
- Optional: To run activities before and after the data flow run completes, in the
Additional processing section, specify the pre-processing and
post-processing activities.For more information, see Adding pre- and post-activities to data flows.
- In the Resilience section, specify the error threshold for the data flow run. In the Fail the run after more than x failed records field, enter an integer greater than 0.
- In the Node failure section, specify how you want the run to
proceed in case the node becomes unreachable:
- To resume processing records on the remaining active nodes, from the last
processed record that is captured by a snapshot, select Resume on other nodes
from the last snapshot. If you enable this option, the run can process
each record more than once.
This option is available only for resumable data flow runs. For more information about resumable and non-resumable data flow runs and their resilience, see the Data flow service overview article on Pega Community.
- To resume processing records on the remaining active nodes from the first record
in the data partition, select Restart the partitions on other
nodes. If you enable this option, the run can process each record more
than once.
This option is available only for non-resumable data flow runs. For more information about resumable and non-resumable data flow runs and their resilience, see the Data flow service overview article on Pega Community.
- To skip processing the data on the failed node, select Skip partitions on the failed node. If you enable this option, the run completes without processing all records. Records that process successfully only process once.
- To terminate the data flow run and change the run status to
Failed, select Fail the entire
run.
This option provides backward compatibility with previous versions of Pega Platform.
- To resume processing records on the remaining active nodes, from the last
processed record that is captured by a snapshot, select Resume on other nodes
from the last snapshot. If you enable this option, the run can process
each record more than once.
- For resumable data flow runs, in the Snapshot management section,
specify how often you want the Data Flow service to take snapshots of the last processed
record from the data flow source.If you set the Data Flow service to take snapshots more frequently then you increase the chance of not repeating record processing, but you can also lower system performance.
- If your data flow references an Event Strategy rule, configure the state management
settings:
- Expand the Event strategy section.
- Optional: To specify how you want the incomplete tumbling windows to act when the data flow
run stops, in the Event emitting section, select one of the
available options.By default, when the data flow run stops, all incomplete tumbling windows in the Event Strategy rule emit the collected events. For more information, see Event Strategy rule form - Completing the Event Strategy tab.
- In the State management section, specify how you want the Data
Flow service to process data from event strategies:
- To keep the event strategy state in running memory and write the output to a
destination when the data flow finishes its run, select
Memory.
If you select this option, the Data Flow service processes records faster, but you can lose data in the event of a system failure.
- To periodically replicate the state of an event strategy in the form of key
values to the Cassandra database that is located in the Decision Data Store,
select Database.
If you select this option, you can fully restore the state of an event strategy after a system failure, and continue processing data.
- To keep the event strategy state in running memory and write the output to a
destination when the data flow finishes its run, select
Memory.
- In the Target cache size field, specify the maximum size of
the cache for state management data.The default value is 10 megabytes.
- Click Done.
- Click Start.
- Optional: To analyze a life cycle during or after a runand troubleshoot potential issues, review
the life cycle events:
- On the Data flow run tab, click Run details.
- On the Run details tab, click View Lifecycle Events.
- Optional: To export the life cycle events to a single file, click Actions, and then select a file type.
- Reprocessing failed records in batch data flow runs
When a batch data flow run finishes with failures, you can identify all the records that failed during the run. After you fix all the issues that are related to the failed records, you can reprocess the failures to complete the run by resubmitting the partitions with failed records. This option saves time when your data flow run processes millions of records and you do not want to start the run from the beginning.
Previous topic Making decisions in data flow runs Next topic Reprocessing failed records in batch data flow runs