Creating a batch run for data flows
Create batch runs for your data flows to make simultaneous decisions for large groups of customers. You can also create a batch run for data flows with a non-streamable primary input, for example, a Facebook data set.
- Start the Data Flow service.
For more information, see Configuring the Data Flow service.
- Check-in the data flow that you want to run.
For more information, see Rule check-in process.
- In the header of Dev Studio, click Configure > Decisioning > Decisions > Data Flows > Batch Processing.
- On the Batch processing tab, click New.
-
On the New: Data Flow Work Item tab, associate a Data Flow rule
with the data flow run:
- In the Applies to field, press the Down arrow key, and then select the class to which the Data Flow rule applies.
- In the Access group field, press the Down arrow key, and then select an access group context for the data flow run.
-
In the Data flow field, press the Down arrow key, and then
select the Data Flow rule that you want to run.
The class that you select in the Applies to field limits the available rules.
- In the Service instance name field, select Batch.
- Optional:
To run activities before and after the data flow run completes, in the
Additional processing section, specify the pre-processing and
post-processing activities.
For more information, see Adding pre- and post- activities to data flows.
-
Specify the error threshold for the data flow run:
- Expand the Resilience section.
-
In the Fail the run after more than x failed
records field, enter an integer greater than 0.
Result: After the number of failed records reaches or exceeds the threshold that you specify, the run stops processing data and the run status changes to Failed. If the number of failed records does not reach or exceed the threshold, the run continues to process data, and the run status then changes to Completed with failures.
-
In the Node failure section, specify how you want the run to
proceed in case the node becomes unreachable:
- To resume processing records on the remaining active nodes, from the last
processed record that is captured by a snapshot, select Resume on other nodes
from the last snapshot. If you enable this option, the run can process
each record more than once.
This option is available only for resumable data flow runs. For more information about resumable and non-resumable data flow runs and their resilience, see the Data flow service overview article on Pega Community.
- To resume processing records on the remaining active nodes from the first record
in the data partition, select Restart the partitions on other
nodes. If you enable this option, the run can process each record more
than once.
This option is available only for non-resumable data flow runs. For more information about resumable and non-resumable data flow runs and their resilience, see the Data flow service overview article on Pega Community.
- To skip processing the data on the failed node, select Skip partitions on the failed node. If you enable this option, the run completes without processing all records. Records that process successfully only process once.
- To terminate the data flow run and change the run status to
Failed, select Fail the entire
run.
This option provides backward compatibility with previous versions of Pega Platform.
- To resume processing records on the remaining active nodes, from the last
processed record that is captured by a snapshot, select Resume on other nodes
from the last snapshot. If you enable this option, the run can process
each record more than once.
-
For resumable data flow runs, in the Snapshot management section,
specify how often you want the Data Flow service to take snapshots of the last processed
record from the data flow source.
If you set the Data Flow service to take snapshots more frequently then you increase the chance of not repeating record processing, but you can also lower system performance.
-
If your data flow references an Event Strategy rule, configure the state management
settings:
- Expand the Event strategy section.
- Optional:
To specify how you want the incomplete tumbling windows to act when the data flow
run stops, in the Event emitting section, select one of the
available
options.
By default, when the data flow run stops, all incomplete tumbling windows in the Event Strategy rule emit the collected events. For more information, see Event Strategy rule form - Completing the Event Strategy tab.
-
In the State management section, specify how you want the Data
Flow service to process data from event strategies:
- To keep the event strategy state in running memory and write the output to a
destination when the data flow finishes its run, select
Memory.
If you select this option, the Data Flow service processes records faster, but you can lose data in the event of a system failure.
- To periodically replicate the state of an event strategy in the form of key
values to the Cassandra database that is located in the Decision Data Store,
select Database.
If you select this option, you can fully restore the state of an event strategy after a system failure, and continue processing data.
- To keep the event strategy state in running memory and write the output to a
destination when the data flow finishes its run, select
Memory.
-
In the Target cache size field, specify the maximum size of
the cache for state management data.
The default value is 10 megabytes.
-
Click Done.
Result: The system creates a batch run for your data flow and opens a new tab with details about the run. The run does not start yet.
-
Click Start.
Result: The batch data flow run starts.
- Optional:
To analyze a life cycle during or after a
runand
troubleshoot potential issues,
review
the life cycle events:
- On the Data flow run tab, click Run details.
-
On the Run details tab, click View Lifecycle
Events.
Result: The system opens a new window with a list of life cycle events. Each event has a list of assigned details, for example, reason. For more information, see Event details in data flow runs on Pega Community.Note: By default, Pega Platform displays events from the last 10 days. You can change this value by editing the dataflow/run/lifecycleEventsRetentionDays dynamic data setting.
- Optional: To export the life cycle events to a single file, click Actions, and then select a file type.