Creating a real-time run for data flows
Provide your decision strategies with the latest data by creating real-time runs for data flows with a streamable data set source, for example, a Kafka data set.
- In the header of Dev Studio, click .
- On the Real-time processing tab, click New.
- On the New: Data Flow Work Item tab, associate a Data Flow rule
with the data flow run:
- In the Applies to field, press the Down arrow key, and then select the class to which the Data Flow rule applies.
- In the Access group field, press the Down arrow key, and then select an access group context for the data flow run.
- In the Number of threads field, enter the number of threads to use per data flow node.
- In the Data flow field, press the Down arrow key, and then
select the Data Flow rule that you want to run.The class that you select in the Applies to field limits the available rules.
- In the Service instance name field, select Batch.
- In the Priority list, select the importance level for the
run.For more information, see Data flow run priorities.
- Optional: To keep the run active and to restart the run automatically after every modification,
specify the following settings:
- Select the Manage the run and include it in the application check box.
- In the Ruleset field, press the Down arrow key, and then select a ruleset that you want to associate with the run.
- In the Run ID field, enter a meaningful ID to identify the data flow run.
- Optional: In the Additional processing section, specify any activities that
you want to run before and after the data flow run.For more information, see Adding pre- and post-activities to data flows.
- In the Resilience section, specify an error threshold for the data
flow run. In the Fail the run after more than x failed records
field, enter an integer greater than 0.After the number of failed records reaches or exceeds the threshold that you specify, the run stops processing data and the run status changes to Failed. If the number of failed records does not reach or exceed the threshold, the run continues to process data, and the run status then changes to Completed with failures.
- In the Node failure section, specify how you want the run to
proceed in case the node becomes unreachable:
- To resume processing records on the remaining active nodes, from the last
processed record that is captured by a snapshot, select Resume on other nodes
from the last snapshot. If you enable this option, the run can process
each record more than once.
This option is available only for resumable data flow runs.
- To resume processing records on the remaining active nodes from the first record
in the data partition, select Restart the partitions on other
nodes. If you enable this option, the run can process each record more
than once.
This option is available only for non-resumable data flow runs.
- To terminate the data flow run and change the run status to
Failed, select Fail the entire
run.
This option provides backward compatibility with previous Pega Platform versions.
The available options depend on the type of data flow run.For more information about resumable and non-resumable data flow runs and their resilience, see the Data flow service overview article on Pega Community.
- To resume processing records on the remaining active nodes, from the last
processed record that is captured by a snapshot, select Resume on other nodes
from the last snapshot. If you enable this option, the run can process
each record more than once.
- For resumable data flow runs, in the Snapshot management section,
specify how often you want the Data Flow service to take snapshots of the last processed
record from the data flow source.If you set the Data Flow service to take snapshots more frequently then you increase the chance of not repeating record processing, but you can also lower system performance.
- If your data flow references an Event Strategy rule, configure the state management
settings:
- Expand the Event strategy section.
- Optional: To specify how you want the incomplete tumbling windows to act when the data flow
run stops, in the Event emitting section, select one of the
available options.By default, when the data flow run stops, all the incomplete tumbling windows in the Event Strategy rule emit the collected events. For more information, see Event Strategy rule form - Completing the Event Strategy tab.
- In the State management section, specify how you want the Data
Flow service to process data from event strategies:
- To keep the event strategy state in running memory and write the output to a
destination when the data flow finishes its run, select
Memory.
If you select this option, the Data Flow service processes records faster, but you can lose data in the event of a system failure.
- To periodically replicate the state of an event strategy in the form of key
values to the Cassandra database that is located in the Decision Data Store,
select Database.
If you select this option, you can fully restore the state of an event strategy after a system failure, and continue processing data.
- To keep the event strategy state in running memory and write the output to a
destination when the data flow finishes its run, select
Memory.
- In the Target cache size field, specify the maximum size of
the cache for state management data.The default value is 10 megabytes.
- Click Done.
- Click Start.
- Optional: To analyze a life cycle during or after a run, and troubleshoot potential issues review
the life cycle events:
- On the Data flow run tab, click Run details.
- On the Run details tab, click View Lifecycle Events.
- Optional: To export the life cycle events to a single file, click Actions, and then select a file type.
Previous topic Reprocessing failed records in batch data flow runs Next topic Monitoring single case data flow runs