The use of streaming data sets in data flows
The configuration of a streaming data set, such as Kafka, Kinesis, or Stream, can impact the life cycle of the records consumed by the data flow run that utilizes these data sets. You can use the following information to prevent duplicate processing of records or loss of records during data flow runs using these data sets.
Users have the option to select Only read new records or Read existing and new records for Read options when configuring streaming data sets as the data flow source.
These options change the behavior of the streaming data set when starting a data flow run from the beginning, whether by creating a new data flow run or restarting an existing one.
When users select Read existing and new records, the data flow run starts reading any records that exist in the data set.
When users select Only read new records, the data flow run discards any existing records and only reads records that appear in the data set after the data flow run is in the In-progress status.
The figure below shows an example of a Source configurations window.
Users can choose to Start, Stop, Continue, or Restart a data flow run. For more information, see Managing data flow runs.
Regardless of the selected Read options, if a user chooses to Continue a stopped run then data flow continues to process records from those received after the Stop.
If a user chooses to Restart the run the outcome of the run depends on the selected Read options:
- If a user selected Only read new records the records sent between Stop and Restart will be ignored. For example, a new data flow run on an existing Kafka topic with this setting will completely disregard existing records.
- If a user selected Read existing and new records all existing records in a data set will be reprocessed, potentially resulting in duplicates. For example, a new run on an existing Kafka topic with this setting will process all existing records on that topic.
Previous topic Creating a Kinesis data set Next topic Data transfer