Review these best practices for configuring data ingestion processes in your application.
Add data flow run details to the case review screen to help debugging issues
Provide reports and access for the client's ETL and production support teams
Provide insight into the execution status (which is normally set up as an agent) by following these best practices:
- Configure a case progress and monitoring report to identify execution statistics and status.
- Provide access to the Case Manager portal to both the client's ETL and production support teams.
- Schedule the report to be sent out as email attachment to the ETL and production support teams.
Provide data to identify and resolve errors
Include the CaseID, DataFlowID, and the details that are available from the Batch processing landing page in error reports. This information provides key details to identify and resolve processing errors.
Clean up the Batch data flow landing page
If you run multiple ingestion jobs daily, the number of entries on the Batch processing landing page increases. This increase can slow down reporting and make manual review of the processing cumbersome. Configure an agent to delete the older data flow instances, typically, after two weeks.
Check the S3 file count and correct file location regularly
The processing of the files in the S3 SFTP folders uses pattern matching to identify the files to be processed. The production support team must regularly verify that the files are being sent to the appropriate S3 locations and the files in those locations match the pattern for the intended processing. The client's ETL teams regularly update their processing as data attributes are added or removed from the data sets transferred and mistakes can be made as the processing jobs that support those changes are updated.
Archive files before processing
Ingestion SFTP and file processing might encounter errors. Minimize reprocessing time when errors occur, by ensuring that the process copies files during different stages of processing (archive/delete).
Use parallel processing
File processing in Pega Platform provides several opportunities to process data in multiple parallel streams, which can substantially reduce processing time. To take advantage of the parallel processing of the data flows, divide large files into multiple smaller files, each with approximately the same number of records.
The number of concurrent partitions is determined by the Number of threads parameter that you set when creating a new data flow. One thread is associated with each input file. If the input files are not approximately equal in size, process takes more time as the entire process must wait until the last file is processed.
You can also manage the thread count setting on the Services landing page. Go to, and then select the Batch service type to edit the settings for the data flows.
For example, in a scenario where there are three nodes, you can process 15 files with 1 million records in each faster than three files with 5 million records in each or one file with 15 million records.