Skip to main content


         This documentation site is for previous versions. Visit our new documentation site for current releases.      
 

Data ingestion best practices

Updated on August 3, 2022

This content applies only to Pega Cloud environments

Review these best practices for configuring data ingestion processes in your application.

Pega Customer Decision Hub Implementation Guide

Add data flow run details to the case review screen to help debugging issues

Ingestion case review screen with data flow details
Ingestion case review screen includes details of the staging and xCAR data flows, such as data flow ID, status, and updated records count.

Provide reports and access for the client's ETL and production support teams

Provide insight into the execution status (which is normally set up as an agent) by following these best practices:

  • Configure a case progress and monitoring report to identify execution statistics and status.
  • Provide access to the Case Manager portal to both the client's ETL and production support teams.
  • Schedule the report to be sent out as email attachment to the ETL and production support teams.

Provide data to identify and resolve errors

Include the CaseID, DataFlowID, and the details that are available from the Batch processing landing page in error reports. This information provides key details to identify and resolve processing errors.

Clean up the Batch data flow landing page

If you run multiple ingestion jobs daily, the number of entries on the Batch processing landing page increases. This increase can slow down reporting and make manual review of the processing cumbersome. Configure an agent to delete the older data flow instances, typically, after two weeks.

Check the S3 file count and correct file location regularly

The processing of the files in the S3 SFTP folders uses pattern matching to identify the files to be processed. The production support team must regularly verify that the files are being sent to the appropriate S3 locations and the files in those locations match the pattern for the intended processing. The client's ETL teams regularly update their processing as data attributes are added or removed from the data sets transferred and mistakes can be made as the processing jobs that support those changes are updated.

Archive files before processing

Ingestion SFTP and file processing might encounter errors. Minimize reprocessing time when errors occur, by ensuring that the process copies files during different stages of processing (archive/delete).

Use parallel processing

File processing in Pega Platform provides several opportunities to process data in multiple parallel streams, which can substantially reduce processing time. To take advantage of the parallel processing of the data flows, divide large files into multiple smaller files, each with approximately the same number of records.

The number of concurrent partitions is determined by the Number of threads parameter that you set when creating a new data flow. One thread is associated with each input file. If the input files are not approximately equal in size, process takes more time as the entire process must wait until the last file is processed.

You can also manage the thread count setting on the Services landing page. Go to ConfigureDecisioningInfrastructureServicesData flows, and then select the Batch service type to edit the settings for the data flows.

For example, in a scenario where there are three nodes, you can process 15 files with 1 million records in each faster than three files with 5 million records in each or one file with 15 million records.

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best.

Pega.com is not optimized for Internet Explorer. For the optimal experience, please use:

Close Deprecation Notice
Contact us