Best practices for data ingestion
This content applies only to Pega Cloud environments
Review these best practices for configuring data ingestion processes in your application.
Add data flow run details to the case review screen to help debug issues
Provide reports and access for the client's ETL and production support teams
Provide insight into the execution status (which is normally set up as an agent) by following these best practices:
- Configure a case progress and monitoring report to identify execution statistics and status.
- Provide access to the Case Manager portal to both the client's ETL and production support teams.
- Schedule the report to be sent out as email attachment to the ETL and production support teams.
For more information, see Creating a report.
Provide data to identify and resolve errors
Include the CaseID, DataFlowID, and the details that are available from the Batch processing landing page in error reports. This information provides key details to identify and resolve processing errors.
Check the S3 file count and correct file location regularly
The processing of the files in the S3 SFTP folders uses pattern matching to identify the files to be processed. The production support team must regularly verify that the files are being sent to the appropriate S3 locations and that the files in those locations match the pattern for the intended processing. The client's ETL teams regularly update their processing as data attributes are added or removed from the data sets transferred and mistakes can be made as the processing jobs that support those changes are updated.
Use parallel processing
File processing in Pega Platform provides several opportunities to process data in multiple parallel streams, which can substantially reduce processing time. To take advantage of the parallel processing of the data flows, divide large files into multiple smaller files, each with approximately the same number of records.
The number of concurrent partitions is determined by the Number of threads parameter that you set when creating a data flow run, as shown in the following figure:
One thread is associated with each input file. If the input files are not approximately equal in size, processing takes more time as the entire process must wait until the last file is processed. For more information, see Creating a batch run for data flows.
You can also manage the thread count setting on the Services landing page, as shown in the following figure:
For more information, see Configuring the Data Flow service.
For example, in a scenario where there are three nodes, you can process 15 files with 1 million records in each file faster than three files with 5 million records in each file or one file with 15 million records.
Archive files after processing
Archive the files in your Pega Cloud File Storage repository at the end of an ingestion run. If errors occur during an ingestion run and you have to reprocess the files, you can use the archived files and save time by not having to retransfer the files to the repository.
Previous topic File types used in data ingestion Next topic Configuring the data ingestion process