This documentation site is for previous versions.

Visit our new documentation site for current releases.

File types used in data ingestion

Updated on August 3, 2022

This content applies only to Pega Cloud environments

Ingesting customer data into Pega Customer Decision Hub on Pega Cloud involves three types of files: data files, manifest files, and token files.

Before you configure the ingestion process, familiarize yourself with these file types, their functions, and formats. Establish the recommended file naming conventions and folder structures.

Data files

These files contain the actual customer data and are generated by the client's extraction-transformation-load (ETL) team. The files might contain new and updated customer data as well as data for purging or deleting existing records. One ingestion run can support one or more data files. A best practice is to divide large files into multiple smaller files to allow parallel transfer and faster processing. Files can be in CSV or JSON format, and can be compressed. .gzip and .zip compressions are supported. File encryption is supported, but requires custom Java coding.

The following table shows sample content of a CSV file for adding or updating customer data:

customer_id	first_name	last_name	middle_initial	clv_value	customer_type	billing_city	billing_state	billing_zip
811162399	Bill	Smith	O	3	I	TAYLOR	MI	48180
443470562	John	Kennel	F	11	I	PORTLAND	OR	97266

The following table shows sample content of a CSV file for deleting or purging customer data:

customer_id
811162389
443470534

Manifest files

The manifest file contains metadata about the data files being transferred. There is one manifest record for each ingestion or purge to be transferred. Examples of ingestion or purge: customer data ingestion, customer data purge, account data ingestion, account data purge. While processing, the file listeners listen for manifest files. Manifest files are in XML format.

The manifest file is backed by a data model which is a part of the Data-Ingestion-Manifest class.

The following example is a manifest file that is typically used:

<?xml version="1.0" ?>
<manifest>
 <processType>CustomerDataIngest</processType>
 <totalRecordCount>1300</totalRecordCount>
 <files>
 <file>
 <name>CustomerDataIngest_MMDDYYYY_000.csv</name>
 <size>149613</size>
 <recordCount>700</recordCount>
 </file>
 <file>
 <name>CustomerDataIngest_MMDDYYYY_001.csv</name>
 <size>125613</size>
 <recordCount>600</recordCount>
 </file>
 </files>
</manifest>

Field	Description	Example
processType	Type of data being loaded. This field also identifies the data flow to run.	CustomerDataIngest AccountDataIngest
totalRecordCount	Total number of records across all data files.	1300
recordCount	Record count for one file.	700 600
name	Name of the data file that needs to be agreed with the client and configured in the file data set. The name can have suffix substitution.	CustomerDataIngest_MMDDYYYY_000.csv CustomerDataIngest_MMDDYYYY_001.csv
size	Size for one file in bytes. This field is optional and used if there is a need to do file size validation. Additional process work is required if size validation must be performed.	149613 125613

Token files

Token files are used to signal that the data file transfer is complete. Token files are optional if your SFTP client application can be configured to send the manifest file after successfully transmitting all data files. Otherwise, token files are required. Token files are typically generated by the SFTP client application when the SFTP has completed the transfer of a file. There is one token for each file transferred. A token file does not need to contain any data, only the presence or absence of the file is checked.

The manifest file and the data files are transferred to the Amazon S3 location through SFTP. Manifest files are typically very small in size (a few kilobytes). The actual data files are typically quite large (multiple gigabytes). As a best practice, large files are divided into smaller files that can be transferred in parallel by the SFTP client application. As a result, the manifest files are transferred before the data files.

Pega Platform file listeners are configured to listen and process the manifest files. As manifest files arrive first, ahead of the data files, their processing completes almost immediately, which in turn kicks off the data flow process. However, the data flow process fails as the data file transmission has not been completed.

Hence, it is critical that the manifest file is sent last. If this cannot be guaranteed by your SFTP client application, then additional work by the SFTP team is required to create a token file and send it after every data file is successfully transferred. The ingestion case is configured to wait until all the token files have arrived before invoking the data flow to start loading the data.

For example, after the DataIngestion_04182020_1.csv is successfully transmitted, a token file DataIngestion_04182020_1.csv.tok or DataIngestion_04182020_1.csv.done is created by the SFTP application.

File naming convention and folder structure

Both the file listeners and file data sets use pattern matching based on file names. As a result, the following file types, naming conventions, and folder structures must be agreed in advance:

Naming convention of manifest files
XML as the format of manifest files
Naming convention of data files
Format of data files: CSV or JSON
- For CSV:
  - Header row (field names to be mapped to the Spine tables and xCAR data sets)
  - Delimiter
- For JSON: Ensure that property names match the names of the fields of the Spine tables and xCAR data sets.
Folder location for manifest and data files

Previous topic Ingesting customer data into Pega Customer Decision Hub in Pega Cloud
Next topic Best practices for data ingestion

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Visit the Support Center

Get Started with Community

File types used in data ingestion

Data files

Manifest files

Token files

File naming convention and folder structure

Have a question? Get answers now.

Ready to crush complexity?

Experience the benefits of Pega Community when you log in.

Get Started with Community

Data files

Manifest files

Token files

File naming convention and folder structure

Have a question? Get answers now.

Ready to crush complexity?

Experience the benefits of Pega Community when you log in.

We'd prefer it if you saw us at our best.