File types used in data ingestion
This content applies only to Pega Cloud environments
Ingesting customer data into Pega Customer Decision Hub on Pega Cloud involves three types of files: data files, manifest files, and token files.
Before you configure the ingestion process, familiarize yourself with these file types, their functions, and formats. Establish the recommended file naming conventions and folder structures.
Data files
These files contain the actual customer data and are generated by the client's
extraction-transformation-load (ETL) team. The files might contain new and updated
customer data as well as data for purging or deleting existing records. One
ingestion run can support one or more data files. A best practice is to divide large
files into multiple smaller files to allow parallel transfer and faster processing.
Files can be in CSV or JSON format, and can be compressed.
.gzip
and .zip
compressions are
supported. File encryption is supported, but requires custom Java coding.
The following table shows sample content of a CSV file for adding or updating customer data:
customer_id | first_name | last_name | middle_initial | clv_value | customer_type | billing_city | billing_state | billing_zip |
811162399 | Bill | Smith | O | 3 | I | TAYLOR | MI | 48180 |
443470562 | John | Kennel | F | 11 | I | PORTLAND | OR | 97266 |
The following table shows sample content of a CSV file for deleting or purging customer data:
customer_id |
811162389 |
443470534 |
Manifest files
The manifest file contains metadata about the data files being transferred. There is one manifest record for each ingestion or purge to be transferred. Examples of ingestion or purge: customer data ingestion, customer data purge, account data ingestion, account data purge. While processing, the file listeners listen for manifest files. Manifest files are in XML format.
The manifest file is backed by a data model which is a part of the Data-Ingestion-Manifest class.
The following example is a manifest file that is typically used:
<?xml version="1.0" ?>
<manifest>
<processType>CustomerDataIngest</processType>
<totalRecordCount>1300</totalRecordCount>
<files>
<file>
<name>CustomerDataIngest_MMDDYYYY_000.csv</name>
<size>149613</size>
<recordCount>700</recordCount>
</file>
<file>
<name>CustomerDataIngest_MMDDYYYY_001.csv</name>
<size>125613</size>
<recordCount>600</recordCount>
</file>
</files>
</manifest>
Field | Description | Example |
processType | Type of data being loaded. This field also identifies the data flow to run. | CustomerDataIngest AccountDataIngest |
totalRecordCount | Total number of records across all data files. | 1300 |
recordCount | Record count for one file. | 700 600 |
name | Name of the data file that needs to be agreed with the client and configured in the file data set. The name can have suffix substitution. | CustomerDataIngest_MMDDYYYY_000.csv CustomerDataIngest_MMDDYYYY_001.csv |
size | Size for one file in bytes. This field is optional and used if there is a need to do file size validation. Additional process work is required if size validation must be performed. | 149613 125613 |
Token files
Token files are used to signal that the data file transfer is complete. Token files are optional if your SFTP client application can be configured to send the manifest file after successfully transmitting all data files. Otherwise, token files are required. Token files are typically generated by the SFTP client application when the SFTP has completed the transfer of a file. There is one token for each file transferred. A token file does not need to contain any data, only the presence or absence of the file is checked.
The manifest file and the data files are transferred to the Amazon S3 location through SFTP. Manifest files are typically very small in size (a few kilobytes). The actual data files are typically quite large (multiple gigabytes). As a best practice, large files are divided into smaller files that can be transferred in parallel by the SFTP client application. As a result, the manifest files are transferred before the data files.
Pega Platform file listeners are configured to listen and process the manifest files. As manifest files arrive first, ahead of the data files, their processing completes almost immediately, which in turn kicks off the data flow process. However, the data flow process fails as the data file transmission has not been completed.
Hence, it is critical that the manifest file is sent last. If this cannot be guaranteed by your SFTP client application, then additional work by the SFTP team is required to create a token file and send it after every data file is successfully transferred. The ingestion case is configured to wait until all the token files have arrived before invoking the data flow to start loading the data.
For example, after the DataIngestion_04182020_1.csv
is
successfully transmitted, a token file
DataIngestion_04182020_1.csv.tok
or
DataIngestion_04182020_1.csv.done
is created by the SFTP
application.
File naming convention and folder structure
Both the file listeners and file data sets use pattern matching based on file names. As a result, the following file types, naming conventions, and folder structures must be agreed in advance:
- Naming convention of manifest files
- XML as the format of manifest files
- Naming convention of data files
- Format of data files: CSV or JSON
- For CSV:
- Header row (field names to be mapped to the Spine tables and xCAR data sets)
- Delimiter
- For JSON: Ensure that property names match the names of the fields of the Spine tables and xCAR data sets.
- For CSV:
- Folder location for manifest and data files
Previous topic Ingesting customer data into Pega Customer Decision Hub in Pega Cloud Next topic Best practices for data ingestion