Skip to main content


         This documentation site is for previous versions. Visit our new documentation site for current releases.      
 

Creating a File data set record for files on repositories

Updated on July 5, 2022

To import and export CSV or JSON files to and from Pega Platform, create a File data set that references a repository, and then use that data set in the source or destination shape in a data flow.

You can perform the following operations for File data sets referencing a remote repository:

Browse
Retrieves records in an undefined order.
Save
Saves records to multiple files, along with a meta file that contains the name, size, and the number of records for every file. The Save operation is not available for manifest files.
Truncate
Removes all configured files and their meta files, except for the manifest file.
GetNumberOfRecords
Estimates the number of records based on the average size of the first few records and the total size of the data set files.
Before you begin: Create a File data set rule instance. See Creating a File data set rule.
  1. On the New tab, in the File location section, click Files on repository.
  2. In the Configuration section, select the source repository:
    • To select one of the predefined repositories, click the Repository configuration field, press the Down Arrow key, and choose a repository.
    • To create a repository, click the Target icon to the right of the Repository Configuration field, and then perform Creating a repository.
  3. In the Path section, select how you want to define the files to import or export:
    ChoicesActions
    Define a single file or a range of files
    1. Select Use a file path to import or export data.
    2. In the File path field, enter the file location.

      When importing data, you can match multiple files in a folder by using an asterisk (*) as a wild card character.

      For example: /folder/part-r-*
      When exporting data, you can define a file name that consists of a prefix and an optional date and time pattern by adding a Java SimpleDateFormat string to the file path. The SimpleDateFormat does not support the following characters: "?*<>|:
      For example: Folder/Prefix-%{yyyy-MM-dd-HH:mm:SS}.csv
      When a file is created, a unique ID is appended to the file name to ensure file uniqueness.
      For example: Export/Customer-2022-01-03-10:30:45-123456789.csv
      For more information, see File Data Set file path pattern.
    Define multiple files that you list in a manifest file
    Note: You can use a manifest file to define files only for read operations.
    1. Select Use a manifest file to import data.

      For manifest files, use the following .xml format:

      For example:
      
      <manifest>
          <files>  
                <file>  
                    <name>file0001.csv</name> 
                </file>  
                <file>  
                    <name>file0002.csv</name> 
                </file>
          </files>  
       </manifest>  
      
    2. In the Manifest file path field, enter the location of the manifest file.
      Note: The file path in the manifest file name needs to be relative to the repository configured in the data set. For example, if your file is in the root folder of your repository, you can directly access the file name, if not, you will need to include the folder structure as a part of the name.
    Repository and file path configuration example
    A file data set references the default store repository. The file path matches a specific file name pattern.
  4. Optional: Click Preview file.
    Result: For a file path configuration, the preview contains the file name and, after applying decompression and decryption, first 100 lines of the file. For a manifest file configuration, the preview shows the manifest file and the contents of the first file that is listed in the manifest.
  5. In the Data protection section, select Enable data protection checkbox and provide the Pretty Good Privacy (PGP) keys to encrypt and decrypt files:
    1. In the Public key reference field, enter the public key required to encrypt files, for example when exporting data.
    2. In the Private key reference field, enter the private key required to decrypt files, for example when importing data.
    3. Optional: In the Passphrase reference field enter the passphrase to decrypt files.
    While providing the PGP keys, use the global resource settings syntax:
    • =D_PageName.PropertyName
    • =Declare_PageName.PropertyName
    Note: The content of your private and public keys must be encoded in base64 format. The supported version of PGP is gpg (GnuPG/MacGPG2) 2.2.x.
  6. Optional: If the file is compressed, in the File configuration section, select Enable file compression, and then select the Compression type.
    The supported compression types are .zip and .gz (gzip).
  7. Optional: To provide additional file processing for read and write operations, such as encoding and decoding, define and implement a dedicated interface:
    1. Select Enable custom stream processing.
    2. In the Java class with reader implementation field, enter the fully qualified name of the java class with the logic that you want to apply before parsing.
      For example: com.pega.bigdata.dataset.file.repository.streamprocessing.sampleclasses.InputStreamShiftingProcessing
    3. In the Java class with writer implementation field, enter the fully qualified name of the java class with the logic that you want to apply after serializing the file, before writing it to the system.
      For example: com.pega.bigdata.dataset.file.repository.streamprocessing.sampleclasses.OutputStreamShiftingProcessing
    For more information on the custom stream processing interface, see Requirements for custom stream processing in File data sets.
  8. From the File type drop-down list, select the type of file that you want to import or export with this data set, CSV or JSON.
  9. Optional: For CSV files, to update the settings automatically, click Configure automatically, and then go to step 12.
  10. For CSV files, update additional file settings:
    1. Specify if the file contains a header row by selecting the File contains header checkbox.
    2. In the Delimiter character list, select a character separating the fields in the selected file.
    3. In the Supported quotation marks list, select the quotation mark type used for string values in the selected file.
  11. For CSV and JSON files, update date and time settings:
    1. In the Date Time format field, enter the pattern representing date and time stamps in the selected file.
      The default pattern is: yyyy-MM-ddHH:mm:ss
    2. In the Date format field, enter the pattern representing date stamps in the selected file.
      The default pattern is: yyyy-MM-dd
    3. In the Time Of Day format field, enter the pattern representing time stamps in the selected file.
      The default pattern is: HH:mm:ss
      Note: Time properties in the selected file can be in a different time zone than the one used by Pega Platform. To avoid confusion, specify the time zone in the time properties of the file, and use the appropriate pattern in the settings.
    Result: You have configured general file settings. For CSV files, you need to complete the property mapping. For JSON files, the Mapping tab is empty, because the system automatically maps the fields, and no manual mapping is available.
  12. For CSV files, in the Mapping tab, modify the number of mapped columns:
    • To add a CSV file column, click Add mapping.
    • To remove a CSV file column and the associated property mapping, click Delete mapping for the applicable row.
  13. For CSV files, on the Mapping tab, check and complete the mapping between the columns in the CSV file and the corresponding properties in Pega Platform:
    • To map an existing property to a CSV file column, in the Property column, press the Down Arrow and choose the applicable item from the list.
    • For CSV files with a header row, to automatically create properties that are not in Pega Platform and map them to CSV file columns, click Create missing properties. Confirm the additional mapping by clicking Create.
    • To manually create properties that are not in Pega Platform and map them to CSV file columns, in the Property column, enter a property name that matches the Column entry, click the Target icon, and configure the new property. For more information, see Creating a property.
    Note: For CSV files with a header row, the Column entry in a new mapping instance must match the column name in the file.
  14. Confirm the new File data set configuration by clicking Save.
    Result: If CSV or JSON files are not valid, error messages display the reason for the error and a line number that identifies where the error is in the file.
  • File Data Set file path pattern

    To process data in parallel by using multiple threads, the File Data Set operates on a collection of files instead of a single file. The file path pattern includes tokens that match existing files to the pattern or generate new files.

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best.

Pega.com is not optimized for Internet Explorer. For the optimal experience, please use:

Close Deprecation Notice
Contact us