Skip to main content

         This documentation site is for previous versions. Visit our new documentation site for current releases.      

Creating a File data set record for files on repositories

Updated on May 17, 2024

To import and export CSV or JSON files to and from Pega Platform, create a File data set that references a repository, and then use that data set in the source or destination shape in a data flow.

You can perform the following operations for File data sets referencing a remote repository:

Retrieves records in an undefined order.
Saves records to multiple files, along with a meta file that contains the name, size, and the number of records for every file. The Save operation is not available for manifest files.
Removes all configured files and their meta files, except for the manifest file.
Estimates the number of records based on the average size of the first few records and the total size of the data set files.
Before you begin: Create a File data set rule instance. See Creating a File data set rule.
  1. On the New tab, in the Data source section, click Files on repository.
  2. In the Connection section, select the source repository:
    • To select one of the predefined repositories, click the Repository configuration field, press the Down Arrow key, and choose a repository.
    • To create a repository, click the Target icon to the right of the Repository Configuration field, and then perform Creating a repository.
  3. In the File configuration section, select how you want to define the files to import or export:
    Define a single file or a range of files
    1. Select Use a file path.
    2. In the File path field, enter the file location.

      When importing data, you can match multiple files in a folder by using an asterisk (*) as a wild card character.

      For example: /folder/part-r-*
      When exporting data, you can define a file name that consists of a prefix and an optional date and time pattern by adding a Java SimpleDateFormat string to the file path. The SimpleDateFormat does not support the following characters: "?*<>|:
      For example: Folder/Prefix-%{yyyy-MM-dd-HH:mm:SS}.csv
      When a file is created, a unique ID is appended to the file name to ensure file uniqueness.
      For example: Export/Customer-2022-01-03-08-10:30:45-123456789.csv
    Define multiple files that you list in a manifest file
    Note: You can use a manifest file to define files only for read operations.
    1. Select Use a manifest file.

      For manifest files, use the following .xml format:

      For example:
    2. In the Manifest file path field, enter the location of the manifest file.
      Note: The file path in the manifest file name needs to be relative to the repository configured in the data set. For example, if your file is in the root folder of your repository, you can directly access the file name, if not, you will need to include the folder structure as a part of the name.
    Repository and file path configuration example
    A sample file data set is configured to reference the default store repository and match a specific file name pattern.
  4. Optional: Click Preview file.
    Result: For a file path configuration, the preview contains the file name and file contents. For a manifest file configuration, the preview shows the manifest file and the contents of the first file that is listed in the manifest.
  5. Optional: If the file is compressed, in the File configuration section, select File is compressed, and then select the Compression type.
    The supported compression types are .zip and .gz (gzip).
  6. Optional: To provide additional file processing for read and write operations, such as encoding and decoding, define and implement a dedicated interface:
    1. Select Custom stream processing.
    2. In the Java class with reader implementation field, enter the fully qualified name of the java class with the logic that you want to apply before parsing.
      For example: com.pega.bigdata.dataset.file.repository.streamprocessing.sampleclasses.InputStreamShiftingProcessing
    3. In the Java class with writer implementation field, enter the fully qualified name of the java class with the logic that you want to apply after serializing the file, before writing it to the system.
      For example: com.pega.bigdata.dataset.file.repository.streamprocessing.sampleclasses.OutputStreamShiftingProcessing
    For more information on the custom stream processing interface, see Requirements for custom stream processing in File data sets.
  7. In the Parser configuration section, from the File type drop-down list, select the type of file that you want to import or export with this data set, CSV or JSON.
  8. Optional: For CSV files, to update the settings automatically, click Configure automatically, and then go to step 11.
  9. For CSV files, update additional file settings:
    1. Specify if the file contains a header row by selecting the File contains header checkbox.
    2. In the Delimiter character list, select a character separating the fields in the selected file.
    3. In the Supported quotation marks list, select the quotation mark type used for string values in the selected file.
  10. For CSV and JSON files, update date and time settings:
    1. In the Date time format field, enter the pattern representing date and time stamps in the selected file.
      The default pattern is: yyyy-MM-ddHH:mm:ss
    2. In the Date format field, enter the pattern representing date stamps in the selected file.
      The default pattern is: yyyy-MM-dd
    3. In the Time of day format field, enter the pattern representing time stamps in the selected file.
      The default pattern is: HH:mm:ss
      Note: Time properties in the selected file can be in a different time zone than the one used by Pega Platform. To avoid confusion, specify the time zone in the time properties of the file, and use the appropriate pattern in the settings.
    Result: You have configured general file settings. For CSV files, you need to complete the property mapping. For JSON files, the Mapping tab is empty, because the system automatically maps the fields, and no manual mapping is available.
  11. For CSV files, in the Mapping tab, modify the number of mapped columns:
    • To add a CSV file column, click Add mapping.
    • To remove a CSV file column and the associated property mapping, click Delete mapping for the applicable row.
  12. For CSV files, on the Mapping tab, check and complete the mapping between the columns in the CSV file and the corresponding properties in Pega Platform:
    • To map an existing property to a CSV file column, in the Property column, press the Down Arrow and choose the applicable item from the list.
    • For CSV files with a header row, to automatically create properties that are not in Pega Platform and map them to CSV file columns, click Create missing properties. Confirm the additional mapping by clicking Create.
    • To manually create properties that are not in Pega Platform and map them to CSV file columns, in the Property column, enter a property name that matches the Column entry, click the Target icon, and configure the new property. For more information, see Creating a property.
    Note: For CSV files with a header row, the Column entry in a new mapping instance must match the column name in the file.
  13. Confirm the new File data set configuration by clicking Save.
    Result: If CSV or JSON files are not valid, error messages display the reason for the error and a line number that identifies where the error is in the file.
  • Previous topic Creating a File data set record for embedded files
  • Next topic Requirements for custom stream processing in File data sets

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best. is not optimized for Internet Explorer. For the optimal experience, please use:

Close Deprecation Notice
Contact us