Creating a File data set record for files on repositories

To enable a parallel load from multiple CSV or JSON files located in remote repositories or on the local file system, create a File data set that references a repository. This feature enables remote files to function as data sources for Pega Platform data sets.

You can perform the following operations for File data sets referencing a remote repository:
Browse
Retrieves records in an undefined order.
Save
Saves records to multiple files, along with a meta file that contains the name, size, and the number of records for every file. The Save operation is not available for manifest files.
Truncate
Removes all configured files and their meta files, except for the manifest file.
GetNumberOfRecords
Estimates the number of records based on the average size of the first few records and the total size of the data set files.
Before you begin:  Create a File data set rule instance. See Creating a File data set rule.
  1. In the Edit data set tab, in the Data Source section, click Files on repositories.
  2. In the Connection section, select the source repository:
    • To select one of the predefined repositories, click the Repository configuration field, press the Down Arrow key, and choose a repository.
    • To create a repository, click Open to the right of the Repository Configuration field and perform Creating a repository.
    To match multiple files in a folder, use an asterisk (*) as a wild card character.
    For example:  /folder/part-r-*
  3. In the File configuration section, select how you want to define the files to read or write:
    • For a single file or a range of files, select Use a file path.
    • For multiple files that you list in a manifest file, select Use a manifest file.
    For manifest files, use the following .xml format:
    
    <manifest>
        <files>  
              <file>  
                  <name>file0001.csv</name> 
              </file>  
              <file>  
                  <name>file0002.csv</name> 
              </file>
        </files>  
     </manifest>  
    
    You can use a manifest file to define files only for read operations.
  4. In the File configuration section, enter the file location.
  5. Optional: For file path, define the date and time pattern by adding a Java SimpleDateFormat string to the file path.
    The SimpleDateFormat does not support the following characters: " ? * < > |:
    For example: %{yyyy-MM-dd-HH-}
  6. Optional: If the file is compressed, select File is compressed and choose the Compression type.
    The supported compression types are .zip and .gz (GZip).
  7. Optional: To provide additional file processing for read and write operations, such as encoding and decoding, define and implement a dedicated interface:
    1. Select Custom stream processing.
    2. In the Java class with reader implementation field, enter the fully qualified name of the java class with the logic that you want to apply before parsing.
      For example: com.pega.bigdata.dataset.file.repository.streamprocessing.sampleclasses.InputStreamShiftingProcessing
    3. In the Java class with writer implementation field, enter the fully qualified name of the java class with the logic that you want to apply after serializing the file, before writing it to the system.
      For example: com.pega.bigdata.dataset.file.repository.streamprocessing.sampleclasses.OutputStreamShiftingProcessing
    For more information on the custom stream processing interface, see Requirements for custom stream processing in File data sets.
  8. In the Parser configuration section, update the settings for the selected file by clicking Configure automatically or by configuring the parameters manually:
    1. From the File type drop-down list, select the defined file type.
    2. For CSV files, specify if the file contains a header row by selecting the File contains header check box.
    3. For CSV files, in the Delimiter character list, select a character separating the fields in the selected file.
    4. For CSV files, in the Supported quotation marks list, select the quotation mark type used for string values in the selected file.
    5. In the Date Time format field, enter the pattern representing date and time stamps in the selected file.
      The default pattern is: yyyy - MM - dd HH : mm : ss
    6. In the Date format field, enter the pattern representing date stamps in the selected file.
      The default pattern is: yyyy - MM - dd
    7. In the Time Of Day format field, enter the pattern representing time stamps in the selected file.
      The default pattern is: HH : mm : ss
      Note: Time properties in the selected file can be in a different time zone than the one used by Pega Platform. To avoid confusion, specify the time zone in the time properties of the file, and use the appropriate pattern in the settings.
  9. Optional: Click Preview file.
    Result: For a file path configuration, the preview contains the file name and file contents. For a manifest file configuration, the preview shows the manifest file and the contents of the first file that is listed in the manifest.
  10. For CSV files, in the Mapping tab, modify the number of mapped columns:
    • To add a CSV file column, click Add mapping.
    • To remove a CSV file column and the associated property mapping, click Delete mapping for the applicable row.

    For CSV files with a header row, the Column entry in a new mapping instance must match the column name in the file.

  11. For CSV files, in the Mapping tab, check and complete the mapping between the columns in the CSV file and the corresponding properties in Pega Platform :
    • To map an existing property to a CSV file column, in the Property column, press the Down Arrow and choose the applicable item from the list.
    • For CSV files with a header row, to automatically create properties that are not in Pega Platform and map them to CSV file columns, click Create missing properties. Confirm the additional mapping by clicking Create.
    • To manually create properties that are not in Pega Platform and map them to CSV file columns, in the Property column, enter a property name that matches the Column entry, click Open, and configure the new property. For more information, see Creating a property.

    For CSV files with a header row, the Column entry in a new mapping instance must match the column name in the file.

    For JSON files, the Mapping tab is empty, because the system automatically maps the fields, and no manual mapping is available.

  12. Confirm the new File data set configuration by clicking Save.
    Result: If CSV or JSON files are not valid, error messages display the reason for the error and a line number that identifies where the error is in the file.