Skip to main content

         This documentation site is for previous versions. Visit our new documentation site for current releases.      

Creating an HDFS data set record

Updated on May 17, 2024

You must configure each instance of the HDFS data set rule before it can read data from and save it to an external Apache Hadoop Distributed File System (HDFS).

Before you begin: Before you can connect to an Apache HBase or HDFS data store, upload the relevant client JAR files into the application container with Pega Platform. For more information, see HDFS and HBase client and server versions supported by Pega Platform.
  1. Create an instance of the HBase data set rule:
    1. In the header of Dev Studio, click CreateData ModelData Set.
    2. In the Label field, enter a short description for your data set.
    3. In the Type field, select HDFS.
    4. In the Context section, select the application context, class, and ruleset for the data set.
      For more information about the fields on this form, see Creating a rule.
    5. Click Create and Open.
  2. On the Create Data Set tab, in the Data Set Record Configuration section, define the following settings to identify your data set:
    1. In the Label field, enter the data set label.
      Result: The identifier is automatically created based on the data set label.
    2. Optional: To change the automatically created identifier, click Edit, enter an identifier name, and then click OK.
    3. In the Type list, select HDFS.
  3. In the Context section, specify the application context, applicable class, ruleset, and ruleset version of the data set.
  4. Click Create and open.
  5. In the Connection section, connect to a Hadoop configuration instance:
    1. In the Hadoop configuration instance field, select the Hadoop configuration instance that contains the HDFS storage configuration.
      You can create a Hadoop configuration instance by clicking the Target icon on the right side of the field.
    2. Test whether Pega Platform can connect to the HDFS data set by clicking Test connectivity.
      Note: The HDFS data set is optimized to support connections to one Apache Hadoop environment. When HDFS data sets connect to different Apache Hadoop environments in the single instance of a data flow rule, the data sets cannot use authenticated connections concurrently. If you need to use authenticated and non-authenticated connections at the same time, the HDFS data sets must use one Hadoop environment.
    3. In the File path field, specify a file path to the group of source and output files that the data set represents.
      Note: This group of files is based on the file within the original path, but also contains all of the files with the following pattern: fileName-XXXXX, where XXXXX are sequence numbers starting from 00000. This is a result of data flows saving records in batches. The save operation appends data to the existing HDFS data set without overwriting it. You can use * to match multiple files in a folder, for example: /folder/part-r-*
  6. Optional: To view the first 100 KB of records in the selected file, click Preview file.
  7. In the Parser configuration section, specify the file type that is used within the selected data set:
    1. In the File type field, select CSV.
    2. In the Delimiter character field, select the character for separating properties.
    3. In the Supported quotes field, select the type of quotation marks that you want to use.
    JSONIn the File type field, select JSON.
    1. In the File type field, select Parquet.
    2. In the Compression algorithm field, for data set write operations, specify the algorithm that is used for file compression in the data set.
      • If you do not use a file compression method in the data set, select Uncompressed.
      • If you use the gzip file compression algorithm in your data set, select Gzip.
      • If you use the SNAPPY file compression algorithm in your data set, select Snappy.
  8. In the Properties mapping section, map the properties from the HDFS data set to the corresponding Pega Platform properties, depending on your parser configuration:
    1. Click Add Property.
    2. In the numbered field, specify the property that corresponds to a column in the CSV file.
      Note: Property mapping for the CSV format is based on the order of columns in the CSV file. For that reason, the order of the properties in the Properties mapping section must correspond to the order of columns in the CSV file.
    • To use the auto-mapping mode, keep the Use property auto mapping check box selected.
      Note: This mode is enabled by default. In auto-mapping mode, the column names from the JSON data file are used as properties. This mode supports the nested JSON structures that are directly mapped to Page and Page List properties in the data model of the class that the data set applies to.
    • To manually map properties, perform the following actions:
      1. Clear the Use property auto mapping check box.
      2. In the JSON column, enter the name of the column that you want to map to a Pega Platform property.
      3. In the Property name field, specify a Pega Platform property that you want to map to the JSON column.
    1. Click Generate missing properties.
    2. Examine the Properties generation dialog that shows both mapped and unmapped properties.
    3. Click Submit.

      To create the mapping, Parquet utilizes properties that are defined in the data set class. You can map only the properties that are scalar and not inherited. If the property name matches a field name in the Parquet file, the property is populated with the corresponding data from the Parquet file.

      You can generate properties from the Parquet file that do not exist in Pega Platform. When you generate missing properties, Pega Platform checks for unmapped columns in the data set, and creates the missing properties in the data set class for any unmapped columns.

  9. Click Save.

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best. is not optimized for Internet Explorer. For the optimal experience, please use:

Close Deprecation Notice
Contact us