Working with HDFS data sets on the Pega 7 Platform
The Pega 7 Platform decision management functionality makes it possible to connect to an external Hadoop distributed file system (HDFS). By associating the Pega 7 Platform with an HDFS, you can access very large data sets that are stored on multiple servers that operate in parallel. This tutorial provides an overview of the HDFS data model and guides you through common types of HDFS operations on the Pega 7 Platform.
The HDFS data model
HDFS is a distributed file system that runs on commodity software and stores very large files (gigabytes or terabytes of data). HDFS represents the structure of files and directories in a tree. It also includes various attributes of directories and files such as ownership, permission, quotas, and replication factor.
Previewing the HDFS data structure
You can view the structure of HDFS data by using the Data-Admin-DataSet-HDFS rule form. It is helpful to examine and understand the structure of the data before processing it on the Pega 7 Platform, for example, mapping the data to properties.
option in thePrerequisites:
You need an instance of the HDFS data set and a configured instance of the Hadoop record.
- On the Pega 7 Platform, access the target HDFS data set.
- Make sure that the Pega 7 Platform is connected to the target Hadoop instance. In the Connection section, click to verify the connection.
- In the File system configuration section, specify the path to the file that you want to preview.
- Click . This action returns the first 100 kilobytes of the file.
Performing the run operation on HDFS data sets
From the Actions menu, you can perform the following types of run operations on an HDFS data set:
- - As data flows save records in batches, the save operation must append data to an existing data set and not override it. For that reason, each save operation creates a data file. For example, if the file path in the data set configuration is specified as /ds/customers.csv, each save operation on that data set creates a file with the following indexing: /ds/customers-00000.csv, /ds/customers-00001.csv, /ds/customers-00002.csv, and so on.
- - Retrieves all data from the HDFS file.
- - Removes all data files from a specific HDFS file path. For example, if the file path in the data set is specified as /ds/customers.csv, the truncate operation deletes not only /ds/customers.csv, but also /ds/customers-00000.csv, /ds/customers-00001.csv, and so on. The result of the truncate operation is always an empty file path.
Configuring parser settings
From the Parser configuration section, choose the file format for the data set. Currently, the CSV and JSON (with one JSON object per file row) formats are supported. When you select a file format, additional configuration sections are displayed, depending on your selection.
- On the Pega 7 Platform, access the target HDFS data set.
- Test connectivity to the target Hadoop instance. In the Connection section, click to verify the connection.
- In the Parser configuration section, perform the following actions:
- In the File type property, select the file format that the HDFS data set uses.
- Only for the CSV format: Specify the Delimiter character property.
- Only for the CSV format, optional: Specify the Supported quotes property.
- In the File type property, select the file format that the HDFS data set uses.
- Click Save.
Configuring property mapping
You use the property mapping functionality of the Pega 7 Platform to associate data fields that are received from an external database with the properties or other sources or destinations in the Pega 7 Platform. The type of property mapping that you can perform depends on your parser settings.
Configuring property mapping for the JSON file format
By default, the auto-mapping mode is enabled. In this mode, the column names from the JSON data file are used as Pega 7 Platform properties. This mode supports the nested JSON structures that are directly mapped to page and page list properties in the data model of the class that the data set applies to.
- On the Pega 7 Platform, access the target HDFS data set.
- Test connectivity to the target Hadoop instance. In the Connection section, click to verify the connection.
- In the Properties mapping section, perform the following actions:
- To use the auto-mapping mode, select Use property auto mapping. This mode is enabled by default.
- To manually map properties, perform the following actions:
- Clear the Use property auto mapping check box.
- In the JSON column, enter the name of the column that you want to map to a Pega 7 Platform property.
- In the Property name field, enter the name of a Pega 7 Platform property that you want to map the JSON column to.
- To add rows, click Add mapping.
- Click Save.
Configuring property mapping for the CSV file format
Property mapping for the CSV format is based on the order of columns in the CSV file. Therefore, the order of properties in the Property mapping section must correspond to the order of columns in the CSV extension file.
- On the Pega 7 Platform, access the target HDFS Data Set.
- Test connectivity to the target Hadoop instance. In the Connection section, click Test connectivity to verify the connection.
- In the Properties mapping section, perform the following actions:
- Click Add property.
- In the numbered field that is displayed, enter the property name that corresponds to a column in the CSV file.
- Click Save.
For the following example of customer data in a CSV file:
Bart Smith, 27, New York
Peter Parker, 44, Boston
Melanie Evans, 57, Atlanta,
The property mapping should have the following structure:
Running an HDFS data set
You can perform save, browse, and truncate operations on an HBase data set.
- Open an HDFS data set instance on the Pega 7 Platform.
- Click Test Page dialog box is displayed.
- On the Test Page, expand the Operation menu.
- Select one of the available operations and click Run.
For more information, see DataSet-Execute method.
Previous topic Working with HBase data sets on the Pega 7 Platform Next topic HDFS and HBase client and server versions supported by Pega Platform