You must configure each instance of the HDFS data set rule before it can read data from and save it to an external Apache Hadoop Distributed File System (HDFS).
Connect to an instance of the Data-Admin-Hadoop configuration rule.
In the Hadoop configuration instance field, reference the Data-Admin-Hadoop configuration rule that contains HDFS storage configuration.
Click Test connectivity.
Note: The HDFS data set is optimized to support connections to one Apache Hadoop environment. When, in one instance of the Data Flow rule, you use HDFS data sets connecting to different Apache Hadoop environments, the data sets cannot use authenticated connections concurrently. If you need to use authenticated and non-authenticated connections at the same time, the HDFS data sets must use one Hadoop environment.
In the File path field, specify a file path to the group of source and output files that the data set represents.
This group is based on the file within the original path, but also contains all the files that apply to the following pattern: fileName-XXXXX, where XXXXX are sequence numbers starting from 00000. This is a result of data flows saving records in batches. The save operation appends data to the existing HDFS data set without overwriting it. It is also possible to use * to match multiple files in a folder (for example, /folder/map-r-*).
Optional: To check the beginning of the selected file, click Preview File. You can view the first 100 KB of the file.
In the Parser configuration section, select the file type that the data set will represent.
Select CSV and do the following steps:
Specify what kind of delimiter character you want to use.
Specify what kind of quotation marks are supported.
Add mapping for the properties.
Properties mapping for the CSV format is based on the columns' order. In this section you can add and reorder properties. The first property is used to populate or read data from the first column, and so on.
Select the JSON option and one of the mapping modes.
By default, the Use property auto mapping check box is selected. In this mode the names of properties are directly used as column names. This mode supports nested JSON structures, which are directly mapped to the page and page list properties in the data model of the class where the data set instance applies to.
Clear the Use property auto mapping check box to specify the link between JSON columns and properties. This mode does not support nested structures.
Click Save.