You are here: Decision Strategy Manager > Rules > Data Set > Configuring the HDFS data set

Configuring HDFS data set

You must configure each instance of the HDFS data set rule before it can read data from and save it to an external Apache Hadoop Distributed File System (HDFS).

  1. Create an instance of the HDFS data set rule.

  2. Connect to an instance of the Data-Admin-Hadoop configuration rule.

    1. In the Hadoop configuration instance field, reference the Data-Admin-Hadoop configuration rule that contains HDFS storage configuration.

    2. Click Test connectivity.

      Note: The HDFS data set is optimized to support connections to one Apache Hadoop environment. When, in one instance of the Data Flow rule, you use HDFS data sets connecting to different Apache Hadoop environments, the data sets cannot use authenticated connections concurrently. If you need to use authenticated and non-authenticated connections at the same time, the HDFS data sets must use one Hadoop environment.

  3. In the File path field, specify a file path to the group of source and output files that the data set represents.

    This group is based on the file within the original path, but also contains all the files that apply to the following pattern: fileName-XXXXX, where XXXXX are sequence numbers starting from 00000. This is a result of data flows saving records in batches. The save operation appends data to the existing HDFS data set without overwriting it. It is also possible to use * to match multiple files in a folder (for example, /folder/map-r-*).

  4. Optional: To check the beginning of the selected file, click Preview File. You can view the first 100 KB of the file.

  5. In the Parser configuration section, select the file type that the data set will represent.

  6. Click Save.

Related Topics Link IconRelated information