To enable a parallel load from multiple CSV or JSON files located in remote
repositories or on the local file system, create a File data set that references a repository.
This feature enables remote files to function as data sources for Pega Platform data sets.
You can perform the following operations for
File data sets referencing a remote repository:
- Browse
- Retrieves records in an undefined order.
- Save
- Saves records to multiple files, along with a meta file that contains the name, size,
and the number of records for every file. The Save operation is not available for manifest
files.
- Truncate
- Removes all configured files and their meta files, except for the manifest file.
- GetNumberOfRecords
- Estimates the number of records based on the average size of the first few records and
the total size of the data set files.
-
In the Edit data set tab, in the Data
Source section, click Files on repositories.
-
In the Connection section, select the source repository:
- To select one of the predefined repositories, click the Repository
configuration field, press the Down Arrow key, and choose a repository.
- To create a repository, click Open to the right of the
Repository Configuration field and perform Creating a repository.
To match multiple files in a folder, use an asterisk (*) as a wild card
character.
For example:
/folder/part-r-*
-
In the File configuration section, select how you want to define
the files to read or write:
- For a single file or a range of files, select Use a file
path.
- For multiple files that you list in a manifest file, select Use a
manifest file.
For manifest files, use the following
.xml
format:
<manifest>
<files>
<file>
<name>file0001.csv</name>
</file>
<file>
<name>file0002.csv</name>
</file>
</files>
</manifest>
You
can use a manifest file to define files only for read operations.
-
In the File configuration section, enter the file
location.
- Optional:
For file path, define the date and time pattern by adding a Java SimpleDateFormat
string to the file path.
The SimpleDateFormat does not support the following characters:
"
?
*
<
>
|:
For example: %{yyyy-MM-dd-HH-}
- Optional:
If the file is compressed, select File is compressed and choose
the Compression type.
The supported compression types are .zip and
.gz (GZip).
- Optional:
To provide additional file processing for read and write operations, such as encoding
and decoding, define and implement a dedicated interface:
-
Select Custom stream processing.
-
In the Java class with reader implementation field, enter
the fully qualified name of the java class with the logic that you want to apply
before parsing.
For example: com.pega.bigdata.dataset.file.repository.streamprocessing.sampleclasses.InputStreamShiftingProcessing
-
In the Java class with writer implementation field, enter
the fully qualified name of the java class with the logic that you want to apply after
serializing the file, before writing it to the system.
For example: com.pega.bigdata.dataset.file.repository.streamprocessing.sampleclasses.OutputStreamShiftingProcessing
-
In the Parser configuration section, update
the settings for the selected file by clicking Configure
automatically or by configuring the parameters manually:
-
From the File type drop-down list, select the defined file
type.
-
For CSV files, specify if the file contains a header row by selecting the
File contains header check box.
-
For CSV files, in the Delimiter character list, select a
character separating the fields in the selected file.
-
For CSV files, in the Supported quotation marks list, select
the quotation mark type used for string values in the selected file.
-
In the Date Time format field, enter the pattern
representing date and time stamps in the selected file.
The default pattern is:
yyyy - MM - dd
HH : mm : ss
-
In the Date format field, enter the pattern representing
date stamps in the selected file.
The default pattern is:
yyyy - MM - dd
-
In the Time Of Day format field, enter the pattern
representing time stamps in the selected file.
The default pattern is:
HH : mm : ss
Note: Time properties in the selected file can be in a different time zone than the
one used by Pega Platform. To avoid confusion, specify the time
zone in the time properties of the file, and use the appropriate pattern in the
settings.
- Optional:
Click Preview file.
Result: For a file path configuration, the preview contains the file name and file
contents. For a manifest file configuration, the preview shows the manifest file and the
contents of the first file that is listed in the manifest.
-
For CSV files, in the Mapping tab, modify the number of mapped
columns:
- To add a CSV file column, click Add mapping.
- To remove a CSV file column and the associated property mapping, click
Delete mapping for the applicable row.
For CSV files with a header row, the Column
entry in a new mapping instance must match the column name in the file.
-
For CSV files, in the Mapping tab, check and complete the
mapping between the columns in the CSV file and the corresponding properties in Pega Platform :
- To map an existing property to a CSV file column, in the
Property column, press the Down Arrow and choose the applicable
item from the list.
- For CSV files with a header row, to automatically create properties that are not
in Pega Platform and map them to CSV file columns, click
Create missing properties. Confirm the additional mapping by
clicking Create.
- To manually create properties that are not in Pega Platform and
map them to CSV file columns, in the Property column, enter a
property name that matches the Column entry, click
Open, and configure the new property. For more information, see
Creating a property.
For CSV files with a header row, the Column
entry in a new mapping instance must match the column name in the file.
For JSON files, the Mapping tab is empty, because the system
automatically maps the fields, and no manual mapping is available.
-
Confirm the new File data set configuration by clicking Save.
Result: If CSV or JSON files are not valid, error messages display the reason for the
error and a line number that identifies where the error is in the file.