Document Processing Service component

The Document Processing Service (DPS) complements Pega Intelligent Virtual Assistants (IVAs) and Pega Email Bots because the component intelligently analyzes image-based files for text and other information. For example, by using DPS, an email bot can automatically analyze and extract key information in a PNG file attachment, such as names, locations, dates, and postal codes.

DPS provides optical character recognition (OCR), highlighting, and analysis of forms and tables in image-based attachments. You can also use the DPS component as an extension point or in automations for custom solutions in your application.

DPS in automations

To use automations for your Pega Platform application, invoke the DPS component activities in your case. As a result, you can process an imaged-based document, such as a PDF file, and then automatically use the results in another step or stage in your application. For example, your application can use automation activities to upload a file; build search configuration; send the document for extraction; perform highlighting; perform form and table analysis; and finally query the DPS component for results.

The following DPS automation activities are available in your application:

DPSStartExtraction
Starts the extraction by sending the input file for processing to DPS.
DPSStartHighlight
Highlights entities in the input file.
DPSStartTextractFormAnalysis
Performs form analysis by using textract, a service that automatically extracts text and data from scanned documents.
DPSStartTextractTableAnalysis
Performs table analysis by using textract, a service that automatically extracts text and data from scanned documents.
DPSStartTextractAnalyzeFile
Performs both form and table analysis by using textract, a service that automatically extracts text and data from scanned documents.
DPSGetResultData
Queries DPS for a file based on the specified document identifier and returns the extracted text, or key-value forms or tables.
DPSGetHighlighted
Queries DPS for a file based on the specified identifier and returns a PDF file with the highlighted entities.

DPS as an OCR extension

To use DPS as an OCR extension instead of automations, you can invoke the following activities in a custom solution for your application:

pyOCRTextExtractor
Invokes the DPSTextExtractor activity that is responsible for text extraction from an input file.
DPSTextMarkUp
Highlights entities and the results are specified in the ProcessedFileSource property. This string also contains a configuration file in a JSON format that specifies which entities are to be highlighted and the color definition. You invoke this activity by using the Data-DPService page class.
DPSTextForm
Analyzes forms in an input file by using textract, a service that automatically extracts text and data from scanned documents.
DPSTextTable
Analyzes tables in an input file by using textract, a service that automatically extracts text and data from scanned documents.
DPSTextAnalyze
Analyzes both forms and tables in an input file by using textract, a service that automatically extracts text and data from scanned documents.

DPS REST API

To access DPS functions externally in an application, you can also invoke a REST API to perform document processing of image-based files, for example, OCR, analysis of forms and tables, and highlighting. For more information, see Document Processing Service REST API.