Document Processing Service component
The Document Processing Service (DPS) complements Pega Intelligent Virtual Assistants (IVAs) and Pega Email Bots because the component intelligently analyzes image-based files for text and other information. For example, by using DPS, an email bot can automatically analyze and extract key information in a PNG file attachment, such as names, locations, dates, and postal codes.
DPS provides optical character recognition (OCR), highlighting, and analysis of forms and tables in image-based attachments. You can also use the DPS component as an extension point or in automations for custom solutions in your application.
You can optimize DPS to fit your needs and requirements by specifying a profile for the OCR and highlighting services and modifying additional custom parameters in a data transform. For more information, see Configuring the Document Processing Service component and Document Processing Service profiles.
DPS in automations
To use automations for your Pega Platform application, invoke the DPS component activities in your case. As a result, you can process an imaged-based document, such as a PDF file, and then automatically use the results in another step or stage in your application. For example, your application can use automation activities to upload a file; build search configuration; send the document for extraction; perform highlighting; perform form and table analysis; and finally query the DPS component for results.
The following DPS automation activities are available in your application:
- DPSStartExtraction
- Starts the extraction by sending the input file for processing to DPS.
- DPSStartHighlight
- Highlights entities in the input file.
- DPSStartTextractFormAnalysis
- Performs form analysis by using textract, a service that automatically extracts text and data from scanned documents.
- DPSStartTextractTableAnalysis
- Performs table analysis by using textract, a service that automatically extracts text and data from scanned documents.
- DPSStartTextractAnalyzeFile
- Performs both form and table analysis by using textract, a service that automatically extracts text and data from scanned documents.
- DPSGetResultData
- Queries DPS for a file based on the specified document identifier and returns the extracted text, or key-value forms or tables.
- DPSGetHighlighted
- Queries DPS for a file based on the specified identifier and returns a PDF file with the highlighted entities.
DPS as an OCR extension
To use DPS as an OCR extension instead of automations, you can invoke the following activities in a custom solution for your application:
- pyOCRTextExtractor
- Invokes the DPSTextExtractor activity that is responsible for text extraction from an input file.
- DPSTextMarkUp
- Highlights entities and the results are specified in the ProcessedFileSource property. This string also contains a configuration file in a JSON format that specifies which entities are to be highlighted and the color definition. You invoke this activity by using the Data-DPService page class.
- DPSTextForm
- Analyzes forms in an input file by using textract, a service that automatically extracts text and data from scanned documents.
- DPSTextTable
- Analyzes tables in an input file by using textract, a service that automatically extracts text and data from scanned documents.
- DPSTextAnalyze
- Analyzes both forms and tables in an input file by using textract, a service that automatically extracts text and data from scanned documents.
DPS REST API
To access DPS functions externally in an application, you can also invoke a REST API to perform document processing of image-based files, for example, OCR, analysis of forms and tables, and highlighting. For more information, see Document Processing Service REST API.