Installing the Pega OCR component
The Optical Character Recognition (OCR) component allows the system to analyze text contained in image-based email attachments. You use this capability in an Pega Email Bot™ to improve the text analysis of emails from users. The Pega OCR component obtains content from image PDF, JPG, PNG, and TIFF files and converts it into electronic text format. This text is then analyzed as though it were contained in the body of the email. The Pega OCR component also provides PDF file entity highlighting of analyzed documents in an Email Bot.
- Installation procedure
- Prerequisites
- Installing ABBYY FineReader 12
- Installing the Pega OCR component
- Verifying Pega OCR component installation
Installation procedure
You must install the Pega OCR component on premises on a Linux server running an instance of Pega Platform™. You first obtain the ABBYY FineReader Installation file and use it to install the ABBYY FineReader 12 application used for optical character recognition on a Linux server. You must also import the Pega OCR component to a Pega Platform instance running on the same Linux server. You obtain both the ABBYY FineReader installation file and Pega OCR component .zip file from the Pega Marketplace.
The ABBYY FineReader Installation (abbyy_installation_pegaXX_vYYYmmDD.zip) file available from Pega Marketplace consists of the following files and folders:
- abbyy_installation.sh - The installation script used to install ABBYY FineReader 12.
- license/*.locallicense - An open license for the ABBYY FineReader 12 application provided by Pegasystems.
The Pega OCR component .zip file available from Pega Marketplace consists of the following files and folders:
- lib/pega.ocr.component.jar - The JAR file that you must import to Pega Platform. It contains a Pega library used to control the multi-threaded usage of ABBYY FineReader Engine and an ABBYY FineReader 12 library called com.abbyy.FREngine.jar.
- component - The component that contains the integration bits to allow OCR capability in email file attachments.
Prerequisites
You use the abbyy_installation.sh script to install the ABBYY FineReader 12 application on the following Linux versions, only:
- Ubuntu/Debian 18.04
- CentOS 6.9 and 7.0
Although best effort was taken for system changes to be safe, ensure that you have a backup of the system before you run the abbyy_installation.sh script.
Installing ABBYY FineReader 12
You must install the ABBYY FineReader 12 on a Linux server for each Pega Platform instance using an installation script. You can run this script with the following optional additional parameters:
- -u - Preconfigures the Linux environment for ld.so.config (LD SO config) that is used to hold path settings which point to directories that hold dynamic libraries. LD SO config path modifications are required by ABBYY FineReader 12 so that the application can find all its native libraries in the <path>/ABBYY/Bin folder.
- -i - Installs the Microsoft TrueType fonts and required libraries.
- Log in to a Linux server that is running a Pega Platform instance using the Secure Shell (SSH) protocol.
- Extract the files in the abbyy_installation_pega81_vYYYYmmDD.zip file obtained from Pega Marketplace to a directory.
- Run the following script as root in one of the following two ways. You must set the installation path so that the web applications of the app server have read access to this path.
- With optional parameters:./abbyy_installation.sh -c install -u -i -d <abbyy_installation_path>
- Without parameters:./abbyy_installation.sh -c install -d <abbyy_installation_path>
- Check whether the installation was successful:
- If there are dependency problems, trace what package needs the required dependency and install it.
- After you fix all the dependency problems, run the health check of the installation:
./abbyy_installation.sh -c check -d <abbyy_installation_path>
- Create ABBYY FineReader 12 data and temp folders:
- To create the default data directory for ABBYY FineReader 12, run the following command:
mkdir -p "/var/lib/ABBYY/SDK/12/FineReader Engine"; chown -R <user>:<group> "/var/lib/ABBYY"
where the<user>
and<group>
string above must be updated to reflect the Java process owner. - To create the default temp directory for ABBYY FineReader 12, run the following command:
mkdir -p "/tmp/ABBYY FineReader Engine 12"; chown <user>:<group> "/tmp/ABBYY FineReader Engine 12"
where the<user>
and<group>
string above must be updated to reflect the Java process owner.
- To create the default data directory for ABBYY FineReader 12, run the following command:
- Restart your Linux system so that Tomcat server configuration is refreshed and the LD SO config changes are applied.
- Repeat steps 1 through 6 for each Linux server that contains a Pega Platform instance.
Installing the Pega OCR component
Before using the OCR capability in Pega Platform and an Email Bot, install the Pega OCR component for a Pega Platform instance. If the Pega OCR component is not available in Pega Platform, import the component from Pega Marketplace to Pega Platform first.
- Log in to Pega Platform.
- In Dev Studio, click the name of your application, and click .
- In the Enabled components section, make sure that the Pega OCR component is displayed in the list. Enabled components section - Application rule form
- If the Pega OCR component is not listed, perform the following steps to install it.
- Click .
- In the Available components section, select the Enabled check box for the Pega OCR component.
If the Pega OCR component is not displayed in the section:- Obtain a .zip file for the component from Pega Marketplace, for example: Pega OCR Component.zip.
- Extract the .zip file contents to get access to the /component folder.
- Click Available components section. to install the file on Pega Platform. If the installation is successful, the component is displayed in the
- Select the Enabled check box for the Pega OCR component.
- Click Enabled components section. . The Pega OCR component is displayed in the
- Click .
- Import the pega.ocr.component.jar file that is part of the .zip file that you obtained from Pega Marketplace:
- In Dev Studio, click > > > .
- Click Local file and then click and select the pega.ocr.component.jar file from a directory. Application Import wizard
- Click Customer 06-01-01 codeset rule. and follow the Import wizard instructions to import the JAR file to the
- Restart the Pega Platform instance to make sure that the imported JAR file is visible in the classpath.
Verifying Pega OCR component installation
To verify the configuration of the Pega OCR component installation files in Pega Platform:
- Log in to Pega Platform.
- From the App explorer, search for the Data-AbbyyFineReader rule, and in the Data Model > Data Transform section, open the configureAbbyyFREngine rule.
- Verify the following parameters:
- Param.abbyySdkPath - Specifies the path to the /Bin folder of the ABBYY FineReader 12 installation, for example: <abbyy_installation_path>/FREngine12/Bin.
- Param.abbyyLicensePath - Specifies the path to the license that is provided by Pegasystems, for example: <abbyy_installation_path>/licenses/pega.locallicense. The license file is automatically installed when you run the script during ABBYY FineReader 12 installation.
- Param.abbyyDataFolder - Specifies the full path to the ABBYY FineReader 12 data folder. It is created and managed by ABBYY FineReader. The default path is: /var/lib/ABBYY/SDK/12/FineReader Engine. Make sure that the Java process has read and write access rights to this folder.
- Param.abbyyTempPath - specifies the full path to the ABBYY FineReader 12 temporary folder. The default path is the following: /tmp/ABBYY FineReader Engine 12. Update the value if you want to use another directory. Make sure that the Java process has read and write access rights to this folder.
- If you modified the configureAbbyyFREngine rule, save the ruleset to your application and check in the changes.
To verify that the Pega OCR component installation was successful, check whether you can use the OCR capability with an Email Bot.
- Log in to Pega Platform.
- Create an Email channel to test the Pega OCR component. For more information, see Creating an Email channel.
- In the Text analytics section of the Behavior tab, in the Analyze email attachments list, click Always.
- Click Save.
- Send a test email that also contains a PDF file attachment with OCR content to the operator email account defined for the Email channel and verify that entities were extracted from the PDF file.
Troubleshooting Pega OCR component installation
Failed to compile generated Java in ABBY error
If you see an error in the logs or tracer that states that the system failed to compile the generated Java in ABBY, make sure you perform step 6 in the Installing the Pega OCR component procedure and Pega Platform is restarted.
Other errors
If the test email that you sent was not analyzed correctly or the verification steps described above fail, examine the Pega Platform logs to obtain more information about the problem. Search the Pega Platform log files for the AbbyyFineReader entries.
Updating LD SO config path manually
To confgure ld.so.config path manually, run the following command as root or using sudo:
echo “<local_path>/ABBYY/Bin" > /etc/ld.so.conf.d/abbyy.conf && ldconfig
Make sure to replace the <local_path> string above with a path to the /ABBYY/Bin folder.
Installing Microsoft TrueType fonts manually
To install Microsoft TrueType fonts manually on a Ubuntu/Debian server, run the following commands as root or using sudo:
- apt-get update
- apt-get install ttf-mscorefonts-installer -y --force-yes
On Ubuntu/Debian server, if you see error messages during installation, for example: 'library.so -> dependency libgomp1 not found', you must also install libgomp1. Run the following command as root:
apt-get install libgomp1
To install Microsoft TrueType fonts manually on a CentOS server, run the following commands as root or using sudo:
- yum install epel-release -y
- yum install curl cabextract fontconfig -y
- yum install https://downloads.sourceforge.net/project/mscorefonts2/rpms/msttcore-fonts-installer-2.6-1.noarch.rpm -y