Skip to main content


         This documentation site is for previous versions. Visit our new documentation site for current releases.      
 

OCR methods

Updated on October 19, 2022

Use the OCR methods to convert documents that contain images into searchable text.

The document can be an image, such as a faxed document, or a document that contains both text and images. This component can convert image files, PDF files, and Word documents.

Note: The image file types include .png, .jpeg, .bmp, .gif, and .tiff. Graphic file support is provided by ABBYY FineReader. For a comprehensive list of supported types, see Supported Image Formats.

GetProcesssToXmlConfig

Use the GetProcessToXmlConfig method to produce information about the XML configuration. You can then pass that configuration information to the configXml parameter of the ProcessToXml method, to change the way the output XML looks.

This method returns an XML string that contains the configuration options.

Note: The following parameter settings come from ABBYY FineReader and you can turn them on or off to alter the XML output. For example, you can have the output show an XML line per character found or, by changing attributes, you can have it output an XML line per line of text found.

The following table describes the parameters available for this method.

ParameterDescription
writeCharAttributesInclude the character attributes.
dontWriteBlocksCoordinatesOmit the block coordinates.
writeExtendedCharAttributesInclude the extended character attributes.
writeOriginalImageCoordinatesInclude the original image coordinates.
writeNameOfBlockInclude the name of the XML block.
writeCharacterFormattingInclude character formatting information.
writeParagraphStylesInclude paragraph style information.
writePagesByElements Include the pages by element.
writeAsciiCharAttributes Include the ASCII character attributes.
writeWordRecognitionVariants Include word recognition variants.
writeCharRecognitionVariantsInclude character recognition variants.
writeLogicalStructureInclude logical structure information.
writeFontStylesInclude font style information.
writeOneCharForTabUse one character to indicate tabs.
checkResult Check the results.

ProcessToPdf

Use the ProcessToPdf method to extract text from the images in the input file and output a PDF file. You can then use the PdfConnector or PdfViewer components to extract data from this PDF file.

You can use this method with three or eight parameters. If you use this method with three parameters, the defaults are used for the other five parameters.

The following table describes the parameters for this method.

ParameterDescription
inputFile(String) Enter the complete path and name of the file that you want to process. You can translate image files, PDF files (.pdf), and Word documents (.doc and .docx).
outputFile (String) Enter the complete path and file name of the PDF file that the ProcessToPdf method creates.
exportWithoutImages (Boolean) Use this parameter to include or exclude the images in the output file. Set the parameter to True if you only want the OCR extracted text to display. Set the parameter to False to export the original image with the hidden text. You can select the text when you highlight the image in the PDF file.
ocrImagesAndText (Boolean) Use this parameter to determine if ABBYY processes the text in the input file through OCR or passes it directly to the PDF file, without modification. If you enter False, the text in the output file is included exactly as it appears in the input file. If you enter True, ABBYY evaluates the text before it creates the output file, which can lead to inaccurate results. The text in some documents might not be retrieved if you set this parameter False. The default is False.
coloredBackground (Boolean) If the document has a dark background, change the coloredBackground parameter to True to improve the accuracy of the translation. For instance, if white text is displayed on a black background, set the coloredBackground parameter to True. The default is False.
lowResolutionText (Boolean) If the text in the document has a small font size, set the lowResolutionText parameter to True. Note that when the system translates images with this setting, the method takes longer to complete. The default is False.
ocrDictionaryType

Use the ocrDictionaryType setting to reduce the character set that ABBYY uses when converting an image into text. For example, if you know that there are no numbers in the image, you can set the ocrDictionaryType option to AlphaOnly to omit numbers from the character set and reduce the possibility of errors, such as changing Search into 5earch. You can choose from the following options:

  • Normal - Uses a character set that is normal for the language that you are scanning. The default is Normal.
  • AlphaOnly - Limits the character set to only a-z and A-Z.
  • NumOnly - Limits the character set to only 0-9.
  • AlphaAndSymbolsOnly - Limits the character set to a-z and A-Z and special characters such as hyphens (-), colons (:), and other punctuation marks.
  • NumAndSymbolsOnly - Limits the character set to 0-9 and special characters such as hyphens (-), colons (:), dollar signs ($), and other punctuation marks.
scanLanguage

You can choose from 459 scan languages, including the most widely used languages available, like English, Spanish, and Japanese, and also some combinations of languages, such as Korean and English. Additionally, there are scan language options that limit the character set in a way similar to how you can limit the search dictionary. The following are some examples:

  • English US WellKnownCode SSN
  • English UK CurrencyByDigits
  • English DateTime MonthByWords

Note: Farsi, Arabic, Hebrew, Vietnamese, and Thai are not included. The default is English.

ProcessToText and ProcessToTextFile methods

Use the ProcessToText method to extract text from the images in the input file and output a string that contains the text. Use the ProcessToTextFile method to output the string to a text file.

You can then use this string or text file in a Pega Robot Studio automation. These methods can use either two or seven parameters. If you use this method with two parameters, the defaults are used for the other five parameters.

The following table describes the parameters for these methods.

ParameterDescription
inputFile (String) Enter the complete path and file name of the file that you want to process. You can translate image files, PDF files (.pdf), and Word documents (.doc and .docx).
extractedText (String) This output parameter contains the text that the system retrieves during DocumentOCR processing.
ocrImagesAndText (Boolean) Use this parameter to determine if ABBYY processes the text in the input file through OCR or passes it directly to the PDF file, without modification. If you enter False, the text in the input file is included exactly as it appears in the input file. If you enter True, ABBYY evaluates the text before it creates the output file. This can produce inaccurate results. The text in some documents might not be retrieved if you set this parameter False. The default is False.
coloredBackground (Boolean) If the document has a dark background, change the coloredBackground parameter to True to improve the accuracy of the translation. For instance, if white text is displayed on a black background, set the coloredBackground parameter to True. The default is False.
lowResolutionText (Boolean) If the text in the document has a small font size, set the lowResolutionText parameter to True. Note that when the system translates images with this setting, the method takes longer to complete. The default is False.
ocrDictionaryType

Use the ocrDictionaryType setting to reduce the character set that ABBYY uses when converting an image into text. For example, if you know that there are no numbers in the image, you can set the ocrDictionaryType option to AlphaOnly to omit numbers from the character set and reduce the possibility of errors, such as changing the text 'Search' into '5earch'. You can choose from the following options:

  • Normal - Uses a character set that is normal for the language that is being scanned. The default is Normal.
  • AlphaOnly - Limits the character set to only a-z and A-Z.
  • NumOnly - Limits the character set to only 0-9.
  • AlphaAndSymbolsOnly - Limits the character set to a-z and A-Z and special characters such as hyphens (-), colons (:), and other punctuation marks.
  • NumAndSymbolsOnly - Limits the character set to 0-9 and special characters such as hyphens (-), colons (:), dollar signs ($), and other punctuation marks.
scanLanguage

You can choose from 459 scan languages, including the most widely used languages available, like English, Spanish, and Japanese, and also some combinations of languages, such as Korean and English. Additionally, there are scan language options that limit the character set in a way that is similar to the way you limit the search dictionary. The following are some examples:

  • English US WellKnownCode SSN
  • English UK CurrencyByDigits
  • English DateTime MonthByWords

Note: Farsi, Arabic, Hebrew, Vietnamese, and Thai are not included. The default is English.

ProcessToXml and ProcessToXmlFile methods

Use the ProcessToXml method to extract text from an image or document and export that text in XML format. The ProcessToXmlFile method exports that text to an XML file.

You can use these methods with two, three, seven, or eight parameters. If you use the two or three parameter variants, the defaults are used for the other parameters.

The following table describes the parameters for both methods.

ParameterDescription
inputFile (String) Enter the complete path and name of the image file or document that you want to process. You can translate image files, PDF files (.pdf), and Word documents (.doc and .docx).
outputFile (String) Enter the complete path and file name of the XML file that the ProcessToXML method creates.
configXml (String) Enter the complete path and file name of the XML configuration file that contains the parameters for formatting the XML output.
ocrImagesAndText (Boolean) Use this parameter to determine if ABBYY processes the text in the input file through OCR or passes it directly to the XML file, without modification. If you enter False, the text in the output file is included exactly as it appears in the input file. If you enter True, ABBYY evaluates the text before it creates the output file, which can lead to inaccurate results. The text in some documents might not be retrieved if you set this parameter False. The default is False.
configXml (String) Enter the complete path and file name of the XML configuration file that contains the parameters for formatting the XML output.
coloredBackground (Boolean) If the document has a dark background, change the coloredBackground parameter to True to improve the accuracy of the translation. For instance, if white text is displayed on a black background, set the coloredBackground parameter to True. The default is False.
lowResolutionText (Boolean) If the text in the document has a small font size, set the lowResolutionText parameter to True. Note that when the system translates images with this setting, the method takes longer to complete. The default is False.
ocrDictionaryType

Use the ocrDictionaryType setting to reduce the character set that ABBYY uses when converting an image into text. For example, if you know that there are no numbers in the image, you can set the ocrDictionaryType option to AlphaOnly to omit numbers from the character set and reduce the possibility of errors, such as changing Search into 5earch. You can choose from the following options:

  • Normal - Uses a character set that is normal for the language that you are scanning. The default is Normal.
  • AlphaOnly - Limits the character set to only a-z and A-Z.
  • NumOnly - Limits the character set to only 0-9.
  • AlphaAndSymbolsOnly - Limits the character set to a-z and A-Z and special characters such as hyphens (-), colons (:), and other punctuation marks.
  • NumAndSymbolsOnly - Limits the character set to 0-9 and special characters such as hyphens (-), colons (:), dollar signs ($), and other punctuation marks.
scanLanguage

You can choose from 459 scan languages, including the most widely used languages available, like English, Spanish, and Japanese, and also some combinations of languages, such as Korean and English. Additionally, there are scan language options that limit the character set in a way similar to how you can limit the search dictionary. The following are some examples:

  • English US WellKnownCode SSN
  • English UK CurrencyByDigits
  • English DateTime MonthByWords

Note: Farsi, Arabic, Hebrew, Vietnamese, and Thai are not included. The default is English.

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best.

Pega.com is not optimized for Internet Explorer. For the optimal experience, please use:

Close Deprecation Notice
Contact us