OCR methods
Use the OCR methods to convert documents that contain images into searchable
text. The document can be an image, such as a faxed document, or a document that contains both
text and images. This component can convert image files, PDF files, and Word
documents. Use the GetProcessToXmlConfig method to produce information
about the XML configuration. You can then pass that configuration information to the
configXml parameter of the ProcessToXml
method, to change the way the output XML looks. This method returns an XML string that contains the configuration options. The following table describes the parameters available for this method. Use the ProcessToPdf method to extract text from the images in
the input file and output a PDF file. You can then use the
PdfConnector or PdfViewer components to
extract data from this PDF file. You can use this method with three or eight parameters. If you use this method with three
parameters, the defaults are used for the other five parameters. The following table describes the parameters for this method. Use the ocrDictionaryType setting to reduce the character set
that ABBYY uses when converting an image into text. For example,
if you know that there are no numbers in the image, you can set
the ocrDictionaryType option to AlphaOnly to omit numbers from
the character set and reduce the possibility of errors, such as
changing Search into 5earch. You can choose from the following
options: You can choose from 459 scan languages, including the most widely
used languages available, like English, Spanish, and Japanese,
and also some combinations of languages, such as Korean and
English. Additionally, there are scan language options that
limit the character set in a way similar to how you can limit
the search dictionary. The following are some examples: Use the ProcessToText method to extract text from the images
in the input file and output a string that contains the text. Use the
ProcessToTextFile method to output the string to a text file. You can then use this string or text file in a Pega Robot Studio automation.
These methods can use either two or seven parameters. If you use this method with two
parameters, the defaults are used for the other five parameters. The following table describes the parameters for these methods. Use the ocrDictionaryType setting to
reduce the character set that ABBYY uses when converting an
image into text. For example, if you know that there are no
numbers in the image, you can set the ocrDictionaryType option
to AlphaOnly to omit numbers from the character set and reduce
the possibility of errors, such as changing the text 'Search'
into '5earch'. You can choose from the following options: You can choose from 459 scan languages, including the most widely
used languages available, like English, Spanish, and Japanese,
and also some combinations of languages, such as Korean and
English. Additionally, there are scan language options that
limit the character set in a way that is similar to the way you
limit the search dictionary. The following are some examples: Use the ProcessToXml method to extract text from an image or
document and export that text in XML format. The ProcessToXmlFile
method exports that text to an XML file. You can use these methods with two, three, seven, or eight parameters. If you use the two
or three parameter variants, the defaults are used for the other parameters. The following table describes the parameters for both methods. Use the ocrDictionaryType setting to
reduce the character set that ABBYY uses when converting an
image into text. For example, if you know that there are no
numbers in the image, you can set the
ocrDictionaryType option to AlphaOnly
to omit numbers from the character set and reduce the
possibility of errors, such as changing Search into 5earch. You
can choose from the following options: You can choose from 459 scan languages, including the most widely
used languages available, like English, Spanish, and Japanese,
and also some combinations of languages, such as Korean and
English. Additionally, there are scan language options that
limit the character set in a way similar to how you can limit
the search dictionary. The following are some examples:GetProcesssToXmlConfig
Parameter Description writeCharAttributes Include the character attributes. dontWriteBlocksCoordinates Omit the block coordinates. writeExtendedCharAttributes Include the extended character attributes. writeOriginalImageCoordinates Include the original image coordinates. writeNameOfBlock Include the name of the XML block. writeCharacterFormatting Include character formatting information. writeParagraphStyles Include paragraph style information. writePagesByElements Include the pages by element. writeAsciiCharAttributes Include the ASCII character attributes. writeWordRecognitionVariants Include word recognition variants. writeCharRecognitionVariants Include character recognition variants. writeLogicalStructure Include logical structure information. writeFontStyles Include font style information. writeOneCharForTab Use one character to indicate tabs. checkResult Check the results. ProcessToPdf
Parameter Description inputFile (String) Enter the complete path and name of the file that you
want to process. You can translate image files, PDF files (.pdf),
and Word documents (.doc and .docx). outputFile (String) Enter the complete path and file name of the PDF file
that the ProcessToPdf method creates. exportWithoutImages (Boolean) Use this parameter to include or exclude the images in
the output file. Set the parameter to True if you only want the OCR
extracted text to display. Set the parameter to False to export the
original image with the hidden text. You can select the text when
you highlight the image in the PDF file. ocrImagesAndText (Boolean) Use this parameter to determine if ABBYY processes the
text in the input file through OCR or passes it directly to the PDF
file, without modification. If you enter False, the text in the
output file is included exactly as it appears in the input file. If
you enter True, ABBYY evaluates the text before it creates the
output file, which can lead to inaccurate results. The text in some
documents might not be retrieved if you set this parameter False.
The default is False. coloredBackground (Boolean) If the document has a dark background, change the
coloredBackground parameter to True to improve the accuracy of the
translation. For instance, if white text is displayed on a black
background, set the coloredBackground parameter to True. The default
is False. lowResolutionText (Boolean) If the text in the document has a small font size, set
the lowResolutionText parameter to True. Note that when the system
translates images with this setting, the method takes longer to
complete. The default is False. ocrDictionaryType scanLanguage ProcessToText and ProcessToTextFile methods
Parameter Description inputFile (String) Enter the complete path and file name of the file that
you want to process. You can translate image files, PDF files
(.pdf), and Word documents (.doc and .docx). extractedText (String) This output parameter contains the text that the system
retrieves during DocumentOCR
processing. ocrImagesAndText (Boolean) Use this parameter to determine if ABBYY processes the
text in the input file through OCR or passes it directly to the PDF
file, without modification. If you enter False, the text in the
input file is included exactly as it appears in the input file. If
you enter True, ABBYY evaluates the text before it creates the
output file. This can produce inaccurate results. The text in some
documents might not be retrieved if you set this parameter False.
The default is False. coloredBackground (Boolean) If the document has a dark background, change the
coloredBackground parameter to True to
improve the accuracy of the translation. For instance, if white text
is displayed on a black background, set the coloredBackground
parameter to True. The default is
False. lowResolutionText (Boolean) If the text in the document has a small font size, set
the lowResolutionText parameter to True. Note
that when the system translates images with this setting, the method
takes longer to complete. The default is
False. ocrDictionaryType scanLanguage ProcessToXml and ProcessToXmlFile methods
Parameter Description inputFile (String) Enter the complete path and name of the image file or
document that you want to process. You can translate image files,
PDF files (.pdf), and Word documents (.doc and .docx). outputFile (String) Enter the complete path and file name of the XML file
that the ProcessToXML method creates. configXml (String) Enter the complete path and file name of the XML
configuration file that contains the parameters for formatting the
XML output. ocrImagesAndText (Boolean) Use this parameter to determine if ABBYY processes the
text in the input file through OCR or passes it directly to the XML
file, without modification. If you enter
False, the text in the output file is
included exactly as it appears in the input file. If you enter
True, ABBYY evaluates the text before it
creates the output file, which can lead to inaccurate results. The
text in some documents might not be retrieved if you set this
parameter False. The default is
False. configXml (String) Enter the complete path and file name of the XML
configuration file that contains the parameters for formatting the
XML output. coloredBackground (Boolean) If the document has a dark background, change the
coloredBackground parameter to
True to improve the accuracy of the
translation. For instance, if white text is displayed on a black
background, set the coloredBackground
parameter to True. The default is
False. lowResolutionText (Boolean) If the text in the document has a small font size, set
the lowResolutionText parameter to
True. Note that when the system
translates images with this setting, the method takes longer to
complete. The default is False. ocrDictionaryType scanLanguage
Previous topic MicrosoftWord connector Next topic PdfConnector properties, methods, and events