PRPC V6.2 includes the Apache UIMA Java framework version 2.3.0. This allows your application to call upon the powerful text analysis and searching facilities of the Unstructured Information Management Architecture library. GRP-28877 GRP-24195
This feature requires advanced integration and Java skills. Visit the Apache UIMA site and contact Pegasystems for additional documentation.
The ConceptMapper is a highly configurable, high-performance dictionary lookup tool, implemented as a UIMA component. Using one of several text matching algorithms, it maps entries in a dictionary into input documents, producing UIMA annotations.
V6.2 incorporates a sample that demonstrates ConceptMapper. You can copy and extend this example. Pegasystems has modified the Java method setTokenizerDescriptor in class DictionaryResource
to look up resources relative to a Java system property uima.datapatch.
ConceptMapper users these UIMA analysis engines:
Aggregate
Primitive(s)
Using the initially installed settings and this dictionary:
<?xml version="1.0" encoding="UTF-8" ?>
<synonym>
<token canonical="United States" DOCNO="10000">
<variant base="United States"/>
<variant base="United States of America"/>
<variant base="USA"/>
</token>
<token canonical="New York City" DOCNO="10001">
<variant base = "New York City"/>
<variant base = "NYC"/>
<variant base = "Big Apple"/>
</token>
</synonym>
With the input string "The Big Apple is a nickname for New York City", produces the following XMI (excerpted):
<?xml version="1.0" encoding="UTF-8" ?>
&help;
<cas:Sofa xmi:id="1" sofaNum="1" sofaID="_InitialView" mimeType="text" sofaString="The Big Apple is a nickname for New York City" />
<tcas:DocumentAnnotation xmi:id="8" sofa="1" begin="0" end="45" language="x-unspecified" />
<tokenizer:TokenAnnotation xmi:id="13" sofa="1" begin="0" end="3" text="the" tokenType="0" />
<tokenizer:TokenAnnotation xmi:id="23" sofa="1" begin="4" end="7" text="big" tokenType="0" />
…
<tokenizer:TokenAnnotation xmi:id="93" sofa="1" begin="36" end="40" text="york" tokenType="0" />
<tokenizer:TokenAnnotation xmi:id="103" sofa="1" begin="41" end="45" text="city" tokenType="0" />
<conceptMapper:DictTerm xmi:id="113" sofa="1" begin="4" end="13" DictCanon="New York City" enclosingSpan="8" matchedText="Big Apple" matchedTokens="23 33" />
<conceptMapper:DictTerm xmi:id="125" sofa="1" begin="32" end="45" DictCanon="New York City" enclosingSpan="8" matchedText="New York City" matchedTokens="83 93 103" />
…
</xmi:XMI>
Five sample descriptor files are saved as text file rules. Your application can override these as needed.
Analysis Engine Descriptors
Type System Descriptors
The dictionary resource is also a text file rule, TextAnalysis.pyDictionary.xml.
The standard activity @baseclass.pxUIMAConceptMapper implements the component. This activity:
V6.2 includes these libraries in prprivate/libredist:
Supply as input to the activity values for these parameters:
content
— The unstructured text to be annotatedresourceNames
— Optional. Comma separated list of the XML resource/configuration filesanalysisDescriptor
— Optional. Name of the analysis engine descriptorstartIndex
— Optional. Index at which to start the analysis (integer)endIndex
— Optional. Index at which to end the analysis (integer)The activity's result is returned on the parameter page, as the value of a parameter named xmiOutput
, an XML document in XML Metadata Interchange format. The output has two types – the TokenAnnotation, which is produced by the OffsetTokenizer annotator, and DictTerm which is produced by the ConceptMapperOffsetTokenizer annotator.
Create a Java system property uima.datapath, set to a directory path on the current server node's file system, with write access.
About Text File rules
|