Detecting transaction details with Ruta
Locate and extract keywords and phrases through pattern matching by using the Apache Rule-based Text Annotation (Ruta) language.
For example, you can use a Ruta script to detect strings that contain the @ symbol and the .com sub-string, such as email addresses. In addition, you can use this detection method to extract entity types by token length (for example, zip codes or telephone numbers), or to extract entities from a word or token. This data can help you automate case field population, or process customer requests faster.
Use case
As an application developer, you want to extract such details as account numbers, case IDs, and customer information from transaction documents, as in the following example:
Learn about extracting transaction details by completing the following tasks:
Before you begin
- In Prediction Studio, create an entity model. For more information, see Creating entity models.
- In the entity model that you create, configure each information type that you want to extract (for example, account number, case ID, or phone number), as a separate entity type.
Detecting account numbers
Use regular expression patterns to detect account numbers with a fixed number of digits. The following example uses an account number with six digits.
- In the account_number entity type, turn on the Enable Ruta switch.
- In the Ruta script box, enter: where:
NUM{REGEXP("......") -> MARK(EntityType) };
NUM
is the annotation type for marking numbers and digits.REGEXP("......")
is the regular expression in which the number of periods represents the number of characters in a specific annotation.
This expression detects all six-digit numbers in the document as account numbers.
- Optional: To improve detection accuracy, add more regular
expressions. For example, you can recognize the context in which an account
number appears in a document that has a fixed structure (such as in transaction documents).
where:W{ REGEXP("(?i)(account)")}
W{ REGEXP("(?i)(number)")}
COLON
NUM{REGEXP("......") -> MARK(EntityType,4) };
W{ REGEXP("(?i)(account)")}
detects the word account.(?i)
indicates that the word is case insensitive.W{ REGEXP("(?i)(number)")}
detects the word number.COLON
detects the colon (:) character.
- Optional: To improve detection accuracy, add more regular
expressions. For example, you can recognize the context in which an account
number appears in a document without a fixed structure (such as My
account number is 435399 or Details for my
account number: 394854):
where:W{ REGEXP("(?i)(account)")}
W{ REGEXP("(?i)(number)")}
ANY[0,4]?
NUM{REGEXP("......") -> MARK(EntityType,4) };
ANY[0,4]?
denotes an optional requirement to detect a numeric value as an account number if that value has the terms account and number four annotations before, at most.
Detecting case IDs
Detect work assignment numbers, such as case IDs by using a regular expression.
- In the case_id entity type turn on the Enable Ruta switch.
- In the Ruta script box, enter the following script:
where:CAP{REGEXP("..")}
"-"
NUM{REGEXP("....") -> MARK(EntityType,1,3)};
CAP{REGEXP("..")}
detects capital letters with two characters."-"
detects the dash character.NUM{REGEXP("....") -> MARK(EntityType,1,3)};
detects four-digit numbers that are followed by the previously detected annotations.MARK(EntityType,1,3)
means that annotations 1 through 3 are marked as the entity type.
Detecting salutations
Detect such salutations as mr, mrs, dr by using regular expressions.
- In the salutation entity type turn on the Enable Ruta switch.
- In the Ruta script box, enter:where:
W{REGEXP("(?i)(mr|mrs|dr|madam|maam|miss|sir|senorita|senor|ms)") -> MARK(EntityType)};
"(?i)"
makes the search case insensitive"|"
indicates the OR condition
Detecting numeric patterns
Use a regular expression to detect numeric patters, for example, social security numbers.
- In the entity type called security_number turn on the Enable Ruta switch.
- In the Ruta script box, enter:
where:UM{REGEXP("0.[1-9]|1..|2..|3..|4..|5..|6..|7..|8..") }
"-" NUM{REGEXP("..")}
"-" NUM{REGEXP("....") -> MARK(EntityType,1,5)};
NUM{REGEXP("0.[1-9]|1..|2..|3..|4..|5..|6..|7..|8..") }:
ensures that the number does not start with 000.
Detecting email addresses
Detect email addresses by extracting phrases or words that contain the @ symbol and .com string.
- In the entity type called email_address, turn on the Enable Ruta switch.
- In the Ruta script box, enter:
where:Document{-> RETAINTYPE(SPACE)};
SPACE ((W|NUM) (W|NUM)[0,1])+ "@" W+? PERIOD+? W{REGEXP("(?i)([a-zA-Z]{3}|[a-zA-Z]{2})")
MARK(EntityType,1,5)};
SPACE ((W|NUM)+ ("."|"_") )+ (W|NUM)+ "@" W[0,1]? PERIOD[0,1]? W+? PERIOD+? W{REGEXP("(?i)([a-zA-Z]{3}|[a-zAZ]{2})") -> MARK(EntityType,1,8)};
- Line 1 allows for spaces in email addresses.
- Line 2 marks an email address that does not contain a period (.) or an underscore (_) before the @ symbol and contains only one period after the @ symbol. For example, [email protected].
- Line 3 marks emails that contain a period (.) or an underscore (_) before the @ symbol and domain names with any number of period characters. For example, [email protected].
+
matches at least one annotation.+?
matches at least one annotation but stops when the next rule element also matches this annotation.
Detecting alphanumeric strings with a fixed number of characters
You can use regular expressions with Ruta scripts to annotate alphanumeric strings, for example, transaction numbers.
- In the transaction_number entity type turn on the Enable Ruta switch.
- In the Ruta script box, enter:This script detects 10-character long entities that contain at least one digit and at least one letter.
Token{ AND(REGEXP(".........."),CONTAINS(NUM),CONTAINS(W)) -> MARK(EntityType)}
Detecting currencies
You can detect currency and money by using a variety of methods. For example, you can create a regular expression, or create a list of keywords for various currency names, symbols, and synonyms, and then refer to that list in the Ruta script.
- In the currency entity type turn on the Enable Ruta switch.
- In the Ruta script box, enter the following script:For information about how to create a list of keywords, see Creating entity models.
W{REGEXP("(?i)(dollar|pounds|rupees)") -> MARK(EntityType)};
Detecting monetary values
You can detect monetary values by combining the currency code (for example, USD) and number entity types.
- In the money entity type turn on the Enable Ruta switch.
- In the Ruta script box, enter:
This script detects the currency that is followed by a number. The name of the entity type must always be in lowercase.EntityType{FEATURE("entityType", "currency")}
NUM {-> MARK(EntityType,1,2)};
Detecting dates
You can use the Ruta script to detect create a pattern for extracting dates. Ruta enables combining multiple detection patters so that you can detect dates that were written in various formats.
- Declare date variables:
DECLARE VarA; //Month name
DECLARE VarB; //Day
DECLARE VarC; //Year
DECLARE VarD; //Month Number
- Detect a month name:
W{REGEXP("(?i)(january|february|march|april|may|june|july|august|september|october|november|december|jan|feb|mar|apr|jun|jul|aug|sep|oct|nov|dec)") -> MARK(VarA)};
- Detect a day number:
NUM{REGEXP("[0]?[1-9]|1[0-9]|2[0-9]|3[0-1]")} W?{REGEXP("(?i)(th|st|nd|rd)")-> MARK(VarB,1,2)};
- Detect a year:
NUM{REGEXP("19..|20..|..") -> MARK(VarC)};
- Detect a month number:
NUM{REGEXP("[0]?[1-9]|1[0-2]") -> MARK(VarD)};
- Detect a full date that follows the January 1st
2008 or February 28, 2010 pattern:
VarA VarB PM? VarC {-> MARK(EntityType,1,4)};
- Clear the date variables that you declared earlier:
Unmarking the temporary variables ensures that they do not interfere with the execution of the next script.VarA{->UNMARK(VarA)}
VarB{->UNMARK(VarB)}
VarC{->UNMARK(VarA)}
VarD{->UNMARK(VarB)}
The script above detects five unique date patterns, where:
- detects the month and is case insensitive. The annotation is marked as
W{REGEXP("(?i)(january|february|march|april|may|june|july|august|september|october|november|december|jan|feb|mar|apr|jun|jul|aug|sep|oct|nov|dec)") -> MARK(VarA)};
VarA
. - detects any number from 0 to 31, with optional strings. The question mark character (?) indicates that this annotation is optional. This annotation is marked as
NUM{REGEXP("[0]?[1-9]|1[0-9]|2[0-9]|3[0-1]")} W?{REGEXP("(?i)(th|st|nd|rd)") -> MARK(VarB,1,2)};
VarB
. NUM{REGEXP("19..|20..|..") -> MARK(VarC)}
detects a four-digit number that starts with 19 or 20, or a two-digit number. This annotation is marked asVarC
.NUM{REGEXP("[0]?[1-9]|1[0-2]") -> MARK(VarD)};
detects numbers from 1 to 12 and is marked asVarD
.VarA VarB PM? VarC {-> MARK(EntityType,1,4)};
detects a full date that follows the January 1st 2008 or February 28, 2010 pattern.
In this tutorial, you learned how to use Apache Ruta to automatically discover and retrieve such business data as account numbers, email addresses, transaction numbers, financial figures, and dates.
What to do next
Include the entity models that contain Ruta scripts in your application by configuring a Text analyzer rule. For more information, see Building text analyzers.
Previous topic Best practices for pattern extraction in text analytics Next topic Building machine learning text extraction models