Best practices for creating extraction models
Use extraction analysis to detect and classify named entities into predefined categories, for example, names of people, locations, organizations, and so on.
By using extraction analysis, your application can automatically scan articles, forum threads, emails, and so on, to discover the talking points (for example, people, organizations, and places) that are discussed in them. Knowing the relevant entities in each analyzed piece of text can help in categorizing the content into well-defined and easily manageable hierarchies.
In Pega Platform, you can use the Conditional Random Fields algorithm to build a machine-learning model for extraction analysis.
Training data for an extraction model
Machine-learning models require a data set for testing and training purposes. During training, the model learns to extract entities based on the input that you provided in the form of sentences or documents that you pre-annotated with entity types that you want to extract. By applying an algorithm on the training data, the model develops rules and patterns for entity extraction. You can select some of the training data as the testing set to determine the accuracy of the model. Test records are excluded from the training process. After the model finishes training, it uses its internal patterns and rules to extract the entities, based on the training data. If the model correctly recognized an entity in the testing set, the model's accuracy increases. Any mismatches between the outcomes that the model predicted (machine outcomes) and those that you annotated in the testing set (manual outcomes) decrease the final accuracy score of a model.
Building the data set for training and testing can be a difficult step in the model development process, depending on the extraction problem that you want to solve. For example, you can have an excessive amount of training data or very little of it. An excessive amount of training data can lead to model overfitting. Overfitting happens when a model learns the noise in the training data so that it negatively affects the performance of the model on new data. When there is not enough training data, the model cannot learn all the patterns and rules to correctly distinguish between labels.
Data format for training an extraction model
Upload the data set for training and testing of a categorization model as a CSV file. This file must contain the following columns:
Example of data for training and testing a categorization model
Content | Type |
<START:person> Justin Trudeau
<END> is the
<START:location> Canadian <END>
<START:designation> Prime Minister
<END> . | |
<START:organization> uPlusTelco
<END> published their quarterly revenue
statement. It is a company based in
<START:location> Springfield
<END> . | |
My account was locked! It's number is
<START:account_no> 123-456-789
<END> . Please help! | Test |
The Content column contains the text input for the model to learn from. The input must contain all entities that you want to highlight, as shown in the preceding example. The Type column determines whether a record is assigned to the training data set or to the testing data set. If the Type column is empty for a record, that record belongs to training data. When building a model, you can customize the split ratio between the training and test sets. Depending on the type of content that you want to analyze, a training record can consist of single sentences or large blocks of text (for example, emails).
Guidelines for building a training and test data set
Follow these guidelines to construct a data set for training and testing an entity extraction model in Pega Platform:
- You can highlight multiple types of entities in one phrase, sentence, or document.
- Entity extraction comes as a sequence. By applying the algorithm, the model learns the probability of an entity, given the accompanying words, and the probability of the entity being followed or preceded by another entity type. Additionally, the model learns the order of entities in the training data. This enables you to extract structured entities from documents such as bills, receipts, or invoices.
- In addition to single words or short phrases, you can mark larger chunks of text (for example, entire paragraphs) as entities. For example, by using entity extraction, you can divide an email into greeting, body, signature, and disclaimer.
- When building a training and test data set, highlight at least 10 to 20 instances per entity type in the text.
- When highlighting entities, avoid incomplete tags, missing characters, or misspelled words. Any mistakes affect the accuracy of the model.
Guidelines for analyzing entity extraction results
Follow these guidelines to correctly analyze an entity model that you built:
- When uploading the training and test data, pay attention to names of entities and the number of instances per entity type. Misspelled names lead to the creation of multiple categories and affect the model's accuracy.
- Inspect the sequence of entity prediction for errors. An incorrect prediction might be the result of an unrepresented or falsely represented sequence of entities in the training data.
- Split the data set so that 60% of the data is used for training and 40% is used for testing.
- Inspect the performance metrics of your model to discover whether the recall and precision scores are similar across all categories. An indicator that you overfitted the model is that some categories have extremely high precision and recall while others perform extremely poorly. A stable model has equal precision and recall scores across all categories.
- A low F-score can indicate that some of your entities are similar and that there is a lack of features that make entities clearly distinguishable in your training set. The highest F-score value is 1 and it can be achieved only if the test data set is the same as the training data set.
Frequently asked questions
What is the difference between the partial and exact match?
When the manual outcome is exactly the same as the machine outcome, for example, both
you and the model labeled <START:>
Justin Trudeau
<END>
as PERSON
, then there is an exact
match. However, if the model highlighted only <START:>
Trudeau
<END>
as PERSON
, then the match is
partial.
When to choose the default and paragraph level of entity extraction?
When the scope of entity detection is a single word, choose the default entity extraction. When the scope of entity extraction is more than a word, choose the paragraph-level entity extraction.
Previous topic Creating entity models Next topic Best practices for pattern extraction in text analytics