Creating entity extraction models
Automatically create a case, populate a form, or route an assignment by building entity extraction models for recognizing keywords and phrases. Each entity extraction model classifies keywords and phrases as personal names, locations, organizations, and so on, into predefined categories that are called entity types.
Locate each entity type in unstructured text through a combination of various detection methods. You can then use entity types to create and manage complex entity extraction models, such as date or date-time. In addition, entity types help you manage entities that nest other entities. For example, address can include such nested entity types as country, state, province, postal code, street, and so on.- In the navigation pane of Prediction Studio, click Models.
- In the header of the Models work area, click .
- In the New text extraction model window, provide the model name, language, and the applicable class.
- Click Start.
- In the Entity types section, click Add new.
- In the Add new entity type section, e the entity type name.
- Define the detection methods for the entity type by performing any of the
following actions:
- To combine multiple entity types under a parent entity type, expand the
Referenced entity types menu and then click
+ Add entity type.
For example, you can nest such entity types as Postal code, Street, and City under a single top-level entity type, such as Address.
- To create a list of keywords that belong to the entity type, enable the
Configure keywords option and then specify the
keywords to detect by manually adding each entry or uploading a file.
Use this detection method when the entity type that you want to extract is an umbrella term for a finite number of associated terms or phrases that do not follow any specific pattern. For example, you can define and associate the city entity type with the keyword New York, with such synonyms as NY, NYC, Big Apple, The Five Boroughs.
- To detect entity types whose structure matches a certain pattern, enable
the Configure RUTA setting and then use Apache
Rule-based Text Annotation (RUTA) language to define the detection pattern.
For example, you can use a RUTA script to detect strings that contain the @ symbol and the .com sub-string as email_address. In addition, you can use this detection method to detect entity types through the token length (for example, postal_code or telephone_number) or to extract entities from a word or token. You can select and modify any of the templates that are provided.
- To detect entities by training a conditional random field (CRF) model,
enable the Configure machine learning setting.
machine learning models for detecting entities work best when entities do not follow any specific pattern but appear in a specific context or are surrounded by certain words or phrases. For example, in the sentence I work at uPlusTelco, a machine learning model might classify uPlusTelco as organization with greater confidence because of the verb work and the preposition at that often appear together with organization names.
- To combine multiple entity types under a parent entity type, expand the
Referenced entity types menu and then click
+ Add entity type.
- Optional: To define additional options or processing activities, perform any of the
following actions:
- To exclude the entity type from the text analytics results, toggle the Is internal entity type switch. Use this setting for entity types which are building blocks of other entity types but which are not important for text analytics results in individual analyses. For example, you can mark month name as internal when the date entity type references that entity.
- To change the default order of detection methods, drag detection method names into the table. For example, to enable providing feedback to the entity detection model, select Model as the preferred detection method. The method that is used to detect an entity appears as the value of the pyDetectionType property in the text analytics results.
- To specify additional steps to process the entities, in the Post-processing activity field, select or define an Activity rule. For example, you can define an activity to normalize the date format of the entities that are detected. The entity that is normalized appears as the value of the pyResolvedValue property in text analytics results.
- Confirm your changes by clicking Submit.
- Optional: To add more entity types in the model, perform steps 5 through 9.
- If you added a Model entity type, click Create with machine learning to start the model creation wizard. For more information, see Building machine learning entity extraction models.
Previous topic Extracting keywords and phrases Next topic Best practices for pattern extraction in text analytics