Preparing data for text extraction

In the Source selection step of the text extraction model creation wizard, select the extraction type and provide the data for training and testing of your text extraction model.

  1. In the Extraction type section select a recognizer type:
    • To detect word-level entities, such as person or location, select Default entity recogniser.
    • To detect paragraph-level entities, such as email disclaimer, select Paragraph entity recogniser.
  2. Optional: To view the template for testing and training data, click Download template.
    An example training data record is: Hi, this is <START:name> Bart <END>, where:
    • <START:name> – Marks the start and type of the entity. In the preceding example, the model will detect the string Bart as name.
    • <END> – Marks the end of the entity.
  3. To select and upload a CSV, XLS, or XLS file that contains training and testing data for your text extraction model, click Choose file.
    After you select a valid file, you can preview the types of identified entities and the size of training and testing data. Depending on your business needs, you can exclude entity types from training data. Additionally, you can view errors, for example, missing <START> or <END> tags.
  4. If your file contains errors, perform any of the following actions:
    • Exclude errors from the model by selecting the Exclude below error records and build model check box.
    • Correct errors in the file and repeat step 3.
  5. Click Next.