Text extraction analysis

Text extraction analysis is the process of extracting named entities from unstructured text such as press articles, Facebook posts, or tweets, and categorizing them. Typically, a named entity is a proper noun that falls into a commonly understood category such as a person, organization, or location. An entity can also be a Social Security number, email address, postal code, and so on.

Auto tags

You can configure a Text Analyzer to automatically detect and mark the most important concepts that are expressed in a document. This option is useful when you want to tag a document with the most relevant words or phrases, create word clouds, or perform faceted search according to semantic categories.

Summarization

You can generate an extractive summary from a large body of text, such as a business report or an email. By using summaries, you can make important business decisions without reading complete documents. Instead, you can examine the summary and the context of the text in the form of extracted topics, entities, intents, and the sentiment.

Text extraction

You can extract keywords and phrases from unstructured text through entity types. An entity type is a keyword or phrase that denotes a person name, organization, location, and so on. You can group similar or related entity types into models.

For each entity type, you can combine the following detection methods for versatile and robust location and classification.

Keywords-based text extraction: You can specify the list of key terms and their synonyms that belong to a particular domain. For example, you can create a list of keywords to track social media messages that pertain to the latest release of a product or a group of products of your competitor.
Pattern extraction: Use pattern extraction models to extract entities whose structures match a specific pattern, for example, postal codes, case numbers, email addresses, and so on. You can select one of the default pattern extraction samples or create custom patterns through the Rule-based Text Annotation (Ruta) script language.
Machine-learning models: Use machine learning to identify and classify named entities in text. You can select one of the default entity extraction models or create custom models in Prediction Studio by using the conditional random fields (CRF) algorithm.