Training data size considerations for building text analytics models
Build your text analytics models efficiently by choosing an optimal algorithm for your training data size. Consider the building times and prediction accuracies that different types of algorithms available in Pega Platform can provide.
Depending on the type of algorithm that you use, the size of training data for a text analytics model can affect the build time. For example, a model that is fed with a very large training data set (such as 10,000 to 20,000 records per category) can take more than one hour to generate.
Algorithms
Pega Platform provides a set of algorithms that you can use to train your classifier for sentiment and classification analysis. Depending on the algorithm that you use, the building times might vary. For example, Naive Bayes performs the fastest analysis of training data sets. However, other algorithms provide more accurate predictions.
- Naive Bayes
- Naive Bayes is a simple but effective algorithm for predictive modeling that assumes that training features are independent of each other. Even though this assumption is incorrect for text data, this classifier can be very effective. The main advantage of choosing Naive Bayes over the other available algorithms is that it provides the fastest build time for large training data sets. Naive Bayes algorithm is available for classification analysis in Pega Platform.
- Maximum Entropy
- Maximum Entropy (MaxEnt) classifier is a probabilistic classifier that belongs to the class of exponential models. Unlike Naive Bayes, MaxEnt does not assume that the features are conditionally independent of each other. Instead, the classifier iterates multiple times over the training data and selects the model that has the largest entropy. This classifier can be used to solve various text classification problems better than Naive Bayes. However, a MaxEnt classifier takes more time to build than a Naive Bayes classifier. The MaxEnt algorithm is available for classification and sentiment analysis in Pega Platform.
- Support Vector Machine
- Support Vector Machine (SVM) is a classifier that represents training data as points in an n-dimensional hypercube that is separated by a hyperplane. SVM is used to build supervised, linear, and non-probabilistic classifiers. SVM performs best with large amounts of training data; however, classifiers based on SVM are the slowest to build. The SVM algorithm is available for classification analysis in Pega Platform.
For more information about building text analytics models in Pega Platform, see Creating machine learning topic models and Determining the emotional tone of text.
Model performance
The values in the following table were derived by testing Naive Bayes, SVM, and MaxEnt algorithms in Pega Platform against training data of various sizes. The following characteristics were common to all training data:
- Number of categories in training data – 10
- Average character count per row – 233
- Train and test data split ratio – 60%/40%
- Heap size – 8 gigabytes
You can use multiple algorithms simultaneously as shown in the following table. However, if the combined training data size exceeds a certain size, the build might fail.
Performance results of Naive Bayes, SVM and MaxEnt algorithms
Training records per category | Total number of rows | File size (megabytes) | Does Naive Bayes build? | Does MaxEnt build? | Does SVM build? | Building time (minutes) | Testing time (minutes) |
1,000 | 10,000 | 1 | Yes | Yes | Yes | SVM: 20 | SVM: 5 |
10,000 | 100,000 | 13 | Yes | Yes | No | MaxEnt: 6.5 Naive Bayes: 0.84 | MaxEnt: 10 Naive Bayes: 10 |
20,000 | 200,000 | 26 | Yes | No | No | Naive Bayes: 35 | Naive Bayes: 22 |
20,000 | 200,000 | 26 | No | Yes | No | MaxEnt: 61 | MaxEnt: 34 |
Previous topic Feedback loop for text analysis Next topic Learning natural language processing with NLP Sample