Skip to main content


         This documentation site is for previous versions. Visit our new documentation site for current releases.      
 

Best practices for choosing predictors

Updated on May 17, 2024

When you create an adaptive or predictive model, the input fields that you select as predictor data play a crucial role in the predictive performance of that model. Some data types, such as dates and text, might require preprocessing. Follow best practices when you select predictors and choose data types for adaptive and predictive analytics.

Predictor selection considerations

When a model makes a prediction, predictive power is the largest when you include as much relevant, yet uncorrelated, information as possible. You can make a wide set of candidate predictors available, as many as several hundred or more. Adaptive models automatically select the best subset of predictors. They group predictors into sets of related predictors and then select the best predictor from each group, that is, the predictor that has the strongest relationship with the outcome (measured in AUC). In adaptive decisioning, this predictor selection process repeats periodically.

To achieve the best results, use predictors that provide data from many different data sources, such as:

  • Customer profile information, for example, age, income, gender, and current product subscriptions. This information is usually part of the Customer Analytic Record (CAR) and is refreshed regularly.
  • Interaction context, such as recent web browsing information, support call reasons, or input that is gathered during a conversation with the customer. This information can be highly relevant and, therefore, very predictive.
  • Customer behavior data, such as product usage or transaction history. The strongest predictors to predict future behavior typically contain data about past behavior.
  • Scores or other results from the off-line execution of external models.
  • Other customer behavior metrics.

These data sources are included in a CAR that summarizes customer details, contextual information for the channel, and additional internal data sources, such as Interaction History.

Verify that the predictors in your models accurately predict customer behavior by monitoring their performance on a regular basis. For more information, see Adaptive models monitoring.

Note: You cannot use all data to drive predictions. There are legal and ethical reasons for not using some data, depending on the context (for example, ethnic origin, exact address, or other personal details). Before selecting such data for predictors, check with your organization about which rules apply, and focus on behavioral data that describes what the customer has done, instead of choosing the fields that describe who that person is.

Data types for predictors

Follow these guidelines to gain a basic understanding of how you can use different data types in adaptive and predictive analytics:

Numeric data
You can use basic numeric data, such as age, income, and customer lifetime value (CLV), without any preprocessing. Your model automatically divides that data into relevant value ranges by dynamically defining the bin boundaries. The following example shows uneven bin sizing for a numeric predictor:
Sample numeric data distribution from a Kaggle data set for marketing
The chart includes bars representing responses and a curve representing propensity.
Categorical (symbolic) data
You can feed strings with up to 200 distinct values without any preprocessing. Such data is automatically categorized into relevant value groups, as shown in the following example:
Sample categorical data distribution
The chart shows responses and propensity for the following categories: Data Only, Platinum, Silver, Gold, Bronze.

For strings with more than 200 distinct values, group the data into fewer categories for better model performance. For more information, see Codes.

Note: Although one-hot encoding is common in data science, do not implement it for symbolic values. The built-in preprocessing feature in Pega Platform efficiently handles symbolics, without any additional preparation
Customer identifiers
Customer identifiers are symbolic or numeric variables that have a unique value for each customer. Typically, they are not useful as predictors, although they might be predictive in special cases. For example, customer identifiers that are handed out sequentially might be predictive in a churn model, as they correlate to tenure.
Codes
For meaningful numeric fields, feed code fragments to the model as separate predictors. Simple values require only basic transformation. For example, you can shorten postal codes to the first 2 or 3 characters which, in most countries, denote geographical location.

For categorical numeric fields, where numbers do not carry any mathematical meaning, categorize input as symbolic in Prediction Studio. If any background information is available, such as hierarchical code grouping, add new fields that are derived from the code (for example, product versus product family).

Dates
If you use dates without any preprocessing, predictive and adaptive models categorize them as numeric data (absolute date/time value). The meaning of such input values changes over time; in other words, the same date might carry a different meaning, depending on the time of reference. For example, July 4th means recently when you run a model on July 5th, but when you perform the analysis on December 6th, the meaning is in the past few months.

Because of that ambiguity, avoid using absolute date/time values as predictors. Instead, take the time span until now (for example, derive age from the DateOfBirth field), or the time difference between various pairs of dates in your data fields. Additionally, you can improve predictor performance by extracting fields that denote a specific time of day, week, or month.

The following examples show date and time values used as predictors:

Sample predictors derived from a date (actual example from a Kaggle data mining competition)
The chart shows responses and propensity for different months.
Sample predictors extracted from a date/time stamp (duration)
The chart shows responses and propensity as ranges.
Text

Do not use plain text to create predictors without any preprocessing because it contains too many unique values. Instead, run a text analyzer on your text input to extract such fields as the intent, topic, and sentiment, to use them as predictors.

A text analyzer puts its output in a property of the Data-NLP-Outcome class. You can use elements from this class as model input, as strategy properties, and so on. Some of the frequently used properties include the following:

PropertyDescription
<Output Field>.pyTopics(1).pyNameThe first category (topic) in a set of multiple topics. A related .pyConfidenceScore property assigns the confidence level for rules that are based on models.
<Output Field>.pyIntents(1).pyNameThe first intent in a set of multiple intents.
<Output Field>.OverallSentimentAn overall sentiment that is mapped to a string (for example, positive, neutral, or negative).
<Output Field>.OverallSentimentValueAn overall sentiment as a numeric value (in the range of –1 to +1).
Event streams
Do not use event streams as predictors without preprocessing, but extract the data in an event strategy instead. Store the aggregations in a Decision Data Store (DDS) data set that is typically keyed by Customer ID, as shown in the following example:
High-frequency events aggregation
Decision response shape points to the Aggregate Events strategy which points to Event Aggregates data set.

In the decision process, this data set is joined with the rest of the customer data, and the aggregates are treated like any other symbolic or numeric field, as shown in the following example:

Merging aggregated events with customer data
The diagram shows the Event Aggregates data set connected to an Event Data merge shape in a decision strategy.
Interaction History
Past interactions are usually very predictive. You can use Interaction History (IH) to extract such fields as the number of recent purchases, the time since last purchase, and so on. To summarize and preprocess IH to use that data in predictions, use IH summaries. A list of predictors based on IH summaries is enabled by default, without any additional setup, for all new adaptive models.

For more information about using IH data in adaptive analytics, see Add predictors based on Interaction History.

Multidimensional data
For models that make your primary decision for a customer, use lists of products, activities, and so on, as the source of useful information for predictors. Create fields from that data either through Pega Platform expressions that operate on these lists or through substrategies that work on this embedded data, and then complete aggregations in strategies. Regardless of your choice, use your intuition and data science insight to determine the possibly relevant derivatives, for example, number-of-products, average-sentiment-last-30-days, and so on.
Pega internal data
For predictions in the context of a Pega Platform application, Pega Platform internal data might be useful to add for predictors on top of external non-Pega Platform customer data.
Real-time contextual data
To increase the efficiency and performance of your models, do not limit the personalization of your decisions and predictions only to the customer. By additionally supplementing the decision process data with the interaction context, you can adjust the predictions for a customer and provide different outcomes depending on the context. The changing circumstances might include the reason for a call, the particular part of the website or mobile app where the customer operates, the current Interactive Voice Response (IVR) menu, and so on
Customer behavior and usage
Customer behavior and interactions, such as financial transactions, claims, calls, complaints, and flights, are typically transactional in nature. From the predictive analytics perspective, you can use that data to create derived fields that summarize or aggregate this data for better predictions, for example, by adopting the Recency, Frequency, Monetary (RFM) approach.

For example, use RFM to track the latest call of a certain type, the frequency of calls in general, and their duration or monetary value. You can perform that search across different time periods, and potentially transform or combine some of that data to extract detailed statistics, such as the average length of a call, the average gigabyte usage last month, an increase or decrease in usage over the last month compared to previous months, and so on.

For more information, see Event streams above.

Model scores and other data science output
Scores from predictive models for different but related outcomes and other data science output might be predictive as well. Common data science output types that are useful as predictors include:
  • Classifications
  • Segmentations and clusters
  • Embeddings
  • Dimensionality reduction scores (PCA)

A typical application of data science output in analytics is the use of a higher-level product propensity score for a large number of adaptive models that are related to the same product. For example, you can apply a single propensity to buy or use a credit card score to all the adaptive models that are related to credit card proposition channel variants.

If you decide to use scores as predictors in your models, evaluate whether the models that include such a score perform better at the model level, by verifying the area under the curve (AUC) and the success rate metrics.

You can use an offline calculated score, for example for product affinity, that is part of the data model and add the property as a predictor. Or you can calculate the score in real-time through H2O, PAD, PMML, Google AI Platform, or SageMaker models and add the score to the adaptive model as a parameterized predictor.

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best.

Pega.com is not optimized for Internet Explorer. For the optimal experience, please use:

Close Deprecation Notice
Contact us