Best practices for choosing predictors
When you create an adaptive or predictive model, the input fields that you select as predictor data play a crucial role in the predictive performance of that model. Some data types, such as dates and text, might require preprocessing. Follow best practices when you select predictors and choose data types for adaptive and predictive analytics.
Predictor selection considerations
When a model makes a prediction, predictive power is the largest when you include as much relevant, yet uncorrelated, information as possible. You can make a wide set of candidate predictors available, as many as several hundred or more. Adaptive models automatically select the best subset of predictors. They group predictors into sets of related predictors and then select the best predictor from each group, that is, the predictor that has the strongest relationship with the outcome (measured in AUC). In adaptive decisioning, this predictor selection process repeats periodically.
To achieve the best results, use predictors that provide data from many different data sources, such as:
- Customer profile information, for example, age, income, gender, and current product subscriptions. This information is usually part of the Customer Analytic Record (CAR) and is refreshed regularly.
- Interaction context, such as recent web browsing information, support call reasons, or input that is gathered during a conversation with the customer. This information can be highly relevant and, therefore, very predictive.
- Customer behavior data, such as product usage or transaction history. The strongest predictors to predict future behavior typically contain data about past behavior.
- Scores or other results from the off-line execution of external models.
- Other customer behavior metrics.
These data sources are included in a CAR that summarizes customer details, contextual information for the channel, and additional internal data sources, such as Interaction History.
Verify that the predictors in your models accurately predict customer behavior by monitoring their performance on a regular basis. For more information, see Adaptive models monitoring.
Data types for predictors
Follow these guidelines to gain a basic understanding of how you can use different data types in adaptive and predictive analytics:
- Numeric data
- You can use basic numeric data, such as age, income, and customer lifetime value (CLV), without any preprocessing. Your model automatically divides that data into relevant value ranges by dynamically defining the bin boundaries. The following example shows uneven bin sizing for a numeric predictor:
- Categorical (symbolic) data
- You can feed strings with up to 200 distinct values without any
preprocessing. Such data is automatically categorized into relevant value
groups, as shown in the following example:
For strings with more than 200 distinct values, group the data into fewer categories for better model performance. For more information, see Codes.
- Customer identifiers
- Customer identifiers are symbolic or numeric variables that have a unique value for each customer. Typically, they are not useful as predictors, although they might be predictive in special cases. For example, customer identifiers that are handed out sequentially might be predictive in a churn model, as they correlate to tenure. However, as a best practice, exclude them from your model.
- Codes
- For meaningful numeric fields, feed code fragments to the model as separate
predictors. Simple values require only basic transformation. For example,
you can shorten postal codes to the first 2 or 3 characters which, in most
countries, denote geographical location.
For categorical numeric fields, where numbers do not carry any mathematical meaning, categorize input as symbolic in Prediction Studio. If any background information is available, such as hierarchical code grouping, add new fields that are derived from the code (for example, product versus product family).
- Dates
- If you use dates without any preprocessing, predictive and adaptive models
categorize them as numeric data (absolute date/time value). The meaning of
such input values changes over time; in other words, the same date might
carry a different meaning, depending on the time of reference. For example,
July 4th means recently when you run a model on July 5th, but
when you perform the analysis on December 6th, the meaning is in the
past few months.
Because of that ambiguity, avoid using absolute date/time values as predictors. Instead, take the time span until now (for example, derive age from the DateOfBirth field), or the time difference between various pairs of dates in your data fields. Additionally, you can improve predictor performance by extracting fields that denote a specific time of day, week, or month.
The following examples show date and time values used as predictors:
- Text
Do not use plain text to create predictors without any preprocessing because it contains too many unique values. Instead, run a text prediction on your text input to extract such fields as the intent, topic, and sentiment, to use them as predictors.
A text prediction puts its output in a property of the Data-NLP-Outcome class. You can use elements from this class as model input, as strategy properties, and so on. Some of the frequently used properties include the following:
Property Description <Output Field>.pyTopics(1).pyName The first category (topic) in a set of multiple topics. A related .pyConfidenceScore property assigns the confidence level for rules that are based on models. <Output Field>.pyIntents(1).pyName The first intent in a set of multiple intents. <Output Field>.OverallSentiment An overall sentiment that is mapped to a string (for example, positive, neutral, or negative). <Output Field>.OverallSentimentValue An overall sentiment as a numeric value (in the range of –1 to +1). - Event streams
- Do not use event streams as predictors without preprocessing, but extract
the data in an event strategy instead. Store the aggregations in a Decision
Data Store (DDS) data set that is typically keyed by Customer ID, as shown
in the following example:
In the decision process, this data set is joined with the rest of the customer data, and the aggregates are treated like any other symbolic or numeric field, as shown in the following example:
- Interaction History
- Past interactions are usually very predictive. You can use Interaction
History (IH) to extract such fields as the number of recent purchases, the
time since last purchase, and so on. To summarize and preprocess IH to use
that data in predictions, use IH summaries. A list of predictors based on IH
summaries is enabled by default, without any additional setup, for all new
adaptive models.
For more information about using IH data in adaptive analytics, see Add predictors based on Interaction History.
- Multidimensional data
- For models that make your primary decision for a customer, use lists of products, activities, and so on, as the source of useful information for predictors. Create fields from that data either through Pega Platform expressions that operate on these lists or through substrategies that work on this embedded data, and then complete aggregations in strategies. Regardless of your choice, use your intuition and data science insight to determine the possibly relevant derivatives, for example, number-of-products, average-sentiment-last-30-days, and so on.
- Pega internal data
- For predictions in the context of a Pega Platform application, Pega Platform internal data might be useful to add for predictors on top of external non-Pega Platform customer data.
- Real-time contextual data
- To increase the efficiency and performance of your models, do not limit the personalization of your decisions and predictions only to the customer. By additionally supplementing the decision process data with the interaction context, you can adjust the predictions for a customer and provide different outcomes depending on the context. The changing circumstances might include the reason for a call, the particular part of the website or mobile app where the customer operates, the current Interactive Voice Response (IVR) menu, and so on
- Customer behavior and usage
- Customer behavior and interactions, such as financial transactions, claims,
calls, complaints, and flights, are typically transactional in nature. From
the predictive analytics perspective, you can use that data to create
derived fields that summarize or aggregate this data for better predictions,
for example, by adopting the Recency, Frequency, Monetary (RFM)
approach.
For example, use RFM to track the latest call of a certain type, the frequency of calls in general, and their duration or monetary value. You can perform that search across different time periods, and potentially transform or combine some of that data to extract detailed statistics, such as the average length of a call, the average gigabyte usage last month, an increase or decrease in usage over the last month compared to previous months, and so on.
For more information, see Event streams above.
- Model scores and other data science output
- Scores from predictive models for different but related outcomes and other
data science output might be predictive as well. Common data science output
types that are useful as predictors include:
- Classifications
- Segmentations and clusters
- Embeddings
- Dimensionality reduction scores (PCA)
A typical application of data science output in analytics is the use of a higher-level product propensity score for a large number of adaptive models that are related to the same product. For example, you can apply a single propensity to buy or use a credit card score to all the adaptive models that are related to credit card proposition channel variants.
If you decide to use scores as predictors in your models, evaluate whether the models that include such a score perform better at the model level, by verifying the area under the curve (AUC) and the success rate metrics.
You can use an offline calculated score, for example for product affinity, that is part of the data model and add the property as a predictor. Or you can calculate the score in real-time through H2O, PAD, PMML, Google AI Platform, or SageMaker models and add the score to the adaptive model as a parameterized predictor.
Previous topic Database tables for monitoring models Next topic Viewing Prediction Studio notifications