Understanding the Pega Customer Decision Hub approach to protecting customer data
With the increasing number of data breaches being reported in the news, IT departments are beginning to take more drastic stances on data protection. Sometimes, security policies state that everything should be encrypted. While the tenet of having multiple layers of security calls for universal encryption at some layers of the stack, we are also seeing clashes between blanket approaches and what makes sense technically, both in terms of marketing functional conflicts and performance considerations.
This document explains Pega's approach to protecting customer data for use by the Pega Customer Decision Hub product, incorporating what is practical, performant, and safe.
Why not just encrypt everything using Pega's PropertyEncrypt control policy feature?
The fundamental problem we run into with encryption of customer data in the field of Marketing and Customer Engagement is that mathematically, there is a limit to what comparison operations can be applied to encrypted data without unencrypting it. Two encrypted strings or numbers can be compared for an exact match, but no substring or range comparisons can be made. Although research is being done in the emerging area of Homomorphic Encryption today, the reality of our technology does not allow us to feasibly filter for Age > 18 on customer records if Age is stored in an encrypted form.
This effectively means that if Pega's PropertyEncrypt mechanism is used on a field, as described in this article, that property cannot be referenced in any segment criteria, nor compared in engagement policies (eligibility, applicability, or suitability).
The reason why segments cannot compare properties encrypted by Pega Platform is that segment logic is evaluated as SQL in the backing relational database (for example, Postgres or Oracle). While all relational database management systems have their own on-disk encryption, they do not know anything about Pega's encryption. Only Pega code can decrypt Pega-encrypted fields, and since Pega code is not evaluating the SQL, a segment that makes comparisons on Pega-encrypted fields will see zero results come back.
Pega's PropertyEncrypt is designed with the use case of encrypting highly confidential personally identifiable information (PII) so it can only be seen by an agent (a user) using a Pega user interface. In support of that goal, the decryption is done at the very last moment – right before the field is displayed to the user in the user interface. Decryption is not invoked by background processing. Because of that, even comparisons made by strategies, which call when rules generated with the Condition Builder, will not be able to make comparisons against encrypted fields.
Even with a way to make background processing call the decryption logic, the problem would be one of performance, as batch processing for marketing use cases is already a CPU-intensive process. While a few milliseconds spent for decryption is perfectly acceptable for fields displayed in a user interface, where a human will spend at least a few seconds looking at the screen, calling this decryption code thousands of times per second per thread (as is common with large scale batch decisioning, which is frequently used in outbound marketing), would introduce a performance bottleneck.
If we cannot encrypt specific fields for use in comparisons, what do we do?
We believe the key is to be selective about what data is encrypted at each level of the stack. We have several levels of encryption in our stack:
- Encryption in the database (AWS's RDS service) – Data at rest
- Encryption on the wire (HTTPS queries to the database) – Data in motion
- Encryption on the wire (HTTPS requests from web browsers to Pega) – Data in motion
- Encryption in Pega Platform (PropertyEncrypt) – Data at rest and in memory
There is no reason not to encrypt all data in the database, as mentioned in point 1 above. In fact, Pega Cloud is provisioned this way by default, using a standard AES-256 encryption algorithm: more details. So even without doing any additional encryption, a data protection policy that requires all data at rest to be encrypted would be implemented since Pega complies with this requirement, without having to use any additional Pega Platform data encryption. Most Pega clients operate at this level of encryption.
All connections between the Pega runtime and the Database run over HTTPS, so data in motion is encrypted as well – without having to do any additional configuration.
All data between the Pega system and the customer's browser or an agent's browser run over HTTPS, so personalized content in motion is encrypted as well – without having to do any additional configuration.
What to do about PII?
Pega's view on PII in Pega Customer Decision Hub is that it should be treated specially. Like premium status on an airline, if everyone is Platinum, then Platinum is not such a special status. In a similar way, if all data is considered to be PII, then essentially, truly private data is not treated specially and differently from other data. Pega Customer Decision Hub's interpretation of personally identifiable information is data that can used to identify one particular person out of a group of similar people. For example, a social security number would be considered PII, as this number, taken by itself, could be used to immediately identify one particular person without ambiguity. A social security number should definitely be encrypted and treated in the most special way.
At the same time, the Pega view on using PII data in marketing is that PII is not meant to be used for targeting specific people with specific actions. In the world of traditional segment-oriented campaign marketing, this would mean that PII fields should not be used as segmentation criteria – not just because it is not technically possible to use Pega-encrypted fields in a SQL query, but because it is not a basis for good marketing practice. As an extreme example, it is hard to think of a good reason to offer a specific action to people whose social security number ends with 89. Likewise, it is hard to think of a reason why an action would be applicable only to people whose last name may be Cohen. Beyond not being good marketing practice, such targeting criteria border on unethical.
Let’s take a less extreme example of date of birth. It is very reasonable, especially in Healthcare-oriented use cases, for people to qualify for offers when they are reaching specific birthdays, such as 65. We could make the argument that it is not date of birth, but number of days until their 65th birthday, which should be used for marketing purposes. Nevertheless, it is hard to calculate such a derived attribute without also exposing the date of birth, as one could calculate the date of birth from the number of days until a certain age. In this scenario, we would not consider date of birth to be Personally Identifiable Information, as you could not identify a particular person solely by their date of birth.
Yet, as we appreciate CISOs often must adopt a stricter policy, the way to treat date of birth as special while still enabling use cases that need to qualify or disqualify people for actions, is to create a derived field that extracts the targeting criteria, while obfuscating the truly Personally Identifiable data. In this example, Pega would advise creating a field called IsWithin60DaysOf65thBirthday, which would not be encrypted in Pega Platform (it would still be encrypted in the database). A background job that runs outside of Pega Cloud and outside of Pega Customer Decision Hub, which has access to the unencrypted data (for example, via ETL from the definitive source or calculated using a custom written script that has the ability to decrypt date of birth), evaluates every customer periodically and sets this flag. Only the value of the flag would make its way into Pega Cloud / Pega Customer Decision Hub. The job could run only once a week or even twice a month if we are looking to set the flag when a person enters the period of time that is 60 days before their 65th birthday – it does not matter if we talk to them exactly 60 days before it, or 45 days before it.
Another obfuscation technique that is often deployed for this purpose is bucketing. In this technique, a similar ETL process is used to bucket value ranges together. For example, the marketing customer profile field may be called DemographicBucket. People who are between ages 35-45 would be assigned the value A. People who are between 46-65 would be assigned value B, and so on. Only the marketing operations team would know the precise boundaries. With this knowledge, the marketing operational team can create obfuscated queries without true PII data being exposed, while true PII data can remain encrypted but never used for query purposes.
Naturally, if marketers wanted to send a Happy Birthday e-mail to people on their birthday, these obfuscation approaches would not work. In this case, an approach would be to store only the birthday (day and month) but not the year, in a form that can be accessed for marketing purposes. If done in conjunction with the bucketing approach, the true date of birth would remain undiscoverable.
What about compound data fields which, together, constitute PII?
It is widely acknowledged that while individual fields may not constitute PII, if you take enough of them together, you can zero in on specific people. A Zip code may have 100,000 people in it. In that Zip code, there may be 1000 people who are within 60 days of their 65th birthday. Of those 1000 people, there may be 500 who live in single family homes. Of those 500, there may be 100 who also have 2 children. Of those 100, there may be 10 who drive BMWs. Of those 10, there may be 5 that are of the model year 2009. Of those 5, there may be 1 that is green. By this logic, every single one of those fields might be considered PII, while also being valid criteria to qualify or disqualify people for actions.
In this situation, we get into a discussion of risk mitigation. As most people in information security acknowledge, one sure way to design a 100% risk free data strategy is to restrict all access to data. That, unfortunately, is not practical for most businesses as it restricts usage of its most valuable resource – its data. So, a balance must be struck and the residual risk must be neutralized through security best practices.
A well accepted best practice of information security is to have multiple layers of security.
A practical step to take for this problem of protecting the most critical data while still enabling the business to make use of its data is to use database-level encryption for all data at rest, which ensures that if a malicious attacker were able to breach the AWS single tenant security domain, and then was also able to connect to the database or gain access to the underlying data storage medium, they would not be able to export a dump of all data upon which to run analysis to triangulate data sets of non PII data back to individual people. Pega configures this by default in Pega Cloud deployments.
Even if an attacker were able to breach the AWS single-tenant security domain and establish a presence on a host that had access to enable promiscuous Ethernet sniffing on the subnet, they would not be able to gain access to any customer data because all data in motion is encrypted with HTTPS.
A further level of security would be to create derived fields that obfuscate fields enough that they thwart triangulation, while still allowing enough specificity for marketing purposes. Earlier, we provided an example of IsWithin60DaysOf65thBirthday, to obfuscate date of birth. Another such example would be to take only the first three digits of a US Zip code, as that is usually representative of a geographical region (such as a metropolitan area), while adding considerable ambiguity for triangulation purposes. Zip code, as well as the full address, would remain encrypted, while First3DigitsofZipCode would be unencrypted.
Finally, truly individually identifiable fields such as social security number and first and last name, should indeed be encrypted using Pega encryption – and not used for segmentation or engagement policy filtering.
How does PII get unencrypted for last mile delivery?
In any outbound marketing technology system, we face the issue of how far outward we can extend our data protection. As an extreme example, even if we send a personalized e-mail and control the entire SMTP delivery path all the way to the recipient's inbox, there is still the risk that another party reads their mail in their inbox. This is why in the most secure situations, such as bank notices, customers must login to a special portal using their authenticated credentials to read the notice. Unfortunately, that is inconvenient for the recipient, so unless the recipient is highly motivated to retrieve the message, the message will likely go unread. This approach is not feasible for marketing offer type of emails, although it is feasible for transactional emails. Pega does not have any out of the box connectors to such delivery systems, though one could be built bespoke for a project. We are left with the fact that for the most typical consumer friendly channels, such normal e-mail and push notifications (which must trust third-party servers not to expose the messages), we cannot expect full end to end encryption all the way to the consumer.
As we must accept that for normal marketing content (that the content we send to customers could be intercepted by unintended recipients), it is best practice not to expose any personalized fields that could be construed as PII, or to obfuscate it. A good example of this is to XXX out all but the last 3 digits of a Social Security number. Similar to the approach above for IsWithin60DaysOf65thBirthday, an unencrypted field of ObfuscatedSSN can be populated during data load time and passed along to delivery engines for personalization purposes.
There is judgement to be applied here. As a consumer, I may appreciate an email telling me my account is about to be terminated on January 14th, 2021 if I do not pay my renewal. Though, if January 14th, 2021 is a special date that is unique to that individual (for example, if renewals always happen on one's birthday), then it would be better to personalize the email by saying Your policy will expire within the next 45 days, and not pass along the specific date to a delivery engine. In this scenario, like the others, there must be a point in the ETL process where data is accessed from a more secure definitive source, decrypted, obfuscated to a less specific form, and then stored for marketing purposes.
In summary, Pega does not advise nor support storing or using unique personally identifiable data for marketing or targeting purposes. This paper covers practices for maintaining a high level of protection for consumer data while still enabling the business to perform marketing functions using that data. Striking this balance requires careful consideration as to what truly is uniquely personally identifiable vs what, when taken in conjunction with other data, could be used to triangulate a consumer. Pega supports and advocates making use of multiple levels of security to treat all data securely while treating truly special data with the respect our consumers' data deserves.
Previous topic Pega Marketing and Pega Customer Decision Hub - reference materials Next topic Managing customer data to implement your GDPR and other regulations compliance strategy