Cassandra overview
Apache Cassandra is the primary example of a backing technology that underpins the Decision Data Store data set. The following sections provide an overview of the most important Cassandra features in terms of scalability, data distribution, consistency, and architecture.
Apache Cassandra
Cassandra handles the database operations for Pega Platform decision management by providing fast access to the data that is essential in making next-best-action decisions in both batch and real time.
- Distributed and decentralized
- Cassandra is a distributed system, which means that it is capable of running on multiple machines while appearing to users as a unified whole. Every node in a Cassandra cluster is identical. No single node performs organizational operations that are distinct from any other node. Instead, Cassandra features a peer-to-peer protocol and uses gossip to synchronize and maintain a list of nodes that are alive or dead.
- Elastically scalable
- The responsibility for data storage and processing is shared across many machines in the Cassandra database, to reduce the reliance on any one machine. Instead of hosting all data on a single server or replicating all of the data on all servers in a cluster, Cassandra divides portions of the data horizontally and hosts it separately.
- Consistent
- Meeting the requirements of performance, reliability, scalability, and high availability in production, Cassandra is an eventually consistent storage system. Eventually consistent implies that all updates reach all replicas eventually. Divergent versions of the same data may exist temporarily, but they are eventually reconciled to a consistent state. Eventual consistency is a tradeoff to achieve high availability and it involves some read and write latencies.
- The replication factor is the number of nodes in the cluster to which you want to propagate updates through add, update, or delete operations in order to support higher resiliency to machine failures and it also improves performance by distributing the data across multiple machines in the database.
- The consistency level controls how many replicas in the cluster must acknowledge a write operation, or respond to a read operation, in order to be successful.
- For example, you can set the consistency level to a number equal to the replication factor to gain stronger consistency at the cost of synchronous blocking operations, which wait for all nodes to be updated in order to declare success.
- Row and column-oriented
- In Cassandra, rows do not need to have the same number of columns. Instead, column families arrange columns into tables and are controlled by keyspaces. A keyspace is a logical namespace that holds the column families, as well as certain configuration properties.
Decision Data Store
The Decision Data Store is the repository for analytical data from a variety of sources and is deployed as part of the Pega Platform node cluster. The Decision Data Store consists of nodes that connect to an external Cassandra cluster using one-to-many relationships, as shown in the following figure:
Each node that comprises the Decision Data Store handles data in JSON format for each customer, from different sources. The data is distributed and replicated around the cluster, and is stored in the node file system.
In earlier Pega Platform versions, it was possible to configure a Decision Data Store node cluster that used a Cassandra database in embedded mode, as shown in the following figure:
This type of configuration is now deprecated and not used for new deployments. However, it is still supported for systems that have been updated from earlier versions of Pega Platform to the current version.
Supported configurations
Pega only supports Cassandra distributions that are based on genuine Apache Cassandra. Note that while many distributions state that they are Cassandra compatible, there may be some caveats that cause incompatibility with Pega Platform. Pega only supports Instaclustr and DataStax Enterprise (DSE). You can find examples of such distributions on the Apache Cassandra website.Deployment
In Pega Cloud environments, the Cassandra database is preconfigured. No action is required.
In on-premises and client-managed cloud environments, you need to install and operate your own Cassandra cluster, and connect it to Pega Platform. For more information, see Connecting to an external Cassandra database through the Decision Data Store service.
Previous topic Managing Cassandra as a store for decisioning data Next topic Configuring the Cassandra cluster