Recommendations for choosing the partitioning key
A partitioning key is a property that you define in a Database Table data set to distribute read operations into partitions across the nodes. Choose a property that has the appropriate number of distinct values to ensure balanced distribution and best performance.
When choosing a property to use as the partition key in a Database Table data set, consider the following configuration aspects:
Distinct values
To ensure a balanced distribution, select a property that is suitable for partitioning. It should be a property with not too many and not too few distinct values. For example, if the table contains customer information, country information is a suitable property for partitioning because it contains enough shared distinct values, but email addresses are not because they typically have as many distinct values as customer entries.
System resources
Another consideration when choosing the partitioning key is the number of nodes assigned in the data flow service and the number of threads defined in the batch data flow run that will use the data set as the source. In a batch run, one thread can process one partition at a time. When a partition is fully processed, the next available partition is picked up. One thread might finish processing a partition faster than the other threads. Therefore, to ensure that threads do not remain idle, the optimal number of partitions should be the number of nodes in a given data flow service instance multiplied by the number of threads defined in the batch data flow run, and multiplied by two:
number of partitions = number of nodes × number of threads × 2
Previous topic Creating a Database Table data set that uses the relational database Next topic Connecting to an external Cassandra database through a Database Table data set