Skip to main content

         This documentation site is for previous versions. Visit our new documentation site for current releases.      

Recommendations for choosing the partitioning key

Updated on July 5, 2022

A partitioning key is a property that you define in a Database Table data set to distribute read operations into partitions across the nodes. Choose a property that has the appropriate number of distinct values to ensure balanced distribution and best performance.

When choosing a property to use as the partition key in a Database Table data set, consider the following configuration aspects:

Distinct values

To ensure a balanced distribution, select a property that is suitable for partitioning. It should be a property with not too many and not too few distinct values. For example, if the table contains customer information, country information is a suitable property for partitioning because it contains enough shared distinct values, but email addresses are not because they typically have as many distinct values as customer entries.

System resources

Another consideration when choosing the partitioning key is the number of nodes assigned in the data flow service and the number of threads defined in the batch data flow run that will use the data set as the source. In a batch run, one thread can process one partition at a time. When a partition is fully processed, the next available partition is picked up. One thread might finish processing a partition faster than the other threads. Therefore, to ensure that threads do not remain idle, the optimal number of partitions should be the number of nodes in a given data flow service instance multiplied by the number of threads defined in the batch data flow run, and multiplied by two:

number of partitions = number of nodes × number of threads × 2

For example:

In the following example, one node is assigned to the batch data flow service:

Nodes in the batch data flow service
The batch data flow service landing page. One node is assigned.

Five threads are defined in the data flow run:

Threads in a data flow run
The batch data flow run configuration form. The number of threads is five.

The optimal number of partitions in this configuration is 1 × 5 × 2 = 10.

  • Previous topic Creating a Database Table data set that uses the relational database
  • Next topic Connecting to an external Cassandra database through a Database Table data set

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best. is not optimized for Internet Explorer. For the optimal experience, please use:

Close Deprecation Notice
Contact us