Skip to main content


         This documentation site is for previous versions. Visit our new documentation site for current releases.      
 

Constructing a sample

Updated on May 17, 2024

A sample is a subset of historical data that you can extract when you apply a selection or sampling method to the data source. A sample construction helps to construct development, validation, and test data sets for analysis and modeling.

  1. In the Data preparation step, in the Sample construction workspace, from the Select the weight field if present drop-down list, click an available weight field.
    Typically, a weight field is available when you sample the data before using it in the Prediction Studio portal. If you do not specify the field, each case counts as one.
  2. In the Select the fields to sample grid, specify the fields you want to include in the sample:
    1. In the Type column, select a field type from the drop-down list.
      Select the Not used type for fields that you want to exclude from the sample.
    2. Optional: In the Description column, enter a field definition.
    3. Optional: In the User defined field, type a new name for a field.
  3. Select a sampling method:
    IfThen
    If you want to sample a simple proportion of cases, select the Uniform sampling option.

    This method fills the sample table with a random selection of records from the source. The probability of selection is set to achieve the specified percentage or number of cases.

    If you want to sample a different proportion of each value for the selected field (stratum) that represents the behavior to be predicted, perform the following actions:
    1. Select the Stratified sampling option.
    2. From the Stratum field drop-down list, select the field you want to sample.
    3. In the table with stratum values, in the Ratio column, set the proportion of population cases to source records.
    4. In the Sample percentage column, enter the percentage of records that you want to sample.
      Note: Population is a group of cases with known behavior which is consistent with the group of cases whose behavior you want to predict. You use the population to extract data samples for modeling and validation.

    This method fills the sample table with random selections of each class.

  4. In the Hold-out sets section, define the sample percentage that you want to use for development, validation, and testing:
    • To divide cases among the sets, select the Setting percentages for each set option.
    • To divide cases that are available for the field, select the User defined field option.
  5. Optional: Select a field from the data source to assign the records with the same value to one hold-out set.
    For example: You can place family members from the same household into one hold-out set. Family members might have similar profiles that can cause overfitting validation of data if they are not in one hold-out set.
    Note: The type of hold-out set is selected at random.
  6. Confirm the sample construction by clicking Next.

Have a question? Get answers now.

Visit the Support Center to ask questions, engage in discussions, share ideas, and help others.

Did you find this content helpful?

Want to help us improve this content?

We'd prefer it if you saw us at our best.

Pega.com is not optimized for Internet Explorer. For the optimal experience, please use:

Close Deprecation Notice
Contact us