Constructing a sample
A sample is a subset of historical data that you can extract when you apply a selection or sampling method to the data source. A sample construction helps to construct development, validation, and test data sets for analysis and modeling.
step, in the
workspace, from the
Select the weight field if
drop-down list, click an available weight field.
Typically, a weight field is available when you sample the data before using it in the Prediction Studio portal. If you do not specify the field, each case counts as one.
Select the fields to sample
grid, specify the fields you
want to include in the sample:
column, select a field type from the drop-down
Select the Not used type for fields that you want to exclude from the sample.
- Optional: In the Description column, enter a field definition.
- Optional: In the User defined field, type a new name for a field.
- In the Type column, select a field type from the drop-down list.
- Select a sampling method:
If Then If you want to sample a simple proportion of cases, select the Uniform sampling option.
This method fills the sample table with a random selection of records from the source. The probability of selection is set to achieve the specified percentage or number of cases.
If you want to sample a different proportion of each value for the selected field (stratum) that represents the behavior to be predicted, perform the following actions:
- Select the Stratified sampling option.
- From the Stratum field drop-down list, select the field you want to sample.
- In the table with stratum values, in the Ratio column, set the proportion of population cases to source records.
column, enter the percentage of
records that you want to sample.
Note: Population is a group of cases with known behavior which is consistent with the group of cases whose behavior you want to predict. You use the population to extract data samples for modeling and validation.
This method fills the sample table with random selections of each class.
section, define the sample percentage that
you want to use for development, validation, and testing:
- To divide cases among the sets, select the Setting percentages for each set option.
- To divide cases that are available for the field, select the User defined field option.
- Optional: Select a field from the data source to assign the records with the same value to one
For example: You can place family members from the same household into one hold-out set. Family members might have similar profiles that can cause overfitting validation of data if they are not in one hold-out set. Note: The type of hold-out set is selected at random.
- Confirm the sample construction by clicking Next.
Previous topic Selecting a data source Next topic Defining an outcome