Constructing a sample
A sample is a subset of historical data that you can extract when you apply a selection or sampling method to the data source. A sample construction helps to construct development, validation, and test data sets for analysis and modeling.
-
In the
Data preparation
step, in the
Sample
construction
workspace, from the
Select the weight field if
present
drop-down list, click an available weight field.
Typically, a weight field is available when you sample the data before using it in the Prediction Studio portal. If you do not specify the field, each case counts as one.
-
In the
Select the fields to sample
grid, specify the fields you
want to include in the sample:
-
In the
Type
column, select a field type from the drop-down
list.
Select the Not used type for fields that you want to exclude from the sample.
- Optional: In the Description column, enter a field definition.
- Optional: In the User defined field, type a new name for a field.
-
In the
Type
column, select a field type from the drop-down
list.
-
Select a sampling method:
If Then If you want to sample a simple proportion of cases, select the Uniform sampling option. This method fills the sample table with a random selection of records from the source. The probability of selection is set to achieve the specified percentage or number of cases.
If you want to sample a different proportion of each value for the selected field (stratum) that represents the behavior to be predicted, perform the following actions: - Select the Stratified sampling option.
- From the Stratum field drop-down list, select the field you want to sample.
- In the table with stratum values, in the Ratio column, set the proportion of population cases to source records.
-
In the
Sample percentage
column, enter the percentage of
records that you want to sample.
Note: Population is a group of cases with known behavior which is consistent with the group of cases whose behavior you want to predict. You use the population to extract data samples for modeling and validation.
This method fills the sample table with random selections of each class.
-
In the
Hold-out sets
section, define the sample percentage that
you want to use for development, validation, and testing:
- To divide cases among the sets, select the Setting percentages for each set option.
- To divide cases that are available for the field, select the User defined field option.
- Optional:
Select a field from the data source to assign the records with the same value to one
hold-out set.
For example: You can place family members from the same household into one hold-out set. Family members might have similar profiles that can cause overfitting validation of data if they are not in one hold-out set.Note: The type of hold-out set is selected at random.
- Confirm the sample construction by clicking Next.