When evaluating machine learning algorithms through resampling, it is preferable that each train/test partition will be a representative subset of the whole data set. This post covers three ways to achieve such reliable resampling procedures.
When evaluating machine learning algorithms through resampling, it is preferable that each train/test partition will be a representative subset of the whole data set. This post covers three ways to achieve such reliable resampling procedures:
library(mlr3)
library(mlbench)
library(data.table)
In classification tasks, the ratio of the target class distribution should be similar in each train/test split, which is achieved by stratification. This is particularly useful in the case of imbalanced classes and small data sets.
Stratification can also be performed with respect to explanatory categorical variables to ensure that all subgroups are represented in all training and test sets.
In mlr3
, each task has a slot col_roles
. This slot shows general roles certain features will have throughout different stages of the machine learning process. At least, the col_roles
slot shows which variables will be used as feature
s and as the target
. Hoever, the col_roles
slot can be more diverse and some variables might even serve multiple roles. We can specify the variable used for stratification in task$col_roles$stratum
. This will be illustrated in the following example using the german_credit
data:
gc_tsk = tsk("german_credit")
gc_tsk$col_roles
$feature
[1] "age" "amount"
[3] "credit_history" "duration"
[5] "employment_duration" "foreign_worker"
[7] "housing" "installment_rate"
[9] "job" "number_credits"
[11] "other_debtors" "other_installment_plans"
[13] "people_liable" "personal_status_sex"
[15] "present_residence" "property"
[17] "purpose" "savings"
[19] "status" "telephone"
$target
[1] "credit_risk"
$name
character(0)
$order
character(0)
$stratum
character(0)
$group
character(0)
$weight
character(0)
$uri
character(0)
We use the target feature called credit_risk
to specify stratification with respect to the target variable:
gc_tsk$col_roles$stratum = "credit_risk"
# alternatively gc_tsk$col_roles$stratum = gc_tsk$col_roles$target
After the specification of task$col_roles$stratum
, the active binding task$strata
will show the number of observations in each group and the corresponding row id’s:
gc_tsk$strata
N row_id
1: 700 1,3,4,6,7,8,...
2: 300 2, 5,10,11,12,14,...
Specify 3-fold cross validation and instantiate the resampling on the task:
row_id fold
1: 1 1
2: 3 1
3: 6 1
4: 8 1
5: 13 1
---
996: 952 3
997: 953 3
998: 974 3
999: 980 3
1000: 981 3
Check if the target class distribution is similar in each fold:
dt = merge(cv3$instance, gc_tsk$data()[, row_id := .I], by = "row_id")
dt[, .(class_ratio = sum(credit_risk == "bad") /
sum(credit_risk == "good")), by = fold]
fold class_ratio
1: 1 0.4273504
2: 3 0.4291845
3: 2 0.4291845
And compare it with the target class distribution from the whole data set:
class_ratio
1: 0.4285714
Note that the variable used for stratification does not necessarily have to be the target class. In fact, multiple categorical features can be used for stratification to maintain their frequency distribution in each fold:
gc_tsk$col_roles$stratum = c("housing", "telephone")
gc_tsk$strata
N row_id
1: 280 1,13,20,21,26,30,...
2: 433 2, 3, 7, 9,10,14,...
3: 47 4, 5, 45, 76,134,192,...
4: 61 6,19,37,55,63,69,...
5: 63 8, 48, 60, 72, 96,100,...
6: 116 11,12,15,22,23,28,...
To illustrate if stratification based on multiple categorical features works, we need to instantiate the CV folds again as we changed the features used for stratification:
set.seed(123)
cv3$instantiate(gc_tsk)
cv3$instance
row_id fold
1: 21 1
2: 26 1
3: 30 1
4: 39 1
5: 42 1
---
996: 747 3
997: 860 3
998: 866 3
999: 953 3
1000: 973 3
Again, we check the relative frequency of observations in each group (combination of housing
and telephone
) across all folds:
dt = merge(cv3$instance, gc_tsk$data()[, row_id := .I], by = "row_id")
dt = dt[, .(freq = .N), by = list(fold, housing, telephone)]
dt = dcast(dt, housing + telephone ~ fold)
dt[, c(3:5) := lapply(.SD, function(x) x / sum(x)), .SDcols = 3:5]
dt
housing telephone 1 2 3
1: for free no 0.11607143 0.11711712 0.11480363
2: for free yes (under customer name) 0.06250000 0.06306306 0.06344411
3: rent no 0.43154762 0.43243243 0.43504532
4: rent yes (under customer name) 0.27976190 0.27927928 0.28096677
5: own no 0.04761905 0.04804805 0.04531722
6: own yes (under customer name) 0.06250000 0.06006006 0.06042296
And compare it with the relative frequency from the whole data set:
housing telephone freq
1: for free no 0.11681772
2: for free yes (under customer name) 0.06415479
3: rent no 0.43300000
4: rent yes (under customer name) 0.28084253
5: own no 0.04895833
6: own yes (under customer name) 0.06106106
It is evident that in each fold, the combination of housing
and telephone
have similar frequencies that also coincide with the frequencies from the whole data set.
An additional concern when specifying resampling is respecting the natural grouping of the data. Blocking refers to the situation where subsets of observations belong together and must not be separated during resampling. Hence, for one train/test set pair the entire block is either in the training set or in the test set.
The following example is based on the BreastCancer data set from the mlbench
package:
data(BreastCancer, package = "mlbench")
bc_tsk = TaskClassif$new(
id = "BreastCancer",
backend = BreastCancer,
target = "Class",
positive = "malignant")
In the BreastCancer data set, for example, several observations have the same “Id” (Sample code number) which implies these are samples taken from the same patient at different times.
# Let's count how many observation actually have the same Id more than once
sum(table(BreastCancer$Id) > 1)
[1] 46
There are 46 Id’s with more than one observation (row).
The model trained on this data set will be used to predict cancer status of new patients. Hence, we have to make sure that each Id
occurs exactly in one fold, so that all observations with the same Id
should be either used for training or for evaluating the model. This way, we get less biased performance estimates via k-fold cross validation. The following example will illustrate block cross validation which can be achieved by specifying a blocking factor in the task$col_roles$group
slot:
# Use Id column as block factor
bc_tsk$col_roles$group = "Id"
# Remove Id from feature
# bc_tsk$col_roles$feature = setdiff(bc_tsk$col_roles$feature, "Id")
cv5 = rsmp("cv", folds = 5)
set.seed(123)
cv5$instantiate(bc_tsk)
cv5$instance
row_id fold
1: 1016277 1
2: 1044572 1
3: 1049815 1
4: 1050718 1
5: 1054590 1
---
641: 1369821 5
642: 1371026 5
643: 1371920 5
644: 714039 5
645: 841769 5
In this case, the row_id
column of the cv5$instance
slot refers to values of the grouping variable “Id”. Additionally, the number of rows of the cv5$instance
is the same as the number of unique groups:
If the specified blocking groups are respected, each Id
appears only in exactly one fold. To inspect if blocking was successful when generating the folds we count how often each Id
appears in a specific fold and print the Id
s that appear in more than one fold:
dt = merge(bc_tsk$data(), cv5$instance, by.x = "Id", by.y = "row_id")
dt = dt[, .(unique_folds = length(unique(fold))), by = Id]
dt[unique_folds > 1, ]
Empty data.table (0 rows and 2 cols): Id,unique_folds
As expected, the table is empty as there are no Id’s present in more than one fold.
In some use cases, it might be necessary to use predefined folds. When using k-fold cross validation without repetition this can be achieved by manually creating a feature used to denote folds and assigning it to the task$col_roles$group
slot. First, we create a vector that contains 5 predefined folds:
set.seed(1)
folds = sample(rep(1:5, length.out = nrow(BreastCancer)),
size = nrow(BreastCancer),
replace = F)
head(folds, 20)
[1] 4 4 4 1 4 5 2 2 2 2 4 5 1 2 5 5 2 2 1 1
table(folds)
folds
1 2 3 4 5
140 140 140 140 139
This vector is now added to the data set and will be used as grouping factor just as when defining block resampling:
bc_tsk = TaskClassif$new(
id = "BreastCancer",
backend = data.frame(BreastCancer, foldIds = as.factor(folds)),
target = "Class",
positive = "malignant")
bc_tsk$col_roles$group = "foldIds"
# Remove "foldIds" from features
# bc_tsk$col_roles$feature = setdiff(bc_tsk$col_roles$feature, "foldIds")
We now instantiate a 5-fold CV that will respect the predefined folds:
row_id fold
1: 3 1
2: 1 2
3: 4 3
4: 2 4
5: 5 5
Since we have only five predefined folds, the cv5$instance
data table has five rows and shows which of our foldIds
values (contained in the row_id
column) will belong to which instantiated fold. To check if the predefined groups are respected, we count how often each foldIds
appears in a specific fold:
dt = merge(bc_tsk$data(), cv5$instance, by.x = "foldIds", by.y = "row_id")
dt[, .(unique_folds = length(unique(fold))), by = foldIds]
foldIds unique_folds
1: 1 1
2: 2 1
3: 3 1
4: 4 1
5: 5 1
There are five groups and each foldIds
appears only in exactly one fold. This means that each instantiated fold corresponds to one of the predefined folds.
The previous example does not cover how to perform repeated k-fold CV or time series CV with predefined indices. This is possible via the mlr_resamplings_custom()
to which a list of predefined train and test indices can be assigned. In the following example, a custom resampling is created using indices created by caret::createMultiFolds()
:
gc_tsk = tsk("german_credit")
train_ind = caret::createMultiFolds(gc_tsk$truth(), k = 5, times = 10)
test_ind = lapply(train_ind, function(x) setdiff(1:gc_tsk$nrow, x))
rc = rsmp("custom")
rc$instantiate(gc_tsk, train_ind, test_ind)
We now check if the instantiated custom resampling contains the intended folds:
# check it for the first fold
all.equal(train_ind[[1]], rc$train_set(1))
[1] TRUE
# check it for all folds
unlist(lapply(1:rc$iters, function(i) all.equal(train_ind[[i]], rc$train_set(i))))
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[46] TRUE TRUE TRUE TRUE TRUE
This post shows how to control the resampling process when using mlr3 in order to account for data specificities.
For attribution, please cite this work as
Dragicevic & Casalicchio (2020, March 30). mlr3gallery: Resampling - Stratified, Blocked and Predefined. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-03-30-stratification-blocking/
BibTeX citation
@misc{dragicevic2020resampling, author = {Dragicevic, Milan and Casalicchio, Giuseppe}, title = {mlr3gallery: Resampling - Stratified, Blocked and Predefined}, url = {https://mlr3gallery.mlr-org.com/posts/2020-03-30-stratification-blocking/}, year = {2020} }