The following example describes a situation where we aim to remove correlated features. This in essence means, that we drop features until no features have a correlation higher then a given cutoff
. This is often useful when we for example want to use linear models.
The following example describes a situation where we aim to remove correlated features. This in essence means, that we drop features until no features have a correlation higher than a given cutoff
. This is often useful when we for example want to use linear models.
This tutorial assumes familiarity with the basics of mlr3pipelines. Consult the mlr3book if some aspects are not fully understandable. Additionally, we compare different cutoff values via tuning using the mlr3tuning package. Again, the mlr3book has an intro to mlr3tuning and paradox.
The example describes a very involved use-case, where the behavior of PipeOpSelect
is manipulated via a trafo on it’s ParamSet
The basic pipeline looks as follows: We use PipeOpSelect
to select a set of variables followed by a rpart learner
.
Now we get to the magic:
We want to use the function caret::findCorrelation()
from the caret package in order to select uncorrelated variables. This function has a cutoff
parameter, that specifies the maximum correlation allowed between variables. In order to expose this variable as a numeric
parameter we can tune over we specify the following ParamSet
:
We define a function select_cutoff
that takes as input a Task
and returns a list of features we aim to keep.
Now we use a trafo
to transform the cutoff
into a set of variables, which is what PipeOpSelect
can work with. Note that we use x$cutoff = NULL
in order to remove the temporary parameter we introduced, as PipeOpSelect
does not know what to do with it.
ps$trafo = function(x, param_set) {
cutoff = x$cutoff
x$select.selector = function(task) {
fn = task$feature_names
data = task$data(cols = fn)
drop = caret::findCorrelation(cor(data), cutoff = cutoff, exact = TRUE, names = TRUE)
setdiff(fn, drop)
}
x$cutoff = NULL
x
}
If you are not sure, you understand the trafo
concept, consult the mlr3book. It has a section on the trafo
concept.
To now tune over different values for cutoff
, we first create a TuningInstanceSingleCrit
.
inst = TuningInstanceSingleCrit$new(
task = tsk("iris"),
learner = pipeline,
resampling = rsmp("cv", folds = 3L),
measure = msr("classif.ce"),
search_space = ps,
terminator = trm("none"),
# don't need the following line for optimization, this is for
# demonstration that different features were selected
store_models = TRUE
)
Then we run the tuning:
tnr("grid_search")$optimize(inst)
cutoff learner_param_vals x_domain classif.ce
1: 0.8888889 <list[2]> <list[1]> 0.04666667
In order to demonstrate that different cutoff values result in different features being selected, we can run the following to inspect the trained models. Note this inspects only the trained models of the first CV fold of each evaluated model. The features being excluded depends on the training data seen by the pipeline and may be different in different folds, even at the same cutoff value.
inst$archive$data("x_domain")[
order(cutoff),
list(cutoff, classif.ce,
featurenames = lapply(resample_result, function(x) {
x$learners[[1]]$model$classif.rpart$train_task$feature_names
}
))]
cutoff classif.ce featurenames
1: 0.0000000 0.29333333
2: 0.1111111 0.29333333
3: 0.2222222 0.28666667
4: 0.3333333 0.28666667
5: 0.4444444 0.28666667
6: 0.5555556 0.28666667
7: 0.6666667 0.28666667
8: 0.7777778 0.28666667
9: 0.8888889 0.04666667
10: 1.0000000 0.06666667
Voila, we created our own PipeOp
, that uses very advanced knowledge of mlr3pipelines and paradox in only few lines of code.
For attribution, please cite this work as
Binder & Pfisterer (2020, Feb. 25). mlr3gallery: Select uncorrelated features. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-02-25-remove-correlated-features/
BibTeX citation
@misc{binder2020select, author = {Binder, Martin and Pfisterer, Florian}, title = {mlr3gallery: Select uncorrelated features}, url = {https://mlr3gallery.mlr-org.com/posts/2020-02-25-remove-correlated-features/}, year = {2020} }