Select uncorrelated features

The following example describes a situation where we aim to remove correlated features. This in essence means, that we drop features until no features have a correlation higher then a given cutoff. This is often useful when we for example want to use linear models.

Martin Binder , Florian Pfisterer
02-25-2020

The following example describes a situation where we aim to remove correlated features. This in essence means, that we drop features until no features have a correlation higher than a given cutoff. This is often useful when we for example want to use linear models.

Prerequisites

This tutorial assumes familiarity with the basics of mlr3pipelines. Consult the mlr3book if some aspects are not fully understandable. Additionally, we compare different cutoff values via tuning using the mlr3tuning package. Again, the mlr3book has an intro to mlr3tuning and paradox.

The example describes a very involved use-case, where the behavior of PipeOpSelect is manipulated via a trafo on it’s ParamSet

Getting started


library("mlr3")
library("mlr3pipelines")
library("paradox")
library("mlr3tuning")

The basic pipeline looks as follows: We use PipeOpSelect to select a set of variables followed by a rpart learner.


pipeline = po("select") %>>% lrn("classif.rpart")

Now we get to the magic:

We want to use the function caret::findCorrelation() from the caret package in order to select uncorrelated variables. This function has a cutoff parameter, that specifies the maximum correlation allowed between variables. In order to expose this variable as a numeric parameter we can tune over we specify the following ParamSet:


ps = ParamSet$new(list(ParamDbl$new("cutoff", 0, 1)))

We define a function select_cutoff that takes as input a Task and returns a list of features we aim to keep.

Now we use a trafo to transform the cutoff into a set of variables, which is what PipeOpSelect can work with. Note that we use x$cutoff = NULL in order to remove the temporary parameter we introduced, as PipeOpSelect does not know what to do with it.


ps$trafo = function(x, param_set) {
  cutoff = x$cutoff
  x$select.selector = function(task) {
    fn = task$feature_names
    data = task$data(cols = fn)
    drop = caret::findCorrelation(cor(data), cutoff = cutoff, exact = TRUE, names = TRUE)
    setdiff(fn, drop)
  }
  x$cutoff = NULL
  x
}

If you are not sure, you understand the trafo concept, consult the mlr3book. It has a section on the trafo concept.

To now tune over different values for cutoff, we first create a TuningInstanceSingleCrit.


inst = TuningInstanceSingleCrit$new(
  task = tsk("iris"),
  learner = pipeline,
  resampling = rsmp("cv", folds = 3L),
  measure = msr("classif.ce"),
  search_space = ps,
  terminator = trm("none"),
  # don't need the following line for optimization, this is for
  # demonstration that different features were selected
  store_models = TRUE
)

Then we run the tuning:


tnr("grid_search")$optimize(inst)

      cutoff learner_param_vals  x_domain classif.ce
1: 0.8888889          <list[2]> <list[1]>       0.04

In order to demonstrate that different cutoff values result in different features being selected, we can run the following to inspect the trained models. Note this inspects only the trained models of the first CV fold of each evaluated model. The features being excluded depends on the training data seen by the pipeline and may be different in different folds, even at the same cutoff value.


inst$archive$data("x_domain")[
  order(cutoff),
  list(cutoff, classif.ce,
    featurenames = lapply(resample_result, function(x) {
      x$learners[[1]]$model$classif.rpart$train_task$feature_names
    }
  ))]

       cutoff classif.ce                                      featurenames
 1: 0.0000000 0.30666667                                      Sepal.Length
 2: 0.1111111 0.31333333                          Sepal.Length,Sepal.Width
 3: 0.2222222 0.29333333                          Sepal.Length,Sepal.Width
 4: 0.3333333 0.29333333                          Sepal.Length,Sepal.Width
 5: 0.4444444 0.29333333                          Sepal.Length,Sepal.Width
 6: 0.5555556 0.29333333                          Sepal.Length,Sepal.Width
 7: 0.6666667 0.29333333                          Sepal.Length,Sepal.Width
 8: 0.7777778 0.29333333                          Sepal.Length,Sepal.Width
 9: 0.8888889 0.04000000              Petal.Width,Sepal.Length,Sepal.Width
10: 1.0000000 0.07333333 Petal.Length,Petal.Width,Sepal.Length,Sepal.Width

Voila, we created our own PipeOp, that uses very advanced knowledge of mlr3pipelines and paradox in only few lines of code.

Citation

For attribution, please cite this work as

Binder & Pfisterer (2020, Feb. 25). mlr3gallery: Select uncorrelated features. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-02-25-remove-correlated-features/

BibTeX citation

@misc{binder2020select,
  author = {Binder, Martin and Pfisterer, Florian},
  title = {mlr3gallery: Select uncorrelated features},
  url = {https://mlr3gallery.mlr-org.com/posts/2020-02-25-remove-correlated-features/},
  year = {2020}
}