Impute Missing Variables

classification imputation mlr3pipelines pima data set classification

We show how to use mlr3pipelines to augment the “mlr_learners_classif.ranger” learner with automatic imputation.

Florian Pfisterer


This tutorial assumes familiarity with the basics of mlr3pipelines. Consult the mlr3book if some aspects are not fully understandable. It deals with the problem of missing data.

The random forest implementation in the package ranger unfortunately does not support missing values. Therefore, it is required to impute missing features before passing the data to the learner.

We show how to use mlr3pipelines to augment the ranger random forest learner with automatic imputation.

Construct the Base Objects

First, we take an example task with missing values (pima) and create the ranger learner:


task = tsk("pima")
<TaskClassif:pima> (768 x 9)
* Target: diabetes
* Properties: twoclass
* Features (8):
  - dbl (8): age, glucose, insulin, mass, pedigree, pregnant, pressure, triceps
learner = lrn("classif.ranger")
* Model: -
* Parameters: num.threads=1
* Packages: ranger
* Predict Type: response
* Feature types: logical, integer, numeric, character, factor, ordered
* Properties: importance, multiclass, oob_error, twoclass, weights

We can now inspect the task for missing values. task$missings() returns the count of missing values for each variable.

diabetes      age  glucose  insulin     mass pedigree pregnant pressure  triceps 
       0        0        5      374       11        0        0       35      227 

Additionally, we can see that the ranger learner can not handle missing values:

[1] "importance" "multiclass" "oob_error"  "twoclass"   "weights"   

For comparison, other learners, e.g. the rpart learner can handle missing values internally.

[1] "importance"        "missings"          "multiclass"        "selected_features" "twoclass"         
[6] "weights"          

Before we dive deeper, we quickly try to visualize the columns with many missing values:

autoplot(task$clone()$select(c("insulin", "triceps")), type = "pairs")

Operators overview

An overview over implemented PipeOps for imputation can be obtained like so:

dt =
dt[grepl("impute", dt$key) | grepl("miss", dt$key), ]
              key packages                    tags                                        feature_types input.num
1: imputeconstant                         missings logical,integer,numeric,character,factor,ordered,...         1
2:     imputehist graphics                missings                                      integer,numeric         1
3:  imputelearner                         missings                               logical,factor,ordered         1
4:     imputemean                         missings                                      numeric,integer         1
5:   imputemedian    stats                missings                                      numeric,integer         1
6:     imputemode                         missings               factor,integer,logical,numeric,ordered         1
7:      imputeoor                         missings             character,factor,integer,numeric,ordered         1
8:   imputesample                         missings               factor,integer,logical,numeric,ordered         1
9:        missind          missings,data transform logical,integer,numeric,character,factor,ordered,...         1
   output.num input.type.train input.type.predict output.type.train output.type.predict
1:          1             Task               Task              Task                Task
2:          1             Task               Task              Task                Task
3:          1             Task               Task              Task                Task
4:          1             Task               Task              Task                Task
5:          1             Task               Task              Task                Task
6:          1             Task               Task              Task                Task
7:          1             Task               Task              Task                Task
8:          1             Task               Task              Task                Task
9:          1             Task               Task              Task                Task

Construct Operators

mlr3pipelines contains several imputation methods. We focus on rather simple ones, and show how to impute missing values for factor features and numeric features respectively.

Since our task only has numeric features, we do not need to deal with imputing factor levels, and can instead concentrate on imputing numeric values:

We do this in a two-step process: * We create new indicator columns, that tells us whether the value of a feature is “missing” or “present”. We achieve this using the missind PipeOp.

We also have to make sure to apply the pipe operators in the correct order!

imp_missind = po("missind")
imp_num = po("imputehist", param_vals = list(affect_columns = selector_type("numeric")))

In order to better understand we can look at the results of every PipeOp separately.

We can manually trigger the PipeOp to test the operator on our task:

ext_task = imp_missind$train(list(task))[[1]]
     diabetes missing_glucose missing_insulin missing_mass missing_pressure missing_triceps
  1:      pos         present         missing      present          present         present
  2:      neg         present         missing      present          present         present
  3:      pos         present         missing      present          present         missing
  4:      neg         present         present      present          present         present
  5:      pos         present         present      present          present         present
764:      neg         present         present      present          present         present
765:      neg         present         missing      present          present         present
766:      neg         present         present      present          present         present
767:      pos         present         missing      present          present         missing
768:      neg         present         missing      present          present         present

For imputehist, we can do the same:

ext_task = imp_num$train(list(task))[[1]]
     diabetes age pedigree pregnant glucose   insulin mass pressure  triceps
  1:      pos  50    0.627        6     148 175.63652 33.6       72 35.00000
  2:      neg  31    0.351        1      85 152.54269 26.6       66 29.00000
  3:      pos  32    0.672        8     183 134.90456 23.3       64 35.14458
  4:      neg  21    0.167        1      89  94.00000 28.1       66 23.00000
  5:      pos  33    2.288        0     137 168.00000 43.1       40 35.00000
764:      neg  63    0.171       10     101 180.00000 32.9       76 48.00000
765:      neg  27    0.340        2     122 156.48499 36.8       70 27.00000
766:      neg  30    0.245        5     121 112.00000 26.2       72 23.00000
767:      pos  47    0.349        1     126  34.65536 30.1       60 23.98408
768:      neg  23    0.315        1      93 118.42039 30.4       70 31.00000

This time we obtain the imputed data set without missing values.

diabetes      age pedigree pregnant  glucose  insulin     mass pressure  triceps 
       0        0        0        0        0        0        0        0        0 

Putting everything together

Now we have to put all PipeOps together in order to form a graph that handles imputation automatically.

We do this by creating a Graph that copies the data twice, processes each copy using the respective imputation method and afterwards unions the features. For this we need the following two PipeOps : * copy: Creates copies of the data. * featureunion Merges the two tasks together.

graph = po("copy", 2) %>>% gunion(list(imp_missind, imp_num)) %>>% po("featureunion")

as a last step we append the learner we planned on using:

graph = graph %>>% po(learner)

We can now visualize the resulting graph:



Correct imputation is especially important when applying imputation to held-out data during the predict step. If applied incorrectly, imputation could leak info from the test set, which potentially skews our performance estimates. mlr3pipelines takes this complexity away from the user and handles correct imputation internally.

By wrapping this graph into a GraphLearner, we can now train resample the full graph, here with a 3-fold cross validation:

glearner = GraphLearner$new(graph)
rr = resample(task, glearner, rsmp("cv", folds = 3))

Missing values during prediction

In some cases, we have missing values only in the data we want to predict on. In order to showcase this, we create a copy of the task with several more missing columns.

dt = task$data()
dt[1:10, "age"] = NA
dt[30:70, "pedigree"] = NA
task2 = TaskClassif$new("pima2", dt, target = "diabetes")

And now we learn on task, while trying to predict on task2.

<PredictionClassif> for 768 observations:
    row_ids truth response
          1   pos      pos
          2   neg      neg
          3   pos      pos
        766   neg      neg
        767   pos      pos
        768   neg      neg

Missing factor features

For factor features, the process works analogously. Instead of using imputehist, we can for example use imputeoor. This will simply replace every NA in each factor variable with a new value missing.

A full graph might the look like this:

imp_missind = po("missind", param_vals = list(affect_columns = NULL, which = "all"))
imp_fct = po("imputeoor",
  param_vals = list(affect_columns = selector_type("factor")))
graph = po("copy", 2) %>>%
  gunion(list(imp_missind, imp_num %>>% imp_fct)) %>>%

Note that we specify the parameter affect_columns = NULL when initializing missind, because we also want indicator columns for our factor features. By default, affect_columns would be set to selector_invert(selector_type(c("factor", "ordered", "character"))). We also set the parameter which to "all" to add indicator columns for all features, regardless whether values were missing during training or not.

In order to test out our new graph, we again create a situation where our task has missing factor levels. As the (pima) task does not have any factor levels, we use the famous (boston_housing) task.

# t1 is the training data without missings
t1 = tsk("boston_housing")

# t2 is the prediction data with missings
dt = t1$data()
dt[1:10, chas := NA][20:30, rm := NA]
t2 = TaskRegr$new("bh", dt, target = "medv")

Now we train on t1 and predict on t2:

gl = GraphLearner$new(graph %>>% po(lrn("regr.ranger")))
<PredictionRegr> for 506 observations:
    row_ids truth response
          1  24.0 24.80063
          2  21.6 22.32329
          3  34.7 33.87577
        504  23.9 24.10095
        505  22.0 22.44476
        506  11.9 16.32286

Success! We learned how to deal with missing values in less than 10 minutes.


For attribution, please cite this work as

Pfisterer (2020, Jan. 31). mlr3gallery: Impute Missing Variables. Retrieved from

BibTeX citation

  author = {Pfisterer, Florian},
  title = {mlr3gallery: Impute Missing Variables},
  url = {},
  year = {2020}