Impute missing variables

We show how to use mlr3pipelines to augment the “mlr_learners_classif.ranger” learner with automatic imputation.

Florian Pfisterer
01-31-2020

Prerequisites

This tutorial assumes familiarity with the basics of mlr3pipelines. Consult the mlr3book if some aspects are not fully understandable. It deals with the problem of missing data.

The random forest implementation in the package ranger unfortunately does not support missing values. Therefore, it is required to impute missing features before passing the data to the learner.

We show how to use mlr3pipelines to augment the ranger random forest learner with automatic imputation.

Construct the Base Objects

First, we take an example task with missing values (pima) and create the ranger learner:


library(mlr3)
library(mlr3learners)

task = tsk("pima")
print(task)

<TaskClassif:pima> (768 x 9)
* Target: diabetes
* Properties: twoclass
* Features (8):
  - dbl (8): age, glucose, insulin, mass, pedigree, pregnant, pressure,
    triceps

learner = lrn("classif.ranger")
print(learner)

<LearnerClassifRanger:classif.ranger>
* Model: -
* Parameters: list()
* Packages: ranger
* Predict Type: response
* Feature types: logical, integer, numeric, character, factor, ordered
* Properties: importance, multiclass, oob_error, twoclass, weights

We can now inspect the task for missing values. task$missings() returns the count of missing values for each variable.


task$missings()

diabetes      age  glucose  insulin     mass pedigree pregnant pressure 
       0        0        5      374       11        0        0       35 
 triceps 
     227 

Additionally, we can see that the ranger learner can not handle missing values:


learner$properties

[1] "importance" "multiclass" "oob_error"  "twoclass"   "weights"   

For comparison, other learners, e.g. the rpart learner can handle missing values internally.


lrn("classif.rpart")$properties

[1] "importance"        "missings"          "multiclass"       
[4] "selected_features" "twoclass"          "weights"          

Before we dive deeper, we quickly try to visualize the columns with many missing values:


library(mlr3viz)
autoplot(task$clone()$select(c("insulin", "triceps")), type = "pairs")

Operators overview

An overview over implemented PipeOps for imputation can be obtained like so:


library(mlr3pipelines)
dt = as.data.table(mlr_pipeops)
dt[grepl("impute", dt$key) | grepl("miss", dt$key), ]

              key packages                    tags
1: imputeconstant                         missings
2:     imputehist graphics                missings
3:  imputelearner                         missings
4:     imputemean                         missings
5:   imputemedian    stats                missings
6:     imputemode                         missings
7:      imputeoor                         missings
8:   imputesample                         missings
9:        missind          missings,data transform
                                          feature_types input.num output.num
1: logical,integer,numeric,character,factor,ordered,...         1          1
2:                                      integer,numeric         1          1
3:                               logical,factor,ordered         1          1
4:                                      numeric,integer         1          1
5:                                      numeric,integer         1          1
6:               factor,integer,logical,numeric,ordered         1          1
7:             character,factor,integer,numeric,ordered         1          1
8:               factor,integer,logical,numeric,ordered         1          1
9: logical,integer,numeric,character,factor,ordered,...         1          1
   input.type.train input.type.predict output.type.train output.type.predict
1:             Task               Task              Task                Task
2:             Task               Task              Task                Task
3:             Task               Task              Task                Task
4:             Task               Task              Task                Task
5:             Task               Task              Task                Task
6:             Task               Task              Task                Task
7:             Task               Task              Task                Task
8:             Task               Task              Task                Task
9:             Task               Task              Task                Task

Construct Operators

mlr3pipelines contains several imputation methods. We focus on rather simple ones, and show how to impute missing values for factor features and numeric features respectively.

Since our task only has numeric features, we do not need to deal with imputing factor levels, and can instead concentrate on imputing numeric values:

We do this in a two-step process: * We create new indicator columns, that tells us whether the value of a feature is “missing” or “present”. We achieve this using the missind PipeOp.

We also have to make sure to apply the pipe operators in the correct order!


imp_missind = po("missind")
imp_num = po("imputehist", param_vals = list(affect_columns = selector_type("numeric")))

In order to better understand we can look at the results of every PipeOp separately.

We can manually trigger the PipeOp to test the operator on our task:


ext_task = imp_missind$train(list(task))[[1]]
ext_task$data()

     diabetes missing_glucose missing_insulin missing_mass missing_pressure
  1:      pos         present         missing      present          present
  2:      neg         present         missing      present          present
  3:      pos         present         missing      present          present
  4:      neg         present         present      present          present
  5:      pos         present         present      present          present
 ---                                                                       
764:      neg         present         present      present          present
765:      neg         present         missing      present          present
766:      neg         present         present      present          present
767:      pos         present         missing      present          present
768:      neg         present         missing      present          present
     missing_triceps
  1:         present
  2:         present
  3:         missing
  4:         present
  5:         present
 ---                
764:         present
765:         present
766:         present
767:         missing
768:         present

For imputehist, we can do the same:


ext_task = imp_num$train(list(task))[[1]]
ext_task$data()

     diabetes age pedigree pregnant glucose   insulin mass pressure  triceps
  1:      pos  50    0.627        6     148 160.31981 33.6       72 35.00000
  2:      neg  31    0.351        1      85 404.07936 26.6       66 29.00000
  3:      pos  32    0.672        8     183  59.58139 23.3       64 47.06329
  4:      neg  21    0.167        1      89  94.00000 28.1       66 23.00000
  5:      pos  33    2.288        0     137 168.00000 43.1       40 35.00000
 ---                                                                        
764:      neg  63    0.171       10     101 180.00000 32.9       76 48.00000
765:      neg  27    0.340        2     122  42.96671 36.8       70 27.00000
766:      neg  30    0.245        5     121 112.00000 26.2       72 23.00000
767:      pos  47    0.349        1     126 227.33020 30.1       60 39.60904
768:      neg  23    0.315        1      93  41.90460 30.4       70 31.00000

This time we obtain the imputed data set without missing values.


ext_task$missings()

diabetes      age pedigree pregnant  glucose  insulin     mass pressure 
       0        0        0        0        0        0        0        0 
 triceps 
       0 

Putting everything together

Now we have to put all PipeOps together in order to form a graph that handles imputation automatically.

We do this by creating a Graph that copies the data twice, processes each copy using the respective imputation method and afterwards unions the features. For this we need the following two PipeOps : * copy: Creates copies of the data. * featureunion Merges the two tasks together.


graph = po("copy", 2) %>>% gunion(list(imp_missind, imp_num)) %>>% po("featureunion")

as a last step we append the learner we planned on using:


graph = graph %>>% po(learner)

We can now visualize the resulting graph:


graph$plot()

Resampling

Correct imputation is especially important when applying imputation to held-out data during the predict step. If applied incorrectly, imputation could leak info from the test set, which potentially skews our performance estimates. mlr3pipelines takes this complexity away from the user and handles correct imputation internally.

By wrapping this graph into a GraphLearner, we can now train resample the full graph, here with a 3-fold cross validation:


glearner = GraphLearner$new(graph)
rr = resample(task, glearner, rsmp("cv", folds = 3))
rr$aggregate()

classif.ce 
 0.2421875 

Missing values during prediction

In some cases, we have missing values only in the data we want to predict on. In order to showcase this, we create a copy of the task with several more missing columns.


dt = task$data()
dt[1:10, "age"] = NA
dt[30:70, "pedigree"] = NA
task2 = TaskClassif$new("pima2", dt, target = "diabetes")

And now we learn on task, while trying to predict on task2.


glearner$train(task)
glearner$predict(task2)

<PredictionClassif> for 768 observations:
    row_id truth response
         1   pos      pos
         2   neg      neg
         3   pos      pos
---                      
       766   neg      neg
       767   pos      pos
       768   neg      neg

Missing factor features

For factor features, the process works analogously. Instead of using imputehist, we can for example use imputeoor. This will simply replace every NA in each factor variable with a new value missing.

A full graph might the look like this:


imp_missind = po("missind", param_vals = list(affect_columns = NULL, which = "all"))
imp_fct = po("imputeoor",
  param_vals = list(affect_columns = selector_type("factor")))
graph = po("copy", 2) %>>%
  gunion(list(imp_missind, imp_num %>>% imp_fct)) %>>%
  po("featureunion")

Note that we specify the parameter affect_columns = NULL when initializing missind, because we also want indicator columns for our factor features. By default, affect_columns would be set to selector_invert(selector_type(c("factor", "ordered", "character"))). We also set the parameter which to "all" to add indicator columns for all features, regardless whether values were missing during training or not.

In order to test out our new graph, we again create a situation where our task has missing factor levels. As the (pima) task does not have any factor levels, we use the famous (boston_housing) task.


# t1 is the training data without missings
t1 = tsk("boston_housing")

# t2 is the prediction data with missings
dt = t1$data()
dt[1:10, chas := NA][20:30, rm := NA]
t2 = TaskRegr$new("bh", dt, target = "medv")

Now we train on t1 and predict on t2:


gl = GraphLearner$new(graph %>>% po(lrn("regr.ranger")))
gl$train(t1)
gl$predict(t2)

<PredictionRegr> for 506 observations:
    row_id truth response
         1  24.0 24.64824
         2  21.6 22.08679
         3  34.7 34.02308
---                      
       504  23.9 23.93832
       505  22.0 22.57185
       506  11.9 16.17228

Success! We learned how to deal with missing values in less than 10 minutes.

Citation

For attribution, please cite this work as

Pfisterer (2020, Jan. 31). mlr3gallery: Impute missing variables. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-01-30-impute-missing-levels/

BibTeX citation

@misc{pfisterer2020impute,
  author = {Pfisterer, Florian},
  title = {mlr3gallery: Impute missing variables},
  url = {https://mlr3gallery.mlr-org.com/posts/2020-01-30-impute-missing-levels/},
  year = {2020}
}