mlr3pipelines Tutorial - German Credit

mlr3pipelines imputation filtering stacking german credit

In this use case, we continue working with the German credit dataset. We already used different Learners on it in previous posts and tried to optimize their hyperparameters. To make things interesting, we artificially introduce missing values into the dataset, perform imputation and filtering and stack Learners.

Martin Binder , Florian Pfisterer
03-11-2020

Outline

This is the third part of a serial of use cases with the German credit dataset. The other parts of this series can be found here:

In this tutorial, we continue working with the German credit dataset. We already used different Learners on it and tried to optimize their hyperparameters. Now we will do four additional things:

  1. We preprocess the data as an integrated step of the model fitting process
  2. We tune the associated preprocessing parameters
  3. We stack multiple Learners in an ensemble model
  4. We discuss some techniques that make Learners able to tackle challenging datasets that they could not handle otherwise (we are going to outline what challenging means in particular later on)

Prerequisites

We again use the German credit dataset, but will restrict ourselves to the factorial features. To make things interesting or to make it a bit harder for our Learners, we introduce missing values in the dataset:

task = tsk("german_credit")
credit_full = task$data()
credit = credit_full[, sapply(credit_full, FUN = is.factor), with = FALSE]

set.seed(20191101)
# sample values to NA
credit = credit[, lapply(.SD, function(x) {
  x[sample(c(TRUE, NA), length(x), replace = TRUE, prob = c(.9, .1))]
})]
credit$credit_risk = credit_full$credit_risk
task = TaskClassif$new("GermanCredit", credit, "credit_risk")

We instantiate a Resampling instance for this Task to be able to compare resampling performance:

set.seed(20191101)
cv10_instance = rsmp("cv")$instantiate(task)

You can uncomment the following line if you are running this locally:

# future::plan("multiprocess") # uncomment for parallelization

Intro

In this use case, we will take a look at composite machine learning algorithms that may incorporate data preprocessing or the combination of multiple Learners (“ensemble methods”).

We use the mlr3pipelines package that enables us to chain PipeOps into data flow Graphs.

Available PipeOps are listed in the mlr_pipeops dictionary:

mlr_pipeops
<DictionaryPipeOp> with 64 stored values
Keys: boxcox, branch, chunk, classbalancing, classifavg, classweights,
  colapply, collapsefactors, colroles, copy, datefeatures, encode,
  encodeimpact, encodelmer, featureunion, filter, fixfactors, histbin,
  ica, imputeconstant, imputehist, imputelearner, imputemean,
  imputemedian, imputemode, imputeoor, imputesample, kernelpca,
  learner, learner_cv, missind, modelmatrix, multiplicityexply,
  multiplicityimply, mutate, nmf, nop, ovrsplit, ovrunite, pca, proxy,
  quantilebin, randomprojection, randomresponse, regravg,
  removeconstants, renamecolumns, replicate, scale, scalemaxabs,
  scalerange, select, smote, spatialsign, subsample, targetinvert,
  targetmutate, targettrafoscalerange, textvectorizer, threshold,
  tunethreshold, unbranch, vtreat, yeojohnson

Missing Value Imputation

We have just introduced missing values into our data. While some Learners can deal with missing value, many cannot. Trying to train a random forest fails because of this:

ranger = lrn("classif.ranger")
ranger$train(task)
Error: Missing data in columns: credit_history, employment_duration, foreign_worker, housing, installment_rate, job, number_credits, other_debtors, other_installment_plans, people_liable, personal_status_sex, present_residence, property, purpose, savings, status, telephone.

We can perform imputation of missing values using a PipeOp. To find out which imputation PipeOps are available, we do the following:

mlr_pipeops$keys("^impute")
[1] "imputeconstant" "imputehist"     "imputelearner"  "imputemean"    
[5] "imputemedian"   "imputemode"     "imputeoor"      "imputesample"  

We choose to impute factorial features using a new level (via PipeOpImputeOOR). Let’s use the PipeOp itself to create an imputed Task. This shows us how the PipeOp actually works:

imputer = po("imputeoor")
task_imputed = imputer$train(list(task))[[1]]
task_imputed$missings()
            credit_risk          credit_history     employment_duration 
                      0                       0                       0 
         foreign_worker                 housing        installment_rate 
                      0                       0                       0 
                    job          number_credits           other_debtors 
                      0                       0                       0 
other_installment_plans           people_liable     personal_status_sex 
                      0                       0                       0 
      present_residence                property                 purpose 
                      0                       0                       0 
                savings                  status               telephone 
                      0                       0                       0 
head(task_imputed$data())
   credit_risk                              credit_history employment_duration
1:        good     all credits at this bank paid back duly            >= 7 yrs
2:         bad no credits taken/all credits paid back duly    1 <= ... < 4 yrs
3:        good     all credits at this bank paid back duly    4 <= ... < 7 yrs
4:        good no credits taken/all credits paid back duly    4 <= ... < 7 yrs
5:         bad    existing credits paid back duly till now    1 <= ... < 4 yrs
6:        good                                    .MISSING    1 <= ... < 4 yrs
   foreign_worker housing installment_rate                       job
1:             no    rent             < 20 skilled employee/official
2:       .MISSING    rent   25 <= ... < 35 skilled employee/official
3:       .MISSING    rent   25 <= ... < 35      unskilled - resident
4:             no     own   25 <= ... < 35 skilled employee/official
5:             no     own   20 <= ... < 25 skilled employee/official
6:             no     own   25 <= ... < 35      unskilled - resident
   number_credits other_debtors other_installment_plans people_liable
1:            2-3          none                    none        0 to 2
2:              1          none                    none        0 to 2
3:       .MISSING          none                    none     3 or more
4:              1      .MISSING                    none     3 or more
5:            2-3      .MISSING                    none     3 or more
6:              1          none                    none     3 or more
                    personal_status_sex present_residence              property
1:                             .MISSING          >= 7 yrs              .MISSING
2: female : non-single or male : single  1 <= ... < 4 yrs unknown / no property
3:                             .MISSING  4 <= ... < 7 yrs unknown / no property
4:               male : married/widowed          >= 7 yrs          car or other
5:               male : married/widowed          >= 7 yrs           real estate
6:               male : married/widowed          >= 7 yrs           real estate
               purpose                    savings
1: furniture/equipment             ... >= 1000 DM
2: furniture/equipment unknown/no savings account
3:             repairs unknown/no savings account
4:          car (used) unknown/no savings account
5:              others unknown/no savings account
6:             repairs             ... >= 1000 DM
                                       status                 telephone
1:                                   .MISSING yes (under customer name)
2:                                 ... < 0 DM                  .MISSING
3: ... >= 200 DM / salary for at least 1 year                        no
4:                        no checking account                  .MISSING
5:                        no checking account                        no
6: ... >= 200 DM / salary for at least 1 year yes (under customer name)

We do not only need complete data during training but also during prediction. Using the same imputation heuristic for both is the most consistent strategy. This way the imputation strategy can, in fact, be seen as a part of the complete learner (which could be tuned).

If we used the imputed Task for Resampling, we would leak information from the test set into the training set. Therefore, it is mandatory to attach the imputation operator to the Learner itself, creating a GraphLearner:

imp_ranger = GraphLearner$new(po("imputeoor") %>>% ranger)

imp_ranger$train(task)

This GraphLearner can be used for resampling – like an ordinary Learner:

rr = resample(task, learner = imp_ranger, resampling = cv10_instance)
rr$aggregate()
classif.ce 
     0.254 

Feature Filtering

Typically, sparse models, i.e. having models with few(er) features, are desirable. This is due to a variety of reasons, e.g., enhanced interpretability or decreased costs of acquiring data. Furthermore, sparse models may actually be associated with increased performance (especially if overfitting is anticipated). We can use feature filter to only keep features with the highest information. Filters are implemented in the mlr3filters package and listed in the following dictionary:

mlr_filters
<DictionaryFilter> with 18 stored values
Keys: anova, auc, carscore, cmim, correlation, disr, find_correlation,
  importance, information_gain, jmi, jmim, kruskal_test, mim, mrmr,
  njmim, performance, permutation, variance

We apply the FilterMIM (mutual information maximization) Filter as implemented in the praznik package. This Filter allows for the selection of the top-k features of best mutual information.

filter = flt("mim")
filter$calculate(task_imputed)$scores
                 status          credit_history                 savings 
           0.0560477846            0.0272292095            0.0162383750 
                purpose                property                 housing 
           0.0145509110            0.0117234634            0.0114749476 
    employment_duration     personal_status_sex other_installment_plans 
           0.0085307023            0.0056389930            0.0052711015 
          other_debtors          foreign_worker       present_residence 
           0.0032857631            0.0030732135            0.0018579748 
       installment_rate           people_liable          number_credits 
           0.0016124103            0.0014452494            0.0010182556 
              telephone                     job 
           0.0006202092            0.0003274007 

Making use of this Filter, you may wonder at which costs the reduction of the feature space comes. We can investigate the trade-off between features and performance by tuning. We incorporate our filtering strategy into the pipeline using PipeOpFilter. Like before, we need to perform imputation as the Filter also relies on complete data:

fpipe = po("imputeoor") %>>% po("filter", flt("mim"), filter.nfeat = 3)
fpipe$train(task)[[1]]$head()
   credit_risk                              credit_history
1:        good     all credits at this bank paid back duly
2:         bad no credits taken/all credits paid back duly
3:        good     all credits at this bank paid back duly
4:        good no credits taken/all credits paid back duly
5:         bad    existing credits paid back duly till now
6:        good                                    .MISSING
                      savings                                     status
1:             ... >= 1000 DM                                   .MISSING
2: unknown/no savings account                                 ... < 0 DM
3: unknown/no savings account ... >= 200 DM / salary for at least 1 year
4: unknown/no savings account                        no checking account
5: unknown/no savings account                        no checking account
6:             ... >= 1000 DM ... >= 200 DM / salary for at least 1 year

We can now tune over the mim.filter.nfeat parameter. It steers how many features are kept by the Filter and eventually used by the learner:

searchspace = ParamSet$new(list(
  ParamInt$new("mim.filter.nfeat", lower = 1, upper = length(task$feature_names))
))

The problem is one-dimensional (i.e. only one parameter is tuned). Thus, we make use of a grid search. For higher dimensions, strategies like random search are more appropriate:

inst = TuningInstanceSingleCrit$new(
  task, fpipe %>>% lrn("classif.ranger"), cv10_instance, msr("classif.ce"),
  searchspace, trm("none")
)
tuner = tnr("grid_search")

The tuning procedure may take some time:

tuner$optimize(inst)
   mim.filter.nfeat learner_param_vals  x_domain classif.ce
1:               12          <list[4]> <list[1]>      0.251

We can plot the performance against the number of features. If we do so, we see the possible trade-off between sparsity and predictive performance:

arx = inst$archive$data()
ggplot(arx, aes(x = mim.filter.nfeat, y = classif.ce)) + geom_line()

Stacking

We want to build a model that is based on the predictions of other Learners. This means that we are in the state that we need predictions already during training. This is a very specific case that is luckily handled by PipeOpLearnerCV. PipeOpLearnerCV performs cross-validation during the training phase and returns the cross-validated predictions. We use "prob" predictions because they carry more information than response prediction:

stackgraph = po("imputeoor") %>>%
  gunion(list(
    po("learner_cv", lrn("classif.ranger", predict_type = "prob")),
    po("learner_cv", lrn("classif.kknn", predict_type = "prob")))) %>>%
  po("featureunion") %>>% lrn("classif.log_reg")

We built a pretty complex Graph already. Therefore, we plot it:

stackgraph$plot()

We now compare the performance of the stacked learner to the performance of the individual Learners:

bmr = benchmark(data.table(
  task = list(task),
  learner = list(
    stackgraph,
    GraphLearner$new(po("imputeoor") %>>% lrn("classif.ranger")),
    GraphLearner$new(po("imputeoor") %>>% lrn("classif.kknn")),
    GraphLearner$new(po("imputeoor") %>>% lrn("classif.log_reg"))),
  resampling = list(cv10_instance)))
bmr$aggregate()[, c("learner_id", "classif.ce")]
                                                           learner_id
1: imputeoor.classif.ranger.classif.kknn.featureunion.classif.log_reg
2:                                           imputeoor.classif.ranger
3:                                             imputeoor.classif.kknn
4:                                          imputeoor.classif.log_reg
   classif.ce
1:      0.250
2:      0.259
3:      0.323
4:      0.284

If we train the stacked learner and look into the final Learner (the logistic regression), we can see how “important” each Learner of the stacked learner is:

stackgraph$train(task)
$classif.log_reg.output
NULL
summary(stackgraph$pipeops$classif.log_reg$state$model)

Call:
stats::glm(formula = task$formula(), family = "binomial", data = task$data(), 
    model = FALSE)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.6734  -0.8102  -0.5686   0.9771   2.2354  

Coefficients: (2 not defined because of singularities)
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)                2.6830     0.3174   8.452   <2e-16 ***
classif.ranger.prob.good  -4.7986     0.5227  -9.180   <2e-16 ***
classif.ranger.prob.bad        NA         NA      NA       NA    
classif.kknn.prob.good    -0.5390     0.3190  -1.690   0.0911 .  
classif.kknn.prob.bad          NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1221.7  on 999  degrees of freedom
Residual deviance: 1076.1  on 997  degrees of freedom
AIC: 1082.1

Number of Fisher Scoring iterations: 4

The random forest has a higher contribution.

Robustify: Preventing new Prediction Factor Levels and other Problems

We now shift the context, using the complete German credit dataset:

task = tsk("german_credit")

There is a potential practical problem for both, small data sets and data sets with covariates having many factor levels: It may occur that not all possible factor levels have been used by the Learner during training. This happens because these rare instances are simply not sampled. The prediction then may fail because the Learner does not know how to handle unseen factor levels:

task_unseen = task$clone()$filter(1:30)
logreg = lrn("classif.log_reg")
logreg$train(task_unseen)
logreg$predict(task)
Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor job has new levels unemployed/unskilled - non-resident

Not only logistic regression but also many other Learners cannot handle new levels during prediction. Thus, we use PipeOpFixFactors to prevent that. PipeOpFixFactors introduces NA values for unseen levels. This means that we may need to impute afterwards. To solve this issue we can use PipeOpImputeSample, but with affect_columns set to only factorial features.

Another observation is that all-constant features may also be a problem:

task_constant = task$clone()$filter(1:2)
logreg = lrn("classif.log_reg")
logreg$train(task_constant)
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]): contrasts can be applied only to factors with 2 or more levels

This can be fixed using PipeOpRemoveConstants.

Both, handling unseen levels and all-constant features can be handled simultaneously using the following Graph:

robustify = po("fixfactors") %>>%
  po("removeconstants") %>>%
  po("imputesample", affect_columns = selector_type(c("ordered", "factor")))

robustify$plot()

This robust learner works even in very pathological conditions:

roblogreg = GraphLearner$new(robustify %>>% logreg)

roblogreg$train(task_constant)
roblogreg$predict(task)
<PredictionClassif> for 1000 observations:
    row_id truth response
         1  good     good
         2   bad      bad
         3  good     good
---                      
       998  good      bad
       999   bad      bad
      1000  good      bad

Your Ideas

There are various possibilities for preprocessing with PipeOps. You can try different methods for preprocessing and training. Feel free to discover this variety by yourself! Here are only a few hints that help when working with PipeOps:

Citation

For attribution, please cite this work as

Binder & Pfisterer (2020, March 11). mlr3gallery: mlr3pipelines Tutorial - German Credit. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-03-11-mlr3pipelines-tutorial-german-credit/

BibTeX citation

@misc{binder2020mlr3pipelines,
  author = {Binder, Martin and Pfisterer, Florian},
  title = {mlr3gallery: mlr3pipelines Tutorial - German Credit},
  url = {https://mlr3gallery.mlr-org.com/posts/2020-03-11-mlr3pipelines-tutorial-german-credit/},
  year = {2020}
}