Tuning Over Multiple Learners

This use case shows how to tune over multiple learners for a single task.

Jakob Richter , Bernd Bischl
02-01-2020

This use case shows how to tune over multiple learners for a single task. You will learn the following:

This is an advanced use case. What should you know before:

The Setup

Assume, you are given some ML task and what to compare a couple of learners, probably because you want to select the best of them at the end of the analysis. That’s a super standard scenario, it actually sounds so common that you might wonder: Why an (advanced) blog post about this? With pipelines? We will consider 2 cases: (a) Running the learners in their default, so without tuning, and (b) with tuning.

Let’s load some packages and define our learners.


set.seed(1)
library(mlr3)
library(mlr3tuning)
library(mlr3pipelines)
library(mlr3learners)
library(paradox)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")

learns = list(
  lrn("classif.xgboost", id = "xgb"),
  lrn("classif.ranger", id = "rf")
)
learns_ids = sapply(learns, function(x) x$id)

task = tsk("sonar") # some random data for this demo
cv1 = rsmp("cv", folds = 2) # inner loop for nested CV
cv2 = rsmp("cv", folds = 5) # outer loop for nested CV

Default Parameters

The Benchmark-Table Approach

Assume we don’t want to perform tuning and or with running all learner in their respective defaults. Simply run benchmark on the learners and the tasks. That tabulates our results nicely and shows us what works best.


bg = benchmark_grid(task, learns, cv2)
b = benchmark(bg)
b$aggregate(measures = msr("classif.ce"))

   nr      resample_result task_id learner_id resampling_id iters classif.ce
1:  1 <ResampleResult[18]>   sonar        xgb            cv     5  0.2743322
2:  2 <ResampleResult[18]>   sonar         rf            cv     5  0.1728223

The Pipelines Approach

Ok, why would we ever want to change the simple approach above - and use pipelines / tuning for this? Three reasons:

  1. What we are doing with benchmark is actually statistically flawed, insofar if we report the error of the numerically best method from the benchmark table as its estimated future performance. If we do that we have “optimized on the CV” (we basically ran a grid search over our learners!) and we know that this is will produce optimistically biased results. NB: This is a somewhat ridiculous criticism if we are going over only a handful of options, and the bias will be very small. But it will be noticeable if we do this over hundreds of learners, so it is important to understand the underlying problem. This is a somewhat subtle point, and this gallery post is more about technical hints for mlr3, so we will stop this discussion here.
  2. For some tuning algorithms, you might have a chance to more efficiently select from the set of algorithms than running the full benchmark. Because of the categorical nature of the problem, you will not be able to learn stuff like “If learner A works bad, I don’t have to try learner B”, but you can potentially save some resampling iterations. Assume you have so select from 100 candidates, experiments are expensive, and you use a 20-fold CV. If learner A has super-bad results in the first 5 folds of the CV, you might already want to stop here. “Racing” would be such a tuning algorithm.
  3. It helps us to foreshadow what comes later in this post where we tune the learners.

The pipeline just has a single purpose in this example: It should allow us to switch between different learners, depending on a hyperparameter. The pipe consists of three elements:


pipe =
  po("branch", options = learns_ids) %>>%
  gunion(lapply(learns, po)) %>>%
  po("unbranch")
pipe$plot()

The pipeline has now quite a lot of available hyperparameters. It includes all hyperparameters from all contained learners. But as we don’t tune them here (yet), we don’t care (yet). But the first hyperparameter is special. branch.selection controls over which (named) branching channel our data flows.


pipe$param_set$ids()

 [1] "branch.selection"                "xgb.booster"                    
 [3] "xgb.watchlist"                   "xgb.eta"                        
 [5] "xgb.gamma"                       "xgb.max_depth"                  
 [7] "xgb.min_child_weight"            "xgb.subsample"                  
 [9] "xgb.colsample_bytree"            "xgb.colsample_bylevel"          
[11] "xgb.colsample_bynode"            "xgb.num_parallel_tree"          
[13] "xgb.lambda"                      "xgb.lambda_bias"                
[15] "xgb.alpha"                       "xgb.objective"                  
[17] "xgb.eval_metric"                 "xgb.base_score"                 
[19] "xgb.max_delta_step"              "xgb.missing"                    
[21] "xgb.monotone_constraints"        "xgb.tweedie_variance_power"     
[23] "xgb.nthread"                     "xgb.nrounds"                    
[25] "xgb.feval"                       "xgb.verbose"                    
[27] "xgb.print_every_n"               "xgb.early_stopping_rounds"      
[29] "xgb.maximize"                    "xgb.sample_type"                
[31] "xgb.normalize_type"              "xgb.rate_drop"                  
[33] "xgb.skip_drop"                   "xgb.one_drop"                   
[35] "xgb.tree_method"                 "xgb.grow_policy"                
[37] "xgb.max_leaves"                  "xgb.max_bin"                    
[39] "xgb.callbacks"                   "xgb.sketch_eps"                 
[41] "xgb.scale_pos_weight"            "xgb.updater"                    
[43] "xgb.refresh_leaf"                "xgb.feature_selector"           
[45] "xgb.top_k"                       "xgb.predictor"                  
[47] "xgb.save_period"                 "xgb.save_name"                  
[49] "xgb.xgb_model"                   "xgb.interaction_constraints"    
[51] "xgb.outputmargin"                "xgb.ntreelimit"                 
[53] "xgb.predleaf"                    "xgb.predcontrib"                
[55] "xgb.approxcontrib"               "xgb.predinteraction"            
[57] "xgb.reshape"                     "xgb.training"                   
[59] "rf.num.trees"                    "rf.mtry"                        
[61] "rf.importance"                   "rf.write.forest"                
[63] "rf.min.node.size"                "rf.replace"                     
[65] "rf.sample.fraction"              "rf.class.weights"               
[67] "rf.splitrule"                    "rf.num.random.splits"           
[69] "rf.split.select.weights"         "rf.always.split.variables"      
[71] "rf.respect.unordered.factors"    "rf.scale.permutation.importance"
[73] "rf.keep.inbag"                   "rf.holdout"                     
[75] "rf.num.threads"                  "rf.save.memory"                 
[77] "rf.verbose"                      "rf.oob.error"                   
[79] "rf.max.depth"                    "rf.alpha"                       
[81] "rf.min.prop"                     "rf.regularization.factor"       
[83] "rf.regularization.usedepth"      "rf.seed"                        
[85] "rf.minprop"                      "rf.predict.all"                 
[87] "rf.se.method"                   

pipe$param_set$params$branch.selection

                 id    class lower upper levels        default
1: branch.selection ParamFct    NA    NA xgb,rf <NoDefault[3]>

We can now tune over this pipeline, and probably running grid search seems a good idea to “touch” every available learner. NB: We have now written down in (much more complicated code) what we did before with benchmark.


glrn = GraphLearner$new(pipe, id = "g") # connect pipe to mlr3
ps = ParamSet$new(list(
  ParamFct$new("branch.selection", levels = c("rf", "xgb"))
))
instance = TuningInstanceSingleCrit$new(
  task = task,
  learner = glrn,
  resampling = cv1,
  measure = msr("classif.ce"),
  search_space = ps,
  terminator = trm("none")
)
tuner = tnr("grid_search")
tuner$optimize(instance)

   branch.selection learner_param_vals  x_domain classif.ce
1:               rf          <list[3]> <list[1]>  0.2067308

instance$archive$data("x_domain")

   branch.selection classif.ce      resample_result           timestamp
1:              xgb  0.2932692 <ResampleResult[18]> 2020-08-08 04:44:36
2:               rf  0.2067308 <ResampleResult[18]> 2020-08-08 04:44:37
   batch_nr x_domain_branch.selection
1:        1                       xgb
2:        2                        rf

instance$result

   branch.selection learner_param_vals  x_domain classif.ce
1:               rf          <list[3]> <list[1]>  0.2067308

But: Via this approach we can now get unbiased performance results via nested resampling and using the AutoTuner (which would make much more sense if we would select from 100 models and not 2).


at = AutoTuner$new(
  learner = glrn,
  resampling = cv1,
  measure = msr("classif.ce"),
  search_space = ps,
  terminator = trm("none"),
  tuner = tuner
)
rr = resample(task, at, cv2, store_models = TRUE)
# access 1st inner tuning result
ll = rr$data$learner[[1]]
ll$tuning_result

   branch.selection learner_param_vals  x_domain classif.ce
1:               rf          <list[3]> <list[1]>   0.253012

ll$archive$data("x_domain")

   branch.selection classif.ce      resample_result           timestamp
1:              xgb  0.3253012 <ResampleResult[18]> 2020-08-08 04:44:39
2:               rf  0.2530120 <ResampleResult[18]> 2020-08-08 04:44:40
   batch_nr x_domain_branch.selection
1:        1                       xgb
2:        2                        rf

Model-Selection and Tuning with Pipelines

Now let’s select from our given set of models and tune their hyperparameters. One way to do this is to define a search space for each individual learner, wrap them all with the AutoTuner, then call benchmark() on them. As this is pretty standard, we will skip this here, and show an even neater option, where you can tune over models and hyperparameters in one go. If you have quite a large space of potential learners and combine this with an efficient tuning algorithm, this can save quite some time in tuning as you can learn during optimization which options work best and focus on them. NB: Many AutoML systems work in a very similar way.

Define the Search Space

Remember, that the pipeline contains a joint set of all contained hyperparameters. Prefixed with the respective PipeOp ID, to make names unique.


as.data.table(pipe$param_set)[,1:4]

                                 id    class lower upper
 1:                branch.selection ParamFct    NA    NA
 2:                     xgb.booster ParamFct    NA    NA
 3:                   xgb.watchlist ParamUty    NA    NA
 4:                         xgb.eta ParamDbl     0     1
 5:                       xgb.gamma ParamDbl     0   Inf
 6:                   xgb.max_depth ParamInt     0   Inf
 7:            xgb.min_child_weight ParamDbl     0   Inf
 8:                   xgb.subsample ParamDbl     0     1
 9:            xgb.colsample_bytree ParamDbl     0     1
10:           xgb.colsample_bylevel ParamDbl     0     1
11:            xgb.colsample_bynode ParamDbl     0     1
12:           xgb.num_parallel_tree ParamInt     1   Inf
13:                      xgb.lambda ParamDbl     0   Inf
14:                 xgb.lambda_bias ParamDbl     0   Inf
15:                       xgb.alpha ParamDbl     0   Inf
16:                   xgb.objective ParamUty    NA    NA
17:                 xgb.eval_metric ParamUty    NA    NA
18:                  xgb.base_score ParamDbl  -Inf   Inf
19:              xgb.max_delta_step ParamDbl     0   Inf
20:                     xgb.missing ParamDbl  -Inf   Inf
21:        xgb.monotone_constraints ParamInt    -1     1
22:      xgb.tweedie_variance_power ParamDbl     1     2
23:                     xgb.nthread ParamInt     1   Inf
24:                     xgb.nrounds ParamInt     1   Inf
25:                       xgb.feval ParamUty    NA    NA
26:                     xgb.verbose ParamInt     0     2
27:               xgb.print_every_n ParamInt     1   Inf
28:       xgb.early_stopping_rounds ParamInt     1   Inf
29:                    xgb.maximize ParamLgl    NA    NA
30:                 xgb.sample_type ParamFct    NA    NA
31:              xgb.normalize_type ParamFct    NA    NA
32:                   xgb.rate_drop ParamDbl     0     1
33:                   xgb.skip_drop ParamDbl     0     1
34:                    xgb.one_drop ParamLgl    NA    NA
35:                 xgb.tree_method ParamFct    NA    NA
36:                 xgb.grow_policy ParamFct    NA    NA
37:                  xgb.max_leaves ParamInt     0   Inf
38:                     xgb.max_bin ParamInt     2   Inf
39:                   xgb.callbacks ParamUty    NA    NA
40:                  xgb.sketch_eps ParamDbl     0     1
41:            xgb.scale_pos_weight ParamDbl  -Inf   Inf
42:                     xgb.updater ParamUty    NA    NA
43:                xgb.refresh_leaf ParamLgl    NA    NA
44:            xgb.feature_selector ParamFct    NA    NA
45:                       xgb.top_k ParamInt     0   Inf
46:                   xgb.predictor ParamFct    NA    NA
47:                 xgb.save_period ParamInt     0   Inf
48:                   xgb.save_name ParamUty    NA    NA
49:                   xgb.xgb_model ParamUty    NA    NA
50:     xgb.interaction_constraints ParamUty    NA    NA
51:                xgb.outputmargin ParamLgl    NA    NA
52:                  xgb.ntreelimit ParamInt     1   Inf
53:                    xgb.predleaf ParamLgl    NA    NA
54:                 xgb.predcontrib ParamLgl    NA    NA
55:               xgb.approxcontrib ParamLgl    NA    NA
56:             xgb.predinteraction ParamLgl    NA    NA
57:                     xgb.reshape ParamLgl    NA    NA
58:                    xgb.training ParamLgl    NA    NA
59:                    rf.num.trees ParamInt     1   Inf
60:                         rf.mtry ParamInt     1   Inf
61:                   rf.importance ParamFct    NA    NA
62:                 rf.write.forest ParamLgl    NA    NA
63:                rf.min.node.size ParamInt     1   Inf
64:                      rf.replace ParamLgl    NA    NA
65:              rf.sample.fraction ParamDbl     0     1
66:                rf.class.weights ParamDbl  -Inf   Inf
67:                    rf.splitrule ParamFct    NA    NA
68:            rf.num.random.splits ParamInt     1   Inf
69:         rf.split.select.weights ParamDbl     0     1
70:       rf.always.split.variables ParamUty    NA    NA
71:    rf.respect.unordered.factors ParamFct    NA    NA
72: rf.scale.permutation.importance ParamLgl    NA    NA
73:                   rf.keep.inbag ParamLgl    NA    NA
74:                      rf.holdout ParamLgl    NA    NA
75:                  rf.num.threads ParamInt     1   Inf
76:                  rf.save.memory ParamLgl    NA    NA
77:                      rf.verbose ParamLgl    NA    NA
78:                    rf.oob.error ParamLgl    NA    NA
79:                    rf.max.depth ParamInt  -Inf   Inf
80:                        rf.alpha ParamDbl  -Inf   Inf
81:                     rf.min.prop ParamDbl  -Inf   Inf
82:        rf.regularization.factor ParamUty    NA    NA
83:      rf.regularization.usedepth ParamLgl    NA    NA
84:                         rf.seed ParamInt  -Inf   Inf
85:                      rf.minprop ParamDbl  -Inf   Inf
86:                  rf.predict.all ParamLgl    NA    NA
87:                    rf.se.method ParamFct    NA    NA
                                 id    class lower upper

We decide to tune the mtry parameter of the random forest and the nrounds parameter of xgboost. Additionally, we tune branching parameter that selects our learner.

We also have to reflect the hierarchical order of the parameter sets (admittedly, this is somewhat inconvenient). We can only set the mtry value if the pipe is configured to use the random forest (ranger). The same applies for the xgboost parameter.


ps = ParamSet$new(list(
  ParamFct$new("branch.selection", levels = c("rf", "xgb")),
  # more complicated, but programmtic way for the above:
  # pipe$param_set$params$branch.selection$clone()
  ParamInt$new("rf.mtry", lower = 1L, upper = 20L),
  ParamInt$new("xgb.nrounds", lower = 1, upper = 500)
))

# FIXME this seems pretty inconvenient
ps$add_dep("rf.mtry", "branch.selection", CondEqual$new("rf"))
ps$add_dep("xgb.nrounds", "branch.selection", CondEqual$new("xgb"))

Very similar code as before, we just swap out the search space. And now use random search.


instance = TuningInstanceSingleCrit$new(
  task = task,
  learner = glrn,
  resampling = cv1,
  measure = msr("classif.ce"),
  search_space = ps,
  terminator = trm("evals", n_evals = 10)
)
tuner = tnr("random_search")
tuner$optimize(instance)

   branch.selection rf.mtry xgb.nrounds learner_param_vals  x_domain classif.ce
1:               rf       1          NA          <list[4]> <list[2]>  0.1682692

instance$archive$data(unnest = "x_domain")

    branch.selection rf.mtry xgb.nrounds classif.ce      resample_result
 1:               rf      13          NA  0.1778846 <ResampleResult[18]>
 2:              xgb      NA         371  0.2019231 <ResampleResult[18]>
 3:               rf       1          NA  0.1682692 <ResampleResult[18]>
 4:              xgb      NA         337  0.2019231 <ResampleResult[18]>
 5:              xgb      NA         354  0.2019231 <ResampleResult[18]>
 6:              xgb      NA         339  0.2019231 <ResampleResult[18]>
 7:               rf      15          NA  0.1826923 <ResampleResult[18]>
 8:               rf       9          NA  0.1682692 <ResampleResult[18]>
 9:              xgb      NA         294  0.2019231 <ResampleResult[18]>
10:               rf      12          NA  0.1826923 <ResampleResult[18]>
              timestamp batch_nr x_domain_branch.selection x_domain_rf.mtry
 1: 2020-08-08 04:44:59        1                        rf               13
 2: 2020-08-08 04:45:00        2                       xgb               NA
 3: 2020-08-08 04:45:01        3                        rf                1
 4: 2020-08-08 04:45:03        4                       xgb               NA
 5: 2020-08-08 04:45:04        5                       xgb               NA
 6: 2020-08-08 04:45:06        6                       xgb               NA
 7: 2020-08-08 04:45:07        7                        rf               15
 8: 2020-08-08 04:45:08        8                        rf                9
 9: 2020-08-08 04:45:10        9                       xgb               NA
10: 2020-08-08 04:45:11       10                        rf               12
    x_domain_xgb.nrounds
 1:                   NA
 2:                  371
 3:                   NA
 4:                  337
 5:                  354
 6:                  339
 7:                   NA
 8:                   NA
 9:                  294
10:                   NA

instance$result

   branch.selection rf.mtry xgb.nrounds learner_param_vals  x_domain classif.ce
1:               rf       1          NA          <list[4]> <list[2]>  0.1682692

The following shows a quick way to visualize the tuning results.


resdf = instance$archive$data(unnest = "x_domain")
resdf = reshape(resdf,
  varying = c("xgb.nrounds","rf.mtry"),
  v.name = "param_value",
  timevar = "param",
  times = c("xgb.nrounds","rf.mtry"),
  direction="long")
library(ggplot2)
g = ggplot(resdf, aes(x = param_value, y = classif.ce))
g = g + geom_point()
g = g + facet_grid(~param, scales = "free")
g

Nested resampling, now really needed:


at = AutoTuner$new(
  learner = glrn,
  resampling = cv1,
  measure = msr("classif.ce"),
  search_space = ps,
  terminator = trm("evals", n_evals = 10),
  tuner = tuner
)
rr = resample(task, at, cv2, store_models = TRUE)
# access 1st inner tuning result
ll = rr$data$learner[[1]]
ll$tuning_result

   branch.selection rf.mtry xgb.nrounds learner_param_vals  x_domain classif.ce
1:              xgb      NA         178          <list[3]> <list[2]>  0.2048193

ll$archive$data(unnest = "x_domain")

    branch.selection rf.mtry xgb.nrounds classif.ce      resample_result
 1:              xgb      NA         178  0.2048193 <ResampleResult[18]>
 2:              xgb      NA         106  0.2048193 <ResampleResult[18]>
 3:               rf      17          NA  0.2168675 <ResampleResult[18]>
 4:              xgb      NA         428  0.2048193 <ResampleResult[18]>
 5:               rf      11          NA  0.2048193 <ResampleResult[18]>
 6:              xgb      NA         193  0.2048193 <ResampleResult[18]>
 7:               rf      10          NA  0.2228916 <ResampleResult[18]>
 8:              xgb      NA         137  0.2048193 <ResampleResult[18]>
 9:               rf       8          NA  0.2228916 <ResampleResult[18]>
10:              xgb      NA         332  0.2048193 <ResampleResult[18]>
              timestamp batch_nr x_domain_branch.selection x_domain_xgb.nrounds
 1: 2020-08-08 04:45:13        1                       xgb                  178
 2: 2020-08-08 04:45:14        2                       xgb                  106
 3: 2020-08-08 04:45:16        3                        rf                   NA
 4: 2020-08-08 04:45:17        4                       xgb                  428
 5: 2020-08-08 04:45:18        5                        rf                   NA
 6: 2020-08-08 04:45:20        6                       xgb                  193
 7: 2020-08-08 04:45:21        7                        rf                   NA
 8: 2020-08-08 04:45:22        8                       xgb                  137
 9: 2020-08-08 04:45:23        9                        rf                   NA
10: 2020-08-08 04:45:25       10                       xgb                  332
    x_domain_rf.mtry
 1:               NA
 2:               NA
 3:               17
 4:               NA
 5:               11
 6:               NA
 7:               10
 8:               NA
 9:                8
10:               NA

Citation

For attribution, please cite this work as

Richter & Bischl (2020, Feb. 1). mlr3gallery: Tuning Over Multiple Learners. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-02-01-tuning-multiplexer/

BibTeX citation

@misc{richter2020tuning,
  author = {Richter, Jakob and Bischl, Bernd},
  title = {mlr3gallery: Tuning Over Multiple Learners},
  url = {https://mlr3gallery.mlr-org.com/posts/2020-02-01-tuning-multiplexer/},
  year = {2020}
}