Threshold Tuning for Classification Tasks

mlr3tuning tuning optimization nested resampling mlr3pipelines pima data set classification

Predicting probabilities in classification tasks allows us to adjust the probability thresholds required for assigning an observation to a certain class. This can lead to improved predictive performance.

Florian Pfisterer
10-14-2020

Intro

Predicting probabilities in classification tasks allows us to adjust the probability thresholds required for assigning an observation to a certain class. This can lead to improved classification performance, especially for cases where we e.g. aim to balance off metrics such as false positive and false negative rates.

This is for example often done in ROC Analysis. The mlr3book also has a chapter on ROC Analysis) for the interested reader. This post does not focus on ROC analysis, but instead focusses on the general problem of adjusting classification thresholds for arbitrary metrics.

This post assumes some familiarity with the mlr3, and also the mlr3pipelines and mlr3tuning packages, as both are used during the post. The mlr3book contains more details on those two packages. This post is a more in-depth version of the article on threshold tuning in the mlr3book.

Prerequisites

Before we start, we have load all required packages:

Thresholds: A short intro

In order to understand thresholds, we will quickly showcase the effect of setting different thresholds:

First we create a learner that predicts probabilities and use it to predict on holdout data, storing the prediction.

learner = lrn("classif.rpart", predict_type = "prob")
rr = resample(tsk("pima"), learner, rsmp("holdout"))
prd = rr$prediction()
prd
<PredictionClassif> for 256 observations:
    row_ids truth response   prob.pos  prob.neg
          4   neg      neg 0.07303371 0.9269663
          6   neg      neg 0.00000000 1.0000000
         12   pos      pos 0.85333333 0.1466667
---                                            
        759   neg      neg 0.07303371 0.9269663
        763   neg      neg 0.06896552 0.9310345
        765   neg      neg 0.07303371 0.9269663

If we now look at the confusion matrix, the off-diagonal elements are errors made by our model (false positives and false negatives) while on-diagol ements are where our model predicted correctly.

# Print confusion matrix
prd$confusion
        truth
response pos neg
     pos  56  26
     neg  33 141
# Print False Positives and False Negatives
prd$score(list(msr("classif.fp"), msr("classif.fn")))
classif.fp classif.fn 
        26         33 

By adjusting the classification threshold, in this case the probability required to predict the positive class, we can now trade off predicting more positive cases (first row) against predicting fewer negative cases (second row) or vice versa.

# Lower threshold: More positives
prd$set_threshold(0.25)$confusion
        truth
response pos neg
     pos  64  39
     neg  25 128
# Higher threshold: Fewer positives
prd$set_threshold(0.75)$confusion
        truth
response pos neg
     pos  45  16
     neg  44 151

This threshold value can now be adjusted optimally for a given measure, such as accuracy. How this can be done is discussed in the following section.

Adjusting thresholds: Two strategies

set.seed(20201014)

Currently mlr3pipelines offers two main strategies towards adjusting classification thresholds. We can either expose the thresholds as a hyperparameter of the Learner by using PipeOpThreshold. This allows us to tune the thresholds via an outside optimizer from mlr3tuning.

Alternatively, we can also use PipeOpTuneThreshold which automatically tunes the threshold after each learner fit.

In this blog-post, we’ll go through both strategies.

PipeOpThreshold

PipeOpThreshold can be put directly after a Learner.

A simple example would be:

gr = lrn("classif.rpart", predict_type = "prob") %>>% po("threshold")
l = GraphLearner$new(gr)

Note, that predict_type = “prob” is required for po("threshold") to have any effect.

The thresholds are now exposed as a hyperparameter of the GraphLearner we created:

l$param_set
<ParamSetCollection>
                              id    class lower upper nlevels        default
 1:       classif.rpart.minsplit ParamInt     1   Inf     Inf             20
 2:      classif.rpart.minbucket ParamInt     1   Inf     Inf <NoDefault[3]>
 3:             classif.rpart.cp ParamDbl     0     1     Inf           0.01
 4:     classif.rpart.maxcompete ParamInt     0   Inf     Inf              4
 5:   classif.rpart.maxsurrogate ParamInt     0   Inf     Inf              5
 6:       classif.rpart.maxdepth ParamInt     1    30      30             30
 7:   classif.rpart.usesurrogate ParamInt     0     2       3              2
 8: classif.rpart.surrogatestyle ParamInt     0     1       2              0
 9:           classif.rpart.xval ParamInt     0   Inf     Inf             10
10:     classif.rpart.keep_model ParamLgl    NA    NA       2          FALSE
11:         threshold.thresholds ParamUty    NA    NA     Inf <NoDefault[3]>
    value
 1:      
 2:      
 3:      
 4:      
 5:      
 6:      
 7:      
 8:      
 9:     0
10:      
11:   0.5

We can now tune those thresholds from the outside as follows:

Before tuning, we have to define which hyperparameters we want to tune over. In this example, we only tune over the thresholds parameter of the threshold PipeOp. you can easily imagine, that we can also jointly tune over additional hyperparameters, i.e. rpart’s cp parameter.

As the Task we aim to optimize for is a binary task, we can simply specify the threshold parameter:

library(paradox)
ps = ParamSet$new(list(
  ParamDbl$new("threshold.thresholds", lower = 0, upper = 1)
))

We now create a AutoTuner, which automatically tunes the supplied learner over the ParamSet we supplied above.

at = AutoTuner$new(
  learner = l,
  resampling = rsmp("cv", folds = 3L),
  measure = msr("classif.ce"),
  search_space = ps,
  terminator = trm("evals", n_evals = 5L),
  tuner = TunerRandomSearch$new()
)

at$train(tsk("german_credit"))

For multi-class Tasks, this is a little more complicated. We have to use a trafo to transform a set of ParamDbl into the desired format for threshold.thresholds: A named numeric vector containing the thresholds. This can be easily achieved via a trafo function:

ps = ParamSet$new(list(
  ParamDbl$new("versicolor", lower = 0, upper = 1),
  ParamDbl$new("setosa", lower = 0, upper = 1),
  ParamDbl$new("virginica", lower = 0, upper = 1)
))
ps$trafo = function(x, param_set) {
  list(threshold.thresholds = mlr3misc::map_dbl(x, identity))
}

Inside the trafo, we simply collect all set params into a named vector via map_dbl and store it in the threshold.thresholds slot expected by the learner.

Again, we create a AutoTuner, which automatically tunes the supplied learner over the ParamSet we supplied above.

at2 = AutoTuner$new(
  learner = l,
  resampling = rsmp("cv", folds = 3L),
  measure = msr("classif.ce"),
  search_space = ps,
  terminator = trm("evals", n_evals = 5L),
  tuner = TunerRandomSearch$new()
)
at2$train(tsk("iris"))

One drawback of this strategy is, that this requires us to fit a new model for each new threshold setting. While setting a threshold and computing performance is relatively cheap, fitting the learner is often more computationally demanding. A better strategy is therefore often to optimize the thresholds separately after each model fit.

PipeOpTuneThreshold

PipeOpTuneThreshold on the other hand works together with PipeOpLearnerCV. It directly optimizes the cross-validated predictions made by this PipeOp.

A simple example would be:

gr = po("learner_cv", lrn("classif.rpart", predict_type = "prob")) %>>% po("tunethreshold")
l2 = GraphLearner$new(gr)

Note, that predict_type = “prob” is required for po("tunethreshold") to have any effect. Additionally, note that this time no threshold parameter is exposed, it is automatically tuned internally.

l2$param_set
<ParamSetCollection>
                                        id    class lower upper nlevels
 1:        classif.rpart.resampling.method ParamFct    NA    NA       2
 2:         classif.rpart.resampling.folds ParamInt     2   Inf     Inf
 3: classif.rpart.resampling.keep_response ParamLgl    NA    NA       2
 4:                 classif.rpart.minsplit ParamInt     1   Inf     Inf
 5:                classif.rpart.minbucket ParamInt     1   Inf     Inf
 6:                       classif.rpart.cp ParamDbl     0     1     Inf
 7:               classif.rpart.maxcompete ParamInt     0   Inf     Inf
 8:             classif.rpart.maxsurrogate ParamInt     0   Inf     Inf
 9:                 classif.rpart.maxdepth ParamInt     1    30      30
10:             classif.rpart.usesurrogate ParamInt     0     2       3
11:           classif.rpart.surrogatestyle ParamInt     0     1       2
12:                     classif.rpart.xval ParamInt     0   Inf     Inf
13:               classif.rpart.keep_model ParamLgl    NA    NA       2
14:           classif.rpart.affect_columns ParamUty    NA    NA     Inf
15:                  tunethreshold.measure ParamUty    NA    NA     Inf
16:                tunethreshold.optimizer ParamUty    NA    NA     Inf
17:                tunethreshold.log_level ParamUty    NA    NA     Inf
           default      value
 1: <NoDefault[3]>         cv
 2: <NoDefault[3]>          3
 3: <NoDefault[3]>      FALSE
 4:             20           
 5: <NoDefault[3]>           
 6:           0.01           
 7:              4           
 8:              5           
 9:             30           
10:              2           
11:              0           
12:             10          0
13:          FALSE           
14:  <Selector[1]>           
15: <NoDefault[3]> classif.ce
16: <NoDefault[3]>      gensa
17:  <function[1]>       warn

If we now use the GraphLearner, it automatically adjusts the thresholds during prediction.

Note that we can set ResamplingInsample as a resampling strategy for PipeOpLearnerCV in order to evaluate predictions on the “training” data. This is generally not advised, as it might lead to over-fitting on the thresholds but can significantly reduce runtime.

Finally, we can compare no threshold tuning to the tunethreshold approach:

Comparison of the approaches

bmr = benchmark(benchmark_grid(
  learners = list(
    no_tuning = lrn("classif.rpart"),
    internal = l2
  ),
  tasks = tsk("german_credit"),
  rsmp("cv", folds = 3L)
))
bmr$aggregate(list(msr("classif.ce"), msr("classif.fnr")))
   nr      resample_result       task_id                  learner_id
1:  1 <ResampleResult[21]> german_credit               classif.rpart
2:  2 <ResampleResult[21]> german_credit classif.rpart.tunethreshold
   resampling_id iters classif.ce classif.fnr
1:            cv     3  0.2699796   0.1249980
2:            cv     3  0.2719906   0.0878889

We obtained a slightly better classification error and false negatives rate!

Citation

For attribution, please cite this work as

Pfisterer (2020, Oct. 14). mlr3gallery: Threshold Tuning for Classification Tasks. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-10-14-threshold-tuning/

BibTeX citation

@misc{pfisterer2020threshold,
  author = {Pfisterer, Florian},
  title = {mlr3gallery: Threshold Tuning for Classification Tasks},
  url = {https://mlr3gallery.mlr-org.com/posts/2020-10-14-threshold-tuning/},
  year = {2020}
}