Predicting probabilities in classification tasks allows us to adjust the probability thresholds required for assigning an observation to a certain class. This can lead to improved predictive performance.

Predicting probabilities in classification tasks allows us to adjust the probability thresholds required for assigning an observation to a certain class. This can lead to improved classification performance, especially for cases where we e.g. aim to balance off metrics such as false positive and false negative rates.

This is for example often done in ROC Analysis. The mlr3book also has a chapter on ROC Analysis) for the interested reader. This post does not focus on ROC analysis, but instead focusses on the general problem of adjusting classification thresholds for arbitrary metrics.

This post assumes some familiarity with the mlr3, and also the mlr3pipelines and mlr3tuning packages, as both are used during the post. The mlr3book contains more details on those two packages. This post is a more in-depth version of the article on threshold tuning in the mlr3book.

Before we start, we have load all required packages:

In order to understand thresholds, we will quickly showcase the effect of setting different thresholds:

First we create a learner that predicts probabilities and use it to predict on holdout data, storing the prediction.

```
learner = lrn("classif.rpart", predict_type = "prob")
rr = resample(tsk("pima"), learner, rsmp("holdout"))
prd = rr$prediction()
prd
```

```
<PredictionClassif> for 256 observations:
row_ids truth response prob.pos prob.neg
4 neg neg 0.07303371 0.9269663
6 neg neg 0.00000000 1.0000000
12 pos pos 0.85333333 0.1466667
---
759 neg neg 0.07303371 0.9269663
763 neg neg 0.06896552 0.9310345
765 neg neg 0.07303371 0.9269663
```

If we now look at the confusion matrix, the off-diagonal elements are errors made by our model (*false positives* and *false negatives*) while on-diagol ements are where our model predicted correctly.

```
# Print confusion matrix
prd$confusion
```

```
truth
response pos neg
pos 56 26
neg 33 141
```

```
classif.fp classif.fn
26 33
```

By adjusting the **classification threshold**, in this case the probability required to predict the positive class, we can now trade off predicting more positive cases (first row) against predicting fewer negative cases (second row) or vice versa.

```
# Lower threshold: More positives
prd$set_threshold(0.25)$confusion
```

```
truth
response pos neg
pos 64 39
neg 25 128
```

```
# Higher threshold: Fewer positives
prd$set_threshold(0.75)$confusion
```

```
truth
response pos neg
pos 45 16
neg 44 151
```

This threshold value can now be adjusted optimally for a given measure, such as accuracy. How this can be done is discussed in the following section.

```
set.seed(20201014)
```

Currently `mlr3pipelines`

offers two main strategies towards adjusting `classification thresholds`

. We can either expose the thresholds as a `hyperparameter`

of the Learner by using `PipeOpThreshold`

. This allows us to tune the `thresholds`

via an outside optimizer from mlr3tuning.

Alternatively, we can also use `PipeOpTuneThreshold`

which automatically tunes the threshold after each learner fit.

In this blog-post, we’ll go through both strategies.

`PipeOpThreshold`

can be put directly after a `Learner`

.

A simple example would be:

```
gr = lrn("classif.rpart", predict_type = "prob") %>>% po("threshold")
l = GraphLearner$new(gr)
```

Note, that `predict_type`

= “prob” is required for `po("threshold")`

to have any effect.

The `thresholds`

are now exposed as a `hyperparameter`

of the `GraphLearner`

we created:

```
l$param_set
```

```
<ParamSetCollection>
id class lower upper nlevels default
1: classif.rpart.minsplit ParamInt 1 Inf Inf 20
2: classif.rpart.minbucket ParamInt 1 Inf Inf <NoDefault[3]>
3: classif.rpart.cp ParamDbl 0 1 Inf 0.01
4: classif.rpart.maxcompete ParamInt 0 Inf Inf 4
5: classif.rpart.maxsurrogate ParamInt 0 Inf Inf 5
6: classif.rpart.maxdepth ParamInt 1 30 30 30
7: classif.rpart.usesurrogate ParamInt 0 2 3 2
8: classif.rpart.surrogatestyle ParamInt 0 1 2 0
9: classif.rpart.xval ParamInt 0 Inf Inf 10
10: classif.rpart.keep_model ParamLgl NA NA 2 FALSE
11: threshold.thresholds ParamUty NA NA Inf <NoDefault[3]>
value
1:
2:
3:
4:
5:
6:
7:
8:
9: 0
10:
11: 0.5
```

We can now tune those thresholds from the outside as follows:

Before `tuning`

, we have to define which hyperparameters we want to tune over. In this example, we only tune over the `thresholds`

parameter of the `threshold`

`PipeOp`

. you can easily imagine, that we can also jointly tune over additional hyperparameters, i.e. rpart’s `cp`

parameter.

As the `Task`

we aim to optimize for is a binary task, we can simply specify the threshold parameter:

We now create a `AutoTuner`

, which automatically tunes the supplied learner over the `ParamSet`

we supplied above.

For multi-class `Tasks`

, this is a little more complicated. We have to use a `trafo`

to transform a set of `ParamDbl`

into the desired format for `threshold.thresholds`

: A named numeric vector containing the thresholds. This can be easily achieved via a `trafo`

function:

Inside the `trafo`

, we simply collect all set params into a named vector via `map_dbl`

and store it in the `threshold.thresholds`

slot expected by the learner.

Again, we create a `AutoTuner`

, which automatically tunes the supplied learner over the `ParamSet`

we supplied above.

One drawback of this strategy is, that this requires us to fit a new model for each new threshold setting. While setting a threshold and computing performance is relatively cheap, fitting the learner is often more computationally demanding. A better strategy is therefore often to optimize the thresholds separately after each model fit.

`PipeOpTuneThreshold`

on the other hand works together with `PipeOpLearnerCV`

. It directly optimizes the `cross-validated`

predictions made by this `PipeOp`

.

A simple example would be:

```
gr = po("learner_cv", lrn("classif.rpart", predict_type = "prob")) %>>% po("tunethreshold")
l2 = GraphLearner$new(gr)
```

Note, that `predict_type`

= “prob” is required for `po("tunethreshold")`

to have any effect. Additionally, note that this time no `threshold`

parameter is exposed, it is automatically tuned internally.

```
l2$param_set
```

```
<ParamSetCollection>
id class lower upper nlevels
1: classif.rpart.resampling.method ParamFct NA NA 2
2: classif.rpart.resampling.folds ParamInt 2 Inf Inf
3: classif.rpart.resampling.keep_response ParamLgl NA NA 2
4: classif.rpart.minsplit ParamInt 1 Inf Inf
5: classif.rpart.minbucket ParamInt 1 Inf Inf
6: classif.rpart.cp ParamDbl 0 1 Inf
7: classif.rpart.maxcompete ParamInt 0 Inf Inf
8: classif.rpart.maxsurrogate ParamInt 0 Inf Inf
9: classif.rpart.maxdepth ParamInt 1 30 30
10: classif.rpart.usesurrogate ParamInt 0 2 3
11: classif.rpart.surrogatestyle ParamInt 0 1 2
12: classif.rpart.xval ParamInt 0 Inf Inf
13: classif.rpart.keep_model ParamLgl NA NA 2
14: classif.rpart.affect_columns ParamUty NA NA Inf
15: tunethreshold.measure ParamUty NA NA Inf
16: tunethreshold.optimizer ParamUty NA NA Inf
17: tunethreshold.log_level ParamUty NA NA Inf
default value
1: <NoDefault[3]> cv
2: <NoDefault[3]> 3
3: <NoDefault[3]> FALSE
4: 20
5: <NoDefault[3]>
6: 0.01
7: 4
8: 5
9: 30
10: 2
11: 0
12: 10 0
13: FALSE
14: <Selector[1]>
15: <NoDefault[3]> classif.ce
16: <NoDefault[3]> gensa
17: <function[1]> warn
```

If we now use the `GraphLearner`

, it automatically adjusts the `thresholds`

during prediction.

Note that we can set `ResamplingInsample`

as a resampling strategy for `PipeOpLearnerCV`

in order to evaluate predictions on the “training” data. This is generally not advised, as it might lead to over-fitting on the thresholds but can significantly reduce runtime.

Finally, we can compare no threshold tuning to the `tunethreshold`

approach:

```
bmr = benchmark(benchmark_grid(
learners = list(
no_tuning = lrn("classif.rpart"),
internal = l2
),
tasks = tsk("german_credit"),
rsmp("cv", folds = 3L)
))
bmr$aggregate(list(msr("classif.ce"), msr("classif.fnr")))
```

```
nr resample_result task_id learner_id
1: 1 <ResampleResult[21]> german_credit classif.rpart
2: 2 <ResampleResult[21]> german_credit classif.rpart.tunethreshold
resampling_id iters classif.ce classif.fnr
1: cv 3 0.2699796 0.1249980
2: cv 3 0.2719906 0.0878889
```

We obtained a slightly better classification error and false negatives rate!

For attribution, please cite this work as

Pfisterer (2020, Oct. 14). mlr3gallery: Threshold Tuning for Classification Tasks. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-10-14-threshold-tuning/

BibTeX citation

@misc{pfisterer2020threshold, author = {Pfisterer, Florian}, title = {mlr3gallery: Threshold Tuning for Classification Tasks}, url = {https://mlr3gallery.mlr-org.com/posts/2020-10-14-threshold-tuning/}, year = {2020} }