We demonstrate how to optimize the hyperparameters of a support vector machine (SVM).

This is the first part of the practical tuning series. The other parts can be found here:

- Part II - Tune a Preprocessing Pipeline
- Part III - Build an Automated Machine Learning System
- Part IV - Tuning and Parallel Processing

In this post, we demonstrate how to optimize the hyperparameters of a support vector machine (SVM). We are using the mlr3 machine learning framework with the mlr3tuning extension package.

First, we start by showing the basic building blocks of mlr3tuning and tune the `cost`

and `gamma`

hyperparameters of an SVM with a radial basis function on the Iris data set. After that, we use transformations to tune the `cost`

hyperparameter on the logarithmic scale. Next, we explain the importance of dependencies to tune hyperparameters like `degree`

which are dependent on the choice of kernel. After that, we fit an SVM with optimized hyperparameters on the full dataset. Finally, nested resampling is used to compute an unbiased performance estimate of our tuned SVM.

We load the mlr3verse package which pulls in the most important packages for this example.

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented. The `lgr`

package is used for logging in all mlr3 packages. The mlr3 logger prints the logging messages from the base package, whereas the bbotk logger is responsible for logging messages from the optimization packages (e.g. mlr3tuning ).

```
set.seed(7832)
lgr::get_logger("mlr3")$set_threshold("warn")
lgr::get_logger("bbotk")$set_threshold("warn")
```

In the example, we use the Iris data set which classifies 150 flowers in three species of Iris. The flowers are characterized by sepal length and width and petal length and width. The Iris data set allows us to quickly fit models to it. However, the influence of hyperparameter tuning on the predictive performance might be minor. Other data sets might give more meaningful tuning results.

```
# retrieve the task from mlr3
task = tsk("iris")
# generate a quick textual overview using the skimr package
skimr::skim(task$data())
```

Name | task$data() |

Number of rows | 150 |

Number of columns | 5 |

Key | NULL |

_______________________ | |

Column type frequency: | |

factor | 1 |

numeric | 4 |

________________________ | |

Group variables | None |

**Variable type: factor**

skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|

Species | 0 | 1 | FALSE | 3 | set: 50, ver: 50, vir: 50 |

**Variable type: numeric**

skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|

Petal.Length | 0 | 1 | 3.76 | 1.77 | 1.0 | 1.6 | 4.35 | 5.1 | 6.9 | ▇▁▆▇▂ |

Petal.Width | 0 | 1 | 1.20 | 0.76 | 0.1 | 0.3 | 1.30 | 1.8 | 2.5 | ▇▁▇▅▃ |

Sepal.Length | 0 | 1 | 5.84 | 0.83 | 4.3 | 5.1 | 5.80 | 6.4 | 7.9 | ▆▇▇▅▂ |

Sepal.Width | 0 | 1 | 3.06 | 0.44 | 2.0 | 2.8 | 3.00 | 3.3 | 4.4 | ▁▆▇▂▁ |

We choose the support vector machine implementation from the e1071 package (which is based on LIBSVM) and use it as a classification machine by setting `type`

to `"C-classification"`

.

```
learner = lrn("classif.svm", type = "C-classification", kernel = "radial")
```

For tuning, it is important to create a search space that defines the type and range of the hyperparameters. A learner stores all information about its hyperparameters in the slot `$param_set`

. Not all parameters are tunable. We have to choose a subset of the hyperparameters we want to tune.

```
print(learner$param_set)
```

```
<ParamSet>
id class lower upper nlevels default parents
1: cachesize ParamDbl -Inf Inf Inf 40
2: class.weights ParamUty NA NA Inf
3: coef0 ParamDbl -Inf Inf Inf 0 kernel
4: cost ParamDbl 0 Inf Inf 1 type
5: cross ParamInt 0 Inf Inf 0
6: decision.values ParamLgl NA NA 2 FALSE
7: degree ParamInt 1 Inf Inf 3 kernel
8: fitted ParamLgl NA NA 2 TRUE
9: gamma ParamDbl 0 Inf Inf <NoDefault[3]> kernel
10: kernel ParamFct NA NA 4 radial
11: nu ParamDbl -Inf Inf Inf 0.5 type
12: scale ParamUty NA NA Inf TRUE
13: shrinking ParamLgl NA NA 2 TRUE
14: tolerance ParamDbl 0 Inf Inf 0.001
15: type ParamFct NA NA 2 C-classification
value
1:
2:
3:
4:
5:
6:
7:
8:
9:
10: radial
11:
12:
13:
14:
15: C-classification
```

We use the `to_tune()`

function to define the range over which the hyperparameter should be tuned. We opt for the `cost`

and `gamma`

hyperparameters of the `radial`

kernel and set the tuning ranges with lower and upper bounds.

We specify how to evaluate the performance of the different hyperparameter configurations. For this, we choose 3-fold cross validation as the resampling strategy and the classification error as the performance measure.

Usually, we have to select a budget for the tuning. This is done by choosing a `Terminator`

, which stops the tuning e.g. after a performance level is reached or after a given time. However, some tuners like grid search terminate themselves. In this case, we choose a terminator that never stops and the tuning is not stopped before all grid points are evaluated.

```
terminator = trm("none")
```

At this point, we can construct a `TuningInstanceSingleCrit`

that describes the tuning problem.

```
instance = TuningInstanceSingleCrit$new(
task = task,
learner = learner,
resampling = resampling,
measure = measure,
terminator = terminator
)
print(instance)
```

```
<TuningInstanceSingleCrit>
* State: Not optimized
* Objective: <ObjectiveTuning:classif.svm_on_iris>
* Search Space:
<ParamSet>
id class lower upper nlevels default value
1: cost ParamDbl 0.1 10 Inf <NoDefault[3]>
2: gamma ParamDbl 0.0 5 Inf <NoDefault[3]>
* Terminator: <TerminatorNone>
* Terminated: FALSE
* Archive:
<ArchiveTuning>
Null data.table (0 rows and 0 cols)
```

Finally, we have to choose a `Tuner`

. Grid Search discretizes numeric parameters into a given resolution and constructs a grid from the Cartesian product of these sets. Categorical parameters produce a grid over all levels specified in the search space. In this example, we only use a resolution of 5 to keep the runtime low. Usually, a higher resolution is used to create a denser grid.

```
<TunerGridSearch>
* Parameters: resolution=5, batch_size=1
* Parameter classes: ParamLgl, ParamInt, ParamDbl, ParamFct
* Properties: dependencies, single-crit, multi-crit
* Packages: -
```

We can preview the proposed configurations by using `generate_design_grid()`

. This function is internally executed by `TunerGridSearch`

.

```
generate_design_grid(learner$param_set$search_space(), resolution = 5)
```

```
<Design> with 25 rows:
cost gamma
1: 0.100 0.00
2: 0.100 1.25
3: 0.100 2.50
4: 0.100 3.75
5: 0.100 5.00
6: 2.575 0.00
7: 2.575 1.25
8: 2.575 2.50
9: 2.575 3.75
10: 2.575 5.00
11: 5.050 0.00
12: 5.050 1.25
13: 5.050 2.50
14: 5.050 3.75
15: 5.050 5.00
16: 7.525 0.00
17: 7.525 1.25
18: 7.525 2.50
19: 7.525 3.75
20: 7.525 5.00
21: 10.000 0.00
22: 10.000 1.25
23: 10.000 2.50
24: 10.000 3.75
25: 10.000 5.00
cost gamma
```

We trigger the tuning by passing the `TuningInstanceSingleCrit`

to the `$optimize()`

method of the `Tuner`

. The instance is modified in-place.

```
tuner$optimize(instance)
```

We plot the performances depending on the evaluated `cost`

and `gamma`

values.

The points mark the evaluated `cost`

and `gamma`

values. We should not infer the performance of new values from the heatmap since it is only an interpolation. However, we can see the general interaction between the hyperparameters.

Tuning a learner can be shortened by using the `tune()`

-shortcut.

```
learner = lrn("classif.svm", type = "C-classification", kernel = "radial")
learner$param_set$values$cost = to_tune(0.1, 10)
learner$param_set$values$gamma = to_tune(0, 5)
instance = tune(
method = "grid_search",
task = tsk("iris"),
learner = learner,
resampling = rsmp ("holdout"),
measure = msr("classif.ce"),
resolution = 5
)
```

Next, we want to tune the `cost`

and `gamma`

hyperparameter more efficiently. It is recommended to tune `cost`

and `gamma`

on the logarithmic scale (Hsu, Chang, and Lin 2003). The log transformation emphasizes smaller `cost`

and `gamma`

values but also creates large values. Therefore, we use a log transformation to emphasize this region of the search space with a denser grid.

Generally speaking, transformations can be used to convert hyperparameters to a new scale. These transformations are applied before the proposed configuration is passed to the `Learner`

. We can directly define the transformation in the `to_tune()`

function. The lower and upper bound is set on the original scale.

```
learner = lrn("classif.svm", type = "C-classification", kernel = "radial")
# tune from 2^-15 to 2^15 on a log scale
learner$param_set$values$cost = to_tune(p_dbl(-15, 15, trafo = function(x) 2^x))
# tune from 2^-15 to 2^5 on a log scale
learner$param_set$values$gamma = to_tune(p_dbl(-15, 5, trafo = function(x) 2^x))
```

Transformations to the log scale are the ones most commonly used. We can use a shortcut for this transformation. The lower and upper bound is set on the transformed scale.

We create a new `TuningInstanceSingleCrit`

and trigger the tuning.

```
instance = tune(
method = "grid_search",
task = task,
learner = learner,
resampling = resampling,
measure = measure,
resolution = 5
)
```

The hyperparameter values after the transformation are stored in the `x_domain`

column as lists. We can expand these lists into multiple columns by using `as.data.table()`

. The hyperparameter names are prefixed by `x_domain`

.

```
data = as.data.table(instance$archive)
data[, .(cost, gamma, x_domain_cost, x_domain_gamma)]
```

```
cost gamma x_domain_cost x_domain_gamma
1: 11.512925 -11.512925 1.000000e+05 1.000000e-05
2: 5.756463 0.000000 3.162278e+02 1.000000e+00
3: -11.512925 11.512925 1.000000e-05 1.000000e+05
4: 0.000000 5.756463 1.000000e+00 3.162278e+02
5: -11.512925 -5.756463 1.000000e-05 3.162278e-03
6: 0.000000 0.000000 1.000000e+00 1.000000e+00
7: 11.512925 5.756463 1.000000e+05 3.162278e+02
8: -5.756463 -11.512925 3.162278e-03 1.000000e-05
9: -11.512925 -11.512925 1.000000e-05 1.000000e-05
10: -5.756463 11.512925 3.162278e-03 1.000000e+05
11: -11.512925 5.756463 1.000000e-05 3.162278e+02
12: 11.512925 0.000000 1.000000e+05 1.000000e+00
13: -11.512925 0.000000 1.000000e-05 1.000000e+00
14: 5.756463 -11.512925 3.162278e+02 1.000000e-05
15: 5.756463 5.756463 3.162278e+02 3.162278e+02
16: 5.756463 -5.756463 3.162278e+02 3.162278e-03
17: 5.756463 11.512925 3.162278e+02 1.000000e+05
18: 11.512925 11.512925 1.000000e+05 1.000000e+05
19: 11.512925 -5.756463 1.000000e+05 3.162278e-03
20: -5.756463 -5.756463 3.162278e-03 3.162278e-03
21: 0.000000 -11.512925 1.000000e+00 1.000000e-05
22: 0.000000 11.512925 1.000000e+00 1.000000e+05
23: 0.000000 -5.756463 1.000000e+00 3.162278e-03
24: -5.756463 0.000000 3.162278e-03 1.000000e+00
25: -5.756463 5.756463 3.162278e-03 3.162278e+02
cost gamma x_domain_cost x_domain_gamma
```

We plot the performances depending on the evaluated `cost`

and `gamma`

values.

```
library(ggplot2)
library(scales)
autoplot(instance, type = "points", cols_x = c("x_domain_cost", "x_domain_gamma")) +
scale_x_continuous(
trans = log2_trans(),
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x))) +
scale_y_continuous(
trans = log2_trans(),
breaks = trans_breaks("log10", function(x) 10^x),
labels = trans_format("log10", math_format(10^.x)))
```

Dependencies ensure that certain parameters are only proposed depending on values of other hyperparameters. We want to tune the `degree`

hyperparameter that is only needed for the `polynomial`

kernel.

```
learner = lrn("classif.svm", type = "C-classification")
learner$param_set$values$cost = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$gamma = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$kernel = to_tune(c("polynomial", "radial"))
learner$param_set$values$degree = to_tune(1, 4)
```

The dependencies are already stored in the learner parameter set.

```
learner$param_set$deps
```

```
id on cond
1: cost type <CondEqual[9]>
2: nu type <CondEqual[9]>
3: degree kernel <CondEqual[9]>
4: coef0 kernel <CondAnyOf[9]>
5: gamma kernel <CondAnyOf[9]>
```

The `gamma`

hyperparameter depends on the kernel being `polynomial`

, `radial`

or `sigmoid`

```
learner$param_set$deps$cond[[5]]
```

`CondAnyOf: x ∈ {polynomial, radial, sigmoid}`

whereas the `degree`

hyperparameter is solely used by the `polynomial`

kernel.

```
learner$param_set$deps$cond[[3]]
```

`CondEqual: x = polynomial`

We preview the grid to show the effect of the dependencies.

```
generate_design_grid(learner$param_set$search_space(), resolution = 2)
```

```
<Design> with 12 rows:
cost gamma kernel degree
1: -11.51293 -11.51293 polynomial 1
2: -11.51293 -11.51293 polynomial 4
3: -11.51293 -11.51293 radial NA
4: -11.51293 11.51293 polynomial 1
5: -11.51293 11.51293 polynomial 4
6: -11.51293 11.51293 radial NA
7: 11.51293 -11.51293 polynomial 1
8: 11.51293 -11.51293 polynomial 4
9: 11.51293 -11.51293 radial NA
10: 11.51293 11.51293 polynomial 1
11: 11.51293 11.51293 polynomial 4
12: 11.51293 11.51293 radial NA
```

The value for `degree`

is `NA`

if the dependency on the `kernel`

is not satisfied.

We create a new `TuningInstanceSingleCrit`

and trigger the tuning.

```
instance = tune(
method = "grid_search",
task = task,
learner = learner,
resampling = resampling,
measure = measure,
resolution = 3
)
```

```
instance$result
```

```
cost gamma kernel degree learner_param_vals x_domain classif.ce
1: 0 0 polynomial 1 <list[5]> <list[4]> 0.02
```

We add the optimized hyperparameters to the learner and train the learner on the full dataset.

```
learner = lrn("classif.svm")
learner$param_set$values = instance$result_learner_param_vals
learner$train(task)
```

The trained model can now be used to make predictions on new data. A common mistake is to report the performance estimated on the resampling sets on which the tuning was performed (`instance$result_y`

) as the model’s performance. These scores might be biased and overestimate the ability of the fitted model to predict with new data. Instead, we have to use nested resampling to get an unbiased performance estimate.

Tuning should not be performed on the same resampling sets which are used for evaluating the model itself, since this would result in a biased performance estimate. Nested resampling uses an outer and inner resampling to separate the tuning from the performance estimation of the model. We can use the `AutoTuner`

class for running nested resampling. The `AutoTuner`

wraps a `Learner`

and tunes the hyperparameter of the learner during `$train()`

. This is our inner resampling loop.

```
learner = lrn("classif.svm", type = "C-classification")
learner$param_set$values$cost = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$gamma = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$kernel = to_tune(c("polynomial", "radial"))
learner$param_set$values$degree = to_tune(1, 4)
resampling_inner = rsmp("cv", folds = 3)
terminator = trm("none")
tuner = tnr("grid_search", resolution = 3)
at = AutoTuner$new(
learner = learner,
resampling = resampling_inner,
measure = measure,
terminator = terminator,
tuner = tuner,
store_models = TRUE)
```

We put the `AutoTuner`

into a `resample()`

call to get the outer resampling loop.

We check the inner tuning results for stable hyperparameters. This means that the selected hyperparameters should not vary too much. We might observe unstable models in this example because the small data set and the low number of resampling iterations might introduce too much randomness. Usually, we aim for the selection of stable hyperparameters for all outer training sets.

```
cost gamma kernel degree learner_param_vals x_domain classif.ce
1: -11.51293 11.51293 polynomial 1 <list[5]> <list[4]> 0.01960784
2: -11.51293 11.51293 polynomial 1 <list[5]> <list[4]> 0.04010695
3: 11.51293 11.51293 polynomial 1 <list[5]> <list[4]> 0.01990493
```

Next, we want to compare the predictive performances estimated on the outer resampling to the inner resampling (`extract_inner_tuning_results(rr)`

). Significantly lower predictive performances on the outer resampling indicate that the models with the optimized hyperparameters overfit the data.

```
rr$score()
```

```
task task_id learner learner_id
1: <TaskClassif[46]> iris <AutoTuner[41]> classif.svm.tuned
2: <TaskClassif[46]> iris <AutoTuner[41]> classif.svm.tuned
3: <TaskClassif[46]> iris <AutoTuner[41]> classif.svm.tuned
resampling resampling_id iteration prediction
1: <ResamplingCV[19]> cv 1 <PredictionClassif[19]>
2: <ResamplingCV[19]> cv 2 <PredictionClassif[19]>
3: <ResamplingCV[19]> cv 3 <PredictionClassif[19]>
classif.ce
1: 0.00
2: 0.02
3: 0.04
```

The archive of the `AutoTuner`

s allows us to inspect all evaluated hyperparameters configurations with the associated predictive performances.

```
rr$learners[[1]]$archive
```

```
<ArchiveTuning>
cost gamma kernel degree classif.ce timestamp batch_nr
1: 12 12 polynomial 2 0.15 2021-03-11 16:21:44 1
2: 0 12 polynomial 4 0.20 2021-03-11 16:21:44 2
3: -12 0 polynomial 1 0.74 2021-03-11 16:21:45 3
4: -12 12 polynomial 1 0.02 2021-03-11 16:21:45 4
5: -12 -12 radial NA 0.74 2021-03-11 16:21:45 5
6: 0 0 polynomial 2 0.15 2021-03-11 16:21:45 6
7: 12 12 polynomial 4 0.20 2021-03-11 16:21:45 7
8: 0 -12 radial NA 0.74 2021-03-11 16:21:46 8
9: -12 12 polynomial 4 0.20 2021-03-11 16:21:46 9
10: -12 0 radial NA 0.74 2021-03-11 16:21:46 10
11: -12 -12 polynomial 2 0.74 2021-03-11 16:21:46 11
12: 0 0 polynomial 4 0.19 2021-03-11 16:21:46 12
13: 0 -12 polynomial 4 0.74 2021-03-11 16:21:46 13
14: 0 0 radial NA 0.05 2021-03-11 16:21:46 14
15: 0 -12 polynomial 2 0.74 2021-03-11 16:21:47 15
16: 0 12 radial NA 0.74 2021-03-11 16:21:47 16
17: -12 12 radial NA 0.74 2021-03-11 16:21:47 17
18: -12 -12 polynomial 1 0.74 2021-03-11 16:21:47 18
19: 0 12 polynomial 1 0.05 2021-03-11 16:21:47 19
20: 12 -12 radial NA 0.04 2021-03-11 16:21:48 20
21: 12 -12 polynomial 4 0.74 2021-03-11 16:21:48 21
22: -12 12 polynomial 2 0.15 2021-03-11 16:21:48 22
23: -12 -12 polynomial 4 0.74 2021-03-11 16:21:48 23
24: 12 -12 polynomial 2 0.74 2021-03-11 16:21:48 24
25: -12 0 polynomial 4 0.74 2021-03-11 16:21:48 25
26: -12 0 polynomial 2 0.74 2021-03-11 16:21:48 26
27: 12 0 polynomial 1 0.05 2021-03-11 16:21:49 27
28: 12 0 polynomial 4 0.20 2021-03-11 16:21:49 28
29: 12 0 radial NA 0.07 2021-03-11 16:21:49 29
30: 12 0 polynomial 2 0.15 2021-03-11 16:21:49 30
31: 0 -12 polynomial 1 0.74 2021-03-11 16:21:49 31
32: 12 -12 polynomial 1 0.02 2021-03-11 16:21:50 32
33: 12 12 polynomial 1 0.07 2021-03-11 16:21:52 33
34: 0 0 polynomial 1 0.02 2021-03-11 16:21:52 34
35: 0 12 polynomial 2 0.15 2021-03-11 16:21:52 35
36: 12 12 radial NA 0.74 2021-03-11 16:21:52 36
cost gamma kernel degree classif.ce timestamp batch_nr
```

The aggregated performance of all outer resampling iterations is essentially the unbiased performance of an SVM with optimal hyperparameter found by grid search.

```
rr$aggregate()
```

```
classif.ce
0.02
```

Applying nested resampling can be shortened by using the `tune_nested()`

-shortcut.

```
learner = lrn("classif.svm", type = "C-classification")
learner$param_set$values$cost = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$gamma = to_tune(p_dbl(1e-5, 1e5, logscale = TRUE))
learner$param_set$values$kernel = to_tune(c("polynomial", "radial"))
learner$param_set$values$degree = to_tune(1, 4)
rr = tune_nested(
method = "grid_search",
task = tsk("iris"),
learner = learner,
inner_resampling = rsmp ("cv", folds = 3),
outer_resampling = rsmp("cv", folds = 3),
measure = msr("classif.ce"),
resolution = 3
)
```

The mlr3book includes chapters on tuning spaces and hyperparameter tuning. The mlr3cheatsheets contain frequently used commands and workflows of mlr3.

Hsu, Chih-wei, Chih-chung Chang, and Chih-Jen Lin. 2003. “A Practical Guide to Support Vector Classification.”

For attribution, please cite this work as

Becker, et al. (2021, March 9). mlr3gallery: Practical Tuning Series - Tune a Support Vector Machine. Retrieved from https://mlr3gallery.mlr-org.com/posts/2021-03-09-practical-tuning-series-tune-a-support-vector-machine/

BibTeX citation

@misc{becker2021practical, author = {Becker, Marc and Ullmann, Theresa and Lang, Michel and Bischl, Bernd and Richter, Jakob and Binder, Martin}, title = {mlr3gallery: Practical Tuning Series - Tune a Support Vector Machine}, url = {https://mlr3gallery.mlr-org.com/posts/2021-03-09-practical-tuning-series-tune-a-support-vector-machine/}, year = {2021} }