mlr3 Basics on “Iris” - Hello World!

mlr3 basics iris data set classification

Basic ML operations on iris: Train, predict, score, resample and benchmark. A simple, hands-on intro to mlr3.

Bernd Bischl
03-18-2020

Goals and Prerequisites

This use case shows how to use the basic mlr3 package on the iris Task, so it’s our “Hello World” example. It assumes no prior knowledge in ML or mlr3. You can find most of the content here also in the mlr3book in a more detailed way. Hence we will not make a lot of general comments, but keep it hands-on and short.

The following operations are shown:

Loading basic packages

We load the mlr3verse package which pulls in the most important packages for this example. The mlr3learners package loads additional learners.

Creating tasks and learners

Let’s work on the canonical, simple iris data set, and try out some ML algorithms. We will start by using a decision tree with default settings.

# creates mlr3 task from scratch, from a data.frame
# 'target' names the column in the dataset we want to learn to predict
task = as_task_classif(iris, target = "Species")
# in this case we could also take the iris example from mlr3's dictionary of shipped example tasks
# 2 equivalent calls to create a task. The second is just sugar for the user.
task = mlr_tasks$get("iris")
task = tsk("iris")
print(task)
<TaskClassif:iris> (150 x 5)
* Target: Species
* Properties: multiclass
* Features (4):
  - dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width
# create learner from dictionary of mlr3learners
# 2 equivalent calls:
learner_1 = mlr_learners$get("classif.rpart")
learner_1 = lrn("classif.rpart")
print(learner_1)
<LearnerClassifRpart:classif.rpart>
* Model: -
* Parameters: xval=0
* Packages: rpart
* Predict Type: response
* Feature types: logical, integer, numeric, factor, ordered
* Properties: importance, missings, multiclass, selected_features, twoclass, weights

Train and predict

Now the usual ML operations: Train on some observations, predict on others.

# train learner on subset of task
learner_1$train(task, row_ids = 1:120)
# this is what the decision tree looks like
print(learner_1$model)
n= 120 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 120 70 setosa (0.41666667 0.41666667 0.16666667)  
  2) Petal.Length< 2.45 50  0 setosa (1.00000000 0.00000000 0.00000000) *
  3) Petal.Length>=2.45 70 20 versicolor (0.00000000 0.71428571 0.28571429)  
    6) Petal.Length< 4.95 49  1 versicolor (0.00000000 0.97959184 0.02040816) *
    7) Petal.Length>=4.95 21  2 virginica (0.00000000 0.09523810 0.90476190) *
# predict using observations from task
prediction = learner_1$predict(task, row_ids = 121:150)
# predict using "new" observations from an external data.frame
prediction = learner_1$predict_newdata(newdata = iris[121:150, ])
print(prediction)
<PredictionClassif> for 30 observations:
    row_ids     truth   response
          1 virginica  virginica
          2 virginica versicolor
          3 virginica  virginica
---                             
         28 virginica  virginica
         29 virginica  virginica
         30 virginica  virginica

Evaluation

Let’s score our Prediction object with some metrics. And take a deeper look by inspecting the confusion matrix.

head(as.data.table(mlr_measures))
              key task_type     packages predict_type task_properties
1:            aic      <NA>                  response                
2:            bic      <NA>                  response                
3:    classif.acc   classif mlr3measures     response                
4:    classif.auc   classif mlr3measures         prob        twoclass
5:   classif.bacc   classif mlr3measures     response                
6: classif.bbrier   classif mlr3measures         prob        twoclass
scores = prediction$score(msr("classif.acc"))
print(scores)
classif.acc 
  0.8333333 
scores = prediction$score(msrs(c("classif.acc", "classif.ce")))
print(scores)
classif.acc  classif.ce 
  0.8333333   0.1666667 
cm = prediction$confusion
print(cm)
            truth
response     setosa versicolor virginica
  setosa          0          0         0
  versicolor      0          0         5
  virginica       0          0        25

Changing hyperpars

The Learner contains information about all parameters that can be configured, including data type, constraints, defaults, etc. We can change the hyperparameters either during construction of later through an active binding.

as.data.table(learner_1$param_set)
id class lower upper nlevels
cp ParamDbl 0 1 Inf
keep_model ParamLgl NA NA 2
maxcompete ParamInt 0 Inf Inf
maxdepth ParamInt 1 30 30
maxsurrogate ParamInt 0 Inf Inf
minbucket ParamInt 1 Inf Inf
minsplit ParamInt 1 Inf Inf
surrogatestyle ParamInt 0 1 2
usesurrogate ParamInt 0 2 3
xval ParamInt 0 Inf Inf
learner_2 = lrn("classif.rpart", predict_type = "prob", minsplit = 50)
learner_2$param_set$values$minsplit = 50

Resampling

Resampling simply repeats the train-predict-score loop and collects all results in a nice data.table::data.table().

cv10 = rsmp("cv", folds = 10)
rr = resample(task, learner_1, cv10)
print(rr)
<ResampleResult> of 10 iterations
* Task: iris
* Learner: classif.rpart
* Warnings: 0 in 0 iterations
* Errors: 0 in 0 iterations
rr$score(msrs(c("classif.acc", "classif.ce")))
iteration task_id learner_id resampling_id classif.ce
1 iris classif.rpart cv 0.1333333
2 iris classif.rpart cv 0.0666667
3 iris classif.rpart cv 0.0666667
4 iris classif.rpart cv 0.1333333
5 iris classif.rpart cv 0.0000000
6 iris classif.rpart cv 0.1333333
7 iris classif.rpart cv 0.0666667
8 iris classif.rpart cv 0.0666667
9 iris classif.rpart cv 0.0000000
10 iris classif.rpart cv 0.0000000
# get all predictions nicely concatenated in a table
prediction = rr$prediction()
head(as.data.table(prediction))
row_ids truth response
25 setosa setosa
53 versicolor virginica
54 versicolor versicolor
59 versicolor versicolor
60 versicolor versicolor
65 versicolor versicolor
cm = prediction$confusion
print(cm)
            truth
response     setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         45         5
  virginica       0          5        45

Populating the learner dictionary

mlr3learners ships out with a dozen different popular Learners. We can list them from the dictionary. If we want more, we can install an extension package, mlr3extralearners, from GitHub. Note how after the installation the dictionary increases in size.

head(as.data.table(mlr_learners)[, c("key", "packages")])
                  key packages
1: classif.AdaBoostM1    RWeka
2:        classif.C50      C50
3:        classif.IBk    RWeka
4:        classif.J48    RWeka
5:       classif.JRip    RWeka
6:        classif.LMT    RWeka
# remotes::install_github("mlr-org/mlr3extralearners")
library(mlr3extralearners)
print(as.data.table(mlr_learners)[, c("key", "packages")])
                    key               packages
  1: classif.AdaBoostM1                  RWeka
  2:        classif.C50                    C50
  3:        classif.IBk                  RWeka
  4:        classif.J48                  RWeka
  5:       classif.JRip                  RWeka
 ---                                          
132:        surv.ranger                 ranger
133:         surv.rfsrc randomForestSRC,pracma
134:         surv.rpart  rpart,distr6,survival
135:           surv.svm            survivalsvm
136:       surv.xgboost                xgboost

Benchmarking multiple learners

The benchmark function can conveniently compare `r ref(“Learner”, “Learners”) on the same dataset(s).

learners = list(learner_1, learner_2, lrn("classif.randomForest"))
grid = benchmark_grid(task, learners, cv10)
bmr = benchmark(grid)
print(bmr)
<BenchmarkResult> of 30 rows with 3 resampling runs
 nr task_id           learner_id resampling_id iters warnings errors
  1    iris        classif.rpart            cv    10        0      0
  2    iris        classif.rpart            cv    10        0      0
  3    iris classif.randomForest            cv    10        0      0
print(bmr$aggregate(measures = msrs(c("classif.acc", "classif.ce"))))
   nr      resample_result task_id           learner_id resampling_id iters classif.acc classif.ce
1:  1 <ResampleResult[20]>    iris        classif.rpart            cv    10   0.9200000 0.08000000
2:  2 <ResampleResult[20]>    iris        classif.rpart            cv    10   0.9333333 0.06666667
3:  3 <ResampleResult[20]>    iris classif.randomForest            cv    10   0.9400000 0.06000000

Conclusion

We left out a lot of details and other features. If you want to know more, read the mlr3book and the documentation of the mentioned packages.

Citation

For attribution, please cite this work as

Bischl (2020, March 18). mlr3gallery: mlr3 Basics on "Iris" - Hello World!. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-03-18-iris-mlr3-basics/

BibTeX citation

@misc{mlr3-basics-iris,
  author = {Bischl, Bernd},
  title = {mlr3gallery: mlr3 Basics on "Iris" - Hello World!},
  url = {https://mlr3gallery.mlr-org.com/posts/2020-03-18-iris-mlr3-basics/},
  year = {2020}
}