Bike Sharing Demand

mlr3tuning tuning optimization nested resampling benchmarking filtering branching bike sharing data set regression

This use case provides an example on tuning and benchmarking in mlr3verse using data from Capital Bikeshare.

Henri Funk , Andreas Hofheinz , Marius Girnat , Marc Becker


The following examples were created as part of the Introduction to Machine Learning Lecture at LMU Munich. The goal of the project was to create and compare one or several machine learning pipelines for the problem at hand together with exploratory analysis and an exposition of results. The posts were contributed to the mlr3gallery by the authors and edited for better legibility by the editor. We want to thank the authors for allowing us to publish their results. Note, that correctness of the results can not be guaranteed.


Bike sharing is a network of publicly shared bicycles that can be rented for a certain period of time at different locations within a city and then returned at any station. For the suppliers of those networks, it is important to always have enough bicycles available, whereby the demand depends on a variety of factors. Therefore it is essential to have a forecasting system that predicts the demand taking into account several variables. In this document, we develop such a system using the mlr3verse package. For our analysis, we used the Kaggle Bike Sharing Demand data set, which was provided by Hadi Fanaee Tork using data from Capital Bikeshare (Fanaee-T and Gama 2013). After ingesting the data into R we continue with some descriptive statistics for our training data set. We then preprocess the data and carry out feature and target engineering. Afterwards we proceed with the modeling process in which we fit, tune and benchmark the KNN, CART and RandomForest learner. We conclude with our model prediction for the data.


This tutorial assumes familiarity with the basics of mlr3tuning and mlr3pipelines. Consult the mlr3book if some aspects are not fully understandable. Note, that expensive calculations are pre-saved in rda files in this tutorial to save computational time.

We initialize the random number generator with a fixed seed for reproducibility, and decrease the verbosity of the logger to keep the output clearly represented.



Note that we use the data from UCI Machinelearning Repository here instead of the Kaggle data set due to License restrictions. The UCI dataset is a slightly adjusted version of the same underlying data.


train_datetime = train$datetime

The Bike Sharing Demand data set contains 10886 rows and 12 columns.

Table 1: Column description training data
Column Name Description
datetime hourly date
season \[\begin{cases} 1, \ spring \\ 2, \ summer \\ 3, \ fall \\ 4, \ winter \\ \end{cases}\]
holiday \[\begin{cases} 1, \ holiday \\ 0, \ else \\ \end{cases}\]
workingday \[\begin{cases} 1, \ neither \ weekend \ nor \ holiday \\ 0, \ else \\ \end{cases}\]
weather \[\begin{cases} 1, \ Clear, \ Few \ clouds\\ 2, \ Mist \ and \ Cloudy\\ 3, \ Light \ Snow, \ Light \ Rain, \ Thunderstorm\\ 4, \ Heavy \ Rain, \ Ice \ Pallets\\ \end{cases}\]
temp temperature in Celsius
atemp ‘feels like’ temperature in Celsius
humidity relative humidity
windspeed wind speed
casual number of non-registered user rentals initiated
registered number of registered user rentals initiated
count number of total rentals

Table 1 gives a detailed overview of the columns of our training data set. Besides our target variable count, we have several weather related variables we can use for our prediction. Moreover, we have the date and hour of the observation as well as holiday and workingday indicators. Since there are no missing values in our data, we do not need to apply any further imputation methods.

Descriptive Statistics

The correlation of the continuous variables of our training data is shown in Figure 1. We find that there is an almost perfect positive correlation between atemp and temp (\(r = 0.98\)), swhich makes sense, since the ‘feels like’ temperature is very similar to the real temperature. Moreover, there is a positive correlation between count and atemp as well as count and temp (\(r = 0.39\)). Additionally, we see that our target variable count is negatively correlated with the humidity (\(r = - 0.32\)).

Correlation matrix of continuous variables in training data

Figure 1: Correlation matrix of continuous variables in training data

Figure 2 shows the Bike Sharing demand over the entire time frame of our training data. We find an overall positive trend with seasonality, whereby the demand is higher in summer than in winter.

Bike Sharing demand over time

Figure 2: Bike Sharing demand over time

As we can see in Figure 3, the bike sharing demand is fairly evenly distributed over a month. Note that we have no values for days after the 19th day of each month in our training data since those days are included in our test data.

Distribution of Bike Sharing demand over a month

Figure 3: Distribution of Bike Sharing demand over a month


It applies that count == casual + registered.

all.equal(train$count, train$casual + train$registered)
[1] TRUE

So we cannot use casual and registered as features to predict count since we only know them once we also know count. Therefore we can remove them without affecting the prediction quality. Furthermore the features atem the ‘feels like’ temperature and temp the real temperature have a nearly perfect correlation.

cor(train$atem, train$temp)
[1] 0.9849481

Therefore we omit temp feature during the modeling process to reduce the number of features and the complexity of our data structure. Additionally, we convert nominal and ordinal features to factors and datetime to POSIXct. The structure of our pre-processed data is shown in Table 2.

train = preprocess(train)
Table 2: Processed training data
Variable Class Mode
atemp -none- numeric
count -none- numeric
datetime POSIXct numeric
holiday factor numeric
humidity -none- numeric
season factor numeric
weather factor numeric
windspeed -none- numeric
workingday factor numeric

Feature engineering

Since we have one observation at each time, each time stamp (datetime) is a unique value and therefore has no predictive quality. However, if we split it into hour, weekday, month and year we obtain useful predictors. The structure of our pre-processed training data is visualized in Table 3.

Note, that PipeOpDateFeatures provides an alternative to engineering such features by automatically extracting a set of date-related features from a POSIXct date variable. Note: In this scenario, we assume that we are not interested in forecasting the demand for the following years (provided the other information), but we instead try to obtain a model that describes the existing data as well as possible. In a forecasting scenario, we would need to use a different time-sensitvie resampling strategy in order to assure that our model generalizes to the data we are interested in predicting (the future).

train = engineer_features(train)
Table 3: Feature engineered training data
Variable Class Mode
atemp -none- numeric
count -none- numeric
holiday factor numeric
hour factor numeric
humidity -none- numeric
month factor numeric
season factor numeric
weather factor numeric
weekday factor numeric
windspeed -none- numeric
workingday factor numeric
year factor numeric

In Figure 4 we use our engineered hour variable to plot the Bike Sharing demand by hour and temperature. Note, that the felt temperature atemp is normalized. We find that the highest demand occurs between 18h and 20h at higher temperatures.

Bike Sharing demand by hour and temperature

Figure 4: Bike Sharing demand by hour and temperature

Target engineering

As mentioned in our descriptive analysis (chapter 2.2), we find that our target variable count is highly right-skewed. Since this could cause problems during model fitting, we perform a \(z = log(x + 1)\) transformation. In the right plot in Figure 5, we see that the distribution of our transformed count variable is much less skewed than before.

Density transformation of `count`Density transformation of `count`

Figure 5: Density transformation of count

For predictions we have to transform it back to the original scale with \(x = exp(z) - 1\).
(Note that \(exp(log(x + 1)) - 1 = x\)) mlr3 simplifies this process by offering Pipeline Operators (PipeOps for short). These PipeOps are computational steps that can be arranged into Pipelines to manage data flow in mlr3.

Using PipeOpTargetTrafo we can dynamically transform our target before the prediction and transform it back into our original scale afterwards.

log1p_target_trafo = function(learner) {
  ppl = ppl("targettrafo", graph = learner)
  ppl$param_set$values$targetmutate.trafo = function(x) log(x)
  ppl$param_set$values$targetmutate.inverter = function(x) list(response =  expm1(x$response))
  glrn = GraphLearner$new(ppl, task_type = "regr")
  glrn$id = paste0("glrn.", learner$id)


Task Initialization

First, we have to initialize our bike task. As back-end, we use our training data and as target variable count.

bike_task = as_task_regr(train, target = "count", id = "bike_sharing")

Descriptive Analysis with mlr3

The mlr3viz package enables the user to quickly create insightful descriptive graphics. We use this function to compare the values of our target variable count over the seasons and between holidays and other days. The results are visualized in Figure 6. We notice that there is almost no difference between the seasons with the exception of spring when the value of count is lower. Moreover, there is no real change between holidays and other days.

Autoplot season/holidayAutoplot season/holiday

Figure 6: Autoplot season/holiday

To examine relationships between variables in train data, we can also use the autoplot function and set the type argument to “pairs”. We make use of this option to further investigate the relationship between atemp, windspeed and count and compare it over the two years of our data. The result is visualized in Figure 7. In the upper left graph, we notice that there the value of log1p_count is higher in 2012 than in 2011. In contrast, the distribtions of atemp and windspeed do not seem to differ between the years. Also, the correlations of the variables are fairly similar in both years.

Relationship between atemp, windspeed and count by year

Figure 7: Relationship between atemp, windspeed and count by year

Model training

In our modeling process, we consider three learners: kknn, rpart, ranger. For parameter optimization, we use the so-called AutoTuner. An AutoTuner is a learner, that tunes its hyperparameters during training. The technical implementation is provided by mlr3tuning. For more details see the mlr3book.

In order to achieve maximum comparability between the learners the resampling strategy (Resampling) which enable us to measure a learner-configuration’s performance as well as the tuning strategy are kept fixed for all compared learners. If a learner’s settings differ from the general case, the reason is given in the section on the AutoTuner concerned.

resampling_cv_5 = rsmp("cv", folds = 5L)
resampling_outer_cv3 = rsmp("cv", folds = 3L)
measures = msr("regr.rmse")
tnr("grid_search", resolution = 10L)
* Parameters: resolution=10, batch_size=1
* Parameter classes: ParamLgl, ParamInt, ParamDbl, ParamFct
* Properties: dependencies, single-crit, multi-crit
* Packages: -
trm("evals", n_evals = 20L)
* Parameters: n_evals=20, k=0

As the computational time of nested resampling is high and tuning over multiple parameters would be very complex, the AutoTuner are be restricted to two hyperparameters, that seem to have a strong influence on the RMSE. The hyperparameter space of numeric hyperparameter is set around their default values for tuning.

Resampling and Terminator during auto tuning are set to relatively small iterations to save up computational effort. In regular cases we would suggest using 5- or 10-fold crossvalidiation. Auto tuning should exploit enough grid points to give the user an impression about the whole hyperparameter space.


kknn is a k-nearest-neighbor learner whose hyperparameters are shown in Table 4.

kknn_learner = lrn("regr.kknn")
Table 4: Paramters of kknn learner
id lower upper levels default
k 1 Inf NULL 7
distance 0 Inf NULL 2
kernel NA NA rectangular , triangular , epanechnikov, biweight , triweight , cos , inv , gaussian , rank , optimal optimal

For simplicity we tune only two of those parameters:

  1. k: Number of neighbors considered.

  2. distance: Parameter of Minkowski distance

For more information see the kknn documentation.

search_space_kknn = ps(
  regr.kknn.k = p_int(lower = 1L, upper = 50L),
  regr.kknn.distance = p_int(lower = 1L, upper = 3L))

Now we set up the autotuner for the kknn learner and train it.

ppl_kknn = log1p_target_trafo(kknn_learner)

at_kknn = auto_tuner(
  method = "grid_search",
  learner = ppl_kknn,
  resampling = resampling_cv_5,
  measure = measures,
  search_space = search_space_kknn,
  term_evals = 20,
  resolution = 10L,
  batch_size = 8L,


In Figure 8 we see the RMSE for the considered hyperparameter of the kknn. It seems that a higher order p of the Minkowski distance leads to a lower RMSE. Moreover, we see that the RMSE is lowest around k = 10 neighbors. For smaller k the estimation might be too wiggly, for higher k it might be too global (bias–variance tradeoff). The minimal RMSE that the AutoTuner could find is given for p = 3 and k = 6.

RMSE on tuning grid of kknn autotuner

Figure 8: RMSE on tuning grid of kknn autotuner


rpart is a CART learner whose hyperparameters are shown in Table 5.

rpart_learner = lrn("regr.rpart")
Table 5: Paramters of rpart learner
id lower upper levels default
cp 0 1 NULL 0.01
maxcompete 0 Inf NULL 4
maxdepth 1 30 NULL 30
maxsurrogate 0 Inf NULL 5
minbucket 1 Inf NULL <environment: 0x5578d3232458>
minsplit 1 Inf NULL 20
surrogatestyle 0 1 NULL 0
usesurrogate 0 2 NULL 2
xval 0 Inf NULL 10

Again, for simplicity, we tune only two of those parameters:

  1. minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted.

  2. cp: complexity parameter. Any split that does not decrease the overall lack of fit by a factor of cp is not attempted.

For more information browse the rpart Documentation.

search_space_cart = ps(
  regr.rpart.minsplit = p_int(lower = 1L, upper = 30L),
  regr.rpart.cp = p_dbl(lower = 0.001, upper = 0.1))

Now we set up the autotuner for the rpart learner and train it.

ppl_rpart = log1p_target_trafo(rpart_learner)

at_rpart = auto_tuner(
  method = "grid_search",
  learner = ppl_rpart,
  resampling = resampling_cv_5,
  measure = measures,
  search_space = search_space_cart,
  term_evals = 20L,
  resolution = 10L,
  batch_size = 8L,


In Figure 9 we see the RMSE for the considered hyperparameter of the rpart. The learner performs the worst if we set the minimum size of nodes in minsplit < 3. For splitting criteria minsplit > 3 seems to have no significant influence on the RMSE in the given hyperparameter space. In contrast, we find that for relatively low cp the error decreases considerably. Lowering cp reduces the level of pruning, resulting in larger trees. We reach the minimal RMSE at the lowest considered cp = 0.001.

RMSE on tuning grid rpart learner

Figure 9: RMSE on tuning grid rpart learner


ranger is a random forest learner whose hyperparameters are shown in Table 6.

ranger_learner = lrn("regr.ranger", importance = "impurity")
Table 6: Paramters of ranger learner
id levels
alpha NULL
always.split.variables NULL
holdout TRUE, FALSE
splitrule variance , extratrees, maxstat
verbose TRUE, FALSE
write.forest TRUE, FALSE

Feature Importance

First, we take advantage of the ability of the ranger model to calculate the importance of its features. The importance filter implemented in mlr3filters provides the possibility to access this calculation via mlr3 syntax.

filter_ranger = flt("importance", learner = ranger_learner)
feature_importance =

In Figure 10 we see there is a difference in features. Some features don’t seem to make a contribution to the model performance that is worthwhile to implement to the learner’s architecture. To test this hypothesis we are going to set up a learner that can test different feature combinations.

Feature importance in ranger learner

Figure 10: Feature importance in ranger learner

We use the features in decreasing order of feature importance. Say, if we use \(n\) features to predict the target we use the \(n\) most important features according to their impurity. Therefore, we set up a graph learner in the next step.

Ranger Pipe

Setting up the Ranger Pipe enables us to compare models with different feature combinations. Therefore we use the mlr3 packages mlr3pipelines and mlr3filters and create a new learner ranger_feature. This learner is now able to select the most important features from the impurity filtering.

po_flt = po("filter", filter_ranger, param_vals = list(filter.nfeat = 11L)) %>>%
  po("learner", ranger_learner$clone())

Tuning Random Forest

In addition to the regular ranger features, we are now able to tune on any parameter from the importance PipeOp.

Again, for simplicity, we tune only two of those parameters:

  1. regr.ranger.mtry: Number of variables to possibly split at in each node. Default is the (rounded down) square root of the number variables.

  2. importance.filter.nfeat: A param from importance that allows to control the size of features computed in the model.

For more information see the ranger documentation and mlr3 pipelines.

search_space_ranger = ps(
  regr.ranger.mtry = p_int(lower = 1L, upper = 11L),
  importance.filter.nfeat = p_int(lower = 1L, upper = 11L))

To ensure that regr.ranger.mtry never exceeds importance.filter.nfeat, we define the search space for the tuner manually. We constraint the possible number of features to the 1, 6 , 10 and 11 most important feature(s). The resulting grid shown in Figure 11, expands to over 28 combinations to tune our learner on.

grid = generate_design_grid(search_space_ranger, resolution = 11)
grid$data = grid$data[regr.ranger.mtry <= importance.filter.nfeat]
grid$data = grid$data[importance.filter.nfeat %in% c(1, 6, 10, 11)]
Search space for ranger autotuner

Figure 11: Search space for ranger autotuner

To ensure that regr.ranger.mtry never exceeds, importance.filter.nfeat we define a search space. Since the space for hyperparameters is limited, one could consider here to exploit the whole parameter space for nfeat and mtry. To save up time we will stick to 28 combinations here, including all possible combinations at 1, 6, 10 and 11 levels.

ranger_feature = log1p_target_trafo(po_flt)

at_ranger = auto_tuner(
  method = "design_points",
  learner = ranger_feature,
  resampling = resampling_cv_5,
  measure = measures,
  search_space = search_space_ranger,
  design = grid$data,
  batch_size = 8L


In Figure 12 we see the RMSE for the evaluated hyperparameter of the ranger autotuner. With increasing mtry and the number features used in the model, the RMSE decreases. The best considered combination is achieved with 11 features and mtry = 10. For higher values of mtry and number of features features the RMSE is slightly higher.

RMSE on tuning grid ranger learner

Figure 12: RMSE on tuning grid ranger learner


Benchmarking has two objectives. First, we want to obtain a performance measure for all the fitted models through resampling.

Second, we want to compare the models with each other and examine how tuning improved each of them. As stated in Chapter 3.3 we use a 3-fold CV as our benchmarking resampling method, a 5-fold CV is evaluated inside each auto tuned object to obtain the best hyperparameter combination. Note, that this method does not lead to an unbiased GE, but rather is a step to compare the learners to each other.

lrns = list(
  rpart_learner, at_rpart,
  ranger_learner, at_ranger,
  kknn_learner, at_kknn

design = benchmark_grid(
  tasks = bike_task,
  learners = lrns,
  resamplings = resampling_outer_cv3

bmr = benchmark(design)
Table 7: Benchmarking results
learner_id resampling_id iters regr.rmse
glrn..tuned cv 3 45.33
regr.ranger cv 3 64.37
glrn.regr.kknn.tuned cv 3 81.03
glrn.regr.rpart.tuned cv 3 87.35
regr.rpart cv 3 93.95
regr.kknn cv 3 121.88

Tuning enables us to optimize hyperparameters in our model via trial and error. For more details see mlr3tuning. To evaluate each tuned learners’ performance and compare it, we set up benchmarking. Results are shown in Table 7 and Figure 13. In the boxplot resulting RMSE from the outer resampling are shown for each tuned learner from chapter 3.3 and its non-tuned equivalent. The benchmarking resampling validates our results from hyperparameter tuning for each autotuner in a 3-fold-cross-validation. We find that with auto tuning each of the learners could increase its performance compared to its’ equivalent that is using default hyperparameters. However, for rpart not only performance is increasing but also variance. That does not apply to the tuned kknn and ranger learners for which the results seem to be comparatively stable. The tuned random forest performs the best among tuned and untuned learners compared in benchmarking. Consequently, the tuned ranger is our learner of choice for prediction.

Nested Resampling

Benchmarking is a fast and simple way to compare different approaches to each other. However, from a theoretical perspective, this approach does not lead to unbiased general error estimators. Since we only used three different learners for benchmarking this bias should be insignificant. Nevertheless, we would like to provide a showcase for how to obtain unbiased results with more advanced techniques that are available in mlr3. Therefore we evaluate the performance of our learners by “true” unbiased nested resampling. In this case, tuning and model selection is only performed during inner resampling. The purpose of outer resampling is to evaluate the unbiased score from inner resampling. This method is also more exhaustive, as it tunes the graph learner in each of the three outer resampling CVs again. Benchmarking on the other hand, “only” tunes each autotuner once and cross-validates with the optimized hyperparameters from the first iteration.

The Pipeline

We create a simple branched pipe, with our already known learners from the previous chapters. Note that we do not implement the autotuners but the actual learners here, since we aim to tune over the whole pipeline as a learner here, rather than over each learner individually. Therefore our selected learners kknn, rpart and importance.ranger can be considered as hyperparameters in a more universal graph-learner glrn. For more information see mlr3pipelines.

list = sapply(list(po(kknn_learner), po(rpart_learner), po(ranger_learner)), log1p_target_trafo)
list[[3]] = po("filter", filter_ranger, param_vals = list(filter.nfeat = 11L)) %>>%
  po("learner", list[[3]]$clone())
pipe = ppl("branch", list)
Pipeline Learner

Figure 13: Pipeline Learner

The Parameter Set

A few words should be said about this rather complex parameter set:

search_space = ps(
  branch.selection =
    p_int(1L, 3L),
  glrn.regr.kknn.regr.kknn.k =
    p_int(1L, 50L, depends = branch.selection == 1),
  glrn.regr.kknn.regr.kknn.distance =
    p_int(1L, 3L, depends = branch.selection == 1),
  glrn.regr.rpart.regr.rpart.minsplit =
    p_int(1L, 30L, depends = branch.selection == 2),
  glrn.regr.rpart.regr.rpart.cp =
    p_dbl(0.001, 0.1, depends = branch.selection == 2),
  importance.filter.nfeat =
    p_fct(c("6", "9", "11"), depends = branch.selection == 3),
  importance.mtry =
    p_int(7L, 11L, depends = branch.selection == 3),
  .extra_trafo = function(x, param_set) {
    if (x$branch.selection == 3L) {
        x$importance.filter.nfeat = as.integer(x$importance.filter.nfeat)
        x$importance.mtry =
          ceiling(x$importance.mtry/11 * x$importance.filter.nfeat)

Resampling the AutoTuner

Now we can perform nested resampling using our previously created pipe. We directly resample this autotuner to reduce computing time. We can save the tuned learners in our resampling result by setting store_model = TRUE.

multi_at = auto_tuner(
  method = "random_search",
  learner = GraphLearner$new(pipe, task_type = "regr"),
  resampling =  resampling_cv_5,
  measure = measures,
  search_space = search_space,
  term_evals = 21L,
  batch_size = 8L)

multi_at_rr = resample(bike_task, multi_at, resampling_outer_cv3, store_models = TRUE)
Table 8: nested resampling outer-cv results
RMSE learner-branch k distance minsplit cp nfeat mtry
Outer CV-1
153.32849 1 42 1 NA NA NA NA
162.19474 2 NA NA 11 0.0828685 NA NA
155.51215 1 2 1 NA NA NA NA
75.32467 3 NA NA NA NA 9 11
131.58625 1 46 3 NA NA NA NA
114.10944 2 NA NA 2 0.0032251 NA NA
158.65536 2 NA NA 24 0.0426784 NA NA
149.35820 2 NA NA 24 0.0328857 NA NA
153.95453 1 46 1 NA NA NA NA
146.41769 1 31 2 NA NA NA NA
162.04286 2 NA NA 26 0.0579197 NA NA
146.21584 1 14 1 NA NA NA NA
146.09092 1 30 2 NA NA NA NA
129.33858 2 NA NA 29 0.0094714 NA NA
162.19474 2 NA NA 2 0.0649928 NA NA
162.19474 2 NA NA 23 0.0997120 NA NA
96.80351 3 NA NA NA NA 6 9
151.38144 1 1 2 NA NA NA NA
81.54096 3 NA NA NA NA 11 11
75.03666 3 NA NA NA NA 9 9
81.58822 3 NA NA NA NA 11 7
Outer CV-2
149.69703 1 43 2 NA NA NA NA
81.02090 3 NA NA NA NA 11 11
118.82557 1 23 3 NA NA NA NA
108.78322 1 14 3 NA NA NA NA
149.53066 1 18 1 NA NA NA NA
157.60186 2 NA NA 26 0.0669369 NA NA
153.63629 2 NA NA 17 0.0339350 NA NA
73.72908 3 NA NA NA NA 9 7
157.16419 2 NA NA 9 0.0528822 NA NA
96.41697 3 NA NA NA NA 6 8
147.15158 2 NA NA 23 0.0232039 NA NA
73.78747 3 NA NA NA NA 9 9
135.36045 2 NA NA 9 0.0186021 NA NA
96.12271 3 NA NA NA NA 6 8
73.11978 3 NA NA NA NA 9 9
80.03013 3 NA NA NA NA 11 8
157.60186 2 NA NA 5 0.0978836 NA NA
73.47320 3 NA NA NA NA 9 9
80.78996 3 NA NA NA NA 11 9
157.60186 2 NA NA 26 0.0932684 NA NA
96.11884 3 NA NA NA NA 6 9
Outer CV-3
96.33161 3 NA NA NA NA 6 8
147.15158 2 NA NA 15 0.0298682 NA NA
143.75300 1 19 2 NA NA NA NA
153.73550 1 49 2 NA NA NA NA
125.87469 2 NA NA 27 0.0077381 NA NA
157.60186 2 NA NA 3 0.0966107 NA NA
80.27757 3 NA NA NA NA 11 11
159.05765 2 NA NA 14 0.0995111 NA NA
144.32775 1 22 2 NA NA NA NA
133.80813 1 50 3 NA NA NA NA
158.64727 2 NA NA 20 0.0524506 NA NA
72.88095 3 NA NA NA NA 9 9
131.36197 2 NA NA 24 0.0136067 NA NA
131.37231 2 NA NA 26 0.0147357 NA NA
96.03287 3 NA NA NA NA 6 8
125.84646 2 NA NA 19 0.0080141 NA NA
72.09021 3 NA NA NA NA 9 11
79.97418 3 NA NA NA NA 11 9
96.99585 3 NA NA NA NA 6 9
149.99131 1 23 1 NA NA NA NA
158.64727 2 NA NA 25 0.0570872 NA NA
159.05765 2 NA NA 15 0.0780135 NA NA
158.64727 2 NA NA 9 0.0479037 NA NA
144.65745 1 5 1 NA NA NA NA
159.05765 2 NA NA 13 0.0639729 NA NA
170.51641 1 1 1 NA NA NA NA
159.05765 2 NA NA 18 0.0649569 NA NA
131.36197 2 NA NA 22 0.0129953 NA NA
152.20995 1 32 1 NA NA NA NA
146.60947 1 13 1 NA NA NA NA

The table shows the results on the first of three outer resampling results. Nested resampling carries out very similar results as Benchmarking in the previous chapter.


Now we use our tuned random forest model to predict the Bike Sharing demand for the train and test data.

Before we can make the predictions, we have to preprocess the test data and engineer features in the same way we did for our training data.

test_datetime = test$datetime
test = preprocess(test)
test = engineer_features(test)

Now we can use our random forest autotuner to predict the log1p_count variable.

pred_train = at_ranger$predict(bike_task)
pred_test = at_ranger$predict_newdata(test)

A visual comparison between the predicted demand with the actual demand is shown in Figure 14. The points depict the actual values and the green bars show our predicted values. The spaces with no points are the 19th day until the end of each month for which we only have test data. We find that overall the model predictions are quite close to the actual values and that the forecast seems reasonable for the test data.

Bike Sharing Demand: Forecast vs. Actual

Figure 14: Bike Sharing Demand: Forecast vs. Actual

In Figure 15, we compare the predictions with the actual data grouped by the days of the month. Again, we find that the predictions are very close to the actual values. However, the model seems to systematically underestimate the actual values slightly.

Distribution of Bike Sharing demand over a month

Figure 15: Distribution of Bike Sharing demand over a month


The mlr3verse makes the process of model fitting, tuning, benchmarking and nested resampling intuitive and fast. The included packages provide access to a set of commonly used learners and their train and predict methods using R6 classes. Moreover one can increase the capabilities of learners of the original packages through tuning, benchmarking as well as imputing and even ensembling. mlr3 provides these methods in a simple syntax to create understandable and comparable code architectures. Using these resources we set up a generalized tuning process for kknn, rpart and ranger learners. By tuning two hyperparameters of each learner, we were able to considerably increase performance of each learner. Performance comparison through nested resampling provided a solid foundation for choosing the best learner to predict the test set. With a rather simple random forest learner, we made it into the top 15% of Kaggle competition results. Further research can be carried out with regard to the apparent issue of systematic underestimation. A model with more affinity to the risk of a higher count could improve performance. To obtain such a model, one could, for example, try to increase the variability of the model by choosing a smaller min.nodesize.

Fanaee-T, Hadi, and Joao Gama. 2013. “Event Labeling Combining Ensemble Detectors and Background Knowledge.” Progress in Artificial Intelligence, 1–15.



For attribution, please cite this work as

Funk, et al. (2020, July 27). mlr3gallery: Bike Sharing Demand. Retrieved from

BibTeX citation

  author = {Funk, Henri and Hofheinz, Andreas and Girnat, Marius and Becker, Marc},
  title = {mlr3gallery: Bike Sharing Demand},
  url = {},
  year = {2020}