This use case provides an example on tuning and benchmarking in mlr3verse using data from Capital Bikeshare.
The following examples were created as part of the Introduction to Machine Learning Lecture at LMU Munich. The goal of the project was to create and compare one or several machine learning pipelines for the problem at hand together with exploratory analysis and an exposition of results. The posts were contributed to the mlr3gallery by the authors and edited for better legibility by the editor. We want to thank the authors for allowing us to publish their results. Note, that correctness of the results can not be guaranteed.
Bike sharing is a network of publicly shared bicycles that can be rented for a certain period of time at different locations within a city and then returned at any station. For the suppliers of those networks, it is important to always have enough bicycles available, whereby the demand depends on a variety of factors. Therefore it is essential to have a forecasting system that predicts the demand taking into account several variables. In this document, we develop such a system using the mlr3verse
package. For our analysis, we used the Kaggle Bike Sharing Demand data set, which was provided by Hadi Fanaee Tork using data from Capital Bikeshare. (Fanaee-T and Gama 2013) After ingesting the data into R we continue with some descriptive statistics for our training data set. We then preprocess the data and carry out feature and target engineering. Afterwards we proceed with the modeling process in which we fit, tune and benchmark the KNN, CART and RandomForest learner of the mlr3verse
package. We conclude with our model prediction for the data.
This tutorial assumes familiarity with the basics of mlr3tuning and mlr3pipelines. Consult the mlr3book if some aspects are not fully understandable.
Note, that expensive calculations are pre-saved in rds files in this tutorial to save computational time.
Note that we use the data from UCI Machinelearning Repository here instead of the Kaggle data set due to License restrictions. The UCI dataset is a slightly adjusted version of the same underlying data.
train_datetime = train$datetime
The Bike Sharing Demand data set contains 10886 rows and 12 columns.
Column Name | Description |
---|---|
datetime | hourly date |
season | \[\begin{cases} 1, \ spring \\ 2, \ summer \\ 3, \ fall \\ 4, \ winter \\ \end{cases}\] |
holiday | \[\begin{cases} 1, \ holiday \\ 0, \ else \\ \end{cases}\] |
workingday | \[\begin{cases} 1, \ neither \ weekend \ nor \ holiday \\ 0, \ else \\ \end{cases}\] |
weather | \[\begin{cases} 1, \ Clear, \ Few \ clouds\\ 2, \ Mist \ and \ Cloudy\\ 3, \ Light \ Snow, \ Light \ Rain, \ Thunderstorm\\ 4, \ Heavy \ Rain, \ Ice \ Pallets\\ \end{cases}\] |
temp | temperature in Celsius |
atemp | ‘feels like’ temperature in Celsius |
humidity | relative humidity |
windspeed | wind speed |
casual | number of non-registered user rentals initiated |
registered | number of registered user rentals initiated |
count | number of total rentals |
Table 1 gives a detailed overview of the columns of our training data set. Besides our target variable count
, we have several weather related variables we can use for our prediction. Moreover, we have the date and hour of the observation as well as holiday and workingday indicators. Since there are no missing values in our data, we do not need to apply any further imputation methods.
The correlation of the continuous variables of our training data is shown in Figure 1. We find that there is an almost perfect positive correlation between atemp
and temp
(\(r = 0.98\)), swhich makes sense, since the ‘feels like’ temperature is very similar to the real temperature. Moreover, there is a positive correlation between count
and atemp
as well as count
and temp
(\(r = 0.39\)). Additionally, we see that our target variable count
is negatively correlated with the humidity
(\(r = - 0.32\)).
Figure 2 shows the Bike Sharing demand over the entire time frame of our training data. We find an overall positive trend with seasonality, whereby the demand is higher in summer than in winter.
As we can see in Figure 3, the bike sharing demand is fairly evenly distributed over a month. Note that we have no values for days after the 19th day of each month in our training data since those days are included in our test data.
It applies that count == casual + registered
.
all.equal(train$count, train$casual + train$registered)
[1] TRUE
So we cannot use casual
and registered
as features to predict count
since we only know them once we also know count
. Therefore we can remove them without affecting the prediction quality. Furthermore the features atem
the ‘feels like’ temperature and temp
the real temperature have a nearly perfect correlation.
cor(train$atem, train$temp)
[1] 0.9849481
Therefore we omit temp
feature during the modeling process to reduce the number of features and the complexity of our data structure. Additionally, we convert nominal and ordinal features to factors and datetime
to POSIXct. The structure of our pre-processed data is shown in Table 2.
train = preprocess(train)
Variable | Class | Mode |
---|---|---|
atemp | -none- | numeric |
count | -none- | numeric |
datetime | POSIXct | numeric |
holiday | factor | numeric |
humidity | -none- | numeric |
season | factor | numeric |
weather | factor | numeric |
windspeed | -none- | numeric |
workingday | factor | numeric |
Since we have one observation at each time, each time stamp (datetime
) is a unique value and therefore has no predictive quality. However, if we split it into hour
, weekday
, month
and year
we obtain useful predictors. The structure of our pre-processed training data is visualized in Table 3.
Note, that PipeOpDateFeatures provides an alternative to engineering such features by automatically extracting a set of date-related features from a POSIXct
date variable. Note: In this scenario, we assume that we are not interested in forecasting the demand for the following years (provided the other information), but we instead try to obtain a model that describes the existing data as well as possible. In a forecasting scenario, we would need to use a different time-sensitvie resampling strategy in order to assure that our model generalizes to the data we are interested in predicting (the future).
train = engineer_features(train)
Variable | Class | Mode |
---|---|---|
atemp | -none- | numeric |
count | -none- | numeric |
holiday | factor | numeric |
hour | factor | numeric |
humidity | -none- | numeric |
month | factor | numeric |
season | factor | numeric |
weather | factor | numeric |
weekday | factor | numeric |
windspeed | -none- | numeric |
workingday | factor | numeric |
year | factor | numeric |
In Figure 4 we use our engineered hour
variable to plot the Bike Sharing demand by hour and temperature. Note, that the felt temperature atemp
is normalized. We find that the highest demand occurs between 18h and 20h at higher temperatures.
As mentioned in our descriptive analysis (chapter 2.2), we find that our target variable count
is highly right-skewed. Since this could cause problems during model fitting, we perform a \(z = log(x + 1)\) transformation. In the right plot in Figure 5, we see that the distribution of our transformed count
variable is much less skewed than before.
For predictions we have to transform it back to the original scale with \(x = exp(z) - 1\).
(Note that \(exp(log(x + 1)) - 1 = x\)) mlr3 simplifies this process by offering Pipeline Operators (PipeOps for short). These PipeOps
are computational steps that can be arranged into Pipelines to manage data flow in mlr3.
Using PipeOpTargetTrafo
we can dynamically transform our target before the prediction and transform it back into our original scale afterwards.
log1p_target_trafo = function(learner, graph = TRUE) {
ppl = ppl("targettrafo", graph = learner)
ppl$param_set$values$targettrafosimple.trafo = function(x) log1p(x)
ppl$param_set$values$targettrafosimple.inverter = function(x) expm1(x)
if (graph == TRUE) return(GraphLearner$new(ppl, task_type = "regr"))
ppl
}
First, we have to initialize our bike task. As back-end, we use our training data and as target variable count
.
bike_task = TaskRegr$new("bike_sharing", backend = train, target = "count")
mlr3
The mlr3viz
package enables the user to quickly create insightful descriptive graphics. We use this function to compare the values of our target variable count
over the seasons and between holidays and other days. The results are visualized in Figure 6. We notice that there is almost no difference between the seasons with the exception of spring when the value of count
is lower. Moreover, there is no real change between holidays and other days.
To examine relationships between variables in train data, we can also use the autoplot
function and set the type
argument to “pairs”. We make use of this option to further investigate the relationship between atemp
, windspeed
and count
and compare it over the two years of our data. The result is visualized in Figure 7. In the upper left graph, we notice that there the value of log1p_count
is higher in 2012 than in 2011. In contrast, the distribtions of atemp
and windspeed
do not seem to differ between the years. Also, the correlations of the variables are fairly similar in both years.
In our modeling process, we consider three learners: kknn
, rpart
and ranger
. For parameter optimization, we use the so-called autotuner
. An Autotuner
is a learner, that tunes its hyperparameters during training. The technical implementation is provided by mlr3tuning
. For more details see the mlr3manual.
In order to achieve maximum comparability between the learners the Resampling
objects and which enable us to measure a learner-configuration’s performance as well as the tuning strategy are kept fixed for all compared learners. If a learner’s settings differ from the general case, the reason is given in the section on the autotuner
concerned.
nested resampling
is a method to obtain an unbiased performance estimation for each learner. For more information see Nested Resamling. During our modeling process, we use 5-fold CV for inner resampling
to avoid overfitting when optimizing the model parameters (resampling_cv_5
). In order to avoid overfitting during hyperparameter tuning, we use 3-fold CV as outer resampling
method (resampling_outer_cv3
). Nested resampling is done in our Chapter 5 “Nested Resampling” where we evaluate and compare the performance of the learners in a single Graph Learner.
measure
defines the loss function that the autotuner evaluates during training. As the Kaggle competition requires participants to minimize the RMSLE
we use the RMSE measure
since our target variable is already log-transformed.
tuning method
will be Grid Search with a grid resolution of 10. As we chose two hyperparameters to tune over, the two-dimensional grid consists of \(10^{2} = 100\) configurations. The tuner grid is called tuner_grid_10
.
terminator
limits the tuning process to 20 hyperparameter combinations, which will be drawn randomly from the tuner_grid_10
. The terminator is called term_evals20
.
As the computational time of nested resampling
is high and tuning over multiple parameters would be very complex, the autotuners
are be restricted to two hyperparameters, that seem to have a strong influence on the regr.rmse
. The hyperparameter space of numeric hyperparameter is set around their default values for tuning.
Resampling and Terminator during auto tuning are set to relatively small iterations to save up computational effort. In regular cases we would suggest using 5- or 10-fold crossvalidiation. Auto tuning should exploit enough grid points to give the user an impression about the whole hyperparameter space.
kknn
is a k-nearest-neighbor learner whose hyperparameters are shown in Table 4.
kknn_learner = lrn("regr.kknn")
kknn_learner$param_set
id | lower | upper | levels | default |
---|---|---|---|---|
k | 1 | Inf | NULL | 7 |
distance | 0 | Inf | NULL | 2 |
kernel | NA | NA | rectangular , triangular , epanechnikov, biweight , triweight , cos , inv , gaussian , rank , optimal | optimal |
scale | NA | NA | TRUE, FALSE | TRUE |
ykernel | NA | NA | NULL | NULL |
For simplicity we tune only two of those parameters:
k
> Number of neighbors considered.
distance
> Parameter of Minkowski distance
For more information see the kknn documentation.
Now we set up the autotuner for the kknn
learner and train it.
set.seed(123456)
ppl_kknn = log1p_target_trafo(kknn_learner)
at_kknn = AutoTuner$new(ppl_kknn,
resampling = resampling_cv_5,
measure = measures,
search_space = param_set_kknn,
terminator = term_evals20,
tuner = tuner_grid_10
)
at_kknn$train(bike_task)
archive_kknn = at_kknn$archive$data[, c("regr.rmse",
"regr.kknn.k",
"regr.kknn.distance")]
In Figure 8 we see the RMSE for the considered hyperparameter of the kknn
autotuner. It seems that a higher order p
of the Minkowski distance leads to a lower RMSE
. Moreover, we see that the RMSE
is lowest around k = 10
neighbors. For smaller k
the estimation might be too wiggly, for higher k
it might be too global (Bias-Variance trade off). The minimal RMSE that the autotuner
could find is given for p = 3
and k = 12
.
rpart
is CART learner whose hyperparameters are shown in Table 5.
rpart_learner = lrn("regr.rpart")
rpart_learner$param_set
id | lower | upper | levels | default |
---|---|---|---|---|
minsplit | 1 | Inf | NULL | 20 |
minbucket | 1 | Inf | NULL | <environment: 0x561283d6ede0> |
cp | 0 | 1 | NULL | 0.01 |
maxcompete | 0 | Inf | NULL | 4 |
maxsurrogate | 0 | Inf | NULL | 5 |
maxdepth | 1 | 30 | NULL | 30 |
usesurrogate | 0 | 2 | NULL | 2 |
surrogatestyle | 0 | 1 | NULL | 0 |
xval | 0 | Inf | NULL | 10 |
keep_model | NA | NA | TRUE, FALSE | FALSE |
Again, for simplicity, we tune only two of those parameters:
minsplit
> the minimum number of observations that must exist in a node in order for a split to be attempted.
cp
> complexity parameter. Any split that does not decrease the overall lack of fit by a factor of cp is not attempted. […]
For more information browse the rpart Documentation.
Now we set up the autotuner for the rpart
learner and train it.
set.seed(123456)
ppl_rpart = log1p_target_trafo(rpart_learner)
at_rpart = AutoTuner$new(ppl_rpart,
resampling = resampling_cv_5,
measure = measures,
search_space = param_set_cart,
terminator = term_evals20,
tuner = tuner_grid_10
)
at_rpart$train(bike_task)
archive_rpart = at_rpart$archive$data[, c("regr.rmse",
"regr.rpart.minsplit",
"regr.rpart.cp")]
In Figure 9 we see the RMSE for the considered hyperparameter of the rpart
autotuner. The learner performs the worst if we set the minimum size of nodes in minsplit < 3
. For splitting criteria minsplit > 3
seems to have no significant influence on the RMSE in the given hyperparameter space. In contrast, we find that for relatively low cp
the error decreases considerably. Lowering cp
reduces the level of pruning, resulting in larger trees. We reach the minimal regr.rmse
at the lowest considered cp = 0.001
.
ranger
is a Random Forest Learner whose hyperparameters are shown in Table 6.
ranger_learner = lrn("regr.ranger", importance = "impurity")
ranger_learner$param_set
id | levels |
---|---|
num.trees | NULL |
mtry | NULL |
importance | none , impurity , impurity_corrected, permutation |
… | … |
seed | NULL |
quantreg | TRUE, FALSE |
se.method | jack , infjack |
Feature Importance
First, we take advantage of the ability of the ranger_learner
to calculate the importance of its features. The importance filter
implemented in mlr3filters
provides the possibility to access this calculation via mlr3
syntax.
filter_ranger = flt("importance", learner = ranger_learner)
filter_ranger$calculate(bike_task)
feature_importance = as.data.table(filter_ranger)
In Figure 10 we see there is a difference in features. Some features don’t seem to make a contribution to the model performance that is worthwhile to implement to the learner’s architecture. To test this hypothesis we are going to set up a learner that can test different feature combinations.
We use the features in decreasing order of feature importance. Say, if we use \(n\) features to predict the target we use the \(n\) most important features according to their impurity
. Therefore, we set up a graph learner in the next step.
Ranger Pipe
Setting up the Ranger Pipe enables us to compare models with different feature combinations. Therefore we use the mlr3
packages mlr3pipelines
and mlr3filters
and create a new learner ranger_feature
. This learner is now able to select the most important features from the impurity filtering.
Tuning Random Forest
In addition to the regular ranger
features, we are now able to tune on any parameter from the importance PipeOp
.
Again, for simplicity, we tune only two of those parameters:
regr.ranger.mtry
> Number of variables to possibly split at in each node. Default is the (rounded down) square root of the number variables. […]
importance.filter.nfeat
> A param from importance
that allows to control the size of features computed in the model.
For more information see the ranger documentation and mlr3 pipelines.
To ensure that regr.ranger.mtry
never exceeds importance.filter.nfeat
, we define the search space for the tuner manually. We constraint the possible number of features to the 1, 6 , 10 and 11 most important feature(s). The resulting grid shown in Figure 11, expands to over 28 combinations to tune our learner on.
grid = generate_design_grid(param_set_ranger, resolution = 11)
grid$data = grid$data[regr.ranger.mtry <= importance.filter.nfeat]
grid$data = grid$data[importance.filter.nfeat %in% c(1, 6, 10, 11)]
To ensure that regr.ranger.mtry
never exceeds, importance.filter.nfeat
we define a search space. Since the space for hyperparameters is limited, one could consider here to exploit the whole parameter space for nfeat
and mtry
. To save up time we will stick to 28 combinations here, including all possible combinations at 1, 6, 10 and 11 levels.
set.seed(123456)
ranger_feature = log1p_target_trafo(po_flt)
at_ranger = AutoTuner$new(ranger_feature,
resampling = resampling_cv_5,
measure = measures,
search_space = param_set_ranger,
terminator = trm("evals", n_evals = 28L),
tuner = mlr3tuning::tnr("design_points", design = grid$data)
)
# at_ranger$store_tuning_instance = FALSE
at_ranger$train(bike_task)
archive_ranger = at_ranger$archive$data[, c("regr.rmse",
"importance.filter.nfeat",
"regr.ranger.mtry")]
In Figure 12 we see the RMSE for the evaluated hyperparameter of the ranger autotuner. With increasing mtry
and the number features
used in the model, the RMSE decreases. The best considered combination is achieved with 11 features
and mtry = 10
. For higher values of mtry
and number of features features
the RMSE
is slightly higher.
Benchmarking has two objectives. First, we want to obtain a performance measure for all the fitted models through resampling.
Second, we want to compare the models with each other and examine how tuning improved each of them. As stated in Chapter 3.3 we use a 3-fold CV as our benchmarking resampling method, a 5-fold CV is evaluated inside each auto tuned object to obtain the best hyperparameter combination. Note, that this method does not lead to an unbiased GE, but rather is a step to compare the learners to each other.
lrns = list(
rpart_learner, at_rpart,
ranger_learner, at_ranger,
kknn_learner, at_kknn
)
design = benchmark_grid(
tasks = bike_task,
learners = lrns,
resamplings = resampling_outer_cv3
)
set.seed(123456)
bmr = benchmark(design)
bmr_table = bmr$aggregate(msr("regr.rmse")) %>%
select(-c(nr, resample_result, task_id))
learner_id | resampling_id | iters | regr.rmse |
---|---|---|---|
targettrafosimple.importance.regr.ranger.targetinverter.tuned | cv | 3 | 44.09 |
regr.ranger | cv | 3 | 61.96 |
targettrafosimple.regr.kknn.targetinverter.tuned | cv | 3 | 76.42 |
regr.rpart | cv | 3 | 89.73 |
targettrafosimple.regr.rpart.targetinverter.tuned | cv | 3 | 102.25 |
regr.kknn | cv | 3 | 120.60 |
Tuning enables us to optimize hyperparameters in our model via trial and error. For more details see mlr3tuning. To evaluate each tuned learners’ performance and compare it, we set up benchmarking. Results are shown in Table 7 and Figure 13. In the boxplot resulting RMSE from the outer resampling are shown for each tuned learner from chapter 3.3 and its non-tuned equivalent. The benchmarking resampling validates our results from hyperparameter tuning for each autotuner in a 3-fold-cross-validation. We find that with auto tuning each of the learners could increase its performance compared to its’ equivalent that is using default hyperparameters. However, for rpart
not only performance is increasing but also variance. That does not apply to the tuned kknn
and ranger
learners for which the results seem to be comparatively stable. With a RMSE of 0.31 the tuned random forest performs the best among tuned and untuned learners compared in benchmarking. Consequently, the tuned ranger
is our learner of choice for prediction.
png("bmr.png")
autoplot(bmr, measure = msr("regr.rmse")) +
theme_bw() +
scale_x_discrete(labels = c("KNN", "Random Forest", "CART",
"Random Forest Tuned" , "KNN Tuned", "CART Tuned")) +
labs(y = "RMSE") +
coord_flip() +
theme(axis.title = element_text(size = 9))
dev.off()
Benchmarking is a fast and simple way to compare different approaches to each other. However, from a theoretical perspective, this approach does not lead to unbiased general error estimators. Since we only used three different learners for benchmarking this bias should be insignificant. Nevertheless, we would like to provide a showcase for how to obtain unbiased results with more advanced techniques that are available in mlr3
. Therefore we evaluate the performance of our learners by “true” unbiased nested resampling. In this case, tuning and model selection is only performed during inner resampling. The purpose of outer resampling is to evaluate the unbiased score from inner resampling. This method is also more exhaustive, as it tunes the graph learner in each of the three outer resampling CVs again. Benchmarking on the other hand, “only” tunes each autotuner once and cross-validates with the optimized hyperparameters from the first iteration.
We create a simple branched pipe, with our already known learners from the previous chapters. Note that we do not implement the autotuners but the actual learners here, since we aim to tune over the whole pipeline as a learner here, rather than over each learner individually. Therefore our selected learners kknn
, rpart
and importance.ranger
can be considered as hyperparameters in a more universal graph-learner glrn
. For more information see mlr3pipelines.
A few words should be said about this rather complex parameter set:
importance.ranger
, for which we tune over \(\{6, 10, 11\}\) features because the autotuners had the best performance at these values.k
or distance
can only be called if the selected branch is regr.kknn
. The same applies to the parameters of the other learners. For this purpose we add dependencies to the parameter.ParameterSet
nfeat
is set to levels
. This is used to control the levels, a grid can tune over. mtry
is initialized in the transformation depending on the randomly chosen nfeat
.ps = ParamSet$new(list(
ParamInt$new("branch.selection", lower = 1L, upper = 3L),
ParamInt$new("targettrafosimple.regr.kknn.targetinverter.regr.kknn.k", lower = 1L, upper = 50L),
ParamInt$new("targettrafosimple.regr.kknn.targetinverter.regr.kknn.distance", lower = 1L, upper = 3L),
ParamInt$new("targettrafosimple.regr.rpart.targetinverter.regr.rpart.minsplit", lower = 1L, upper = 30L),
ParamDbl$new("targettrafosimple.regr.rpart.targetinverter.regr.rpart.cp", lower = 0.001, upper = 0.1),
ParamFct$new("importance.filter.nfeat", levels = c("6", "9", "11")),
ParamInt$new("importance.mtry", lower = 7L, upper = 11L)
))
apply(
cbind(rep(1L:3L, each = 2), ps$ids()[-1]),
MARGIN = 1, function(x) ps$add_dep(x[2], "branch.selection", CondEqual$new(as.integer(x[1])))
)
ps$trafo = function(x, param_set) {
if (x$branch.selection == 3L) {
x$importance.filter.nfeat = as.integer(x$importance.filter.nfeat)
x$importance.mtry = ceiling(x$importance.mtry/11 * x$importance.filter.nfeat)
}
x
}
Now we can perform nested resampling using our previously created pipe. We directly resample this autotuner to reduce computing time. We can save the tuned learners in our resampling result by setting store_model = TRUE
.
multi_at = AutoTuner$new(
learner = GraphLearner$new(pipe, task_type = "regr"),
resampling = rsmp("cv", folds = 5),
measure = measures,
search_space = ps,
term = trm("evals", n_evals = 21),
tuner = mlr3tuning::tnr("random_search")
)
set.seed(12345)
multi_at_rr = resample(bike_task, multi_at, resampling_outer_cv3, store_models = TRUE)
#arrange archive
cols = c("regr.rmse", "branch.selection",
"targettrafosimple.regr.kknn.targetinverter.regr.kknn.k",
"targettrafosimple.regr.kknn.targetinverter.regr.kknn.distance",
"targettrafosimple.regr.rpart.targetinverter.regr.rpart.minsplit",
"targettrafosimple.regr.rpart.targetinverter.regr.rpart.cp",
"importance.filter.nfeat", "importance.mtry")
multi_at_archiv = lapply(
1:3,
function(x) multi_at_rr$data$learner[[x]]$archive$data[, ..cols]
)
RMSE | learner-branch | k | distance | minsplit | cp | nfeat | mtry |
---|---|---|---|---|---|---|---|
Outer CV-1 | |||||||
148.46285 | 2 | NA | NA | 17 | 0.0335980 | NA | NA |
78.19935 | 3 | NA | NA | NA | NA | 11 | 10 |
156.95734 | 2 | NA | NA | 22 | 0.0407070 | NA | NA |
148.46285 | 2 | NA | NA | 5 | 0.0306334 | NA | NA |
131.29139 | 1 | 45 | 3 | NA | NA | NA | NA |
77.70967 | 3 | NA | NA | NA | NA | 11 | 9 |
70.69794 | 3 | NA | NA | NA | NA | 9 | 11 |
124.32739 | 1 | 30 | 3 | NA | NA | NA | NA |
204.37702 | 2 | NA | NA | 1 | 0.0218316 | NA | NA |
72.40151 | 3 | NA | NA | NA | NA | 6 | 8 |
159.44030 | 2 | NA | NA | 5 | 0.0916700 | NA | NA |
128.91845 | 1 | 39 | 3 | NA | NA | NA | NA |
159.44030 | 2 | NA | NA | 17 | 0.0641958 | NA | NA |
148.37438 | 1 | 22 | 1 | NA | NA | NA | NA |
159.44030 | 2 | NA | NA | 8 | 0.0745519 | NA | NA |
70.86133 | 3 | NA | NA | NA | NA | 9 | 7 |
72.21180 | 3 | NA | NA | NA | NA | 6 | 7 |
144.47091 | 1 | 25 | 2 | NA | NA | NA | NA |
159.44030 | 2 | NA | NA | 28 | 0.0759690 | NA | NA |
77.22525 | 3 | NA | NA | NA | NA | 6 | 11 |
72.41628 | 3 | NA | NA | NA | NA | 6 | 8 |
Outer CV-2 | |||||||
70.90703 | 3 | NA | NA | NA | NA | 9 | 11 |
128.60097 | 2 | NA | NA | 29 | 0.0111552 | NA | NA |
73.32828 | 3 | NA | NA | NA | NA | 6 | 9 |
155.05227 | 1 | 45 | 1 | NA | NA | NA | NA |
145.04972 | 1 | 25 | 2 | NA | NA | NA | NA |
133.32470 | 1 | 6 | 2 | NA | NA | NA | NA |
78.19142 | 3 | NA | NA | NA | NA | 11 | 9 |
126.72121 | 2 | NA | NA | 15 | 0.0076925 | NA | NA |
144.14502 | 1 | 8 | 1 | NA | NA | NA | NA |
174.11230 | 1 | 1 | 1 | NA | NA | NA | NA |
73.28386 | 3 | NA | NA | NA | NA | 6 | 9 |
144.89675 | 2 | NA | NA | 12 | 0.0239967 | NA | NA |
78.08017 | 3 | NA | NA | NA | NA | 11 | 7 |
73.09981 | 3 | NA | NA | NA | NA | 6 | 10 |
159.00586 | 2 | NA | NA | 15 | 0.0560581 | NA | NA |
78.12298 | 3 | NA | NA | NA | NA | 11 | 8 |
147.90780 | 1 | 17 | 1 | NA | NA | NA | NA |
155.53159 | 1 | 48 | 1 | NA | NA | NA | NA |
159.33757 | 2 | NA | NA | 3 | 0.0810964 | NA | NA |
127.10846 | 1 | 34 | 3 | NA | NA | NA | NA |
159.33757 | 2 | NA | NA | 5 | 0.0593319 | NA | NA |
Outer CV-3 | |||||||
148.38511 | 1 | 27 | 1 | NA | NA | NA | NA |
157.53131 | 2 | NA | NA | 20 | 0.0900243 | NA | NA |
76.41411 | 3 | NA | NA | NA | NA | 11 | 7 |
130.27486 | 1 | 45 | 3 | NA | NA | NA | NA |
69.57183 | 3 | NA | NA | NA | NA | 9 | 7 |
130.47776 | 2 | NA | NA | 9 | 0.0116256 | NA | NA |
157.53131 | 2 | NA | NA | 24 | 0.0761248 | NA | NA |
157.53131 | 2 | NA | NA | 8 | 0.0848399 | NA | NA |
146.32105 | 1 | 36 | 2 | NA | NA | NA | NA |
157.53131 | 2 | NA | NA | 18 | 0.0828409 | NA | NA |
141.57592 | 2 | NA | NA | 25 | 0.0202572 | NA | NA |
120.43218 | 1 | 26 | 3 | NA | NA | NA | NA |
135.86966 | 1 | 12 | 2 | NA | NA | NA | NA |
157.53131 | 2 | NA | NA | 4 | 0.0706477 | NA | NA |
147.61153 | 1 | 24 | 1 | NA | NA | NA | NA |
157.53131 | 2 | NA | NA | 6 | 0.0708308 | NA | NA |
202.20482 | 2 | NA | NA | 1 | 0.0132166 | NA | NA |
157.53131 | 2 | NA | NA | 5 | 0.0715144 | NA | NA |
76.33956 | 3 | NA | NA | NA | NA | 11 | 7 |
157.11866 | 2 | NA | NA | 4 | 0.0402888 | NA | NA |
157.53131 | 2 | NA | NA | 18 | 0.0963503 | NA | NA |
The table shows the results on the first of three outer resampling results. Nested resampling carries out very similar results as Benchmarking in the previous chapter.
Now we use our tuned random forest model to predict the Bike Sharing demand for the train and test data.
Before we can make the predictions, we have to preprocess the test data and engineer features in the same way we did for our training data.
test_datetime = test$datetime
test = preprocess(test)
test = engineer_features(test)
Now we can use our random forest autotuner
to predict the log1p_count
variable.
pred_train = at_ranger$predict(bike_task)
pred_test = at_ranger$predict_newdata(test)
A visual comparison between the predicted demand with the actual demand is shown in Figure 14. The points depict the actual values and the green bars show our predicted values. The spaces with no points are the 19th day until the end of each month for which we only have test data. We find that overall the model predictions are quite close to the actual values and that the forecast seems reasonable for the test data.
In Figure 15, we compare the predictions with the actual data grouped by the days of the month. Again, we find that the predictions are very close to the actual values. However, the model seems to systematically underestimate the actual values slightly.
The mlr3verse
makes the process of model fitting, tuning, benchmarking and nested resampling intuitive and fast. The included packages provide access to a set of commonly used learners and their train and predict methods using R6 classes. Moreover one can increase the capabilities of learners of the original packages through tuning, benchmarking as well as imputing and even ensembling. mlr3
provides these methods in a simple syntax to create understandable and comparable code architectures. Using these resources we set up a generalized tuning process for kknn
, rpart
and ranger
learners. By tuning two hyperparameters of each learner, we were able to considerably increase performance of each learner. Performance comparison through nested resampling provided a solid foundation for choosing the best learner to predict the test set. With a rather simple random forest learner, we made it into the top 15% of Kaggle competition results. Further research can be carried out with regard to the apparent issue of systematic underestimation. A model with more affinity to the risk of a higher count
could improve performance. To obtain such a model, one could, for example, try to increase the variability of the model by choosing a smaller min.nodesize
.
Fanaee-T, Hadi, and Joao Gama. 2013. “Event Labeling Combining Ensemble Detectors and Background Knowledge.” Progress in Artificial Intelligence, 1–15.
For attribution, please cite this work as
Funk, et al. (2020, July 27). mlr3gallery: Bike Sharing Demand - Use Case. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-07-27-bikesharing-demand/
BibTeX citation
@misc{funk2020bike, author = {Funk, Henri and Hofheinz, Andreas and Girnat, Marius}, title = {mlr3gallery: Bike Sharing Demand - Use Case}, url = {https://mlr3gallery.mlr-org.com/posts/2020-07-27-bikesharing-demand/}, year = {2020} }