We show how to use mlr3pipelines to do regression chains.

In this tutorial we demonstrate how to use mlr3pipelines to handle multi-target regression by arranging regression models as a chain, i.e., creating a linear sequence of regression models.

In a simple regression chain, regression models are arranged in a linear sequence. Here, the first model will use the input to predict a single output and the second model will use the input and the prediction output of the first model to make its own prediction and so on. For more details, see e.g. Spyromitros-Xioufis et al. (2016).

The following sections describe an approach towards working with `Tasks`

that have multiple targets. E.g., in the example below, we have three target variables \(y_{1}\) to \(y_{3}\). This type of `Task`

can be created via the mlr3multioutput package (currently under development) in the future. `mlr3multioutput`

will also offer simple chaining approaches as pre-built pipelines (so called `ppls`

). The current goal of this post is to show how such modeling steps can be written as a relatively small amount of pipeline steps and how such steps can be put together. Writing pipelines with such steps allows for great flexibility in modeling more complicated scenarios such as the ones described below.

In the following, we rely on some toy data. We simulate 100 responses to three target variables, \(y_{1}\), \(y_{2}\), and \(y_{3}\) following a multivariate normal distribution with a mean and covariance matrix of:

```
y1 y2 y3
1 2 3
```

```
(sigma = matrix(c(1, -0.5, 0.25, -0.5, 1, -0.25, 0.25, -0.25, 1),
nrow = 3, ncol = 3, byrow = TRUE))
```

```
[,1] [,2] [,3]
[1,] 1.00 -0.50 0.25
[2,] -0.50 1.00 -0.25
[3,] 0.25 -0.25 1.00
```

```
Y = rmvnorm(n, mean = mean, sigma = sigma)
```

The feature variables \(x_{1}\), and \(x_{2}\) are simulated as follows: \(x_{1}\) is simply given by \(y_{1}\) and an independent normally distributed error term and \(x_{2}\) is given by \(y_{2}\) and an independent normally distributed error term.

The final data is given as:

```
dat = as.data.table(cbind(Y, x1, x2))
str(dat)
```

```
Classes 'data.table' and 'data.frame': 100 obs. of 5 variables:
$ y1: num 0.681 1.836 0.355 1.783 0.974 ...
$ y2: num 2.33 1.735 3.126 0.691 1.573 ...
$ y3: num 3.19 3.14 2.74 4.31 2.77 ...
$ x1: num 0.788 1.754 0.174 1.844 1.05 ...
$ x2: num 2.336 1.665 2.967 0.651 1.634 ...
- attr(*, ".internal.selfref")=<externalptr>
```

This simulates a situation where we have multiple target variables that are correlated with each other, such that predicting them along with each other can improve the resulting prediction model. As a real-world example for such a situation, consider e.g. hospital data, where time spent in the ICU (not known a priori) heavily influences the cost incurred by a patient’s treatment.

If you feel confident to already have a good feeling of the data, feel free to skip this section. If not, you can use the rgl package to play around with the following four 3D plots with either the feature variables or \(y_{1}\) and \(y_{2}\) on the x- and y-axis and the target variables on the respective z-axes:

```
library(rgl)
colfun = colorRampPalette(c("#161B1D", "#ADD8E6"))
```

```
setorder(dat, y1)
plot3d(dat$x1, dat$x2, dat$y1,
xlab = "x1", ylab = "x2", zlab = "y1",
type = "s", radius = 0.1, col = colfun(n))
```

```
setorder(dat, y2)
plot3d(dat$x1, dat$x2, dat$y2,
xlab = "x1", ylab = "x2", zlab = "y2",
type = "s", radius = 0.1, col = colfun(n))
```

```
setorder(dat, y3)
plot3d(dat$x1, dat$x2, dat$y3,
xlab = "x1", ylab = "x2", zlab = "y3",
type = "s", radius = 0.1, col = colfun(n))
```

```
setorder(dat, y3)
plot3d(dat$y1, dat$y2, dat$y3,
xlab = "y1", ylab = "y2", zlab = "y3",
type = "s", radius = 0.1, col = colfun(n))
```

In our regression chain, the first model will predict \(y_{1}\). Therefore, we initialize our `Task`

with respect to this target:

```
library(mlr3)
library(mlr3learners)
library(mlr3pipelines)
task = TaskRegr$new("multiregression", backend = dat, target = "y1")
```

As `Learners`

we will use simple `linear regression models`

. Our pipeline building the regression chain then has to do the following:

- Use the input to predict \(y_{1}\) within the first learner (i.e., \(y_{1} \sim x_{1} + x_{2}\)).
- Combine the input with the prediction of \(y_{1}\), \(\hat{y_{1}}\) and use this to predict \(y_{2}\) within the second learner (i.e., \(y_{2} \sim x_{1} + x_{2} + \hat{y_{1}}\)).
- Combine the input with the prediction of \(y_{2}\) and use this to predict \(y_{3}\) within the final third learner (i.e., \(y_{3} \sim x_{1} + x_{2} + \hat{y_{1}} + \hat{y_{2}}\)).

To combine predictions of a `Learner`

with the previous input, we rely on `PipeOpLearnerCV`

and `PipeOpNOP`

arranged in parallel via `gunion`

combined via `PipeOpFeatureUnion`

. To drop the respective remaining target variables as features, we rely on `PipeOpColRoles`

. The first step of predicting \(y_{1}\) looks like the following:

```
step1 = po("copy", outnum = 2, id = "copy1") %>>%
gunion(list(
po("colroles", id = "drop_y2_y3",
param_vals = list(new_role = list(y2 = character(), y3 = character()))) %>>%
po("learner_cv", learner = lrn("regr.lm"), id = "y1_learner"),
po("nop", id = "nop1")
)) %>>%
po("featureunion", id = "union1")
step1$plot()
```

Training using the input `Task`

, shows us how the output and the `$state`

look like:

```
step1_out = step1$train(task)[[1]]
step1_out
```

```
<TaskRegr:multiregression> (100 x 6)
* Target: y1
* Properties: -
* Features (5):
- dbl (5): x1, x2, y1_learner.response, y2, y3
```

```
step1$state
```

```
$copy1
list()
$drop_y2_y3
$drop_y2_y3$dt_columns
[1] "x1" "x2" "y2" "y3"
$drop_y2_y3$affected_cols
[1] "x1" "x2" "y2" "y3"
$drop_y2_y3$intasklayout
id type
1: x1 numeric
2: x2 numeric
3: y2 numeric
4: y3 numeric
$drop_y2_y3$outtasklayout
id type
1: x1 numeric
2: x2 numeric
$drop_y2_y3$outtaskshell
Empty data.table (0 rows and 3 cols): y1,x1,x2
$y1_learner
$y1_learner$model
Call:
stats::lm(formula = task$formula(), data = task$data())
Coefficients:
(Intercept) x1 x2
-0.03762 0.99851 0.01364
$y1_learner$log
Empty data.table (0 rows and 3 cols): stage,class,msg
$y1_learner$train_time
[1] 0.005
$y1_learner$train_task
<TaskRegr:multiregression> (100 x 3)
* Target: y1
* Properties: -
* Features (2):
- dbl (2): x1, x2
$y1_learner$affected_cols
[1] "x1" "x2"
$y1_learner$intasklayout
id type
1: x1 numeric
2: x2 numeric
$y1_learner$outtasklayout
id type
1: y1_learner.response numeric
$y1_learner$outtaskshell
Empty data.table (0 rows and 2 cols): y1,y1_learner.response
$nop1
list()
$union1
list()
```

Within the second step we then have to define \(y_{2}\) as the new target. This can be done using `PipeOpUpdateTarget`

(note that `PipeOpUpdateTarget`

currently is not exported but will be in a future version). By default, `PipeOpUpdateTarget`

drops the original target from the feature set, here \(y_{1}\).

```
mlr_pipeops$add("update_target", mlr3pipelines:::PipeOpUpdateTarget)
```

```
step2 = po("update_target", id = "y2_target",
param_vals = list(new_target_name = "y2")) %>>%
po("copy", outnum = 2, id = "copy2") %>>%
gunion(list(
po("colroles", id = "drop_y3",
param_vals = list(new_role = list(y3 = character()))) %>>%
po("learner_cv", learner = lrn("regr.lm"), id = "y2_learner"),
po("nop", id = "nop2")
)) %>>%
po("featureunion", id = "union2")
```

Again, we can train to see how the output and `$state`

look like, but now using the output of `step1`

as the input:

```
step2_out = step2$train(step1_out)[[1]]
step2_out
```

```
<TaskRegr:multiregression> (100 x 6)
* Target: y2
* Properties: -
* Features (5):
- dbl (5): x1, x2, y1_learner.response, y2_learner.response, y3
```

```
step2$state
```

```
$y2_target
list()
$copy2
list()
$drop_y3
$drop_y3$dt_columns
[1] "x1" "x2" "y1_learner.response"
[4] "y3"
$drop_y3$affected_cols
[1] "y1_learner.response" "x1" "x2"
[4] "y3"
$drop_y3$intasklayout
id type
1: x1 numeric
2: x2 numeric
3: y1_learner.response numeric
4: y3 numeric
$drop_y3$outtasklayout
id type
1: x1 numeric
2: x2 numeric
3: y1_learner.response numeric
$drop_y3$outtaskshell
Empty data.table (0 rows and 4 cols): y2,y1_learner.response,x1,x2
$y2_learner
$y2_learner$model
Call:
stats::lm(formula = task$formula(), data = task$data())
Coefficients:
(Intercept) y1_learner.response x1
0.07135 0.22773 -0.25186
x2
0.97877
$y2_learner$log
Empty data.table (0 rows and 3 cols): stage,class,msg
$y2_learner$train_time
[1] 0.012
$y2_learner$train_task
<TaskRegr:multiregression> (100 x 4)
* Target: y2
* Properties: -
* Features (3):
- dbl (3): x1, x2, y1_learner.response
$y2_learner$affected_cols
[1] "y1_learner.response" "x1" "x2"
$y2_learner$intasklayout
id type
1: x1 numeric
2: x2 numeric
3: y1_learner.response numeric
$y2_learner$outtasklayout
id type
1: y2_learner.response numeric
$y2_learner$outtaskshell
Empty data.table (0 rows and 2 cols): y2,y2_learner.response
$nop2
list()
$union2
list()
```

In the final third step we define \(y_{3}\) as the new target (again, `PipeOpUpdateTarget`

drops the previous original target from the feature set, here \(y_{2}\)):

Using the output of `step2`

as input:

```
step3_out = step3$train(step2_out)[[1]]
step3_out
```

`NULL`

```
step3$state
```

```
$y3_target
list()
$y3_learner
$y3_learner$model
Call:
stats::lm(formula = task$formula(), data = task$data())
Coefficients:
(Intercept) y2_learner.response y1_learner.response
2.6445 0.8155 3.8776
x1 x2
-3.5217 -0.7304
$y3_learner$log
Empty data.table (0 rows and 3 cols): stage,class,msg
$y3_learner$train_time
[1] 0.013
$y3_learner$train_task
<TaskRegr:multiregression> (100 x 5)
* Target: y3
* Properties: -
* Features (4):
- dbl (4): x1, x2, y1_learner.response, y2_learner.response
```

The complete pipeline, more precisely `Graph`

, looks like the following:

```
graph = step1 %>>% step2 %>>% step3
graph$plot()
```

By wrapping our `Graph`

in a `GraphLearner`

, we can perform `3-fold cross-validation`

and get an estimated average of the root-mean-square error (of course, in a real world setting splitting the data in a training and test set should have been done):

```
learner = GraphLearner$new(graph)
rr = resample(task, learner = learner, resampling = rsmp("cv", folds = 3))
rr$aggregate(msr("regr.mse"))
```

```
regr.mse
0.7265587
```

For completeness, we also show how a prediction step without having any target variable data available would look like:

```
dat_predict = as.data.table(cbind(x1, x2, y1 = NA, y2 = NA, y3 = NA))
learner$train(task)
learner$predict_newdata(dat_predict)
```

```
<PredictionRegr> for 100 observations:
row_ids truth response
1 NA 3.116960
2 NA 3.327345
3 NA 3.010821
---
98 NA 3.462541
99 NA 3.020585
100 NA 3.664326
```

Note that we have to initialize the `Task`

with \(y_{1}\) as the target but the pipeline will automatically predict \(y_{3}\) in the final step as our final target, which was our ultimate goal here.

Spyromitros-Xioufis, Eleftherios, Grigorios Tsoumakas, William Groves, and Ioannis Vlahavas. 2016. “Multi-Target Regression via Input Space Expansion: Treating Targets as Inputs.” *Machine Learning* 104 (1): 55–98. https://doi.org/10.1007/s10994-016-5546-z.

For attribution, please cite this work as

Schneider (2020, April 18). mlr3gallery: Regression Chains. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-04-18-regression-chains/

BibTeX citation

@misc{schneider2020regression, author = {Schneider, Lennart}, title = {mlr3gallery: Regression Chains}, url = {https://mlr3gallery.mlr-org.com/posts/2020-04-18-regression-chains/}, year = {2020} }