# Regression chains

We show how to use mlr3pipelines to do regression chains.

Lennart Schneider
04-18-2020

In this tutorial we demonstrate how to use mlr3pipelines to handle multi-target regression by arranging regression models as a chain, i.e., creating a linear sequence of regression models.

## Regression Chains

In a simple regression chain, regression models are arranged in a linear sequence. Here, the first model will use the input to predict a single output and the second model will use the input and the prediction output of the first model to make its own prediction and so on. For more details, see e.g. Spyromitros-Xioufis et al. (2016).

## Before you start

The following sections describe an approach towards working with Tasks that have multiple targets. E.g., in the example below, we have three target variables $$y_{1}$$ to $$y_{3}$$. This type of Task can be created via the mlr3multioutput package (currently under development) in the future. mlr3multioutput will also offer simple chaining approaches as pre-built pipelines (so called ppls). The current goal of this post is to show how such modeling steps can be written as a relatively small amount of pipeline steps and how such steps can be put together. Writing pipelines with such steps allows for great flexibility in modeling more complicated scenarios such as the ones described below.

## Data

In the following, we rely on some toy data. We simulate 100 responses to three target variables, $$y_{1}$$, $$y_{2}$$, and $$y_{3}$$ following a multivariate normal distribution with a mean and covariance matrix of:

library(data.table)
library(mvtnorm)
set.seed(2409)
n = 100
(mean = c(y1 = 1, y2 = 2, y3 = 3))

y1 y2 y3
1  2  3 
(sigma = matrix(c(1, -0.5, 0.25, -0.5, 1, -0.25, 0.25, -0.25, 1),
nrow = 3, ncol = 3, byrow = TRUE))

      [,1]  [,2]  [,3]
[1,]  1.00 -0.50  0.25
[2,] -0.50  1.00 -0.25
[3,]  0.25 -0.25  1.00
Y = rmvnorm(n, mean = mean, sigma = sigma)


The feature variables $$x_{1}$$, and $$x_{2}$$ are simulated as follows: $$x_{1}$$ is simply given by $$y_{1}$$ and an independent normally distributed error term and $$x_{2}$$ is given by $$y_{2}$$ and an independent normally distributed error term.

x1 = Y[, 1] + rnorm(n, sd = 0.1)
x2 = Y[, 2] + rnorm(n, sd = 0.1)


The final data is given as:

dat = as.data.table(cbind(Y, x1, x2))
str(dat)

Classes 'data.table' and 'data.frame':  100 obs. of  5 variables:
$y1: num 0.681 1.836 0.355 1.783 0.974 ...$ y2: num  2.33 1.735 3.126 0.691 1.573 ...
$y3: num 3.19 3.14 2.74 4.31 2.77 ...$ x1: num  0.788 1.754 0.174 1.844 1.05 ...
$x2: num 2.336 1.665 2.967 0.651 1.634 ... - attr(*, ".internal.selfref")=<externalptr>  This simulates a situation where we have multiple target variables that are correlated with each other, such that predicting them along with each other can improve the resulting prediction model. As a real-world example for such a situation, consider e.g. hospital data, where time spent in the ICU (not known a priori) heavily influences the cost incurred by a patient’s treatment. ## 3D Visualization of the Data If you feel confident to already have a good feeling of the data, feel free to skip this section. If not, you can use the rgl package to play around with the following four 3D plots with either the feature variables or $$y_{1}$$ and $$y_{2}$$ on the x- and y-axis and the target variables on the respective z-axes: library(rgl) colfun = colorRampPalette(c("#161B1D", "#ADD8E6"))  setorder(dat, y1) plot3d(dat$x1, dat$x2, dat$y1,
xlab = "x1", ylab = "x2", zlab = "y1",
type = "s", radius = 0.1, col = colfun(n))

setorder(dat, y2)
plot3d(dat$x1, dat$x2, dat$y2, xlab = "x1", ylab = "x2", zlab = "y2", type = "s", radius = 0.1, col = colfun(n))  setorder(dat, y3) plot3d(dat$x1, dat$x2, dat$y3,
xlab = "x1", ylab = "x2", zlab = "y3",
type = "s", radius = 0.1, col = colfun(n))

setorder(dat, y3)
plot3d(dat$y1, dat$y2, dat$y3, xlab = "y1", ylab = "y2", zlab = "y3", type = "s", radius = 0.1, col = colfun(n))  ## Building the Pipeline In our regression chain, the first model will predict $$y_{1}$$. Therefore, we initialize our Task with respect to this target: library(mlr3) library(mlr3learners) library(mlr3pipelines) task = TaskRegr$new("multiregression", backend = dat, target = "y1")


As Learners we will use simple linear regression models. Our pipeline building the regression chain then has to do the following:

• Use the input to predict $$y_{1}$$ within the first learner (i.e., $$y_{1} \sim x_{1} + x_{2}$$).
• Combine the input with the prediction of $$y_{1}$$, $$\hat{y_{1}}$$ and use this to predict $$y_{2}$$ within the second learner (i.e., $$y_{2} \sim x_{1} + x_{2} + \hat{y_{1}}$$).
• Combine the input with the prediction of $$y_{2}$$ and use this to predict $$y_{3}$$ within the final third learner (i.e., $$y_{3} \sim x_{1} + x_{2} + \hat{y_{1}} + \hat{y_{2}}$$).

To combine predictions of a Learner with the previous input, we rely on PipeOpLearnerCV and PipeOpNOP arranged in parallel via gunion combined via PipeOpFeatureUnion. To drop the respective remaining target variables as features, we rely on PipeOpColRoles. The first step of predicting $$y_{1}$$ looks like the following:

step1 = po("copy", outnum = 2, id = "copy1") %>>%
gunion(list(
po("colroles", id = "drop_y2_y3",
param_vals = list(new_role = list(y2 = character(), y3 = character()))) %>>%
po("learner_cv", learner = lrn("regr.lm"), id = "y1_learner"),
po("nop", id = "nop1")
)) %>>%
po("featureunion", id = "union1")
step1$plot()  Training using the input Task, shows us how the output and the $state look like:

step1_out = step1$train(task)[[1]] step1_out  <TaskRegr:multiregression> (100 x 6) * Target: y1 * Properties: - * Features (5): - dbl (5): x1, x2, y1_learner.response, y2, y3 step1$state

$copy1 list()$drop_y2_y3
$drop_y2_y3$dt_columns
[1] "x1" "x2" "y2" "y3"

$drop_y2_y3$affected_cols
[1] "x1" "x2" "y2" "y3"

$drop_y2_y3$intasklayout
id    type
1: x1 numeric
2: x2 numeric
3: y2 numeric
4: y3 numeric

$drop_y2_y3$outtasklayout
id    type
1: x1 numeric
2: x2 numeric

$drop_y2_y3$outtaskshell
Empty data.table (0 rows and 3 cols): y1,x1,x2

$y1_learner$y1_learner$model Call: stats::lm(formula = task$formula(), data = task$data()) Coefficients: (Intercept) x1 x2 -0.03762 0.99851 0.01364$y1_learner$log Empty data.table (0 rows and 3 cols): stage,class,msg$y1_learner$train_time [1] 0.004$y1_learner$train_task <TaskRegr:multiregression> (100 x 3) * Target: y1 * Properties: - * Features (2): - dbl (2): x1, x2$y1_learner$affected_cols [1] "x1" "x2"$y1_learner$intasklayout id type 1: x1 numeric 2: x2 numeric$y1_learner$outtasklayout id type 1: y1_learner.response numeric$y1_learner$outtaskshell Empty data.table (0 rows and 2 cols): y1,y1_learner.response$nop1
list()

$union1 list() Within the second step we then have to define $$y_{2}$$ as the new target. This can be done using PipeOpUpdateTarget (note that PipeOpUpdateTarget currently is not exported but will be in a future version). By default, PipeOpUpdateTarget drops the original target from the feature set, here $$y_{1}$$. mlr_pipeops$add("update_target", mlr3pipelines:::PipeOpUpdateTarget)

step2 = po("update_target", id = "y2_target",
param_vals = list(new_target_name = "y2")) %>>%
po("copy", outnum = 2, id = "copy2") %>>%
gunion(list(
po("colroles", id = "drop_y3",
param_vals = list(new_role = list(y3 = character()))) %>>%
po("learner_cv", learner = lrn("regr.lm"), id = "y2_learner"),
po("nop", id = "nop2")
)) %>>%
po("featureunion", id = "union2")


Again, we can train to see how the output and $state look like, but now using the output of step1 as the input: step2_out = step2$train(step1_out)[[1]]
step2_out

<TaskRegr:multiregression> (100 x 6)
* Target: y2
* Properties: -
* Features (5):
- dbl (5): x1, x2, y1_learner.response, y2_learner.response, y3
step2$state  $y2_target
list()

$copy2 list()$drop_y3
$drop_y3$dt_columns
[1] "x1"                  "x2"                  "y1_learner.response"
[4] "y3"

$drop_y3$affected_cols
[1] "y1_learner.response" "x1"                  "x2"
[4] "y3"

$drop_y3$intasklayout
id    type
1:                  x1 numeric
2:                  x2 numeric
3: y1_learner.response numeric
4:                  y3 numeric

$drop_y3$outtasklayout
id    type
1:                  x1 numeric
2:                  x2 numeric
3: y1_learner.response numeric

$drop_y3$outtaskshell
Empty data.table (0 rows and 4 cols): y2,y1_learner.response,x1,x2

$y2_learner$y2_learner$model Call: stats::lm(formula = task$formula(), data = task$data()) Coefficients: (Intercept) y1_learner.response x1 0.07135 0.22773 -0.25186 x2 0.97877$y2_learner$log Empty data.table (0 rows and 3 cols): stage,class,msg$y2_learner$train_time [1] 0.009$y2_learner$train_task <TaskRegr:multiregression> (100 x 4) * Target: y2 * Properties: - * Features (3): - dbl (3): x1, x2, y1_learner.response$y2_learner$affected_cols [1] "y1_learner.response" "x1" "x2"$y2_learner$intasklayout id type 1: x1 numeric 2: x2 numeric 3: y1_learner.response numeric$y2_learner$outtasklayout id type 1: y2_learner.response numeric$y2_learner$outtaskshell Empty data.table (0 rows and 2 cols): y2,y2_learner.response$nop2
list()

$union2 list() In the final third step we define $$y_{3}$$ as the new target (again, PipeOpUpdateTarget drops the previous original target from the feature set, here $$y_{2}$$): step3 = po("update_target", id = "y3_target", param_vals = list(new_target_name = "y3")) %>>% po("learner", learner = lrn("regr.lm"), id = "y3_learner")  Using the output of step2 as input: step3_out = step3$train(step2_out)[[1]]
step3_out

NULL
step3$state  $y3_target
list()

$y3_learner$y3_learner$model Call: stats::lm(formula = task$formula(), data = task$data()) Coefficients: (Intercept) y2_learner.response y1_learner.response 2.6445 0.8155 3.8776 x1 x2 -3.5217 -0.7304$y3_learner$log Empty data.table (0 rows and 3 cols): stage,class,msg$y3_learner$train_time [1] 0.008$y3_learner$train_task <TaskRegr:multiregression> (100 x 5) * Target: y3 * Properties: - * Features (4): - dbl (4): x1, x2, y1_learner.response, y2_learner.response The complete pipeline, more precisely Graph, looks like the following: graph = step1 %>>% step2 %>>% step3 graph$plot()


## Evaluating the Pipeline

By wrapping our Graph in a GraphLearner, we can perform 3-fold cross-validation and get an estimated average of the root-mean-square error (of course, in a real world setting splitting the data in a training and test set should have been done):

learner = GraphLearner$new(graph) rr = resample(task, learner = learner, resampling = rsmp("cv", folds = 3)) rr$aggregate(msr("regr.mse"))

 regr.mse
0.7265587 

## Predicting with the Pipeline

For completeness, we also show how a prediction step without having any target variable data available would look like:

dat_predict = as.data.table(cbind(x1, x2, y1 = NA, y2 = NA, y3 = NA))
learner$train(task) learner$predict_newdata(dat_predict)

<PredictionRegr> for 100 observations:
row_id truth response
1    NA 3.116960
2    NA 3.327345
3    NA 3.010821
---
98    NA 3.462541
99    NA 3.020585
100    NA 3.664326

Note that we have to initialize the Task with $$y_{1}$$ as the target but the pipeline will automatically predict $$y_{3}$$ in the final step as our final target, which was our ultimate goal here.

Spyromitros-Xioufis, Eleftherios, Grigorios Tsoumakas, William Groves, and Ioannis Vlahavas. 2016. “Multi-Target Regression via Input Space Expansion: Treating Targets as Inputs.” Machine Learning 104 (1): 55–98. https://doi.org/10.1007/s10994-016-5546-z.

### Citation

Schneider (2020, April 18). mlr3gallery: Regression chains. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-04-18-regression-chains/
@misc{schneider2020regression,
}