A pipeline for the titanic data set - Advanced

imputation classification mlr3pipelines feature engineering

This post shows how to build a Graph using the mlr3pipelines package on the “titanic” dataset. Moreover, feature engineering, data imputation and benchmarking are covered.

Florian Pfisterer
04-27-2020

This is the second post of the titanic use case series. You can find the first use case here.

In this section we will focus on more advanced usage of mlr3pipelines. Specifically, this section illustrates the different options when it comes to data imputation and feature engineering. Furthermore, the section shows how to benchmark, feature engineer and compare our results. Again, we need to take some preparatory steps:

For the case that you are only tuning in for the second part, we need to load some packages: The data is part of the mlr3data package.

As in the basics chapter, we use the titanic data set. To recap we have undertaken the following steps:

# splitting the data set
titanic_train = titanic[1:891, ]
titanic_test = titanic[892:1309, ]

# setting up the task
task = TaskClassif$new("titanic", titanic_train, target = "survived", positive = "yes")
task$select(cols = setdiff(task$feature_names, c("cabin", "name", "ticket")))

# setting up the learner
learner = lrn("classif.rpart")

#setting up our resampling method
rdesc = rsmp("cv", folds = 3L)$instantiate(task)
res = resample(task, learner, rdesc, store_models = TRUE)

Imputation

A very simple way to do this to just impute a constant value for each feature. We could i.e. impute every character or factor column with missing and every numeric column with -999. And depending on the model, this might actually be fine. This approach has a few drawbacks though:

As a result, instead of imputing a constant value, we will do two things: * Draw samples from each numeric features’ histogram using PipeOpImputeHist * Add an additional column for each variable that indicates whether a value was missing or not. If the information that a value was missing is important, this column contains this information.

This imputation scheme is called ‘imputation with constants’ and is already implemented in mlr3pipelines. It can be done using PipeOpImputeConstant.

Remember that we are trying to optimize our predictive power by using a random forest model (mlr_learners_classif.ranger). Now, random forest models do not naturally handle missing values which is the reason why we need imputation. Before imputation, our data looks as follows:

task$missings()
survived      age embarked     fare    parch   pclass      sex   sib_sp 
       0      177        2        0        0        0        0        0 

Let’s first deal with the categorical variables:

po_newlvl = po("imputeoor")
task_newlvl = po_newlvl$train(list(task))[[1]]

Note that we use the PipeOp in an unusual way, which is why the syntax does not look very clean. We’ll learn how to use a full graph below.

First, let’s look at the result:

task_newlvl$missings()
survived     fare    parch   pclass      sex   sib_sp      age embarked 
       0        0        0        0        0        0        0        0 

Cool! embarked does not have missing values anymore. Note that PipeOpImputeOOR by default affects character, factor and ordered columns.

For the numeric features we want to do two things, impute values and add an indicator column. In order to do this, we need a more complicated structure, a Graph.

Our po_indicator creates the indicator column. We tell it to only do this for numeric and integer columns via its param_vals, and additionally tell it to create a numeric column (0 = “not missing”, 1 = “missing”).

po_indicator = po("missind",
  param_vals = list(affect_columns = selector_type(c("numeric", "integer")), type = "numeric")
)

Now we can simultaneously impute features from the histogram and create indicator columns. This can be achieved using the gunion function, which puts two operations in parallel:

gr = gunion(list(po_indicator, po("imputehist")))
gr = gr %>>% po("featureunion")

Afterwards, we cbind the resulting data using po("featureunion"), connecting the different operations using our graph connector: %>>%. We can now also connect the newlvl imputation:

gr = gr %>>% po("imputeoor")

and see what happens when we now train the whole Graph:

task_imputed = gr$clone()$train(task)[[1]]
task_imputed$missings()
   survived missing_age      pclass         sex        fare       parch 
          0           0           0           0           0           0 
     sib_sp         age    embarked 
          0           0           0 

Awesome, now we do not have any missing values!

autoplot(task_imputed)

We could now use task_imputed for resampling and see whether a ranger model does better. But this is dangerous! If we preprocess all training data at once, data could leak through the different cross-validation folds. In order to do this properly, we have to process the training data in every fold separately. Luckily, this is automatically handled in our Graph, if we use it through a GraphLearner.

We can simply append a ranger learner to the Graph and create a GraphLearner from this.

glrn = GraphLearner$new(gr$clone() %>>%
  po("imputesample") %>>%
  po("fixfactors") %>>%
  po(learner))

We needed to use the following commands for the Graph: * fixfactors: Removes empty factor levels and removes factor levels that do not exist during training. * imputesample: In some cases, if missing factor levels do not occur during training but only while predicting, imputeoor does not create a new level. For those, we sample a random value.

res = resample(task, glrn, rdesc, store_models = TRUE)
agg = res$aggregate(msr("classif.acc"))
agg
classif.acc 
  0.7991021 

So our model has not improved heavily, currently it has an accuracy of 0.799.

Feature Engineering

We will do this using PipeOpMutate in order to showcase the power of mlr3pipelines. Additionally, we will make use of the character columns. Hence, we will re-select them:

task$col_roles$feature = c(task$feature_names, c("cabin", "name", "ticket"))
library("stringi")
po_ftextract = po("mutate", param_vals = list(
  mutation = list(
    fare_per_person = ~ fare / (parch + sib_sp + 1),
    deck = ~ factor(stri_sub(cabin, 1, 1)),
    title = ~ factor(stri_match(name, regex = ", (.*)\\.")[, 2]),
    surname = ~ factor(stri_match(name, regex = "(.*),")[, 2]),
    ticket_prefix = ~ factor(stri_replace_all_fixed(stri_trim(stri_match(ticket, regex = "(.*) ")[, 2]), ".", ""))
  )
))

Quickly checking what happens:

task_eng = po_ftextract$clone()$train(list(task))[[1]]
task_eng$data()
     survived age embarked    fare parch pclass    sex sib_sp cabin
  1:       no  22        S  7.2500     0      3   male      1  <NA>
  2:      yes  38        C 71.2833     0      1 female      1   C85
  3:      yes  26        S  7.9250     0      3 female      0  <NA>
  4:      yes  35        S 53.1000     0      1 female      1  C123
  5:       no  35        S  8.0500     0      3   male      0  <NA>
 ---                                                               
887:       no  27        S 13.0000     0      2   male      0  <NA>
888:      yes  19        S 30.0000     0      1 female      0   B42
889:       no  NA        S 23.4500     2      3 female      1  <NA>
890:      yes  26        C 30.0000     0      1   male      0  C148
891:       no  32        Q  7.7500     0      3   male      0  <NA>
                                                    name           ticket
  1:                             Braund, Mr. Owen Harris        A/5 21171
  2: Cumings, Mrs. John Bradley (Florence Briggs Thayer)         PC 17599
  3:                              Heikkinen, Miss. Laina STON/O2. 3101282
  4:        Futrelle, Mrs. Jacques Heath (Lily May Peel)           113803
  5:                            Allen, Mr. William Henry           373450
 ---                                                                     
887:                               Montvila, Rev. Juozas           211536
888:                        Graham, Miss. Margaret Edith           112053
889:            Johnston, Miss. Catherine Helen "Carrie"       W./C. 6607
890:                               Behr, Mr. Karl Howell           111369
891:                                 Dooley, Mr. Patrick           370376
     fare_per_person deck title   surname ticket_prefix
  1:         3.62500 <NA>    Mr    Braund           A/5
  2:        35.64165    C   Mrs   Cumings            PC
  3:         7.92500 <NA>  Miss Heikkinen       STON/O2
  4:        26.55000    C   Mrs  Futrelle          <NA>
  5:         8.05000 <NA>    Mr     Allen          <NA>
 ---                                                   
887:        13.00000 <NA>   Rev  Montvila          <NA>
888:        30.00000    B  Miss    Graham          <NA>
889:         5.86250 <NA>  Miss  Johnston           W/C
890:        30.00000    C    Mr      Behr          <NA>
891:         7.75000 <NA>    Mr    Dooley          <NA>
autoplot(task_eng$clone()$select(c("sex", "age")), type = "pairs")

Now we can put everything together again, we concatenate our new PipeOp with the Graph created above and use PipeOpSelect in order to de-select the character features we used for feature extraction. Additionally, we collapse the ‘surname’, so only surnames that make up more than 0.6 % of the data are kept.

In summary, we do the following:

The full graph we created is the following:

learner = lrn("classif.ranger")
learner$param_set$values = list(num.trees = 500, min.node.size = 4)
gr_final = po_ftextract %>>%
  po("collapsefactors", param_vals = list(no_collapse_above_prevalence = 0.03)) %>>%
  po("select", param_vals = list(selector = selector_invert(selector_type("character")))) %>>%
  gunion(list(po_indicator, po("imputehist"))) %>>%
  po("featureunion") %>>%
  po("imputeoor") %>>%
  po("fixfactors") %>>%
  po("imputesample") %>>%
  po(learner)

Evaluation

Let us see if things have improved:

glrn = GraphLearner$new(gr_final)
res = resample(task, glrn, rdesc, store_models = TRUE)
agg_ranger = res$aggregate(msr("classif.acc"))
agg_ranger
classif.acc 
  0.8249158 

We have improved even more!

Benchmarking

To undertake benchmarking, we need to set up a benchmarking design. The first step is creating a list with the learners we used, namely the learners form the first and second part of this use case.

lrn_list = list(
  lrn("classif.rpart", predict_type = "prob"),
  lrn("classif.ranger", predict_type = "prob")
)

Now we can define our benchmark design. This is done to ensure exhaustive and consistent resampling for all learners. This step is needed to execute over the same train/test split for each task.

bm_design = benchmark_grid(task = task_imputed, resamplings = rsmp("cv", folds = 10), learners = lrn_list)
bmr = benchmark(bm_design, store_models = TRUE)
print(bmr)
<BenchmarkResult> of 20 rows with 2 resampling runs
 nr task_id     learner_id resampling_id iters warnings errors
  1 titanic  classif.rpart            cv    10        0      0
  2 titanic classif.ranger            cv    10        0      0

So, where do we go from here? We could for instance use a boxplot:

autoplot(bmr) + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Further we are able to compare sensitivity and specificity. Here we need to ensure that the benchmark results only contain a single Task:

autoplot(bmr$clone()$filter(task_id = "titanic"), type = "roc")

Moreover, one can compare the precision-recall:

# Precision vs Recall
ggplot2::autoplot(bmr, type = "prc")

As one can see, there are various options when it comes to benchmarking and visualizing. You could have a look at some other use cases in our gallery for inspiration.

Future

In this case we have examined a number of different features, but there are many more things to explore! We could extract even more information from the different features and see what happens. But now you are left to yourself! There are many kaggle kernels that treat the Titanic Dataset available. This can be a great starter to find even better models.

Citation

For attribution, please cite this work as

Pfisterer (2020, April 27). mlr3gallery: A pipeline for the titanic data set - Advanced. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-04-27-mlr3pipelines-Imputation-titanic/

BibTeX citation

@misc{pfisterer2020a,
  author = {Pfisterer, Florian},
  title = {mlr3gallery: A pipeline for the titanic data set - Advanced},
  url = {https://mlr3gallery.mlr-org.com/posts/2020-04-27-mlr3pipelines-Imputation-titanic/},
  year = {2020}
}