A pipeline for the titanic data set - Basics

This post shows how to build a Graph using the mlr3pipelines package on the “titanic” dataset.

Florian Pfisterer
03-12-2020

Intro

First of all we are going to load required packages and the data. The data is part of the mlr3data package.


library("mlr3")
library("mlr3learners")
library("mlr3pipelines")
library("mlr3data")
library("mlr3misc")
library("mlr3viz")
data("titanic")

The titanic data is very interesting to analyze, even though it is part of many tutorials and showcases. This is because it requires many steps often required in real-world applications of machine learning techniques, such as missing value imputation, handling factors and others.

Following features are illustrated in this use case section:

In order to obtain solutions comparable to official leaderboards, such as the ones available from CRAN, we split the data into train and test set before doing any further analysis. Here we are using the predefined split used by Kaggle.


titanic_train <- titanic[1:891, ]
titanic_test <- titanic[892:1309, ]

Exploratory Data Analysis

With the dataset, we get an explanation of the meanings of the different variables:


survived        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

We can use the skimr package in order to get a first overview of the data:


skimr::skim(titanic_train)
Table 1: Data summary
Name titanic_train
Number of rows 891
Number of columns 11
_______________________
Column type frequency:
character 3
factor 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 12 82 0 891 0
ticket 0 1.00 3 18 0 681 0
cabin 687 0.23 1 15 0 147 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
survived 0 1 FALSE 2 no: 549, yes: 342
pclass 0 1 TRUE 3 3: 491, 1: 216, 2: 184
sex 0 1 FALSE 2 mal: 577, fem: 314
embarked 2 1 FALSE 3 S: 644, C: 168, Q: 77

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 177 0.8 29.70 14.53 0.42 20.12 28.00 38 80.00 ▂▇▅▂▁
sib_sp 0 1.0 0.52 1.10 0.00 0.00 0.00 1 8.00 ▇▁▁▁▁
parch 0 1.0 0.38 0.81 0.00 0.00 0.00 0 6.00 ▇▁▁▁▁
fare 0 1.0 32.20 49.69 0.00 7.91 14.45 31 512.33 ▇▁▁▁▁

skimr::skim(titanic_test)
Table 1: Data summary
Name titanic_test
Number of rows 418
Number of columns 11
_______________________
Column type frequency:
character 3
factor 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 13 63 0 418 0
ticket 0 1.00 3 18 0 363 0
cabin 327 0.22 1 15 0 76 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
survived 418 0 FALSE 0 yes: 0, no: 0
pclass 0 1 TRUE 3 3: 218, 1: 107, 2: 93
sex 0 1 FALSE 2 mal: 266, fem: 152
embarked 0 1 FALSE 3 S: 270, C: 102, Q: 46

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 86 0.79 30.27 14.18 0.17 21.0 27.00 39.0 76.00 ▂▇▃▂▁
sib_sp 0 1.00 0.45 0.90 0.00 0.0 0.00 1.0 8.00 ▇▁▁▁▁
parch 0 1.00 0.39 0.98 0.00 0.0 0.00 0.0 9.00 ▇▁▁▁▁
fare 1 1.00 35.63 55.91 0.00 7.9 14.45 31.5 512.33 ▇▁▁▁▁

Here we can also inspect the data for differences in the train and test set. This might be important, as shifts in the data distribution often make our models unreliable.


DataExplorer::plot_bar(titanic_train, nrow = 5, ncol = 3)


DataExplorer::plot_histogram(titanic_train, nrow = 2, ncol = 3)


DataExplorer::plot_boxplot(titanic_train, by = "survived", nrow = 2, ncol = 3)

We can now create a Task from our data. As we want to classify whether the person survived or not, we will create a TaskClassif. We’ll ignore the ‘titanic_test’ data for now and come back to it later.

A first model


task <- TaskClassif$new("titanic", titanic_train, target = "survived", positive = "yes")
task

<TaskClassif:titanic> (891 x 11)
* Target: survived
* Properties: twoclass
* Features (10):
  - chr (3): cabin, name, ticket
  - dbl (2): age, fare
  - fct (2): embarked, sex
  - int (2): parch, sib_sp
  - ord (1): pclass

Our Task currently has \(3\) features of type character, which we don’t really know how to handle: “Cabin”, “Name”, “Ticket” and “PassengerId”. Additionally, from our skim of the data, we have seen, that they have many unique values (up to 891).

We’ll drop them for now and see how we can deal with them later on.


task$select(cols = setdiff(task$feature_names, c("cabin", "name", "ticket")))

Additionally, we create a resampling instance that allows to compare data.


rdesc <- rsmp("cv", folds = 3L)$instantiate(task)

To get a first impression of what performance we can fit a simple decision tree:


learner <- mlr_learners$get("classif.rpart")
# or shorter:
learner <- lrn("classif.rpart")

res <- resample(task, learner, rdesc, store_models = TRUE)
agg <- res$aggregate(msr("classif.acc"))
agg

classif.acc 
  0.8080808 

So our model should have a minimal accuracy of 0.808 in order to improve over the simple decision tree. In order to improve more, we might need to do some feature engineering.

Optimizing the model

If we now try to fit a ‘ranger’ random forest model, we will get an error, as ‘ranger’ models can not naturally handle missing values.


learner <- lrn("classif.ranger")
learner$param_set$values <- list(num.trees = 250, min.node.size = 4)
res <- resample(task, learner, rdesc, store_models = TRUE)

Error: Missing data in columns: age, embarked.

This means we have to find a way to impute the missing values. To learn how to use more advanced commands of the mlr3pipelines package see:

Citation

For attribution, please cite this work as

Pfisterer (2020, March 12). mlr3gallery: A pipeline for the titanic data set - Basics. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-03-12-intro-pipelines-titanic/

BibTeX citation

@misc{pfisterer2020a,
  author = {Pfisterer, Florian},
  title = {mlr3gallery: A pipeline for the titanic data set - Basics},
  url = {https://mlr3gallery.mlr-org.com/posts/2020-03-12-intro-pipelines-titanic/},
  year = {2020}
}