# A pipeline for the titanic data set - Basics

This post shows how to build a Graph using the mlr3pipelines package on the “titanic” dataset.

Florian Pfisterer
03-12-2020

## Intro

First of all we are going to load required packages and the data. The data is part of the mlr3data package.

library("mlr3")
library("mlr3learners")
library("mlr3pipelines")
library("mlr3data")
library("mlr3misc")
library("mlr3viz")
data("titanic")


The titanic data is very interesting to analyze, even though it is part of many tutorials and showcases. This is because it requires many steps often required in real-world applications of machine learning techniques, such as missing value imputation, handling factors and others.

Following features are illustrated in this use case section:

• Summarizing the data set
• Visualizing data
• Splitting data into train and test data sets
• Defining a task and a learner

In order to obtain solutions comparable to official leaderboards, such as the ones available from kaggle, we split the data into train and test set before doing any further analysis. Here we are using the predefined split used by Kaggle.

titanic_train <- titanic[1:891, ]
titanic_test <- titanic[892:1309, ]


## Exploratory Data Analysis

With the dataset, we get an explanation of the meanings of the different variables:

survived        Survival
(0 = No; 1 = Yes)
pclass          Passenger Class
(1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
(C = Cherbourg; Q = Queenstown; S = Southampton)

We can use the skimr package in order to get a first overview of the data:

skimr::skim(titanic_train)

 Name titanic_train Number of rows 891 Number of columns 11 _______________________ Column type frequency: character 3 factor 4 numeric 4 ________________________ Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 12 82 0 891 0
ticket 0 1.00 3 18 0 681 0
cabin 687 0.23 1 15 0 147 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
survived 0 1 FALSE 2 no: 549, yes: 342
pclass 0 1 TRUE 3 3: 491, 1: 216, 2: 184
sex 0 1 FALSE 2 mal: 577, fem: 314
embarked 2 1 FALSE 3 S: 644, C: 168, Q: 77

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 177 0.8 29.70 14.53 0.42 20.12 28.00 38 80.00 ▂▇▅▂▁
sib_sp 0 1.0 0.52 1.10 0.00 0.00 0.00 1 8.00 ▇▁▁▁▁
parch 0 1.0 0.38 0.81 0.00 0.00 0.00 0 6.00 ▇▁▁▁▁
fare 0 1.0 32.20 49.69 0.00 7.91 14.45 31 512.33 ▇▁▁▁▁
skimr::skim(titanic_test)

 Name titanic_test Number of rows 418 Number of columns 11 _______________________ Column type frequency: character 3 factor 4 numeric 4 ________________________ Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 13 63 0 418 0
ticket 0 1.00 3 18 0 363 0
cabin 327 0.22 1 15 0 76 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
survived 418 0 FALSE 0 yes: 0, no: 0
pclass 0 1 TRUE 3 3: 218, 1: 107, 2: 93
sex 0 1 FALSE 2 mal: 266, fem: 152
embarked 0 1 FALSE 3 S: 270, C: 102, Q: 46

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
age 86 0.79 30.27 14.18 0.17 21.0 27.00 39.0 76.00 ▂▇▃▂▁
sib_sp 0 1.00 0.45 0.90 0.00 0.0 0.00 1.0 8.00 ▇▁▁▁▁
parch 0 1.00 0.39 0.98 0.00 0.0 0.00 0.0 9.00 ▇▁▁▁▁
fare 1 1.00 35.63 55.91 0.00 7.9 14.45 31.5 512.33 ▇▁▁▁▁

Here we can also inspect the data for differences in the train and test set. This might be important, as shifts in the data distribution often make our models unreliable.

DataExplorer::plot_bar(titanic_train, nrow = 5, ncol = 3)


DataExplorer::plot_histogram(titanic_train, nrow = 2, ncol = 3)

DataExplorer::plot_boxplot(titanic_train, by = "survived", nrow = 2, ncol = 3)


We can now create a Task from our data. As we want to classify whether the person survived or not, we will create a TaskClassif. We’ll ignore the ‘titanic_test’ data for now and come back to it later.

## A first model

task <- TaskClassif$new("titanic", titanic_train, target = "survived", positive = "yes") task  <TaskClassif:titanic> (891 x 11) * Target: survived * Properties: twoclass * Features (10): - chr (3): cabin, name, ticket - dbl (2): age, fare - fct (2): embarked, sex - int (2): parch, sib_sp - ord (1): pclass Our Task currently has $$3$$ features of type character, which we don’t really know how to handle: “Cabin”, “Name”, “Ticket” and “PassengerId”. Additionally, from our skim of the data, we have seen, that they have many unique values (up to 891). We’ll drop them for now and see how we can deal with them later on. task$select(cols = setdiff(task$feature_names, c("cabin", "name", "ticket")))  Additionally, we create a resampling instance that allows to compare data. rdesc <- rsmp("cv", folds = 3L)$instantiate(task)


To get a first impression of what performance we can fit a simple decision tree:

learner <- mlr_learners$get("classif.rpart") # or shorter: learner <- lrn("classif.rpart") res <- resample(task, learner, rdesc, store_models = TRUE) agg <- res$aggregate(msr("classif.acc"))
agg

classif.acc
0.8080808 

So our model should have a minimal accuracy of 0.808 in order to improve over the simple decision tree. In order to improve more, we might need to do some feature engineering.

# Optimizing the model

If we now try to fit a ‘ranger’ random forest model, we will get an error, as ‘ranger’ models can not naturally handle missing values.

learner <- lrn("classif.ranger")
learner$param_set$values <- list(num.trees = 250, min.node.size = 4)
res <- resample(task, learner, rdesc, store_models = TRUE)

Error: Missing data in columns: age, embarked.

This means we have to find a way to impute the missing values. To learn how to use more advanced commands of the mlr3pipelines package see:

### Citation

Pfisterer (2020, March 12). mlr3gallery: A pipeline for the titanic data set - Basics. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-03-12-intro-pipelines-titanic/
@misc{pfisterer2020a,
}