The package “xgboost” unfortunately does not support handling of categorical features. Therefore, it is required to manually convert factor columns to numerical dummy features. We show how to use “mlr3pipelines” to augment the “mlr_learners_classif.xgboost” learner with an automatic factor encoding.
The package xgboost unfortunately does not support handling of categorical features. Therefore, it is required to manually convert factor columns to numerical dummy features. We show how to use mlr3pipelines to augment the xgboost learner
with an automatic factor encoding.
First, we take an example task with factors (german_credit
) and create the xgboost learner
:
<TaskClassif:german_credit> (1000 x 21)
* Target: credit_risk
* Properties: twoclass
* Features (20):
- fct (14): credit_history, employment_duration, foreign_worker,
housing, job, other_debtors, other_installment_plans,
people_liable, personal_status_sex, property, purpose, savings,
status, telephone
- int (3): age, amount, duration
- ord (3): installment_rate, number_credits, present_residence
<LearnerClassifXgboost:classif.xgboost>
* Model: -
* Parameters: verbose=0, nrounds=100
* Packages: xgboost
* Predict Type: response
* Feature types: logical, integer, numeric
* Properties: importance, missings, multiclass, twoclass, weights
We now compare the feature types of the task and the supported feature types:
unique(task$feature_types$type)
[1] "integer" "factor" "ordered"
learner$feature_types
[1] "logical" "integer" "numeric"
setdiff(task$feature_types$type, learner$feature_types)
[1] "factor" "ordered"
In this example, we have to convert factors and ordered factors to numeric columns to apply the xgboost learner. Because xgboost is based on decision trees (at least in its default settings), it is perfectly fine to convert the ordered factors to integer. Unordered factors must still be encoded though.
The factor encoder’s man page can be found under mlr_pipeops_encode
. Here, we decide to use “treatment” encoding (first factor level serves as baseline, and there will be a new binary column for each additional level). We restrict the operator to factor columns using the respective Selector
selector_type()
:
library(mlr3pipelines)
fencoder = po("encode", method = "treatment",
affect_columns = selector_type("factor"))
We can manually trigger the PipeOp
to test the operator on our task:
fencoder$train(list(task))
$output
<TaskClassif:german_credit> (1000 x 50)
* Target: credit_risk
* Properties: twoclass
* Features (49):
- dbl (43): credit_history.all.credits.at.this.bank.paid.back.duly,
credit_history.critical.account.other.credits.elsewhere,
credit_history.existing.credits.paid.back.duly.till.now,
credit_history.no.credits.taken.all.credits.paid.back.duly,
employment_duration....7.yrs, employment_duration...1.yr,
employment_duration.1..........4.yrs,
employment_duration.4..........7.yrs, foreign_worker.yes,
housing.own, housing.rent,
job.manager.self.empl.highly.qualif..employee,
job.skilled.employee.official, job.unskilled...resident,
other_debtors.co.applicant, other_debtors.guarantor,
other_installment_plans.none, other_installment_plans.stores,
people_liable.3.or.more,
personal_status_sex.female...non.single.or.male...single,
personal_status_sex.female...single,
personal_status_sex.male...married.widowed,
property.building.soc..savings.agr....life.insurance,
property.car.or.other, property.real.estate, purpose.business,
purpose.car..new., purpose.car..used., purpose.domestic.appliances,
purpose.education, purpose.furniture.equipment,
purpose.radio.television, purpose.repairs, purpose.retraining,
purpose.vacation, savings........1000.DM, savings.......100.DM,
savings.100..........500.DM, savings.500..........1000.DM,
status........200.DM...salary.for.at.least.1.year,
status.......0.DM, status.0.........200.DM,
telephone.yes..under.customer.name.
- int (3): age, amount, duration
- ord (3): installment_rate, number_credits, present_residence
The ordered factor remained untouched, all other factors have been converted to numeric columns. To also convert the ordered variables installment_rate
, number_credits
, and present_residence
, we construct the colapply
operator with the converter as.integer()
:
ord_to_int = po("colapply", applicator = as.integer,
affect_columns = selector_type("ordered"))
Applied on the original task, it changes factor columns to integer
:
ord_to_int$train(list(task))
$output
<TaskClassif:german_credit> (1000 x 21)
* Target: credit_risk
* Properties: twoclass
* Features (20):
- fct (14): credit_history, employment_duration, foreign_worker,
housing, job, other_debtors, other_installment_plans,
people_liable, personal_status_sex, property, purpose, savings,
status, telephone
- int (6): age, amount, duration, installment_rate, number_credits,
present_residence
Finally, we construct a linear pipeline consisting of
fencoder
,ord_to_int
, andpipe = fencoder %>>% ord_to_int %>>% learner
print(pipe)
Graph with 3 PipeOps:
ID State sccssors prdcssors
encode <list> colapply
colapply <list> classif.xgboost encode
classif.xgboost <<UNTRAINED>> colapply
The pipeline is wrapped in a GraphLearner
so that it behaves like a regular learner:
glearner = GraphLearner$new(pipe)
We can now apply the new learner on the task, here with a 3-fold cross validation:
[05:24:35] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[05:24:35] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
[05:24:36] WARNING: amalgamation/../src/learner.cc:1061: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
rr$aggregate()
classif.ce
0.2539995
Success! We augmented xgboost with handling of factors and ordered factors. If we combine this learner with a tuner from mlr3tuning, we get a universal and competitive learner.
For attribution, please cite this work as
Lang (2020, Jan. 31). mlr3gallery: Encode factor levels for xgboost. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-01-31-encode-factors-for-xgboost/
BibTeX citation
@misc{lang2020encode, author = {Lang, Michel}, title = {mlr3gallery: Encode factor levels for xgboost}, url = {https://mlr3gallery.mlr-org.com/posts/2020-01-31-encode-factors-for-xgboost/}, year = {2020} }