Encode factor levels for xgboost

The package “xgboost” unfortunately does not support handling of categorical features. Therefore, it is required to manually convert factor columns to numerical dummy features. We show how to use “mlr3pipelines” to augment the “mlr_learners_classif.xgboost” learner with an automatic factor encoding.

Michel Lang
01-31-2020

The package xgboost unfortunately does not support handling of categorical features. Therefore, it is required to manually convert factor columns to numerical dummy features. We show how to use mlr3pipelines to augment the xgboost learner with an automatic factor encoding.

Construct the Base Objects

First, we take an example task with factors (german_credit) and create the xgboost learner:


library(mlr3)
library(mlr3learners)

task = tsk("german_credit")
print(task)

<TaskClassif:german_credit> (1000 x 21)
* Target: credit_risk
* Properties: twoclass
* Features (20):
  - fct (14): credit_history, employment_duration, foreign_worker,
    housing, job, other_debtors, other_installment_plans,
    people_liable, personal_status_sex, property, purpose, savings,
    status, telephone
  - int (3): age, amount, duration
  - ord (3): installment_rate, number_credits, present_residence

learner = lrn("classif.xgboost", nrounds = 100)
print(learner)

<LearnerClassifXgboost:classif.xgboost>
* Model: -
* Parameters: verbose=0, nrounds=100
* Packages: xgboost
* Predict Type: response
* Feature types: logical, integer, numeric
* Properties: importance, missings, multiclass, twoclass, weights

We now compare the feature types of the task and the supported feature types:


unique(task$feature_types$type)

[1] "integer" "factor"  "ordered"

learner$feature_types

[1] "logical" "integer" "numeric"

setdiff(task$feature_types$type, learner$feature_types)

[1] "factor"  "ordered"

In this example, we have to convert factors and ordered factors to numeric columns to apply the xgboost learner. Because xgboost is based on decision trees (at least in its default settings), it is perfectly fine to convert the ordered factors to integer. Unordered factors must still be encoded though.

Construct Operators

The factor encoder’s man page can be found under mlr_pipeops_encode. Here, we decide to use “treatment” encoding (first factor level serves as baseline, and there will be a new binary column for each additional level). We restrict the operator to factor columns using the respective Selector selector_type():


library(mlr3pipelines)
fencoder = po("encode", method = "treatment",
  affect_columns = selector_type("factor"))

We can manually trigger the PipeOp to test the operator on our task:


fencoder$train(list(task))

$output
<TaskClassif:german_credit> (1000 x 50)
* Target: credit_risk
* Properties: twoclass
* Features (49):
  - dbl (43): credit_history.all.credits.at.this.bank.paid.back.duly,
    credit_history.critical.account.other.credits.elsewhere,
    credit_history.existing.credits.paid.back.duly.till.now,
    credit_history.no.credits.taken.all.credits.paid.back.duly,
    employment_duration....7.yrs, employment_duration...1.yr,
    employment_duration.1..........4.yrs,
    employment_duration.4..........7.yrs, foreign_worker.yes,
    housing.own, housing.rent,
    job.manager.self.empl.highly.qualif..employee,
    job.skilled.employee.official, job.unskilled...resident,
    other_debtors.co.applicant, other_debtors.guarantor,
    other_installment_plans.none, other_installment_plans.stores,
    people_liable.3.or.more,
    personal_status_sex.female...non.single.or.male...single,
    personal_status_sex.female...single,
    personal_status_sex.male...married.widowed,
    property.building.soc..savings.agr....life.insurance,
    property.car.or.other, property.real.estate, purpose.business,
    purpose.car..new., purpose.car..used., purpose.domestic.appliances,
    purpose.education, purpose.furniture.equipment,
    purpose.radio.television, purpose.repairs, purpose.retraining,
    purpose.vacation, savings........1000.DM, savings.......100.DM,
    savings.100..........500.DM, savings.500..........1000.DM,
    status........200.DM...salary.for.at.least.1.year,
    status.......0.DM, status.0.........200.DM,
    telephone.yes..under.customer.name.
  - int (3): age, amount, duration
  - ord (3): installment_rate, number_credits, present_residence

The ordered factor remained untouched, all other factors have been converted to numeric columns. To also convert the ordered variables installment_rate, number_credits, and present_residence, we construct the colapply operator with the converter as.integer():


ord_to_int = po("colapply", applicator = as.integer,
  affect_columns = selector_type("ordered"))

Applied on the original task, it changes factor columns to integer:


ord_to_int$train(list(task))

$output
<TaskClassif:german_credit> (1000 x 21)
* Target: credit_risk
* Properties: twoclass
* Features (20):
  - fct (14): credit_history, employment_duration, foreign_worker,
    housing, job, other_debtors, other_installment_plans,
    people_liable, personal_status_sex, property, purpose, savings,
    status, telephone
  - int (6): age, amount, duration, installment_rate, number_credits,
    present_residence

Construct Pipeline

Finally, we construct a linear pipeline consisting of

  1. the factor encoder fencoder,
  2. the ordered factor converter ord_to_int, and
  3. the xgboost base learner.

pipe = fencoder %>>% ord_to_int %>>% learner
print(pipe)

Graph with 3 PipeOps:
              ID         State        sccssors prdcssors
          encode        <list>        colapply          
        colapply        <list> classif.xgboost    encode
 classif.xgboost <<UNTRAINED>>                  colapply

The pipeline is wrapped in a GraphLearner so that it behaves like a regular learner:


glearner = GraphLearner$new(pipe)

We can now apply the new learner on the task, here with a 3-fold cross validation:


rr = resample(task, glearner, rsmp("cv", folds = 3))
rr$aggregate()

classif.ce 
 0.2620075 

Success! We augmented xgboost with handling of factors and ordered factors. If we combine this learner with a tuner from mlr3tuning, we get a universal and competitive learner.

Citation

For attribution, please cite this work as

Lang (2020, Jan. 31). mlr3gallery: Encode factor levels for xgboost. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-01-31-encode-factors-for-xgboost/

BibTeX citation

@misc{lang2020encode,
  author = {Lang, Michel},
  title = {mlr3gallery: Encode factor levels for xgboost},
  url = {https://mlr3gallery.mlr-org.com/posts/2020-01-31-encode-factors-for-xgboost/},
  year = {2020}
}