Feature Engineering of Date-Time Variables

We show how to engineer features using date-time variables.

Lennart Schneider
05-02-2020

In this tutorial, we demonstrate how mlr3pipelines can be used to easily engineer features based on date-time variables. Relying on the Bike Sharing Dataset and the ranger learner we compare the RMSE of a random forest using the original features (baseline), to the RMSE of a random forest using newly engineered features on top of the original ones.

Motivation

A single date-time variable (i.e., a POSIXct column) contains plenty of information ranging from year, month, day, hour, minute and second to other features such as week of the year, or day of the week. Moreover, most of these features are of cyclical nature, i.e., the eleventh and twelfth hour of a day are one hour apart, but so are the 23rd hour and midnight of the other day (see also this blog post and fastai for more information).

Not respecting this cyclical nature results in treating hours on a linear continuum. One way to handle a cyclical feature \(\mathbf{x}\) is to compute the sine and cosine transformation of \(\frac{2 \pi \mathbf{x}}{\mathbf{x}_{\text{max}}}\), with \(\mathbf{x}_{\text{max}} = 24\) for hours and \(60\) for minutes and seconds.

This results in a two-dimensional representation of the feature:

mlr3pipelines provides the PipeOpDateFeatures pipeline which can be used to automatically engineer features based on POSIXct columns, including handling of cyclical features.

This is useful as most learners naturally cannot handle dates and POSIXct variables and therefore require conversion prior to training.

Bike Sharing

The Bike Sharing Dataset contains the hourly count of rental bikes between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information. The dataset can be downloaded from the UCI Machine Learning Repository. After reading in the data, we fix some factor levels, and convert some data types:


tmp <- tempfile()
download.file(
  "https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip",
  tmp)
bikes = read.csv(unz(tmp, filename = "hour.csv"), as.is = TRUE)
bikes$season = factor(bikes$season,
  labels = c("winter", "spring", "summer", "fall"))
bikes$holiday = as.logical(bikes$holiday)
bikes$workingday = as.logical(bikes$workingday)
bikes$weathersit = as.factor(bikes$weathersit)

Our goal will be to predict the total number of rented bikes on a given day: cnt.


str(bikes)

'data.frame':   17379 obs. of  17 variables:
 $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
 $ dteday    : chr  "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
 $ season    : Factor w/ 4 levels "winter","spring",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
 $ hr        : int  0 1 2 3 4 5 6 7 8 9 ...
 $ holiday   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ weekday   : int  6 6 6 6 6 6 6 6 6 6 ...
 $ workingday: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ weathersit: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 2 1 1 1 1 ...
 $ temp      : num  0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
 $ atemp     : num  0.288 0.273 0.273 0.288 0.288 ...
 $ hum       : num  0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
 $ windspeed : num  0 0 0 0 0 0.0896 0 0 0 0 ...
 $ casual    : int  3 8 5 3 0 0 2 1 1 8 ...
 $ registered: int  13 32 27 10 1 1 0 2 7 6 ...
 $ cnt       : int  16 40 32 13 1 1 2 3 8 14 ...

The original dataset does not contain a POSIXct column, but we can easily generate one based on the other variables available (note that as no information regarding minutes and seconds is available, we set them to :00:00):


bikes$date = as.POSIXct(strptime(paste0(bikes$dteday, " ", bikes$hr, ":00:00"),
  tz = "GMT", format = "%Y-%m-%d %H:%M:%S"))
summary(bikes$date)

                 Min.               1st Qu.                Median 
"2011-01-01 00:00:00" "2011-07-04 22:30:00" "2012-01-02 21:00:00" 
                 Mean               3rd Qu.                  Max. 
"2012-01-02 15:41:22" "2012-07-02 06:30:00" "2012-12-31 23:00:00" 

Baseline Random Forest

We construct a new regression task and create a vector of train and test indices:


library(mlr3)
library(mlr3learners)
set.seed(2906)
tsk = TaskRegr$new("bikes", backend = bikes, target = "cnt")
train.idx = sample(seq_len(tsk$nrow), size = 0.7 * tsk$nrow)
test.idx = setdiff(seq_len(tsk$nrow), train.idx)

This allows us to construct a train and test task:


tsk_train = tsk$clone()$filter(train.idx)
tsk_test = tsk$clone()$filter(test.idx)

To estimate the performance on unseen data, we will use a 3-fold cross-validation.

Note that this involves validating on past data, which is usually bad practice but should suffice for this example:


cv3 = rsmp("cv", folds = 3)

To obtain reliable estimates on how well our model generalizes to the future, we would have to split our training and test sets according to the date variable.

As our baseline model, we use a random forest, ranger learner. For the baseline, we only use the original features that are sensible and drop instant (record index), dteday (year-month-day as a character, not usable) and date (our new POSIXct variable which we will only use later). We also do not use casual (count of casual users) and registered (count of registered users) as features as they together add up to cnt and could be used as different target variables if we were interested in only the casual or registered users.


lrn_rf = lrn("regr.ranger")
tsk_train_rf = tsk_train$clone()$select(setdiff(
  tsk$feature_names,
  c("instant", "dteday", "date", "casual", "registered")
  )
)

We can then use resample with 3-fold cross-validation:


res_rf = resample(tsk_train_rf, learner = lrn_rf, resampling = cv3)
res_rf$score(msr("regr.mse"))

             task task_id                 learner  learner_id
1: <TaskRegr[42]>   bikes <LearnerRegrRanger[32]> regr.ranger
2: <TaskRegr[42]>   bikes <LearnerRegrRanger[32]> regr.ranger
3: <TaskRegr[42]>   bikes <LearnerRegrRanger[32]> regr.ranger
           resampling resampling_id iteration           prediction regr.mse
1: <ResamplingCV[19]>            cv         1 <PredictionRegr[18]> 4492.038
2: <ResamplingCV[19]>            cv         2 <PredictionRegr[18]> 4720.677
3: <ResamplingCV[19]>            cv         3 <PredictionRegr[18]> 4381.281

The average RMSE is given by:


sprintf("RMSE ranger original features: %s", round(sqrt(res_rf$aggregate()),
  digits = 2))

[1] "RMSE ranger original features: 67.32"

We now want to improve our baseline model by using newly engineered features based on the date POSIXct column.

PipeOpDateFeatures

To engineer new features we use PipeOpDateFeatures. This pipeline automatically dispatches on POSIXct columns of the data and by default adds plenty of new date-time related features. Here, we want to add all except for minute and second, because this information is not available. As we additionally want to use cyclical versions of the features we set cyclic = TRUE:


library(mlr3pipelines)
pop = po("datefeatures", param_vals = list(
  cyclic = TRUE, minute = FALSE, second = FALSE)
)

Training this pipeline will result in simply adding the new features (and removing the original POSIXct feature(s) used for the feature engineering, see also the keep_date_var parameter). In our training task, we can now drop the features, yr, mnth, hr, and weekday, because our pipeline will generate these anyways:


tsk_train_ex = tsk_train$clone()$select(setdiff(
  tsk$feature_names,
  c("instant", "dteday", "yr", "mnth", "hr", "weekday", "casual", "registered")
  )
)
pop$train(list(tsk_train_ex))

$output
<TaskRegr:bikes> (12165 x 29)
* Target: cnt
* Properties: -
* Features (28):
  - dbl (23): atemp, date.day_of_month, date.day_of_month_cos,
    date.day_of_month_sin, date.day_of_week, date.day_of_week_cos,
    date.day_of_week_sin, date.day_of_year, date.day_of_year_cos,
    date.day_of_year_sin, date.hour, date.hour_cos, date.hour_sin,
    date.month, date.month_cos, date.month_sin, date.week_of_year,
    date.week_of_year_cos, date.week_of_year_sin, date.year, hum, temp,
    windspeed
  - lgl (3): date.is_day, holiday, workingday
  - fct (2): season, weathersit

Note that it may be useful to familiarize yourself with PipeOpRemoveConstants which can be used after the feature engineering to remove features that are constant. PipeOpDateFeatures does not do this step automatically.

To combine this feature engineering step with a random forest, ranger learner, we now construct a GraphLearner.

Using the New Features in a GraphLearner

We create a GraphLearner consisting of the PipeOpDateFeatures pipeline and a ranger learner. This GraphLearner then behaves like any other Learner:


grl = GraphLearner$new(
  po("datefeatures", param_vals = list(
    cyclic = TRUE, minute = FALSE, second = FALSE)
  ) %>>%
  lrn("regr.ranger")
)

Using resample with 3-fold cross-validation on the training task yields:


tsk_train_grl = tsk_train$clone()$select(setdiff(
  tsk$feature_names,
  c("instant", "dteday", "yr", "mnth", "hr", "weekday", "casual", "registered")
  )
)
res_grl = resample(tsk_train_grl, learner = grl, resampling = cv3)
res_grl$score(msr("regr.mse"))

             task task_id            learner               learner_id
1: <TaskRegr[42]>   bikes <GraphLearner[31]> datefeatures.regr.ranger
2: <TaskRegr[42]>   bikes <GraphLearner[31]> datefeatures.regr.ranger
3: <TaskRegr[42]>   bikes <GraphLearner[31]> datefeatures.regr.ranger
           resampling resampling_id iteration           prediction regr.mse
1: <ResamplingCV[19]>            cv         1 <PredictionRegr[18]> 2366.673
2: <ResamplingCV[19]>            cv         2 <PredictionRegr[18]> 2688.929
3: <ResamplingCV[19]>            cv         3 <PredictionRegr[18]> 2209.474

The average RMSE is given by


sprintf("RMSE graph learner date features: %s", round(sqrt(res_grl$aggregate()),
  digits = 2))

[1] "RMSE graph learner date features: 49.21"

and therefore improved by almost 30%!

Finally, we fit our GraphLearner on the complete training task and predict on the test task:


tsk_train$select(setdiff(
  tsk$feature_names,
  c("instant", "dteday", "yr", "mnth", "hr", "weekday", "casual", "registered")
  )
)
grl$train(tsk_train)

tsk_test$select(setdiff(
  tsk$feature_names,
  c("instant", "dteday", "yr", "mnth", "hr", "weekday", "casual", "registered")
  )
)
pred = grl$predict(tsk_test)
pred$score(msr("regr.mse"))

regr.mse 
1832.432 

Where we can obtain the RMSE on the held-out test data:


sprintf("RMSE graph learner date features: %s", round(sqrt(pred$score(msr("regr.mse"))),
  digits = 2))

[1] "RMSE graph learner date features: 42.81"

Citation

For attribution, please cite this work as

Schneider (2020, May 2). mlr3gallery: Feature Engineering of Date-Time Variables. Retrieved from https://mlr3gallery.mlr-org.com/posts/2020-05-02-feature-engineering-of-date-time-variables/

BibTeX citation

@misc{schneider2020feature,
  author = {Schneider, Lennart},
  title = {mlr3gallery: Feature Engineering of Date-Time Variables},
  url = {https://mlr3gallery.mlr-org.com/posts/2020-05-02-feature-engineering-of-date-time-variables/},
  year = {2020}
}