Rule Your Data with Tidy Validation Reports. Design

2017-11-28

rstats ruler

The story about design of ruler package: dplyr-style exploration and validation of data frame like objects.

Prologue

Some time ago I had a task to write data validation code. As for most R practitioners, this led to exploration of present solutions. I was looking for a package with the following features:

  • Relatively small amount of time should be spent learning before comfortable usage. Preferably, it should be built with tidyverse in mind.
  • It should be quite flexible in terms of types of validation rules.
  • Package should offer functionality for both validations (with relatively simple output format) and assertions (with relatively flexible behaviour).
  • Pipe-friendliness.
  • Validating only data frames would be enough.

After devoting couple of days to research, I didn’t find any package fully (subjectively) meeting my needs (for a composed list look here). So I decided to write one myself. More precisely, it turned out into not one but two packages: ruler and keyholder, which powers some of ruler’s functionality.

This post is a rather long story about key moments in the journey of ruler’s design process. To learn other aspects see its README (for relatively brief introduction) or vignettes (for more thorough description of package capabilities).

Overview

In my mind, the whole process of data validation should be performed in the following steps:

  • Create conditions (rules) for data to meet.
  • Expose data to them and obtain some kind of unified report as a result.
  • Act based on the report.

The design process went through a little different sequence of definition steps:

Of course, there was switching between these items in order to ensure they would work well together, but I feel this order was decisive for the end result.

suppressMessages(library(dplyr))
suppressMessages(library(purrr))
library(ruler)

Validation result

Dplyr data units

I started with an attempt of simple and clear formulation of validation: it is a process of checking whether something satisfies certain conditions. As it was enough to be only validating data frames, something should be thought of as parts of data frame which I will call data units. Certain conditions might be represented as functions, which I will call rules, associated with some data unit and which return TRUE, if condition is satisfied, and FALSE otherwise.

I decided to make dplyr package a default tool for creating rules. The reason is, basically, because it satisfies most conditions I had in mind. Also I tend to use it for interactive validation of data frames, as, I am sure, many more R users. Its pipe-friendliness creates another important feature: interactive code can be transformed into a function just by replacing the initial data frame variable by a dot .. This will create a functional sequence, “a function which applies the entire chain of right-hand sides in turn to its input.”.

dplyr offers a set of tools for operating with the following data units (see comments):

is_integerish <- function(x) {all(x == as.integer(x))}
z_score <- function(x) {abs(x - mean(x)) / sd(x)}

mtcars_tbl <- mtcars %>% as_tibble()

# Data frame as a whole
validate_data <- . %>% summarise(nrow_low = n() >= 15,
                                 nrow_up = n() <= 20)
mtcars_tbl %>% validate_data()
## # A tibble: 1 x 2
##   nrow_low nrow_up
##      <lgl>   <lgl>
## 1     TRUE   FALSE

# Group as a whole
validate_groups <- . %>% group_by(vs, am) %>%
  summarise(vs_am_low = n() >= 7) %>%
  ungroup()
mtcars_tbl %>% validate_groups()
## # A tibble: 4 x 3
##      vs    am vs_am_low
##   <dbl> <dbl>     <lgl>
## 1     0     0      TRUE
## 2     0     1     FALSE
## 3     1     0      TRUE
## 4     1     1      TRUE

# Column as a whole
validate_columns <- . %>%
  summarise_if(is_integerish, funs(is_enough_sum = sum(.) >= 14))
mtcars_tbl %>% validate_columns()
## # A tibble: 1 x 6
##   cyl_is_enough_sum hp_is_enough_sum vs_is_enough_sum am_is_enough_sum
##               <lgl>            <lgl>            <lgl>            <lgl>
## 1              TRUE             TRUE             TRUE            FALSE
## # ... with 2 more variables: gear_is_enough_sum <lgl>,
## #   carb_is_enough_sum <lgl>

# Row as a whole
validate_rows <- . %>% filter(vs == 1) %>%
  transmute(is_enough_sum = rowSums(.) >= 200)
mtcars_tbl %>% validate_rows()
## # A tibble: 14 x 1
##   is_enough_sum
##           <lgl>
## 1          TRUE
## 2          TRUE
## 3          TRUE
## 4          TRUE
## 5          TRUE
## # ... with 9 more rows

# Cell
validate_cells <- . %>%
  transmute_if(is.numeric, funs(is_out = z_score(.) > 1)) %>%
  slice(-(1:5))
mtcars_tbl %>% validate_cells()
## # A tibble: 27 x 11
##   mpg_is_out cyl_is_out disp_is_out hp_is_out drat_is_out wt_is_out
##        <lgl>      <lgl>       <lgl>     <lgl>       <lgl>     <lgl>
## 1      FALSE      FALSE       FALSE     FALSE        TRUE     FALSE
## 2      FALSE       TRUE        TRUE      TRUE       FALSE     FALSE
## 3      FALSE       TRUE       FALSE      TRUE       FALSE     FALSE
## 4      FALSE       TRUE       FALSE     FALSE       FALSE     FALSE
## 5      FALSE      FALSE       FALSE     FALSE       FALSE     FALSE
## # ... with 22 more rows, and 5 more variables: qsec_is_out <lgl>,
## #   vs_is_out <lgl>, am_is_out <lgl>, gear_is_out <lgl>, carb_is_out <lgl>

Tidy data validation report

After realizing this type of dplyr structure, I noticed the following points.

In order to use dplyr as tool for creating rules, there should be one extra level of abstraction for the whole functional sequence. It is not a rule but rather a several rules. In other words, it is a function that answers multiple questions about one type of data unit. I decided to call this rule pack or simply pack.

In order to identify, whether some data unit obeys some rule, one needs to describe that data unit, rule and result of validation. Descriptions of last two are simple: for rule it is a combination of pack and rule names (which should always be defined) and for validation result it is value TRUE or FALSE.

Description of data unit is trickier. After some thought, I decided that the most balanced way to do it is with two variables:

  • var (character) which represents the variable name of data unit:
    • Value “.all” is reserved for “all columns as a whole”.
    • Value equal to some column name indicates column of data unit.
    • Value not equal to some column name indicates the name of group: it is created by uniting (with delimiter) group levels of grouping columns.
  • id (integer) which represents the row index of data unit:
    • Value 0 is reserved for “all rows as a whole”.
    • Value not equal to 0 indicates the row index of data unit.

Combinations of these variables describe all mentioned data units:

  • var == '.all' and id == 0: Data as a whole.
  • var != '.all' and id == 0: Group (var shouldn’t be an actual column name) or column (var should be an actual column name) as a whole.
  • var == '.all' and id != 0: Row as a whole.
  • var != '.all' and id != 0: Described cell.

With this knowledge in mind, I decided that the tidy data validation report should be a tibble with the following columns:

  • pack <chr> : Pack name.
  • rule <chr> : Rule name inside pack.
  • var <chr> : Variable name of data unit.
  • id <int> : Row index of data unit.
  • value <lgl> : Whether the described data unit obeys the rule.

Exposure

Using only described report as validation output is possible if only information about breakers (data units which do not obey respective rules) is interesting. However, reproducibility is a great deal in R community, and keeping information about call can be helpful for future use.

This idea led to creation of another object in ruler called packs info. It is also a tibble which contains all information about exposure call:

  • name <chr> : Name of the rule pack. This column is used to match column pack in tidy report.
  • type <chr> : Name of pack type. Indicates which data unit pack checks.
  • fun <list> : List of actually used rule pack functions.
  • remove_obeyers <lgl> : Value of convenience argument of the future expose function. It indicates whether rows about obeyers (data units that obey certain rule) were removed from report after applying pack.

To fully represent validation, described two tibbles should be returned together. So the actual validation result is decided to be exposure which is basically an S3 class list with two tibbles packs_info and report. This data structure is fairly easy to understand and use. For example, exposures can be binded together which is useful for combining several validation results. Also its elements are regular tibbles which can be filtered, summarised, joined, etc.

Rules definition

Interpretation of dplyr output

I was willing to use pure dplyr in creating rule packs, i.e. without extra knowledge of data unit to be validated. However, I found it impossible to do without experiencing annoying edge cases. Problem with this approach is that all of dplyr outputs are tibbles with similar structures. The only differentiating features are:

  • summarise without grouping returns tibble with one row and user-defined column names.
  • summarise with grouping returns tibble with number of rows equal to number of summarised groups. Columns consist from grouping and user-defined ones.
  • transmute returns tibble with number of rows as in input data frame and user-defined column names.
  • Scoped variants of summarise and transmute differ from regular ones in another mechanism of creating columns. They apply all supplied functions to all chosen columns. Resulting names are “the shortest … needed to uniquely identify the output”. It means that:
    • In case of one function they are column names.
    • In case of more than one function and one column they are function names.
    • In case of more than one column and function they are combinations of column and function names, pasted with character _ (which, unfortunately, is hardcoded). To force this behaviour in previous cases both columns and functions should be named inside of helper functions vars and funs. To match output columns with combination of validated column and rule, this option is preferred. However, there is a need of different separator between column and function names, as character _ is frequently used in column names.

The first attempt was to use the following algorithm to interpret (identify validated data unit) the output:

  • If there is at least one non-logical column then groups are validated. The reason is that in most cases grouping columns are character or factor ones. This already introduces edge case with logical grouping columns.
  • Combination of whether number of rows equals 1 (n_rows_one) and presence of name separator in all column names (all_contain_sep) is used to make interpretation:
    • If n_rows_one == TRUE and all_contain_sep == FALSE then data is validated.
    • If n_rows_one == TRUE and all_contain_sep == TRUE then columns are validated.
    • If n_rows_one == FALSE and all_contain_sep == FALSE then rows are validated. This introduces an edge case when output has one row which is intended to be validated. It will be interpreted as ‘data as a whole’.
    • If n_rows_one == FALSE and all_contain_sep == TRUE then cells are validated. This also has edge case when output has one row in which cells are intended to be validated. It will be interpreted as ‘columns as a whole’.

Despite of having edge cases, this algorithm is good for guessing the validated data unit, which can be useful for interactive use. Its important prerequisite is to have a simple way of forcing extended naming in scoped dplyr verbs with custom rarely used separator.

Pack creation

Research of pure dplyr-style way of creating rule packs left no choice but to create a mechanism of supplying information about data unit of interest along with pack functions. It consists of following important principles.

Use ruler’s function rules() instead of funs(). Its goals are to force usage of full naming in scoped dplyr verbs as much as possible and impute missing rule names (as every rule should be named for validation report). rules is just a wrapper for funs but with extra functionality of naming its every output element and adding prefix to that names (which will be used as a part of separator between column and rule name). By default prefix is a string ._.. It is chosen for its, hopefully, rare usage inside column names and symbolism (it is the Morse code of letter ‘R’).

funs(mean, sd)
## <fun_calls>
## $ mean: mean(.)
## $ sd  : sd(.)

rules(mean, sd)
## <fun_calls>
## $ ._.rule..1: mean(.)
## $ ._.rule..2: sd(.)

rules(mean, sd, .prefix = "___")
## <fun_calls>
## $ ___rule..1: mean(.)
## $ ___rule..2: sd(.)

rules(fn_1 = mean, fn_2 = sd)
## <fun_calls>
## $ ._.fn_1: mean(.)
## $ ._.fn_2: sd(.)

Note that in case of using only one column in scoped verb it should be named within dplyr::vars in order to force full naming.

Use functions supported by keyholder to build rule packs. One of the main features I was going to implement is a possibility of validating only a subset of all possible data units. For example, validation of only last two rows (or columns) of data frame. There is no problem with columns: they can be specified with summarise_at. However, the default way of specifying rows is by subsetting data frame, after which all information about original row position is lost. To solve this, I needed a mechanism of tracking rows as invisibly for user as possible. This led to creation of keyholder package (which is also on CRAN now). To learn details about it go to its site or read my previous post.

Use specific rule pack wrappers for certain data units. Their goal is to create S3 classes for rule packs in order to carry information about data unit of interest through exposing process. All of them always return a list with supplied functions but with changed attribute class (with additional group_vars and group_sep for group_packs()). Note that packs might be named inside these functions, which is recommended. If not, names will be imputed during exposing process. Also note that supplied functions are not checked to be correct in terms of validating specified data unit. This is done during exposure (exposing process).

# Data unit. Rule pack is manually named 'my_data'
my_data_packs <- data_packs(my_data = validate_data)
map(my_data_packs, class)
## $my_data
## [1] "data_pack" "rule_pack" "fseq"      "function"

# Group unit. Need to supply grouping variables explicitly
my_group_packs <- group_packs(validate_groups, .group_vars = c("vs", "am"))
map(my_group_packs, class)
## [[1]]
## [1] "group_pack" "rule_pack"  "fseq"       "function"

# Column unit. Need to be rewritten using `rules`
my_col_packs <- col_packs(
  my_col = . %>%
    summarise_if(is_integerish, rules(is_enough_sum = sum(.) >= 14))
)
map(my_col_packs, class)
## $my_col
## [1] "col_pack"  "rule_pack" "fseq"      "function"

# Row unit. One can supply several rule packs
my_row_packs <- row_packs(
  my_row_1 = validate_rows,
  my_row_2 = . %>% transmute(is_vs_one = vs == 1)
)
map(my_row_packs, class)
## $my_row_1
## [1] "row_pack"  "rule_pack" "fseq"      "function" 
## 
## $my_row_2
## [1] "row_pack"  "rule_pack" "fseq"      "function"

# Cell unit. Also needs to be rewritten using `rules`.
my_cell_packs <- cell_packs(
  my_cell = . %>%
    transmute_if(is.numeric, rules(is_out = z_score(.) > 1)) %>%
    slice(-(1:5))
)
map(my_cell_packs, class)
## $my_cell
## [1] "cell_pack" "rule_pack" "fseq"      "function"

Exposing process

After sorting things out with formats of validation result and rule packs it was time to combine them in the main ruler’s function: expose(). I had the following requirements:

  • It should be insertable inside common %>% pipelines as smoothly and flexibly as possible. Two main examples are validating data frame before performing some operations with it and actually obtaining results of validation.
  • There should be possibility of sequential apply of expose with different rule packs. In this case exposure (validation report) after first call should be updated with new exposure. In other words, the result should be as if those rule packs were both supplied in expose by one call.

These requirements led to the following main design property of expose: it never modifies content of input data frame but possibly creates or updates attribute exposure with validation report. To access validation data there are wrappers get_exposure(), get_report() and get_packs_info(). The whole exposing process can be described as follows:

  • Apply all supplied rule packs to keyed with keyholder::use_id version of input data frame.
  • Impute names of rule packs based on possible present exposure (from previous use of expose) and validated data units.
  • Bind possible present exposure with new ones and create/update attribute exposure with it.

Also it was decided (for flexibility and convenience) to add following arguments to expose:

  • .rule_sep. It is a regular expression used to delimit column and function names in the output of scoped dplyr verbs. By default it is a string ._. possibly surrounded by punctuation characters. This is done to account of dplyr’s hardcoded use of _ in scoped verbs. Note that .rule_sep should take into account separator used in rules().
  • .remove_obeyers. It is a logical argument indicating whether to automatically remove elements, which obey rules, from tidy validation report. It can be very useful because the usual result of validation is a handful of rule breakers. Without possibility of setting .remove_obeyers to TRUE (which is default) validation report will grow unnecessary big.
  • .guess. By default expose guesses the type of unsupported rule pack type with algorithm described before. In order to write strict and robust code this can be set to FALSE in which case error will be thrown after detecting unfamiliar pack type.

Some examples:

mtcars_tbl %>%
  expose(my_data_packs, my_col_packs) %>%
  get_exposure()
##   Exposure
## 
## Packs info:
## # A tibble: 2 x 4
##      name      type             fun remove_obeyers
##     <chr>     <chr>          <list>          <lgl>
## 1 my_data data_pack <S3: data_pack>           TRUE
## 2  my_col  col_pack  <S3: col_pack>           TRUE
## 
## Tidy data validation report:
## # A tibble: 2 x 5
##      pack          rule   var    id value
##     <chr>         <chr> <chr> <int> <lgl>
## 1 my_data       nrow_up  .all     0 FALSE
## 2  my_col is_enough_sum    am     0 FALSE

# Note that `id` starts from 6 as rows 1:5 were removed from validating
mtcars_tbl %>%
  expose(my_cell_packs, .remove_obeyers = FALSE) %>%
  get_exposure()
##   Exposure
## 
## Packs info:
## # A tibble: 1 x 4
##      name      type             fun remove_obeyers
##     <chr>     <chr>          <list>          <lgl>
## 1 my_cell cell_pack <S3: cell_pack>          FALSE
## 
## Tidy data validation report:
## # A tibble: 297 x 5
##      pack   rule   var    id value
##     <chr>  <chr> <chr> <int> <lgl>
## 1 my_cell is_out   mpg     6 FALSE
## 2 my_cell is_out   mpg     7 FALSE
## 3 my_cell is_out   mpg     8 FALSE
## 4 my_cell is_out   mpg     9 FALSE
## 5 my_cell is_out   mpg    10 FALSE
## # ... with 292 more rows

# Note name imputation and guessing
mtcars_tbl %>%
  expose(my_data_packs, .remove_obeyers = FALSE) %>%
  expose(validate_rows) %>%
  get_exposure()
##   Exposure
## 
## Packs info:
## # A tibble: 2 x 4
##          name      type             fun remove_obeyers
##         <chr>     <chr>          <list>          <lgl>
## 1     my_data data_pack <S3: data_pack>          FALSE
## 2 row_pack..1  row_pack  <S3: row_pack>           TRUE
## 
## Tidy data validation report:
## # A tibble: 3 x 5
##          pack          rule   var    id value
##         <chr>         <chr> <chr> <int> <lgl>
## 1     my_data      nrow_low  .all     0  TRUE
## 2     my_data       nrow_up  .all     0 FALSE
## 3 row_pack..1 is_enough_sum  .all    19 FALSE

Act after exposure

After creating data frame with attribute exposure, it is pretty straightforward to design how to perform any action. It is implemented in function act_after_exposure with the following arguments:

  • .tbl which should be the result of using expose().
  • .trigger: function which takes .tbl as argument and returns TRUE if some action needs to be performed.
  • actor: function which takes .tbl as argument and performs the action.

Basically act_after_exposure() is doing the following:

  • Check that .tbl has a proper exposure attribute.
  • Compute whether to perform intended action by computing .trigger(.tbl).
  • If trigger results in TRUE then .actor(.tbl) is returned. In other case .tbl is returned.

It is a good idea that .actor should be doing one of two things:

  • Making side effects. For example throwing an error (if condition in .trigger is met), printing some information and so on. In this case it should return .tbl to be used properly inside a pipe.
  • Changing .tbl based on exposure information. In this case it should return the imputed version of .tbl.

As a main use case, ruler has function assert_any_breaker. It is a wrapper for act_after_exposure with .trigger checking presence of any breaker in exposure and .actor being notifier about it.

mtcars_tbl %>%
  expose(my_data_packs) %>%
  assert_any_breaker()
##   Breakers report
## Tidy data validation report:
## # A tibble: 1 x 5
##      pack    rule   var    id value
##     <chr>   <chr> <chr> <int> <lgl>
## 1 my_data nrow_up  .all     0 FALSE
## Error: assert_any_breaker: Some breakers found in exposure.

Conclusions

  • Design process of a package deserves its own story.
  • Package ruler offers tools for dplyr-style exploration and validation of data frame like objects. With its help validation is done with 3 commands/steps each designed for specific purpose.
sessionInfo()
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
## 
## locale:
##  [1] LC_CTYPE=ru_UA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=ru_UA.UTF-8        LC_COLLATE=ru_UA.UTF-8    
##  [5] LC_MONETARY=ru_UA.UTF-8    LC_MESSAGES=ru_UA.UTF-8   
##  [7] LC_PAPER=ru_UA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] bindrcpp_0.2 ruler_0.1.0  purrr_0.2.4  dplyr_0.7.4 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.13     knitr_1.17       bindr_0.1        magrittr_1.5    
##  [5] tidyselect_0.2.3 R6_2.2.2         rlang_0.1.4      stringr_1.2.0   
##  [9] tools_3.4.2      htmltools_0.3.6  yaml_2.1.14      rprojroot_1.2   
## [13] digest_0.6.12    assertthat_0.2.0 tibble_1.3.4     bookdown_0.5    
## [17] tidyr_0.7.2      glue_1.2.0       evaluate_0.10.1  rmarkdown_1.7   
## [21] blogdown_0.2     stringi_1.1.5    compiler_3.4.2   keyholder_0.1.1 
## [25] backports_1.1.1  pkgconfig_2.0.1

Statistical uncertainty with R and pdqr

2019-11-11

rstats pdqr

Local randomness in R

2019-08-13

rstats

Arguments of stats::density()

2019-08-06

rstats pdqr

comments powered by Disqus