Usage of ruler package
2017-12-05
Usage examples of ruler package: dplyr-style exploration and validation of data frame like objects.
Prologue
My previous post tells a story about design of my ruler package, which presents tools for “… creating data validation pipelines and tidy reports”. This package offers a framework for exploring and validating data frame like objects using dplyr grammar of data manipulation.
This post is intended to show some close to reality ruler
usage examples. Described methods and approaches reflect package design. Along the way you will learn why Yoda and Jabba the Hutt are “outliers” among core “Star Wars” characters.
For more information see README (for relatively brief comprehensive introduction) or vignettes (for more thorough description of package capabilities).
Beware of a lot of code.
Overview
suppressMessages(library(dplyr))
suppressMessages(library(purrr))
library(ruler)
The general way of performing validation with ruler
can be described with following steps:
- Formulate a validation task. It is usually stated in the form of a yes-no question or true-false statement about some part (data unit) of an input data frame. Data unit can be one of: data [as a whole], group of rows [as a whole], column [as a whole], row [as a whole], cell. For example, does every column contain elements with sum more than 100?.
- Create a
dplyr
-style validation function (rule pack) which checks desired data unit for obedience to [possibly] several rules:- Create
dplyr
code for “interactive” validation. Note to use only functions supported by keyholder package. An example with “enough_sum” as a rule name:
mtcars %>% summarise_all(funs(enough_sum = sum(.) > 100))
- Use
ruler
’s functionrules()
instead of explicit or implicit usage offuns()
:
mtcars %>% summarise_all(rules(enough_sum = sum(.) > 100))
- Modify code to create magrittr functional sequence by replacing input variable name by a
.
:
. %>% summarise_all(rules(enough_sum = sum(.) > 100))
- Wrap with rule specification function to explicitly identify validated data unit and to name rule pack. In this case it is
col_packs()
for column data unit with “is_enough_sum” as rule pack name:
col_packs( is_enough_sum = . %>% summarise_all(rules(is_enough = sum(.) > 100)) )
- Create
- Expose data to rules to obtain validation result (exposure). Use
ruler
’sexpose()
function for that. It doesn’t modify contents of input data frame but creates/updatesexposure
attribute. Exposure is a list with information about used rule packs (packs_info) and tidy data validation report (report). - Act after exposure. It can be:
- Observing validation results with
get_exposure()
,get_packs_info()
orget_report()
. - Making assertions if specific rules are not followed in desired way.
- Imputing input data frame based on report.
- Observing validation results with
In examples we will use starwars data from dplyr
package (to celebrate an upcoming new episode). It is a tibble with every row describing one “Star Wars” character. Every example starts with a validation task stated in italic and performs validation from beginning to end.
Create rule packs
Data
■ Does starwars
have 1) number of rows 1a) more than 50; 1b) less than 60; 2) number of columns 2a) more than 10; 2b) less than 15?
check_data_dims <- data_packs(
check_dims = . %>% summarise(
nrow_low = nrow(.) >= 50, nrow_up = nrow(.) <= 60,
ncol_low = ncol(.) >= 10, ncol_up = ncol(.) <= 15
)
)
starwars %>%
expose(check_data_dims) %>%
get_exposure()
## Exposure
##
## Packs info:
## # A tibble: 1 x 4
## name type fun remove_obeyers
## <chr> <chr> <list> <lgl>
## 1 check_dims data_pack <S3: data_pack> TRUE
##
## Tidy data validation report:
## # A tibble: 1 x 5
## pack rule var id value
## <chr> <chr> <chr> <int> <lgl>
## 1 check_dims nrow_up .all 0 FALSE
The result is interpreted as follows:
- Data was exposed to one rule pack for data as a whole (data rule pack) named “check_dims”. For it all obeyers (data units which follow specified rule) were removed from validation report.
- Combination of
var
equals.all
andid
equals0
means that data as a whole is validated. - Input data frame doesn’t obey (because
value
is equal toFALSE
) rulenrow_up
from rule packcheck_dims
.
■ Does starwars
have enough rows for characters 1) with blond hair; 2) humans; 3) humans with blond hair?
check_enough_rows <- data_packs(
enough_blond = . %>% filter(hair_color == "blond") %>%
summarise(is_enough = n() > 10),
enough_humans = . %>% summarise(
is_enough = sum(species == "Human", na.rm = TRUE) > 30
),
ehough_blond_humans = . %>% filter(
hair_color == "blond", species == "Human"
) %>%
summarise(is_enough = n() > 5)
)
starwars %>%
expose(check_enough_rows) %>%
get_exposure()
## Exposure
##
## Packs info:
## # A tibble: 3 x 4
## name type fun remove_obeyers
## <chr> <chr> <list> <lgl>
## 1 enough_blond data_pack <S3: data_pack> TRUE
## 2 enough_humans data_pack <S3: data_pack> TRUE
## 3 ehough_blond_humans data_pack <S3: data_pack> TRUE
##
## Tidy data validation report:
## # A tibble: 2 x 5
## pack rule var id value
## <chr> <chr> <chr> <int> <lgl>
## 1 enough_blond is_enough .all 0 FALSE
## 2 ehough_blond_humans is_enough .all 0 FALSE
New information gained from example:
- Rule specification functions can be supplied with multiple rule packs all of which will be independently used during exposing.
■ Does starwars
have enough numeric columns?
check_enough_num_cols <- data_packs(
enough_num_cols = . %>% summarise(
is_enough = sum(map_lgl(., is.numeric)) > 1
)
)
starwars %>%
expose(check_enough_num_cols) %>%
get_report()
## Tidy data validation report:
## # A tibble: 0 x 5
## # ... with 5 variables: pack <chr>, rule <chr>, var <chr>, id <int>,
## # value <lgl>
- If no breaker is found
get_report()
returns tibble with zero rows and usual columns.
Group
■ Does group defined by hair color and gender have a member from Tatooine?
has_hair_gender_tatooine <- group_packs(
hair_gender_tatooine = . %>%
group_by(hair_color, gender) %>%
summarise(has_tatooine = any(homeworld == "Tatooine")),
.group_vars = c("hair_color", "gender"),
.group_sep = "__"
)
starwars %>%
expose(has_hair_gender_tatooine) %>%
get_report()
## Tidy data validation report:
## # A tibble: 12 x 5
## pack rule var id value
## <chr> <chr> <chr> <int> <lgl>
## 1 hair_gender_tatooine has_tatooine auburn__female 0 FALSE
## 2 hair_gender_tatooine has_tatooine auburn, grey__male 0 FALSE
## 3 hair_gender_tatooine has_tatooine auburn, white__male 0 FALSE
## 4 hair_gender_tatooine has_tatooine blonde__female 0 FALSE
## 5 hair_gender_tatooine has_tatooine grey__male 0 FALSE
## # ... with 7 more rows
group_packs()
needs grouping columns supplied via.group_vars
.- Column
var
of validation report contains levels of grouping columns to identify group. By default their are pasted together with.
. To change that supply.group_sep
argument. - 12 combinations of
hair_color
andgender
don’t have a character from Tatooine. They are “auburn”-“female”, “auburn, grey”-“male” and so on.
Column
■ Does every list-column have 1) enough average length; 2) enough unique elements?
check_list_cols <- col_packs(
check_list_cols = . %>%
summarise_if(
is.list,
rules(
is_enough_mean = mean(map_int(., length)) >= 1,
length(unique(unlist(.))) >= 10
)
)
)
starwars %>%
expose(check_list_cols) %>%
get_report()
## Tidy data validation report:
## # A tibble: 3 x 5
## pack rule var id value
## <chr> <chr> <chr> <int> <lgl>
## 1 check_list_cols is_enough_mean vehicles 0 FALSE
## 2 check_list_cols is_enough_mean starships 0 FALSE
## 3 check_list_cols rule..2 films 0 FALSE
- To specify rule functions inside
dplyr
’s scoped verbs useruler::rules()
. It powers correct output interpretation during exposing process and imputes missing rule names based on the present rules in current rule pack. - Columns
vehicles
andstarships
don’t have enough average length and columnfilms
doesn’t have enough unique elements.
■ Are all values of column birth_year
non-NA
?
starwars %>%
expose(
col_packs(
. %>% summarise_at(
vars(birth_year = "birth_year"),
rules(all_present = all(!is.na(.)))
)
)
) %>%
get_report()
## Tidy data validation report:
## # A tibble: 1 x 5
## pack rule var id value
## <chr> <chr> <chr> <int> <lgl>
## 1 col_pack..1 all_present birth_year 0 FALSE
- To correctly validate one column with scoped
dplyr
verb it should be a named argument insidevars
. It is needed for correct interpretation of rule pack output.
Row
■ Has character appeared in enough films? As character is defined by row, this is a row pack.
has_enough_films <- row_packs(
enough_films = . %>% transmute(is_enough = map_int(films, length) >= 3)
)
starwars %>%
expose(has_enough_films) %>%
get_report() %>%
left_join(y = starwars %>% transmute(id = 1:n(), name),
by = "id") %>%
print(.validate = FALSE)
## Tidy data validation report:
## # A tibble: 64 x 6
## pack rule var id value name
## <chr> <chr> <chr> <int> <lgl> <chr>
## 1 enough_films is_enough .all 8 FALSE R5-D4
## 2 enough_films is_enough .all 9 FALSE Biggs Darklighter
## 3 enough_films is_enough .all 12 FALSE Wilhuff Tarkin
## 4 enough_films is_enough .all 15 FALSE Greedo
## 5 enough_films is_enough .all 18 FALSE Jek Tono Porkins
## # ... with 59 more rows
- 64 characters haven’t appeared in 3 films or more. Those are characters described in
starwars
in rows 8, 9, etc. (counting based on input data).
■ Is character with height
less than 100 a droid?
is_short_droid <- row_packs(
is_short_droid = . %>% filter(height < 100) %>%
transmute(is_droid = species == "Droid")
)
starwars %>%
expose(is_short_droid) %>%
get_report() %>%
left_join(y = starwars %>% transmute(id = 1:n(), name, height),
by = "id") %>%
print(.validate = FALSE)
## Tidy data validation report:
## # A tibble: 5 x 7
## pack rule var id value name height
## <chr> <chr> <chr> <int> <lgl> <chr> <int>
## 1 is_short_droid is_droid .all 19 FALSE Yoda 66
## 2 is_short_droid is_droid .all 29 FALSE Wicket Systri Warrick 88
## 3 is_short_droid is_droid .all 45 FALSE Dud Bolt 94
## 4 is_short_droid is_droid .all 72 FALSE Ratts Tyerell 79
## 5 is_short_droid is_droid .all 73 NA R4-P17 96
- One can expose only subset of rows by using
filter
orslice
. The value ofid
column in result will reflect row number in the original input data frame. This feature is powered by keyholder package. In order to use it, rule pack should be created using its supported functions. value
equal toNA
is treated as rule breaker.- 5 “not tall” characters are not droids.
Cell
■ Is non-NA
numeric cell not an outlier based on z-score? This is a bit tricky. To present outliers as rule breakers one should ask whether cell is not outlier.
z_score <- function(x, ...) {abs(x - mean(x, ...)) / sd(x, ...)}
cell_isnt_outlier <- cell_packs(
dbl_not_outlier = . %>%
transmute_if(
is.numeric,
rules(isnt_out = z_score(., na.rm = TRUE) < 3 | is.na(.))
)
)
starwars %>%
expose(cell_isnt_outlier) %>%
get_report() %>%
left_join(y = starwars %>% transmute(id = 1:n(), name),
by = "id") %>%
print(.validate = FALSE)
## Tidy data validation report:
## # A tibble: 4 x 6
## pack rule var id value name
## <chr> <chr> <chr> <int> <lgl> <chr>
## 1 dbl_not_outlier isnt_out height 19 FALSE Yoda
## 2 dbl_not_outlier isnt_out mass 16 FALSE Jabba Desilijic Tiure
## 3 dbl_not_outlier isnt_out birth_year 16 FALSE Jabba Desilijic Tiure
## 4 dbl_not_outlier isnt_out birth_year 19 FALSE Yoda
- 4 non-
NA
numeric cells appear to be an outlier within their column.
Expose data to rules
■ Do groups defined by species
, gender
and eye_color
(3 different checks) have appropriate size?
starwars %>%
expose(
group_packs(. %>% group_by(species) %>% summarise(isnt_many = n() <= 5),
.group_vars = "species")
) %>%
expose(
group_packs(. %>% group_by(gender) %>% summarise(isnt_many = n() <= 60),
.group_vars = "gender"),
.remove_obeyers = FALSE
) %>%
expose(is_enough_eye_color = . %>% group_by(eye_color) %>%
summarise(isnt_many = n() <= 20)) %>%
get_exposure() %>%
print(n_report = Inf)
## Exposure
##
## Packs info:
## # A tibble: 3 x 4
## name type fun remove_obeyers
## <chr> <chr> <list> <lgl>
## 1 group_pack..1 group_pack <S3: group_pack> TRUE
## 2 group_pack..2 group_pack <S3: group_pack> FALSE
## 3 is_enough_eye_color group_pack <S3: group_pack> TRUE
##
## Tidy data validation report:
## # A tibble: 7 x 5
## pack rule var id value
## <chr> <chr> <chr> <int> <lgl>
## 1 group_pack..1 isnt_many Human 0 FALSE
## 2 group_pack..2 isnt_many female 0 TRUE
## 3 group_pack..2 isnt_many hermaphrodite 0 TRUE
## 4 group_pack..2 isnt_many male 0 FALSE
## 5 group_pack..2 isnt_many none 0 TRUE
## 6 group_pack..2 isnt_many NA 0 TRUE
## 7 is_enough_eye_color isnt_many brown 0 FALSE
expose()
can be applied sequentially which results into updating existingexposure
with new information.expose()
imputes names of supplied unnamed rule packs based on the present rule packs for the same data unit type.expose()
by default removes obeyers (rows with data units that obey respective rules) from validation report. To stop doing that use.remove_obeyers = FALSE
duringexpose()
call.expose()
by default guesses the type of the supplied rule pack based only on its output. This has some annoying edge cases but is suitable for interactive usage. To turn this feature off use.guess = FALSE
as an argument forexpose()
. Also, to avoid edge cases create rule packs with appropriate wrappers.
■ Perform some previous checks with one expose()
.
my_packs <- list(check_data_dims, is_short_droid, cell_isnt_outlier)
str(my_packs)
## List of 3
## $ :List of 1
## ..$ check_dims:function (value)
## .. ..- attr(*, "class")= chr [1:4] "data_pack" "rule_pack" "fseq" "function"
## $ :List of 1
## ..$ is_short_droid:function (value)
## .. ..- attr(*, "class")= chr [1:4] "row_pack" "rule_pack" "fseq" "function"
## $ :List of 1
## ..$ dbl_not_outlier:function (value)
## .. ..- attr(*, "class")= chr [1:4] "cell_pack" "rule_pack" "fseq" "function"
starwars_exposed_list <- starwars %>%
expose(my_packs)
starwars_exposed_arguments <- starwars %>%
expose(check_data_dims, is_short_droid, cell_isnt_outlier)
identical(starwars_exposed_list, starwars_exposed_arguments)
## [1] TRUE
expose()
can have for rule pack argument a list of lists [of lists, of lists, …] with functions at any depth. This enables creating a list of rule packs wrapped with*_packs()
functions (which all return a list of functions).expose()
can have multiple rule packs as separate arguments.
Act after exposure
■ Throw an error if any non-NA
value of mass
is more than 1000.
starwars %>%
expose(
col_packs(
low_mass = . %>% summarise_at(
vars(mass = "mass"),
rules(is_small_mass = all(. <= 1000, na.rm = TRUE))
)
)
) %>%
assert_any_breaker()
## Breakers report
## Tidy data validation report:
## # A tibble: 1 x 5
## pack rule var id value
## <chr> <chr> <chr> <int> <lgl>
## 1 low_mass is_small_mass mass 0 FALSE
## Error: assert_any_breaker: Some breakers found in exposure.
assert_any_breaker()
is used to assert presence of at least one breaker in validation report.
However, offered solution via column pack doesn’t show rows which break the rule. To do that one can use cell pack:
starwars %>%
expose(
cell_packs(
low_mass = . %>% transmute_at(
vars(mass = "mass"),
rules(is_small_mass = (. <= 1000) | is.na(.))
)
)
) %>%
assert_any_breaker()
## Breakers report
## Tidy data validation report:
## # A tibble: 1 x 5
## pack rule var id value
## <chr> <chr> <chr> <int> <lgl>
## 1 low_mass is_small_mass mass 16 FALSE
## Error: assert_any_breaker: Some breakers found in exposure.
■ Remove numeric columns with mean value below certain threshold. To achieve that one should formulate rule as “column mean should be above threshold”, identify breakers and act upon this information.
remove_bad_cols <- function(.tbl) {
bad_cols <- .tbl %>%
get_report() %>%
pull(var) %>%
unique()
.tbl[, setdiff(colnames(.tbl), bad_cols)]
}
starwars %>%
expose(
col_packs(
. %>% summarise_if(is.numeric, rules(mean(., na.rm = TRUE) >= 100))
)
) %>%
act_after_exposure(
.trigger = any_breaker,
.actor = remove_bad_cols
) %>%
remove_exposure()
## # A tibble: 87 x 11
## name height hair_color skin_color eye_color gender homeworld
## <chr> <int> <chr> <chr> <chr> <chr> <chr>
## 1 Luke Skywalker 172 blond fair blue male Tatooine
## 2 C-3PO 167 <NA> gold yellow <NA> Tatooine
## 3 R2-D2 96 <NA> white, blue red <NA> Naboo
## 4 Darth Vader 202 none white yellow male Tatooine
## 5 Leia Organa 150 brown light brown female Alderaan
## # ... with 82 more rows, and 4 more variables: species <chr>,
## # films <list>, vehicles <list>, starships <list>
- act_after_exposure is a wrapper for performing actions after exposing. It takes
.trigger
function to trigger action and.actor
function to perform action and return its result. any_breaker
is a function which returnTRUE
if tidy validation report attached to it has any breaker andFALSE
otherwise.
Conclusions
- Yoda and Jabba the Hutt are outliers among other “Star Wars” characters: Yoda is by height and birth year, Jabba is by mass and also birth year.
- There are less than 10 “Star Wars” films yet.
ruler
offers flexible and extendable functionality for common validation tasks. Validation can be done for data [as a whole], group of rows [as a whole], column [as a whole], row [as a whole] and cell. After exposing data frame of interest to rules and obtaining tidy validation report, one can perform any action based on this information: explore report, throw error, impute input data frame, etc.
Related
Statistical uncertainty with R and pdqr
2019-11-11
Local randomness in R
2019-08-13
Arguments of stats::density()
2019-08-06