General gems of comperes
2018-05-17
Examples of exported functions from comperes package that could be useful for general tasks.
Prologue
I am very glad to announce that my new package comperes is on CRAN now. It provides tools for managing competition results in a tidy manner as much as possible. For more information go to:
- Package README.
- Package vignettes.
- My previous post for usage examples based on built-in
hp_survey
data set (results of my Harry Potter Books Survey).
Besides tools for competition results, comperes
offers some functions that can be useful in more general tasks. This post presents examples of their most common usage.
Overview
This post covers the following themes:
- Compute vector levels with levels2().
- Manage item summaries with summarise_item() and join_item_summary().
- Convert pairwise data with long_to_mat() and mat_to_long().
For examples we will use a shortened version of the everlasting mtcars
data set. We will need the following setup:
library(comperes)
library(rlang)
# For example analysis
library(dplyr)
library(tibble)
mtcars_tbl <- mtcars %>%
rownames_to_column(var = "car") %>%
select(car, cyl, vs, carb) %>%
as_tibble()
Compute vector levels
We will start with the most simple function. During comperes
development, idea about the it really helped me reason more clearly about package functional API. I am talking about levels2() which computes “levels” of any non-list vector.
It has the following logic: if x
has levels
attribute then return levels(x)
; otherwise return character representation of vector’s sorted unique values. Notes about design and implementation of this function:
- I hesitated a lot about whether it should return character or same type as input vector in case
x
has nolevels
. In many practical cases there is a need in latter behavior. However, in the end I decided that type stable output (levels(x)
always returns character vector orNULL
) is better. - Conversion to character is done after sorting, which is really important when dealing with numeric vectors.
This function is helpful when one needs to produce unique values in standardized manner (for example, during pairwise distance computation). Some examples:
levels2(mtcars_tbl$cyl)
## [1] "4" "6" "8"
# Importance of conversion to character after sorting
tricky_vec <- c(10, 1, 2, 12)
sort(as.character(tricky_vec))
## [1] "1" "10" "12" "2"
levels2(tricky_vec)
## [1] "1" "2" "10" "12"
Manage item summaries
Arguably, the most common task in data analysis is computation of group summaries. This task is conveniently done by consecutive application of dplyr’s group_by()
, summarise()
and ungroup()
(to return regular data frame and not grouped one). comperes
offers a wrapper summarise_item() for this task (which always returns tibble instead of a data frame) with additional feature of modifying column names by adding prefix (which will be handy soon):
cyl_vs_summary <- mtcars_tbl %>%
summarise_item(
item = c("cyl", "vs"),
n = n(), mean_carb = mean(carb),
.prefix = "cyl_vs__"
)
cyl_vs_summary
## # A tibble: 5 x 4
## cyl vs cyl_vs__n cyl_vs__mean_carb
## <dbl> <dbl> <int> <dbl>
## 1 4. 0. 1 2.00
## 2 4. 1. 10 1.50
## 3 6. 0. 3 4.67
## 4 6. 1. 4 2.50
## 5 8. 0. 14 3.50
Sometimes, there is also a need to compare actual values with their summaries across different grouping. For example, determine whether car’s number of carburetors (carb
) is bigger than average value per different groupings: by number of cylinders cyl
and V/S vs
.
To simplify this task, comperes
offers a join_item_summary() function for that: it computes item summary with summarise_item()
and joins it (with dplyr::left_join()
) to input data frame:
# Save (with rlang magic) expression for reused summary
carb_summary <- list(mean_carb = expr(mean(carb)))
# Create new columns with joined grouped summaries
mtcats_gear_summary <- mtcars_tbl %>%
join_item_summary("cyl", !!! carb_summary, .prefix = "cyl__") %>%
join_item_summary("vs", !!! carb_summary, .prefix = "vs__")
print(mtcats_gear_summary, width = Inf)
## # A tibble: 32 x 6
## car cyl vs carb cyl__mean_carb vs__mean_carb
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mazda RX4 6. 0. 4. 3.43 3.61
## 2 Mazda RX4 Wag 6. 0. 4. 3.43 3.61
## 3 Datsun 710 4. 1. 1. 1.55 1.79
## 4 Hornet 4 Drive 6. 1. 1. 3.43 1.79
## 5 Hornet Sportabout 8. 0. 2. 3.50 3.61
## # ... with 27 more rows
# Compute comparisons
mtcats_gear_summary %>%
mutate_at(vars(ends_with("mean_carb")), funs(carb > .)) %>%
select(car, ends_with("mean_carb")) %>%
rename_at(vars(-car), funs(gsub("__mean_carb$", "", .)))
## # A tibble: 32 x 3
## car cyl vs
## <chr> <lgl> <lgl>
## 1 Mazda RX4 TRUE TRUE
## 2 Mazda RX4 Wag TRUE TRUE
## 3 Datsun 710 FALSE FALSE
## 4 Hornet 4 Drive FALSE FALSE
## 5 Hornet Sportabout FALSE FALSE
## # ... with 27 more rows
Adding different prefixes helps navigating through columns with different summaries.
Convert pariwise data
One of the main features of comperes
is the ability to compute Head-to-Head values of players in competition. There are functions h2h_long()
and h2h_mat()
which produce output in “long” (tibble with row describing one ordered pair) and “matrix” (matrix with cell value describing pair in corresponding row and column) formats respectively.
These formats of pairwise data is quite common: “long” is better for tidy computing and “matrix” is better for result presentation. Also converting distance matrix to data frame with pair data is a theme of several Stack Overflow questions (for example, this one and that one).
Package comperes
has functions as_h2h_long()
and as_h2h_mat()
for converting between those formats. They are powered by a “general usage” functions long_to_mat() and mat_to_long(). Here is an example of how they can be used to convert between different formats of pairwise distances:
# Compute matrix of pairwise distances based on all numeric columns
dist_mat <- mtcars_tbl %>%
select_if(is.numeric) %>%
dist() %>%
as.matrix()
dist_mat[1:4, 1:4]
## 1 2 3 4
## 1 0.000000 0.000000 3.741657 3.162278
## 2 0.000000 0.000000 3.741657 3.162278
## 3 3.741657 3.741657 0.000000 2.000000
## 4 3.162278 3.162278 2.000000 0.000000
# Convert to data frame (tibble in this case)
dist_tbl <- dist_mat %>%
mat_to_long(row_key = "id_1", col_key = "id_2", value = "dist")
dist_tbl
## # A tibble: 1,024 x 3
## id_1 id_2 dist
## <chr> <chr> <dbl>
## 1 1 1 0.
## 2 1 2 0.
## 3 1 3 3.74
## 4 1 4 3.16
## 5 1 5 2.83
## # ... with 1,019 more rows
# Convert tibble back to matrix
dist_mat_new <- dist_tbl %>%
# To make natural row and column sortings
mutate_at(vars("id_1", "id_2"), as.numeric) %>%
long_to_mat(row_key = "id_1", col_key = "id_2", value = "dist")
identical(dist_mat, dist_mat_new)
## [1] TRUE
Conclusion
- Package
comperes
provides not only tools for managing competition results but also functions with general purpose:- Compute vector levels with
levels2()
. Usually used to produce unique values in standardized manner. - Manage item summaries with
summarise_item()
andjoin_item_summary()
. May be used to concisely compute comparisons of values with summaries from different groupings. - Convert pairwise data with
long_to_mat()
andmat_to_long()
. Very helpful in converting pairwise distances between “long” and “matrix” formats.
- Compute vector levels with
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
##
## locale:
## [1] LC_CTYPE=ru_UA.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=ru_UA.UTF-8 LC_COLLATE=ru_UA.UTF-8
## [5] LC_MONETARY=ru_UA.UTF-8 LC_MESSAGES=ru_UA.UTF-8
## [7] LC_PAPER=ru_UA.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] methods stats graphics grDevices utils datasets base
##
## other attached packages:
## [1] bindrcpp_0.2.2 tibble_1.4.2 dplyr_0.7.5.9000 rlang_0.2.0
## [5] comperes_0.2.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.16 knitr_1.20 bindr_0.1.1 magrittr_1.5
## [5] tidyselect_0.2.4 R6_2.2.2 stringr_1.3.0 tools_3.4.4
## [9] xfun_0.1 utf8_1.1.3 cli_1.0.0 htmltools_0.3.6
## [13] yaml_2.1.17 rprojroot_1.3-2 digest_0.6.15 assertthat_0.2.0
## [17] crayon_1.3.4 bookdown_0.7 purrr_0.2.4 glue_1.2.0
## [21] evaluate_0.10.1 rmarkdown_1.9 blogdown_0.5 stringi_1.1.6
## [25] compiler_3.4.4 pillar_1.2.1 backports_1.1.2 pkgconfig_2.0.1
Related
Statistical uncertainty with R and pdqr
2019-11-11
Local randomness in R
2019-08-13
Arguments of stats::density()
2019-08-06