General gems of comperes

2018-05-17

rstats comperes

Examples of exported functions from comperes package that could be useful for general tasks.

Prologue

I am very glad to announce that my new package comperes is on CRAN now. It provides tools for managing competition results in a tidy manner as much as possible. For more information go to:

Package README.
Package vignettes.
My previous post for usage examples based on built-in hp_survey data set (results of my Harry Potter Books Survey).

Besides tools for competition results, comperes offers some functions that can be useful in more general tasks. This post presents examples of their most common usage.

Overview

This post covers the following themes:

Compute vector levels with levels2().
Manage item summaries with summarise_item() and join_item_summary().
Convert pairwise data with long_to_mat() and mat_to_long().

For examples we will use a shortened version of the everlasting mtcars data set. We will need the following setup:

library(comperes)
library(rlang)
# For example analysis
library(dplyr)
library(tibble)

mtcars_tbl <- mtcars %>%
  rownames_to_column(var = "car") %>%
  select(car, cyl, vs, carb) %>%
  as_tibble()

Compute vector levels

We will start with the most simple function. During comperes development, idea about the it really helped me reason more clearly about package functional API. I am talking about levels2() which computes “levels” of any non-list vector.

It has the following logic: if x has levels attribute then return levels(x); otherwise return character representation of vector’s sorted unique values. Notes about design and implementation of this function:

I hesitated a lot about whether it should return character or same type as input vector in case x has no levels. In many practical cases there is a need in latter behavior. However, in the end I decided that type stable output (levels(x) always returns character vector or NULL) is better.
Conversion to character is done after sorting, which is really important when dealing with numeric vectors.

This function is helpful when one needs to produce unique values in standardized manner (for example, during pairwise distance computation). Some examples:

levels2(mtcars_tbl$cyl)
## [1] "4" "6" "8"

# Importance of conversion to character after sorting
tricky_vec <- c(10, 1, 2, 12)
sort(as.character(tricky_vec))
## [1] "1"  "10" "12" "2"
levels2(tricky_vec)
## [1] "1"  "2"  "10" "12"

Manage item summaries

Arguably, the most common task in data analysis is computation of group summaries. This task is conveniently done by consecutive application of dplyr’s group_by(), summarise() and ungroup() (to return regular data frame and not grouped one). comperes offers a wrapper summarise_item() for this task (which always returns tibble instead of a data frame) with additional feature of modifying column names by adding prefix (which will be handy soon):

cyl_vs_summary <- mtcars_tbl %>%
  summarise_item(
    item = c("cyl", "vs"),
    n = n(), mean_carb = mean(carb),
    .prefix = "cyl_vs__"
  )
cyl_vs_summary
## # A tibble: 5 x 4
##     cyl    vs cyl_vs__n cyl_vs__mean_carb
##   <dbl> <dbl>     <int>             <dbl>
## 1    4.    0.         1              2.00
## 2    4.    1.        10              1.50
## 3    6.    0.         3              4.67
## 4    6.    1.         4              2.50
## 5    8.    0.        14              3.50

Sometimes, there is also a need to compare actual values with their summaries across different grouping. For example, determine whether car’s number of carburetors (carb) is bigger than average value per different groupings: by number of cylinders cyl and V/S vs.

To simplify this task, comperes offers a join_item_summary() function for that: it computes item summary with summarise_item() and joins it (with dplyr::left_join()) to input data frame:

# Save (with rlang magic) expression for reused summary
carb_summary <- list(mean_carb = expr(mean(carb)))

# Create new columns with joined grouped summaries
mtcats_gear_summary <- mtcars_tbl %>%
  join_item_summary("cyl", !!! carb_summary, .prefix = "cyl__") %>%
  join_item_summary("vs",  !!! carb_summary, .prefix = "vs__")

print(mtcats_gear_summary, width = Inf)
## # A tibble: 32 x 6
##   car                 cyl    vs  carb cyl__mean_carb vs__mean_carb
##   <chr>             <dbl> <dbl> <dbl>          <dbl>         <dbl>
## 1 Mazda RX4            6.    0.    4.           3.43          3.61
## 2 Mazda RX4 Wag        6.    0.    4.           3.43          3.61
## 3 Datsun 710           4.    1.    1.           1.55          1.79
## 4 Hornet 4 Drive       6.    1.    1.           3.43          1.79
## 5 Hornet Sportabout    8.    0.    2.           3.50          3.61
## # ... with 27 more rows

# Compute comparisons
mtcats_gear_summary %>%
  mutate_at(vars(ends_with("mean_carb")), funs(carb > .)) %>%
  select(car, ends_with("mean_carb")) %>%
  rename_at(vars(-car), funs(gsub("__mean_carb$", "", .)))
## # A tibble: 32 x 3
##   car               cyl   vs   
##   <chr>             <lgl> <lgl>
## 1 Mazda RX4         TRUE  TRUE 
## 2 Mazda RX4 Wag     TRUE  TRUE 
## 3 Datsun 710        FALSE FALSE
## 4 Hornet 4 Drive    FALSE FALSE
## 5 Hornet Sportabout FALSE FALSE
## # ... with 27 more rows

Adding different prefixes helps navigating through columns with different summaries.

Convert pariwise data

One of the main features of comperes is the ability to compute Head-to-Head values of players in competition. There are functions h2h_long() and h2h_mat() which produce output in “long” (tibble with row describing one ordered pair) and “matrix” (matrix with cell value describing pair in corresponding row and column) formats respectively.

These formats of pairwise data is quite common: “long” is better for tidy computing and “matrix” is better for result presentation. Also converting distance matrix to data frame with pair data is a theme of several Stack Overflow questions (for example, this one and that one).

Package comperes has functions as_h2h_long() and as_h2h_mat() for converting between those formats. They are powered by a “general usage” functions long_to_mat() and mat_to_long(). Here is an example of how they can be used to convert between different formats of pairwise distances:

# Compute matrix of pairwise distances based on all numeric columns
dist_mat <- mtcars_tbl %>%
  select_if(is.numeric) %>%
  dist() %>%
  as.matrix()
dist_mat[1:4, 1:4]
##          1        2        3        4
## 1 0.000000 0.000000 3.741657 3.162278
## 2 0.000000 0.000000 3.741657 3.162278
## 3 3.741657 3.741657 0.000000 2.000000
## 4 3.162278 3.162278 2.000000 0.000000

# Convert to data frame (tibble in this case)
dist_tbl <- dist_mat %>%
  mat_to_long(row_key = "id_1", col_key = "id_2", value = "dist")
dist_tbl
## # A tibble: 1,024 x 3
##   id_1  id_2   dist
##   <chr> <chr> <dbl>
## 1 1     1      0.  
## 2 1     2      0.  
## 3 1     3      3.74
## 4 1     4      3.16
## 5 1     5      2.83
## # ... with 1,019 more rows

# Convert tibble back to matrix
dist_mat_new <- dist_tbl %>%
  # To make natural row and column sortings
  mutate_at(vars("id_1", "id_2"), as.numeric) %>%
  long_to_mat(row_key = "id_1", col_key = "id_2", value = "dist")
identical(dist_mat, dist_mat_new)
## [1] TRUE

Conclusion

Package comperes provides not only tools for managing competition results but also functions with general purpose:
- Compute vector levels with levels2(). Usually used to produce unique values in standardized manner.
- Manage item summaries with summarise_item() and join_item_summary(). May be used to concisely compute comparisons of values with summaries from different groupings.
- Convert pairwise data with long_to_mat() and mat_to_long(). Very helpful in converting pairwise distances between “long” and “matrix” formats.

sessionInfo()

sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
## 
## locale:
##  [1] LC_CTYPE=ru_UA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=ru_UA.UTF-8        LC_COLLATE=ru_UA.UTF-8    
##  [5] LC_MONETARY=ru_UA.UTF-8    LC_MESSAGES=ru_UA.UTF-8   
##  [7] LC_PAPER=ru_UA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2   tibble_1.4.2     dplyr_0.7.5.9000 rlang_0.2.0     
## [5] comperes_0.2.0  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16     knitr_1.20       bindr_0.1.1      magrittr_1.5    
##  [5] tidyselect_0.2.4 R6_2.2.2         stringr_1.3.0    tools_3.4.4     
##  [9] xfun_0.1         utf8_1.1.3       cli_1.0.0        htmltools_0.3.6 
## [13] yaml_2.1.17      rprojroot_1.3-2  digest_0.6.15    assertthat_0.2.0
## [17] crayon_1.3.4     bookdown_0.7     purrr_0.2.4      glue_1.2.0      
## [21] evaluate_0.10.1  rmarkdown_1.9    blogdown_0.5     stringi_1.1.6   
## [25] compiler_3.4.4   pillar_1.2.1     backports_1.1.2  pkgconfig_2.0.1

General gems of comperes

Prologue

Overview

Compute vector levels

Manage item summaries

Convert pariwise data

Conclusion

Related

Statistical uncertainty with R and pdqr

Local randomness in R

Arguments of stats::density()

Comments

Contents