Harry Potter and competition results with comperes

2018-05-09

rstats comperes

Exploration of Harry Potter Books Survey results with help of my new comperes package.

Prologue

About a month ago I decided to add interesting data set to my almost finished (hopefully, soon to be on CRAN) comperes package. Data should represent results of some not ordinary competition. After some thought I picked a “competition” between Harry Potter books with a goal eventually to rate them from worst to best. After a series of events I ended up creating data myself. You can read more about that in my previous post.

Post and survey in general were popularized mostly among R users with R-bloggers (which gave me ~53 respondents), Twitter (which added the rest) and Reddit (which added ~0 people as post was deleted soon after publication). Survey managed to attract 182 respondents. I want to greatly thank all people who took their time to take part in and spread a word about my survey. Special thanks goes to Mara Averick who started a Twitter wave.

This post has two goals:

  • Present and explore results of the survey.
  • Demonstrate basic functionality of comperes package. To learn more go to its README and vignettes.

Overview

Survey results can be obtained by installing development version of comperes package from GitHub. They are present as package data named hp_survey.

This post will cover the following topics:

  • Exploration of survey results (most important being Book scores section).
  • Description of comperes competition results formats with conversion hp_survey to one of them.
  • Head-to-Head “performance” of books against each other.

We will need the following setup:

library(dplyr)
library(tidyr)
library(rlang)
library(stringr)
library(ggplot2)
library(comperes)

set.seed(201805)

theme_set(theme_bw())

# Authenticity palette
hp_pal <- c(Gryff = "#D02037", Huffl = "#F0C346",
            Raven = "#2450A8", Raven_light = "#0088FF",
            Slyth = "#09774A")

# For less noisy bar charts
theme_bar <- function() {
  list(theme(panel.grid.major.x = element_blank(),
             panel.grid.minor.x = element_blank()))
}

Exploration

Data preparation

hp_suvery is a tibble (enhanced data frame) and has the following columns:

  • person <int> : Identifier of a person.
  • book <chr> : Identifier of a Harry Potter book. Its values are of the form “HP_x” where “x” represents book’s number in the series (from 1 to 7).
  • score <chr> : Book’s score. Can be one of “1 - Poor”, “2 - Fair”, “3 - Good”, “4 - Very Good”, “5 - Excellent”.

For exploration, let’s transform hp_survey for more expressive code and results:

  • Convert scores to numerical.
  • Add book names.
book_names <- c(
    "Philosopher's (Sorcerer's) Stone (#1)",
    "Chamber of Secrets (#2)",
    "Prisoner of Azkaban (#3)",
    "Goblet of Fire (#4)",
    "Order of the Phoenix (#5)",
    "Half-Blood Prince (#6)",
    "Deathly Hallows (#7)"
  )
book_name_tbl <- tibble(
  book = paste0("HP_", 1:7),
  book_name = factor(book_names, levels = book_names)
)

hp <- hp_survey %>%
  # Extract numerical score
  rename(score_chr = score) %>%
  mutate(score = as.integer(gsub("[^0-9].*$", "", score_chr))) %>%
  # Add book names
  left_join(y = book_name_tbl, by = "book")

hp
## # A tibble: 657 x 5
##   person book  score_chr     score book_name                            
##    <int> <chr> <chr>         <int> <fct>                                
## 1      1 HP_6  5 - Excellent     5 Half-Blood Prince (#6)               
## 2      1 HP_7  5 - Excellent     5 Deathly Hallows (#7)                 
## 3      2 HP_1  3 - Good          3 Philosopher's (Sorcerer's) Stone (#1)
## 4      2 HP_4  5 - Excellent     5 Goblet of Fire (#4)                  
## 5      2 HP_5  2 - Fair          2 Order of the Phoenix (#5)            
## # ... with 652 more rows

Subset uniformity

The first step in the survey was to choose the first element in the randomly shuffled list to simulate generation of random subset from all books. Each of 127 list element was connected to one subset. Lets visualize subset frequency to ensure a good faith of respondents:

# Compute subset representations
hp_subsets <- hp %>%
  arrange(person, book) %>%
  group_by(person) %>%
  summarise(subset = paste0(book, collapse = "-"))

# Compute the number of actually picked subsets
n_distinct(hp_subsets$subset)
## [1] 95

# Visualize
hp_subsets %>%
  ggplot(aes(subset)) +
    geom_bar(fill = hp_pal["Gryff"]) +
    labs(
      x = "Subset", y = "Number of times subset was picked",
      title = "Picked subsets have fairly uniform distribution"
    ) +
    scale_x_discrete(labels = NULL) +
    theme_bar() +
    theme(axis.ticks.x = element_blank())

So there are 95 subsets actually picked and their distribution seems reasonably uniform. This is enough for me to confirm that randomization for subsets was successful.

Book presence

Other important thing to explore is number of times book was actually rated:

hp %>%
  ggplot(aes(book_name)) +
    geom_bar(fill = hp_pal["Huffl"]) +
    # Cool way to wrap labels for a given width
    scale_x_discrete(labels = function(x) str_wrap(x, width = 15)) +
    labs(
      x = "", y = "Number of present scores",
      title = "Some books were rated more times than others",
      subtitle = "But it seems fine"
    ) +
    theme_bar()

Book scores

The most obvious way to summarise book “performance” is its mean score of numerical representation of scale. Using mean is not harmful in this study as no outlier can be present.

hp_book_score <- hp %>%
  group_by(book_name) %>%
  summarise(mean_score = round(mean(score), digits = 2)) %>%
  arrange(desc(mean_score))

hp_book_score
## # A tibble: 7 x 2
##   book_name                             mean_score
##   <fct>                                      <dbl>
## 1 Prisoner of Azkaban (#3)                    4.19
## 2 Half-Blood Prince (#6)                      4.13
## 3 Goblet of Fire (#4)                         4.00
## 4 Deathly Hallows (#7)                        3.96
## 5 Philosopher's (Sorcerer's) Stone (#1)       3.91
## 6 Order of the Phoenix (#5)                   3.90
## 7 Chamber of Secrets (#2)                     3.55

So, “the best” book seems to be “Harry Potter and the Prisoner of Azkaban (#3)”.

For more understanding of results, lets also visualize score distribution.

hp %>%
  # Compute share of score per book
  count(book_name, score) %>%
  group_by(book_name) %>%
  mutate(share = n / sum(n)) %>%
  ungroup() %>%
  # Visualize
  ggplot() +
    geom_col(
      aes(score, share, colour = score, fill = score),
      show.legend = FALSE
    ) +
    geom_text(
      data = hp_book_score,
      mapping = aes(label = paste0("Mean = ", mean_score)),
      x = -Inf, y = Inf,
      hjust = -0.05, vjust = 1.3
    ) +
    facet_wrap(~ book_name) +
    scale_x_continuous(
      breaks = 1:5,
      labels = c("1\nPoor", "2\nFair", "3\nGood",
                 "4\nVery\nGood", "5\nExcellent")
    ) +
    scale_fill_gradient(low = hp_pal["Raven"], high = hp_pal["Raven_light"]) +
    scale_colour_gradient(low = hp_pal["Raven"], high = hp_pal["Raven_light"]) +
    labs(
      x = "", y = "Score share per book",
      title = '"Prisoner of Azkaban (#3)" seems to be "the best" HP book',
      caption = "@echasnovski"
    ) +
    theme_bar()

Competition results

Formats of comperes

Understanding of competition is quite general: it is a set of games (abstract event) in which players (abstract entity) gain some abstract scores (typically numeric). Inside games all players are treated equally. The most natural example is sport results, however not the only one. For example, product rating can be considered as a competition between products as “players”. Here a “game” is a customer that reviews a set of products by rating them with numerical “score” (stars, points, etc.).

In case of Harry Potter Books Survey results “game” is an act of respondent taking part in survey, “player” - Harry Potter book, “score” - discrete scale values converted to numerical score from 1 to 5.

In comperes there are two supported formats of competition results:

  • Long format. It is the most abstract way of presenting competition results. Basically, it is a data frame (or tibble) with columns game (game identifier), player (player identifier) and score where each row represents the score of particular player in particular game. One game can consist from variable number of players which makes this format more usable. Extra columns are allowed.
  • Wide format is a more convenient way to store results with fixed number of players in a game. Each row represents scores of all players in particular game. Data should be organized in pairs of columns “player”-“score”. Identifier of a pair should go after respective keyword and consist only from digits. For example: player1, score1, player2, score2. Order doesn’t matter. Column game is optional. Extra columns are also allowed.

Programmatically these formats are implemented as S3 classes longcr and widecr respectively. Essentially, they are tibbles with fixed structure. Objects of these classes should be created using functions as_longcr() and as_widecr() which also do conversions to other format.

Conversion

hp_survey presents results in long format.

hp_cr <- hp_survey %>%
  transmute(
    game = person, player = book,
    score = as.integer(gsub("[^0-9].*$", "", score))
  ) %>%
  as_longcr()

hp_cr
## # A longcr object:
## # A tibble: 657 x 3
##    game player score
##   <int> <chr>  <int>
## 1     1 HP_6       5
## 2     1 HP_7       5
## 3     2 HP_1       3
## 4     2 HP_4       5
## 5     2 HP_5       2
## # ... with 652 more rows

Here is the demonstration of conversion to wide format. It detects the maximum number of players in a game, which is 7, and assumes that data is missing in games with less number of players.

as_widecr(hp_cr)
## # A widecr object:
## # A tibble: 182 x 15
##    game player1 score1 player2 score2 player3 score3 player4 score4
##   <int> <chr>    <int> <chr>    <int> <chr>    <int> <chr>    <int>
## 1     1 HP_6         5 HP_7         5 <NA>        NA <NA>        NA
## 2     2 HP_1         3 HP_4         5 HP_5         2 HP_6         4
## 3     3 HP_1         3 HP_3         4 HP_5         1 <NA>        NA
## 4     4 HP_6         5 HP_7         5 <NA>        NA <NA>        NA
## 5     5 HP_4         4 HP_5         3 <NA>        NA <NA>        NA
## # ... with 177 more rows, and 6 more variables: player5 <chr>,
## #   score5 <int>, player6 <chr>, score6 <int>, player7 <chr>, score7 <int>

Head-to-Head

Functionality of comperes

Head-to-Head value is a summary statistic of direct confrontation between two players. It is assumed that this value can be computed based only on the players’ matchups (results for ordered pairs of players from one game). In other words, every game is converted into series of “subgames” between ordered pairs of players (including selfplay) which is stored as widecr object. After that, summary of item, defined by columns player1 and player2, is computed.

comperes has function get_matchups() for computing matchups:

get_matchups(hp_cr)
## # A widecr object:
## # A tibble: 2,697 x 5
##    game player1 score1 player2 score2
##   <int> <chr>    <int> <chr>    <int>
## 1     1 HP_6         5 HP_6         5
## 2     1 HP_6         5 HP_7         5
## 3     1 HP_7         5 HP_6         5
## 4     1 HP_7         5 HP_7         5
## 5     2 HP_1         3 HP_1         3
## # ... with 2,692 more rows

To compute multiple Head-to-Head values, use h2h_long() supplying competition results and summarizing expressions in dplyr::summarise() fashion. They will be applied to a data frame of matchups.

hp_cr_h2h <- hp_cr %>% h2h_long(
  # Number of macthups
  n = n(),
  # Number of wins plus half the number of ties
  # num_wins() is a function from comperes to compute number of times
  # first score is bigger than second one
  num_wins = num_wins(score1, score2, half_for_draw = TRUE),
  # Mean rating of a book scored in matchups with other books
  mean_score = mean(score1),
  # Mean rating difference of books scored in direct matchups
  mean_score_diff = mean(score1 - score2)
) %>%
  mutate_if(is.numeric, funs(round(., 2)))

hp_cr_h2h
## # A long format of Head-to-Head values:
## # A tibble: 49 x 6
##   player1 player2     n num_wins mean_score mean_score_diff
##   <chr>   <chr>   <dbl>    <dbl>      <dbl>           <dbl>
## 1 HP_1    HP_1      88.     44.0       3.91          0.    
## 2 HP_1    HP_2      42.     29.5       3.88          0.500 
## 3 HP_1    HP_3      51.     19.5       3.92         -0.390 
## 4 HP_1    HP_4      48.     24.0       3.79          0.0400
## 5 HP_1    HP_5      42.     21.5       3.79          0.    
## # ... with 44 more rows

So here we see, for example, that HP_1 and HP_2 had 42 matchups, i.e. they were rated by the same person 42 times. HP_1 “won” 29.5 (respecting ties) times, gained mean score of 3.88 in those matchups and had, on average, 0.5 points more.

There is also an h2h_mat() function which computes a matrix of Head-to-Head values for one expression.

hp_cr %>% h2h_mat(num_wins(score1, score2, half_for_draw = TRUE))
## # A matrix format of Head-to-Head values:
##      HP_1 HP_2 HP_3 HP_4 HP_5 HP_6 HP_7
## HP_1 44.0 29.5 19.5 24.0 21.5 17.0 24.0
## HP_2 12.5 40.0 12.0 11.5 10.5 12.0 19.0
## HP_3 31.5 32.0 49.0 31.5 28.0 25.0 33.5
## HP_4 24.0 33.5 26.5 49.5 23.5 30.5 31.5
## HP_5 20.5 25.5 15.0 24.5 42.0 23.0 24.5
## HP_6 25.0 30.0 20.0 27.5 24.0 50.0 34.0
## HP_7 26.0 34.0 21.5 29.5 25.5 26.0 54.0

For more convenient usage, comperes has a list h2h_funs of some common Head-to-Head functions stored as expressions. To use them you need a little bit of rlang’s unquoting magic.

h2h_funs[1:3]
## $mean_score_diff
## mean(score1 - score2)
## 
## $mean_score_diff_pos
## max(mean(score1 - score2), 0)
## 
## $mean_score
## mean(score1)

hp_cr %>% h2h_long(!!! h2h_funs)
## # A long format of Head-to-Head values:
## # A tibble: 49 x 11
##   player1 player2 mean_score_diff mean_score_diff_pos mean_score
##   <chr>   <chr>             <dbl>               <dbl>      <dbl>
## 1 HP_1    HP_1             0.                  0.           3.91
## 2 HP_1    HP_2             0.500               0.500        3.88
## 3 HP_1    HP_3            -0.392               0.           3.92
## 4 HP_1    HP_4             0.0417              0.0417       3.79
## 5 HP_1    HP_5             0.                  0.           3.79
## # ... with 44 more rows, and 6 more variables: sum_score_diff <int>,
## #   sum_score_diff_pos <dbl>, sum_score <int>, num_wins <dbl>,
## #   num_wins2 <dbl>, num <int>

Harry Potter books

Head-to-Head “performance” of Harry Potter books is summarised in the following plot:

hp_cr_h2h %>%
  gather(h2h_fun, value, -player1, -player2) %>%
  # Manually produce a dummy colour variable to use in facets
  group_by(h2h_fun) %>%
  mutate(col = (value - min(value)) / (max(value) - min(value))) %>%
  ungroup() %>%
  # Make factors for correct orders
  mutate(
    player1 = factor(player1, levels = rev(sort(unique(player1)))),
    player2 = factor(player2, levels = sort(unique(player2))),
    h2h_fun = factor(h2h_fun,
                     levels = c("n", "num_wins",
                                "mean_score", "mean_score_diff")),
    h2h_fun = recode(
      h2h_fun,
      n = "Number of matchups (ratings by common person)",
      num_wins = 'Number of "wins" in matchups (half for ties)',
      mean_score = "Mean score in matchups",
      mean_score_diff = "Mean score difference in matchups"
    )
  ) %>%
  # Visualize
  ggplot(aes(player1, player2)) +
    geom_text(
      aes(label = value, colour = col),
      size = 5, fontface = "bold", show.legend = FALSE
    ) +
    facet_wrap(~ h2h_fun, scales = "free") +
    # To coordinate well with matrix form of Head-to-Head results
    coord_flip() +
    scale_colour_gradient(low = hp_pal["Slyth"], high = hp_pal["Gryff"]) +
    labs(
      x = "", y = "",
      title = "Head-to-Head performance of Harry Potter books",
      subtitle = paste0(
        '"HP_x" means Harry Potter book number "x" in series\n',
        "Numbers are Head-to-Head values of book in row against book in column"
      ),
      caption = "@echasnovski"
    ) +
    theme_classic() +
    theme(strip.text = element_text(face = "bold"))

There is a lot of information hidden in this plot. The most obvious discoveries:

  • It happened that book “HP_7” (“Deathly Hallows”) was rated with “HP_4” (“Goblet of Fire”) by one person the most: 61 times.
  • “HP_7” scored over “HP_2” (“Chamber of Secrets”) the most wins (34, half for ties) as did “HP_6” (“Half-Blood Prince”) over “HP_7”.
  • Book “HP_6” made the highest mean score of 4.36 in matchups with “HP_2”, which is bigger by 0.23 from its overall mean score.
  • In terms of score differences, “HP_3” (“Prisoner of Azkaban”) did best in matchups with “HP_2”, scoring on average 0.77 points more. This pair also represents “the best” and “the worst” books in terms of mean score.

Conclusion

  • A public call for help in creating data set for R package shouldn’t be made on Reddit but rather on R-bloggers or Twitter.
  • Among all original Harry Potter books, “Harry Potter and the Prisoner of Azkaban” seems to be considered “best” among R users. “Harry Potter and the Chamber of Secrets” suffers the opposite fate.
  • Package comperes is useful for storing, manipulating and summarising abstract competition results.
  • However informative, manually inspecting competition results with direct summaries and Head-to-Head tables is hard. They can display complex nature of performance relations between players. Next analysis of Harry Potter Books Survey data will be using my package comperank which implements different ranking methods for automatic discovery of player’s performance.
sessionInfo()
sessionInfo()
## R version 3.4.4 (2018-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.4 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/openblas-base/libblas.so.3
## LAPACK: /usr/lib/libopenblasp-r0.2.18.so
## 
## locale:
##  [1] LC_CTYPE=ru_UA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=ru_UA.UTF-8        LC_COLLATE=ru_UA.UTF-8    
##  [5] LC_MONETARY=ru_UA.UTF-8    LC_MESSAGES=ru_UA.UTF-8   
##  [7] LC_PAPER=ru_UA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=ru_UA.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] methods   stats     graphics  grDevices utils     datasets  base     
## 
## other attached packages:
## [1] bindrcpp_0.2.2   comperes_0.2.0   ggplot2_2.2.1    stringr_1.3.0   
## [5] rlang_0.2.0      tidyr_0.8.0.9000 dplyr_0.7.5.9000
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.16     pillar_1.2.1     compiler_3.4.4   plyr_1.8.4      
##  [5] bindr_0.1.1      tools_3.4.4      digest_0.6.15    evaluate_0.10.1 
##  [9] tibble_1.4.2     gtable_0.2.0     pkgconfig_2.0.1  cli_1.0.0       
## [13] yaml_2.1.17      blogdown_0.5     xfun_0.1         knitr_1.20      
## [17] rprojroot_1.3-2  grid_3.4.4       tidyselect_0.2.4 glue_1.2.0      
## [21] R6_2.2.2         rmarkdown_1.9    bookdown_0.7     purrr_0.2.4     
## [25] magrittr_1.5     backports_1.1.2  scales_0.5.0     htmltools_0.3.6 
## [29] assertthat_0.2.0 colorspace_1.3-2 labeling_0.3     utf8_1.1.3      
## [33] stringi_1.1.6    lazyeval_0.2.1   munsell_0.4.3    crayon_1.3.4

Statistical uncertainty with R and pdqr

2019-11-11

rstats pdqr

Local randomness in R

2019-08-13

rstats

Arguments of stats::density()

2019-08-06

rstats pdqr

comments powered by Disqus