An Introduction to the dragracer Package

library(dragracer)
#> The library is now open
library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)

The dragracer package has three data sets. The first is episode-level data (rpdr_ep). These data contain some more granular information about each episode that may not be discernible from how episodes are typically summarized on Wikipedia (e.g. mini-challenge winners, runway themes [where applicable], lip-sync song and artist). The second data set is contestant-level (rpdr_contestants). This data frame includes the contestant name, hometown, and purported date of birth and age by the start of the show. The third data set is episode-contestant-level data (rpdr_contep). This is the most familiar form of the data that a reader of the show’s Wikipedia entries could discern. They include information about how a contestant fared in a particular episode (i.e. whether they won, scored high, were safe, scored low, or were in the bottom). The show’s fans are accustomed to seeing this form of the data as akin to a pyramid. However, I convert the data from wide to long, making the data akin to a survival data-generating process.

Here are some potential uses of the data.

Summarizing the Data

A user can learn about how to summarize data. Here, we can get the average age of the contestants by season from the rpdr_contestants data.

rpdr_contestants %>%
  group_by(season) %>%
  summarize(mean_age = mean(age))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 12 x 2
#>    season mean_age
#>    <chr>     <dbl>
#>  1 S01        31  
#>  2 S02        27.6
#>  3 S03        28.2
#>  4 S04        29.2
#>  5 S05        28  
#>  6 S06        29.3
#>  7 S07        30.2
#>  8 S08        29.8
#>  9 S09        30.4
#> 10 S10        28.3
#> 11 S11        29.3
#> 12 S12        28.4

A user can also see which musical artists have appeared most for lip-syncs. The answer here is, unsurprisingly, RuPaul.

rpdr_ep %>%
  group_by(lipsyncartist) %>%
  summarize(n = n()) %>% 
  na.omit %>%
  arrange(-n) %>% head(10)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 10 x 2
#>    lipsyncartist       n
#>    <chr>           <int>
#>  1 RuPaul             11
#>  2 Britney Spears      5
#>  3 Madonna             5
#>  4 Aretha Franklin     4
#>  5 Lady Gaga           4
#>  6 Donna Summer        3
#>  7 Whitney Houston     3
#>  8 Ariana Grande       2
#>  9 Blondie             2
#> 10 Chaka Khan          2

A user can also see how Jinkx Monsoon, the GOAT, fared in all her episodes.

rpdr_contep %>%
  filter(contestant == "Jinkx Monsoon") %>%
  select(season, contestant, episode, outcome, finale)
#> # A tibble: 12 x 5
#>    season contestant    episode outcome finale
#>    <chr>  <chr>           <dbl> <chr>    <dbl>
#>  1 S05    Jinkx Monsoon       1 SAFE         0
#>  2 S05    Jinkx Monsoon       2 HIGH         0
#>  3 S05    Jinkx Monsoon       3 HIGH         0
#>  4 S05    Jinkx Monsoon       4 HIGH         0
#>  5 S05    Jinkx Monsoon       5 WIN          0
#>  6 S05    Jinkx Monsoon       6 HIGH         0
#>  7 S05    Jinkx Monsoon       7 HIGH         0
#>  8 S05    Jinkx Monsoon       8 HIGH         0
#>  9 S05    Jinkx Monsoon       9 WIN          0
#> 10 S05    Jinkx Monsoon      10 HIGH         0
#> 11 S05    Jinkx Monsoon      11 BTM          0
#> 12 S05    Jinkx Monsoon      14 WIN          1

Merging Across Data

Previous versions of the data included all sorts of information at the contestant-level. For release, I decided to strip that information from the data in order to allow the user to learn how to do this. For example, if you were interested in summarizing how each contestant did in their particular season on various metrics, here’s how you might do that.

First, let’s merge in the mini-challenge data. Mini-challenges are irregular; not every episode has them. Indeed, Season 12 had very few of them. So, they get special treatment in the episode-level data.


rpdr_ep %>%
  select(season, minicw1:minicw3) %>%
  group_by(season) %>%
  gather(Category, contestant, minicw1:minicw3) %>%
  na.omit %>%
  group_by(season, contestant) %>%
  summarize(minicwins = n()) %>%
  left_join(rpdr_contestants, .) %>%
  mutate(minicwins = ifelse(is.na(minicwins), 0, minicwins)) -> D
#> `summarise()` regrouping output by 'season' (override with `.groups` argument)
#> Joining, by = c("season", "contestant")

Now, let’s merge in data from the episode-contestant-level about how each contestant fared, excluding finales and specials. We’ll calculate all sorts of things here, including estimated “points per episode” and “Dusted or Busted” scores.

rpdr_contep %>%
  filter(participant == 1 & finale == 0 & penultimate == 0) %>%
  mutate(high = ifelse(outcome == "HIGH", 1, 0),
         win = ifelse(outcome == "WIN", 1, 0),
         low = ifelse(outcome == "LOW", 1, 0),
         safe = ifelse(outcome == "SAFE", 1, 0),
         highsafe = ifelse(outcome %in% c("HIGH", "SAFE"), 1, 0),
         winhigh = ifelse(outcome %in% c("HIGH", "WIN"), 1, 0),
         btm = ifelse(outcome == "BTM", 1, 0),
         lowbtm = ifelse(outcome %in% c("BTM", "LOW"), 1, 0)) %>%
  group_by(season,contestant,rank) %>%
  mutate(numcontests = n()) %>%
  group_by(season,contestant, numcontests, rank) %>%
  summarize(perc_high = sum(high)/unique(numcontests),
            perc_win = sum(win)/unique(numcontests),
            perc_winhigh = sum(winhigh)/unique(numcontests),
            perc_low = sum(low)/unique(numcontests),
            perc_btm = sum(btm)/unique(numcontests),
            perc_lowbtm = sum(lowbtm)/unique(numcontests),
            num_high = sum(high),
            num_win = sum(win),
            num_winhigh = sum(winhigh),
            num_btm = sum(btm),
            num_low = sum(low),
            num_lowbtm = sum(lowbtm),
            db_score = 2*sum(win, na.rm=T) +
              1*sum(high, na.rm=T) +
              (sum(safe, na.rm=T)*0) +
              (sum(low, na.rm=T)*-1) + (sum(btm, na.rm=T)*-2)) %>%
  ungroup() %>%
  mutate(points = (2*num_win + num_high - num_low + (-2)*num_btm),
            ppe = points/numcontests) %>%
  full_join(D, .) -> D
#> `summarise()` regrouping output by 'season', 'contestant', 'numcontests' (override with `.groups` argument)
#> Joining, by = c("season", "contestant")

How, let’s look at who had the highest “Dusted or Busted” score across all seasons.

D %>%
  arrange(-db_score) %>%
  head(10) %>%
  select(season, contestant, rank, db_score)
#> # A tibble: 10 x 4
#>    season contestant      rank db_score
#>    <chr>  <chr>          <dbl>    <dbl>
#>  1 S06    Bianca Del Rio     1       10
#>  2 S05    Jinkx Monsoon      1        9
#>  3 S09    Shea Couleé        3        9
#>  4 S09    Sasha Velour       1        8
#>  5 S02    Tyra Sanchez       1        7
#>  6 S03    Raja               1        7
#>  7 S03    Manila Luzon       2        7
#>  8 S04    Sharon Needles     1        7
#>  9 S12    Gigi Goode         2        7
#> 10 S05    Alaska             2        6

Let’s also see who has the highest “points per episode” score.

D %>%
  arrange(-ppe) %>%
  head(10) %>%
  select(season, contestant, rank, ppe)
#> # A tibble: 10 x 4
#>    season contestant      rank   ppe
#>    <chr>  <chr>          <dbl> <dbl>
#>  1 S06    Bianca Del Rio     1 1    
#>  2 S05    Jinkx Monsoon      1 0.818
#>  3 S09    Shea Couleé        3 0.818
#>  4 S01    Ongina             5 0.8  
#>  5 S02    Tyra Sanchez       1 0.778
#>  6 S09    Sasha Velour       1 0.727
#>  7 S01    Nina Flowers       2 0.667
#>  8 S04    Sharon Needles     1 0.636
#>  9 S03    Raja               1 0.583
#> 10 S03    Manila Luzon       2 0.583

Feel free to use the data for your own ends or learn R from it.