dragracer
Packagelibrary(dragracer)
#> The library is now open
library(tibble)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tidyr)
The dragracer
package has three data sets. The first is episode-level data (rpdr_ep
). These data contain some more granular information about each episode that may not be discernible from how episodes are typically summarized on Wikipedia (e.g. mini-challenge winners, runway themes [where applicable], lip-sync song and artist). The second data set is contestant-level (rpdr_contestants
). This data frame includes the contestant name, hometown, and purported date of birth and age by the start of the show. The third data set is episode-contestant-level data (rpdr_contep
). This is the most familiar form of the data that a reader of the show’s Wikipedia entries could discern. They include information about how a contestant fared in a particular episode (i.e. whether they won, scored high, were safe, scored low, or were in the bottom). The show’s fans are accustomed to seeing this form of the data as akin to a pyramid. However, I convert the data from wide to long, making the data akin to a survival data-generating process.
Here are some potential uses of the data.
A user can learn about how to summarize data. Here, we can get the average age of the contestants by season from the rpdr_contestants
data.
rpdr_contestants %>%
group_by(season) %>%
summarize(mean_age = mean(age))
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 12 x 2
#> season mean_age
#> <chr> <dbl>
#> 1 S01 31
#> 2 S02 27.6
#> 3 S03 28.2
#> 4 S04 29.2
#> 5 S05 28
#> 6 S06 29.3
#> 7 S07 30.2
#> 8 S08 29.8
#> 9 S09 30.4
#> 10 S10 28.3
#> 11 S11 29.3
#> 12 S12 28.4
A user can also see which musical artists have appeared most for lip-syncs. The answer here is, unsurprisingly, RuPaul.
rpdr_ep %>%
group_by(lipsyncartist) %>%
summarize(n = n()) %>%
na.omit %>%
arrange(-n) %>% head(10)
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 10 x 2
#> lipsyncartist n
#> <chr> <int>
#> 1 RuPaul 11
#> 2 Britney Spears 5
#> 3 Madonna 5
#> 4 Aretha Franklin 4
#> 5 Lady Gaga 4
#> 6 Donna Summer 3
#> 7 Whitney Houston 3
#> 8 Ariana Grande 2
#> 9 Blondie 2
#> 10 Chaka Khan 2
A user can also see how Jinkx Monsoon, the GOAT, fared in all her episodes.
rpdr_contep %>%
filter(contestant == "Jinkx Monsoon") %>%
select(season, contestant, episode, outcome, finale)
#> # A tibble: 12 x 5
#> season contestant episode outcome finale
#> <chr> <chr> <dbl> <chr> <dbl>
#> 1 S05 Jinkx Monsoon 1 SAFE 0
#> 2 S05 Jinkx Monsoon 2 HIGH 0
#> 3 S05 Jinkx Monsoon 3 HIGH 0
#> 4 S05 Jinkx Monsoon 4 HIGH 0
#> 5 S05 Jinkx Monsoon 5 WIN 0
#> 6 S05 Jinkx Monsoon 6 HIGH 0
#> 7 S05 Jinkx Monsoon 7 HIGH 0
#> 8 S05 Jinkx Monsoon 8 HIGH 0
#> 9 S05 Jinkx Monsoon 9 WIN 0
#> 10 S05 Jinkx Monsoon 10 HIGH 0
#> 11 S05 Jinkx Monsoon 11 BTM 0
#> 12 S05 Jinkx Monsoon 14 WIN 1
Previous versions of the data included all sorts of information at the contestant-level. For release, I decided to strip that information from the data in order to allow the user to learn how to do this. For example, if you were interested in summarizing how each contestant did in their particular season on various metrics, here’s how you might do that.
First, let’s merge in the mini-challenge data. Mini-challenges are irregular; not every episode has them. Indeed, Season 12 had very few of them. So, they get special treatment in the episode-level data.
rpdr_ep %>%
select(season, minicw1:minicw3) %>%
group_by(season) %>%
gather(Category, contestant, minicw1:minicw3) %>%
na.omit %>%
group_by(season, contestant) %>%
summarize(minicwins = n()) %>%
left_join(rpdr_contestants, .) %>%
mutate(minicwins = ifelse(is.na(minicwins), 0, minicwins)) -> D
#> `summarise()` regrouping output by 'season' (override with `.groups` argument)
#> Joining, by = c("season", "contestant")
Now, let’s merge in data from the episode-contestant-level about how each contestant fared, excluding finales and specials. We’ll calculate all sorts of things here, including estimated “points per episode” and “Dusted or Busted” scores.
rpdr_contep %>%
filter(participant == 1 & finale == 0 & penultimate == 0) %>%
mutate(high = ifelse(outcome == "HIGH", 1, 0),
win = ifelse(outcome == "WIN", 1, 0),
low = ifelse(outcome == "LOW", 1, 0),
safe = ifelse(outcome == "SAFE", 1, 0),
highsafe = ifelse(outcome %in% c("HIGH", "SAFE"), 1, 0),
winhigh = ifelse(outcome %in% c("HIGH", "WIN"), 1, 0),
btm = ifelse(outcome == "BTM", 1, 0),
lowbtm = ifelse(outcome %in% c("BTM", "LOW"), 1, 0)) %>%
group_by(season,contestant,rank) %>%
mutate(numcontests = n()) %>%
group_by(season,contestant, numcontests, rank) %>%
summarize(perc_high = sum(high)/unique(numcontests),
perc_win = sum(win)/unique(numcontests),
perc_winhigh = sum(winhigh)/unique(numcontests),
perc_low = sum(low)/unique(numcontests),
perc_btm = sum(btm)/unique(numcontests),
perc_lowbtm = sum(lowbtm)/unique(numcontests),
num_high = sum(high),
num_win = sum(win),
num_winhigh = sum(winhigh),
num_btm = sum(btm),
num_low = sum(low),
num_lowbtm = sum(lowbtm),
db_score = 2*sum(win, na.rm=T) +
1*sum(high, na.rm=T) +
(sum(safe, na.rm=T)*0) +
(sum(low, na.rm=T)*-1) + (sum(btm, na.rm=T)*-2)) %>%
ungroup() %>%
mutate(points = (2*num_win + num_high - num_low + (-2)*num_btm),
ppe = points/numcontests) %>%
full_join(D, .) -> D
#> `summarise()` regrouping output by 'season', 'contestant', 'numcontests' (override with `.groups` argument)
#> Joining, by = c("season", "contestant")
How, let’s look at who had the highest “Dusted or Busted” score across all seasons.
D %>%
arrange(-db_score) %>%
head(10) %>%
select(season, contestant, rank, db_score)
#> # A tibble: 10 x 4
#> season contestant rank db_score
#> <chr> <chr> <dbl> <dbl>
#> 1 S06 Bianca Del Rio 1 10
#> 2 S05 Jinkx Monsoon 1 9
#> 3 S09 Shea Couleé 3 9
#> 4 S09 Sasha Velour 1 8
#> 5 S02 Tyra Sanchez 1 7
#> 6 S03 Raja 1 7
#> 7 S03 Manila Luzon 2 7
#> 8 S04 Sharon Needles 1 7
#> 9 S12 Gigi Goode 2 7
#> 10 S05 Alaska 2 6
Let’s also see who has the highest “points per episode” score.
D %>%
arrange(-ppe) %>%
head(10) %>%
select(season, contestant, rank, ppe)
#> # A tibble: 10 x 4
#> season contestant rank ppe
#> <chr> <chr> <dbl> <dbl>
#> 1 S06 Bianca Del Rio 1 1
#> 2 S05 Jinkx Monsoon 1 0.818
#> 3 S09 Shea Couleé 3 0.818
#> 4 S01 Ongina 5 0.8
#> 5 S02 Tyra Sanchez 1 0.778
#> 6 S09 Sasha Velour 1 0.727
#> 7 S01 Nina Flowers 2 0.667
#> 8 S04 Sharon Needles 1 0.636
#> 9 S03 Raja 1 0.583
#> 10 S03 Manila Luzon 2 0.583
Feel free to use the data for your own ends or learn R from it.