Revisit of fun data challenge from last year
Title photo from Diana Polekhina on Unsplash.
‘Hanukkah of Data’ is a data challenge where you have to solve 8 puzzles surrounding a fictional data set. I have already participated in last year’s challenge, but it was a lot of fun to revisit the old puzzles and rework some of my solutions.
Especially the ‘speed-run’ challenge had some twists in it, therefore I will talk about these puzzles in this blog post. The puzzles are mostly the same as in the normal version, but the data has some more difficult edge cases in it.
To solve the puzzles we use the following R libraries:
I’ve already used the tidyverse
in last years challenge. This year I am
also using assertr
to detect problems with my solutions as early as possible.
Behind the scenes the whole analysis is created as a targets
pipeline.
So if you want to reproduce the analysis, you have to perform the following
steps:
renv::restore()
to restore the R package versionstargets::tar_make()
to run the pipelineTo find the rug, we will need to contact a private investigator. The last name of the investigator can be spelled by using the letters printed on the phone buttons. For example: 2 has “ABC”, and 3 “DEF”, etc.
The key pad can be represented in R like this:
phone_letter <- function() {
tibble(letter = letters) |>
mutate(
nr = as.character(case_when(
letter %in% c("a", "b", "c") ~ 2,
letter %in% c("d", "e", "f") ~ 3,
letter %in% c("g", "h", "i") ~ 4,
letter %in% c("j", "k", "l") ~ 5,
letter %in% c("m", "n", "o") ~ 6,
letter %in% c("p", "q", "r", "s") ~ 7,
letter %in% c("t", "u", "v") ~ 8,
letter %in% c("w", "x", "y", "z") ~ 9,
TRUE ~ NA_real_)))
}
We then need to find the last name of each person.
This can be a little bit tricky. Names can also include special suffixes
like Jr.
or roman numbers like III
. Therefore we use a regex to filter for
last names that start with a upper case letter and end with one or more lower
case letters.
After that we transform the data so that every letter is one row:
last_names <- function(df_customers) {
df_customers |>
transmute(
customerid,
name_split = str_split(name, "\\s+"),
name_split = map(
name_split,
~ str_subset(.x, regex("^[A-Z][a-z]+$"))),
last_name = map_chr(name_split, last),
letter = str_split(last_name, ""),
phone_chr = str_remove_all(phone, "-")) |>
unnest(letter) |>
mutate(letter = str_to_lower(letter))
}
By combining both data sources, we can answer the question:
customerid | name | phone |
---|---|---|
Now we are looking for a contractor, to whom the rug was given to by a cleaning company.
Look for customers with the searched initials (‘ds’, for the speed run):
initials <- function(df_customers, searched_initials) {
df_customers |>
mutate(
name_split = str_split(name, "\\s+"),
name_split = map(
name_split, ~ str_subset(.x, regex("^[a-z]+$", ignore_case = TRUE))),
first_name = map_chr(name_split, first),
last_name = map_chr(name_split, last)) |>
transmute(
customerid,
initials = str_to_lower(str_c(
str_sub(first_name, end = 1), str_sub(last_name, end = 1)))) |>
filter(initials == searched_initials)
}
We then look for products that are ‘coffee’ or ‘bagels’. Join the order items to the filtered products:
coffee_bagels <- function(df_products, df_order_items) {
df_coffee <- df_products |>
filter(str_detect(desc, regex("coffee", ignore_case = TRUE)))
df_bagel <- df_products |>
filter(str_detect(desc, regex("bagel", ignore_case = TRUE)))
bind_rows(
list(bagel = df_bagel, coffee = df_coffee), .id = "coffee_bagel") |>
left_join(df_order_items, by = "sku")
}
Look for 2017 orders where coffee or bagels were bought. Keep only those were the customer has the above mentioned initials.
order_contractor <- function(df_orders, df_coffee_bagels, df_initials) {
df_orders |>
filter(year(ordered) == 2017) |>
inner_join(df_coffee_bagels, by = "orderid") |>
group_by(customerid, day = floor_date(ordered, unit = "day")) |>
summarise(
coffee = any(coffee_bagel == "coffee"),
bagel = any(coffee_bagel == "bagel"), .groups = "drop_last") |>
summarise(coffee_and_bagel = any(coffee & bagel)) |>
filter(coffee_and_bagel) |>
semi_join(df_initials, by = "customerid") |>
verify(length(customerid) == 1)
}
customerid | name | phone |
---|---|---|
Search for the neighbor with the spider hat. The filtering conditions are the following:
These are the specific filter conditions for the speed run. For the other versions of the game, these are different. But the general filtering stays the same.
Define goat years (pasted from wikipedia):
goat_years <- function() {
df_goat_raw <- tibble::tribble(
~Start.date, ~End.date, ~Heavenly.branch,
"13 February 1907", "1 February 1908", "Fire Goat",
"1 February 1919", "19 February 1920", "Earth Goat",
"17 February 1931", "5 February 1932", "Metal Goat",
"5 February 1943", "24 January 1944", "Water Goat",
"24 January 1955", "11 February 1956", "Wood Goat",
"9 February 1967", "29 January 1968", "Fire Goat",
"28 January 1979", "15 February 1980", "Earth Goat",
"15 February 1991", "3 February 1992", "Metal Goat",
"1 February 2003", "21 January 2004", "Water Goat",
"19 February 2015", "7 February 2016", "Wood Goat",
"6 February 2027", "25 January 2028", "Fire Goat",
"24 January 2039", "11 February 2040", "Earth Goat",
"11 February 2051", "31 January 2052", "Metal Goat",
"29 January 2063", "16 February 2064", "Water Goat",
"15 February 2075", "4 February 2076", "Wood Goat",
"3 February 2087", "23 January 2088", "Fire Goat",
"21 January 2099", "8 February 2100", "Earth Goat") |>
clean_names()
df_goat_raw |>
mutate(
across(c(start_date, end_date), ~ parse_date(.x, "%d %B %Y")))
}
spider_hat <- function(df_customers, df_contractor, df_chinese_year) {
df_customers |>
filter(
map_lgl(
birthdate,
~ any(
df_chinese_year$start_date <= .x & df_chinese_year$end_date >= .x)),
case_when(
month(birthdate) == 9 ~ day(birthdate) >= 23,
month(birthdate) == 10 ~ day(birthdate) <= 23,
TRUE ~ FALSE),
str_detect(citystatezip, df_contractor$citystatezip)) |>
select(customerid, name, citystatezip, phone) |>
verify(length(customerid) == 1)
}
customerid | name | citystatezip | phone |
---|---|---|---|
Look for order items that are ‘pastries’:
order_items_pastries <- function(df_order_items, df_products) {
df_products_pastries <- df_products |>
filter(str_detect(sku, regex("bky", ignore_case = TRUE)))
df_order_items |>
semi_join(df_products_pastries, by = "sku")
}
Look for persons that order pastries early in the morning:
tinder_woman <- function(df_orders, df_order_items_pastries, df_customers) {
df_order_items_pastries |>
left_join(df_orders, by = "orderid") |>
filter(hour(ordered) < 9) |>
arrange(ordered) |>
group_by(day = floor_date(ordered, "day")) |>
summarise(
earliest_order = min(ordered),
customerid = unique(customerid[ordered == earliest_order])) |>
count(customerid, sort = TRUE) |>
slice(1) |>
left_join(
select(df_customers, customerid, name, phone),
by = c("customerid"))
}
customerid | n | name | phone |
---|---|---|---|
Look for people that live in Staten Island (not needed for the speedrun):
Look for products that represent cat food for senior cats:
senior_cat_food <- function(df_products) {
df_products |>
filter(
str_detect(desc, regex("cat\\s+food", ignore_case = TRUE)),
str_detect(desc, regex("senior", ignore_case = TRUE)))
}
Combine the information and look for the searched woman:
cat_lady <- function(df_order_items, df_orders, df_senior_cat_food,
df_staten_island) {
df_order_items |>
semi_join(df_senior_cat_food, by = "sku") |>
left_join(select(df_orders, orderid, customerid), by = "orderid") |>
#semi_join(df_staten_island, by = "customerid") |>
count(customerid, sort = TRUE) |>
slice(1)
}
customerid | name | citystatezip | phone |
---|---|---|---|
Calculate margin for each order item
Determine customer with the lowest total margin:
customerid | name | phone |
---|---|---|
Find all orders that contain a colored item:
color_orders <- function(df_orders, df_order_items, df_products) {
df_orders |>
left_join(df_order_items, by = c("orderid")) |>
left_join(df_products, by = "sku") |>
mutate(
color = str_remove_all(str_extract(desc, "\\(.+\\)"), "\\(|\\)"),
day = as_date(floor_date(ordered, unit = "day"))) |>
filter(!is.na(color))
}
Search for orders that happened in close proximity to the orders of the frugal cousin:
ex_boyfriend <- function(df_color_orders, df_frugal_cousin) {
df_color_orders_fc <- df_color_orders |>
semi_join(df_frugal_cousin, by = "customerid") |>
mutate(start = ordered - dminutes(0.3), end = ordered + dminutes(0.3))
df_color_orders |>
anti_join(df_color_orders_fc, by = join_by(customerid)) |>
inner_join(
select(df_color_orders_fc, day, start, end), by = join_by(day)) |>
filter(ordered >= start & ordered <= end) |>
verify(length(customerid) == 1)
}
customerid | name | phone |
---|---|---|
Look for products that are collectibles
collectibles <- function(df_products) {
df_products |>
filter(str_detect(sku, "COL"))
}
Find the person who has all the collectibles
collector <- function(df_orders, df_order_items, df_collectibles) {
df_order_items |>
semi_join(df_collectibles, by = "sku") |>
left_join(df_orders, by = "orderid") |>
group_by(customerid) |>
summarise(anz_coll = n_distinct(sku)) |>
filter(anz_coll == nrow(df_collectibles)) |>
verify(length(customerid) == 1)
}
customerid | name | phone |
---|---|---|
As last year, I had a lot of fun solving the Hanukkah of Data challenges.
I revisited my previous solutions and improved them to solve the new challenges.
By using functions from the assertr
package, I could spot difficulties early.
Especially during the speed run at the end of the challenge, this type of
assertive programming made it more easy for me, to adjust my solutions to more
challenging data and edge cases. I’m already looking forward to the challenges
next year :-).
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/duju211/hanukkah_of_data, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
During (2023, Dec. 31). Datannery: Hanukkah of Data Revisited (2023 / Speedrun). Retrieved from https://www.datannery.com/posts/hanukkah-of-data-revisited-2023-speedrun/
BibTeX citation
@misc{during2023hanukkah, author = {During, Julian}, title = {Datannery: Hanukkah of Data Revisited (2023 / Speedrun)}, url = {https://www.datannery.com/posts/hanukkah-of-data-revisited-2023-speedrun/}, year = {2023} }