Short data challenge released over 8 days of Hanukkah in 2022
Title photo from Gaelle Marcel on Unsplash
‘Hanukkah of Data’ is a data challenge where you have to solve 8 puzzles surrounding a fictional data set.
We are asked to find a rug for our granduncle Noah who owns a store. The store has data about customer, products and orders. With the help of the data (CSV files) we are asked to solve a total of eight puzzles.
The data for this analysis consists of four CSV files: ‘noahs-customers.csv’, ‘noahs-orders.csv’, ‘noahs-orders_items.csv’ and ‘noahs-products.csv’. The data is read into R with the help of the readr::read_csv
function, specifying the correct column types as arguments in the function calls.
There are a total of 11080 customers in the raw data. Perform the following preprocessing steps:
phone_chr
<- function(df_customers_raw) {
customers |>
df_customers_raw mutate(
phone_chr = map_chr(str_extract_all(phone, "\\d"), str_flatten),
name_split = str_split(name, "\\s+"),
first_name = map_chr(name_split, ~ str_flatten(.x[.x != last(.x)])),
last_name = map_chr(name_split, last),
initials = str_to_lower(str_glue(
"{str_sub(first_name, end = 1)}{str_sub(last_name, end = 1)}"))) |>
select(where(negate(is_list))) |>
distinct(customerid, .keep_all = TRUE)
}
# A tibble: 11,079 × 10
customerid name address citystatezip birthdate phone phone_chr
<chr> <chr> <chr> <chr> <date> <chr> <chr>
1 1001 Jack Qu… 201 E … Los Angeles… 1960-05-14 805-… 80528785…
2 1002 David P… 224C T… Staten Isla… 1978-04-04 516-… 51676816…
3 1003 Carrie … 1608 W… Tampa, FL 3… 1969-01-21 727-… 72720904…
4 1004 Steven … 178½ E… Manhattan, … 1953-08-17 607-… 60794195…
5 1005 Christi… 270 W … Bronx, NY 1… 1983-06-06 212-… 21275990…
6 1006 Amanda … 183-48… Saint Alban… 1962-07-08 914-… 91442131…
7 1007 Mark Co… 14-47 … College Poi… 1967-04-14 585-… 58555419…
8 1008 Jill St… 735A A… Manhattan, … 1959-06-11 516-… 51630704…
9 1009 Samuel … 56 Ric… Brooklyn, N… 1988-10-24 929-… 92986987…
10 1010 Brenda … 2821 B… Bronx, NY 1… 1960-09-07 914-… 91420526…
# ℹ 11,069 more rows
# ℹ 3 more variables: first_name <chr>, last_name <chr>,
# initials <chr>
There are a total of 1124 products. Perform the following preprocessing:
Extract the category by removing all digits from the desc
string
Extract additional information that can be found in the bracket text
Remove remaining brackets from add_info
and desc
<- function(df_products_raw) {
products |>
df_products_raw mutate(
category = str_remove(sku, "\\d+"),
add_info = str_extract(desc, "\\(.+\\)"),
desc = if_else(!is.na(add_info), str_remove(desc, add_info), desc),
across(c(add_info, desc), ~ str_trim(str_remove_all(.x, "\\)|\\("))))
}
# A tibble: 1,124 × 5
sku desc wholesale_cost category add_info
<chr> <chr> <dbl> <chr> <chr>
1 DLI0002 Smoked Whitefish Sandwich 9.33 DLI <NA>
2 PET0005 Vegan Cat Food, Turkey & … 4.35 PET <NA>
3 HOM0018 Power Radio 21.8 HOM red
4 KIT0034 Azure Ladle 2.81 KIT <NA>
5 PET0041 Gluten-free Cat Food, Pum… 4.6 PET <NA>
6 PET0045 Gluten-free Cat Food, Sal… 4.32 PET <NA>
7 TOY0048 Electric Doll 10.2 TOY <NA>
8 CMP0061 Network Printer 136. CMP <NA>
9 DLI0066 Pickled Herring Sandwich 9.94 DLI <NA>
10 TOY0085 Noah's Toy Soldier 12.0 TOY <NA>
# ℹ 1,114 more rows
There are a total of 214207 orders with 427258 items ordered. Join the ordered item as list column to make them easily accessible.
<- function(df_orders_raw, df_order_items) {
orders |>
df_orders_raw select(-items) |>
nest_join(df_order_items, by = "orderid", name = "items")
}
To find the rug, we will need to contact a private investigator. The last name of the investigator can be spelled by using the letters printed on the phone buttons. For example: 2 has “ABC”, and 3 “DEF”, etc.
The key pad of the phone can be represented in R like this:
<- function() {
phone_letter tibble(letter = letters) |>
mutate(
nr = ((row_number() - 1) %/% 3) + 2,
nr = as.character(if_else(nr == 10, 0, nr)))
}
# A tibble: 26 × 2
letter nr
<chr> <chr>
1 a 2
2 b 2
3 c 2
4 d 3
5 e 3
6 f 3
7 g 4
8 h 4
9 i 4
10 j 5
# ℹ 16 more rows
By combining this representation with the data on hand, we can determine the investigator:
<- function(df_customers, df_phone_letter) {
investigator <- df_customers |>
df_customers_pro transmute(
name_pro = str_split(str_to_lower(str_remove_all(
"\\s"), "\\s"), ""),
last_name, |>
customerid) unnest(name_pro) |>
left_join(df_phone_letter, by = c("name_pro" = "letter")) |>
group_by(customerid) |>
summarise(phone_pro = str_flatten(nr))
|>
df_customers left_join(df_customers_pro, by = "customerid") |>
filter(str_length(str_extract(phone_chr, phone_pro)) == str_length(phone_chr))
}
customerid | name | phone |
---|---|---|
Now we are looking for a contractor, to whom the rug was given to by a cleaning company. The following is known:
<- function(df_orders, df_customers, coffee_bagel_regex) {
contractor <- df_orders |>
df_customers_after_2017 filter(year(ordered) > 2017) |>
distinct(customerid)
<- df_orders |>
df_rel_orders filter(year(ordered) <= 2017) |>
anti_join(df_customers_after_2017, by = "customerid") |>
unnest(items) |>
filter(str_detect(desc, coffee_bagel_regex)) |>
semi_join(filter(df_customers, initials == "jd"), by = "customerid")
|>
df_customers semi_join(df_rel_orders, by = "customerid")
}
customerid | name | citystatezip | phone |
---|---|---|---|
We are searching for the neighbor of the contractor. The neighbor has the following characteristics:
Dog years are determined by the following function:
<- function() {
det_dog_years 2018 - (1:9 * 12)
}
Resulting in: 2006, 1994, 1982, 1970, 1958, 1946, 1934, 1922, 1910
With this information we can find out who the neighbor is:
<- function(df_customers, df_contractor, dog_years) {
spider_hat |>
df_customers filter(year(birthdate) %in% dog_years) |>
filter(
case_when(
month(birthdate) == 3 ~ day(birthdate) >= 21,
month(birthdate) == 4 ~ day(birthdate) <= 20,
TRUE ~ FALSE)) |>
filter(str_detect(citystatezip, df_contractor$citystatezip))
}
customerid | name | citystatezip | birthdate | phone |
---|---|---|---|---|
The next persons has the following habit:
Filter products for the ‘BKY’ category and search for the person which always orders these products first:
<- function(df_products, df_orders, df_customers,
tinder_woman
product_pastries) {<- df_products |>
df_products_pastries filter(category == product_pastries)
<- df_orders |>
df_orders_pastries unnest(items) |>
semi_join(df_products_pastries, by = "sku")
<- df_orders_pastries |>
df_first_pastries mutate(day = floor_date(ordered, unit = "day")) |>
group_by(day) |>
filter(ordered == min(ordered)) |>
ungroup() |>
count(customerid, sort = TRUE)
|>
df_customers semi_join(slice(df_first_pastries, 1), by = "customerid")
}
customerid | name | phone |
---|---|---|
In the 5th puzzle we need to find a lady with a lot of cats as pets. We know about her and her pets:
The cats are senior
She lives in Queens
She has a lot of cats
<- function(df_products, df_orders, df_customers) {
cat_lady <- df_products |>
df_products_cat filter(str_detect(desc, regex("cat", ignore_case = TRUE))) |>
filter(str_detect(desc, regex("senior", ignore_case = TRUE)))
<- df_customers |>
df_customers_queens filter(str_detect(citystatezip, regex("queens", ignore_case = TRUE)))
<- df_orders |>
df_customer_id semi_join(df_customers_queens, by = "customerid") |>
unnest(items) |>
semi_join(df_products_cat, by = "sku") |>
group_by(customerid) |>
summarise(sku = str_flatten(unique(sku), ", "), anz = n()) |>
filter(anz > 2)
|>
df_customers semi_join(df_customer_id, by = "customerid")
}
customerid | name | phone |
---|---|---|
The cat lady has a very frugal cousin. She tells us: ” Noah […] loses money whenever she comes in the store”:
<- function(df_orders, df_customers) {
frugal_cousin <- df_orders |>
df_customer_id mutate(
profit = map_dbl(items, ~ sum(.x$unit_price - .x$wholesale_cost))) |>
group_by(customerid) |>
summarise(profit = sum(profit)) |>
top_n(n = 1, wt = -profit)
|>
df_customers semi_join(df_customer_id, "customerid")
}
customerid | name | phone |
---|---|---|
The frugal cousin tells us how she met her ex-boyfriend. They met while they were buying the same product, but in different colors. So we look at all the orders from the frugal cousin and search for that incident:
<- function(df_frugal_cousin, df_orders, df_customers) {
ex_boyfriend <- df_orders |>
df_orders_color semi_join(df_frugal_cousin, by = "customerid") |>
unnest(items) |>
mutate(day = floor_date(ordered, "day")) |>
filter(!is.na(add_info))
<- df_orders |>
df_order_rel mutate(day = floor_date(ordered, "day")) |>
semi_join(df_orders_color, by = "day") |>
unnest(items) |>
inner_join(
by = c("desc", "day"),
df_orders_color, suffix = c("_male", "_female")) |>
filter(
!= add_info_female,
add_info_male >= ordered_female - dminutes(10)
ordered_male & ordered_male <= ordered_female + dminutes(10))
|>
df_customers semi_join(df_order_rel, by = c("customerid" = "customerid_male"))
}
customerid | name | phone |
---|---|---|
Finally, we search for a collector. The collector is in possession of a complete set of Noah’s collectibles. By looking at the data, the relevant category is ‘COL’:
<- function(df_products, df_orders, df_customers, product_collect) {
collector <- df_products |>
df_products_collect filter(category == product_collect)
<- df_orders |>
df_orders_unnested unnest(items)
<- df_orders_unnested |>
df_customer_id semi_join(df_products_collect, by = c("sku")) |>
group_by(customerid) |>
summarise(anz_col = n_distinct(sku)) |>
arrange(desc(anz_col)) |>
slice(1)
|>
df_customers semi_join(df_customer_id, by = "customerid")
}
customerid | name | phone |
---|---|---|
Participating in the “Hanukkah of Data” competition was both challenging and rewarding. We successfully solved 8 difficult puzzles, sharpening our problem-solving and coding skills. I look forward to the challenge again next year and am eager to see what it holds.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/duju211/hanukkah_of_data, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
During (2023, Jan. 15). Datannery: Hanukkah of Data. Retrieved from https://www.datannery.com/posts/hanukkah-of-data/
BibTeX citation
@misc{during2023hanukkah, author = {During, Julian}, title = {Datannery: Hanukkah of Data}, url = {https://www.datannery.com/posts/hanukkah-of-data/}, year = {2023} }