Datannery: Hanukkah of Data

Julian During

Hanukkah of Data

Short data challenge released over 8 days of Hanukkah in 2022

Author

Affiliation

Julian During

Published

Jan. 14, 2023

Citation

During, 2023

Title photo from Gaelle Marcel on Unsplash

Challenge

‘Hanukkah of Data’ is a data challenge where you have to solve 8 puzzles surrounding a fictional data set.

We are asked to find a rug for our granduncle Noah who owns a store. The store has data about customer, products and orders. With the help of the data (CSV files) we are asked to solve a total of eight puzzles.

Data

The data for this analysis consists of four CSV files: ‘noahs-customers.csv’, ‘noahs-orders.csv’, ‘noahs-orders_items.csv’ and ‘noahs-products.csv’. The data is read into R with the help of the readr::read_csv function, specifying the correct column types as arguments in the function calls.

Customers

There are a total of 11080 customers in the raw data. Perform the following preprocessing steps:

Determine the Initials by splitting the name into first and last name
Extract all digits from the phone number and save in column phone_chr
Remove duplicates

customers <- function(df_customers_raw) {
  df_customers_raw |>
    mutate(
      phone_chr = map_chr(str_extract_all(phone, "\\d"), str_flatten),
      name_split = str_split(name, "\\s+"),
      first_name = map_chr(name_split, ~ str_flatten(.x[.x != last(.x)])),
      last_name = map_chr(name_split, last),
      initials = str_to_lower(str_glue(
        "{str_sub(first_name, end = 1)}{str_sub(last_name, end = 1)}"))) |>
    select(where(negate(is_list))) |>
    distinct(customerid, .keep_all = TRUE)
}

# A tibble: 11,079 × 10
   customerid name     address citystatezip birthdate  phone phone_chr
   <chr>      <chr>    <chr>   <chr>        <date>     <chr> <chr>    
 1 1001       Jack Qu… 201 E … Los Angeles… 1960-05-14 805-… 80528785…
 2 1002       David P… 224C T… Staten Isla… 1978-04-04 516-… 51676816…
 3 1003       Carrie … 1608 W… Tampa, FL 3… 1969-01-21 727-… 72720904…
 4 1004       Steven … 178½ E… Manhattan, … 1953-08-17 607-… 60794195…
 5 1005       Christi… 270 W … Bronx, NY 1… 1983-06-06 212-… 21275990…
 6 1006       Amanda … 183-48… Saint Alban… 1962-07-08 914-… 91442131…
 7 1007       Mark Co… 14-47 … College Poi… 1967-04-14 585-… 58555419…
 8 1008       Jill St… 735A A… Manhattan, … 1959-06-11 516-… 51630704…
 9 1009       Samuel … 56 Ric… Brooklyn, N… 1988-10-24 929-… 92986987…
10 1010       Brenda … 2821 B… Bronx, NY 1… 1960-09-07 914-… 91420526…
# ℹ 11,069 more rows
# ℹ 3 more variables: first_name <chr>, last_name <chr>,
#   initials <chr>

Products

There are a total of 1124 products. Perform the following preprocessing:

Extract the category by removing all digits from the desc string
Extract additional information that can be found in the bracket text
Remove remaining brackets from add_info and desc

products <- function(df_products_raw) {
  df_products_raw |>
    mutate(
      category = str_remove(sku, "\\d+"),
      add_info = str_extract(desc, "\\(.+\\)"),
      desc = if_else(!is.na(add_info), str_remove(desc, add_info), desc),
      across(c(add_info, desc), ~ str_trim(str_remove_all(.x, "\\)|\\("))))
}

# A tibble: 1,124 × 5
   sku     desc                       wholesale_cost category add_info
   <chr>   <chr>                               <dbl> <chr>    <chr>   
 1 DLI0002 Smoked Whitefish Sandwich            9.33 DLI      <NA>    
 2 PET0005 Vegan Cat Food, Turkey & …           4.35 PET      <NA>    
 3 HOM0018 Power Radio                         21.8  HOM      red     
 4 KIT0034 Azure Ladle                          2.81 KIT      <NA>    
 5 PET0041 Gluten-free Cat Food, Pum…           4.6  PET      <NA>    
 6 PET0045 Gluten-free Cat Food, Sal…           4.32 PET      <NA>    
 7 TOY0048 Electric Doll                       10.2  TOY      <NA>    
 8 CMP0061 Network Printer                    136.   CMP      <NA>    
 9 DLI0066 Pickled Herring Sandwich             9.94 DLI      <NA>    
10 TOY0085 Noah's Toy Soldier                  12.0  TOY      <NA>    
# ℹ 1,114 more rows

Orders

There are a total of 214207 orders with 427258 items ordered. Join the ordered item as list column to make them easily accessible.

orders <- function(df_orders_raw, df_order_items) {
  df_orders_raw |>
    select(-items) |>
    nest_join(df_order_items, by = "orderid", name = "items")
}

Puzzles

Puzzle 1

To find the rug, we will need to contact a private investigator. The last name of the investigator can be spelled by using the letters printed on the phone buttons. For example: 2 has “ABC”, and 3 “DEF”, etc.

The key pad of the phone can be represented in R like this:

phone_letter <- function() {
  tibble(letter = letters) |>
    mutate(
      nr = ((row_number() - 1) %/% 3) + 2,
      nr = as.character(if_else(nr == 10, 0, nr)))
}

# A tibble: 26 × 2
   letter nr   
   <chr>  <chr>
 1 a      2    
 2 b      2    
 3 c      2    
 4 d      3    
 5 e      3    
 6 f      3    
 7 g      4    
 8 h      4    
 9 i      4    
10 j      5    
# ℹ 16 more rows

By combining this representation with the data on hand, we can determine the investigator:

investigator <- function(df_customers, df_phone_letter) {
  df_customers_pro <- df_customers |>
    transmute(
      name_pro = str_split(str_to_lower(str_remove_all(
        last_name, "\\s"), "\\s"), ""),
      customerid) |>
    unnest(name_pro) |>
    left_join(df_phone_letter, by = c("name_pro" = "letter")) |>
    group_by(customerid) |>
    summarise(phone_pro = str_flatten(nr))
  
  df_customers |>
    left_join(df_customers_pro, by = "customerid") |>
    filter(str_length(str_extract(phone_chr, phone_pro)) == str_length(phone_chr))
}

customerid	name	phone
3188	Sam Guttenberg	488-836-2374

Puzzle 2

Now we are looking for a contractor, to whom the rug was given to by a cleaning company. The following is known:

The contractor has the initials ‘jd’
The cleaning company and the contractor had meetings at Noah’s over coffee and bagels
The cleaning company stopped outsourcing a few years ago

contractor <- function(df_orders, df_customers, coffee_bagel_regex) {
  df_customers_after_2017 <- df_orders |>
    filter(year(ordered) > 2017) |>
    distinct(customerid)
  
  df_rel_orders <- df_orders |>
    filter(year(ordered) <= 2017) |>
    anti_join(df_customers_after_2017, by = "customerid") |>
    unnest(items) |>
    filter(str_detect(desc, coffee_bagel_regex)) |>
    semi_join(filter(df_customers, initials == "jd"), by = "customerid")
  
  df_customers |>
    semi_join(df_rel_orders, by = "customerid")
}

customerid	name	citystatezip	phone
4164	Jeremy Davis	South Ozone Park, NY 11420	212-771-8924

Puzzle 3

We are searching for the neighbor of the contractor. The neighbor has the following characteristics:

Star Sign: Aries
- March 21 - April 19
Born in the year of the Dog
- Every 12 years (2018 for example)
Lives in the neighborhood

Dog years are determined by the following function:

det_dog_years <- function() {
  2018 - (1:9 * 12)
}

Resulting in: 2006, 1994, 1982, 1970, 1958, 1946, 1934, 1922, 1910

With this information we can find out who the neighbor is:

spider_hat <- function(df_customers, df_contractor, dog_years) {
  df_customers |>
    filter(year(birthdate) %in% dog_years) |>
    filter(
      case_when(
        month(birthdate) == 3 ~ day(birthdate) >= 21,
        month(birthdate) == 4 ~ day(birthdate) <= 20,
        TRUE ~ FALSE)) |>
    filter(str_detect(citystatezip, df_contractor$citystatezip))
}

customerid	name	citystatezip	birthdate	phone
2274	Brent Nguyen	South Ozone Park, NY 11420	1958-03-25	516-636-7397

Puzzle 4

The next persons has the following habit:

Claims the first pastries of the day

Filter products for the ‘BKY’ category and search for the person which always orders these products first:

tinder_woman <- function(df_products, df_orders, df_customers,
                         product_pastries) {
  df_products_pastries <- df_products |>
    filter(category == product_pastries)
  
  df_orders_pastries <- df_orders |>
    unnest(items) |>
    semi_join(df_products_pastries, by = "sku")
  
  df_first_pastries <- df_orders_pastries |>
    mutate(day = floor_date(ordered, unit = "day")) |>
    group_by(day) |>
    filter(ordered == min(ordered)) |>
    ungroup() |>
    count(customerid, sort = TRUE)
  
  df_customers |>
    semi_join(slice(df_first_pastries, 1), by = "customerid")
}

customerid	name	phone
5375	Christina Booker	718-649-9036

Puzzle 5

In the 5th puzzle we need to find a lady with a lot of cats as pets. We know about her and her pets:

The cats are senior
She lives in Queens
She has a lot of cats

cat_lady <- function(df_products, df_orders, df_customers) {
  df_products_cat <- df_products |>
    filter(str_detect(desc, regex("cat", ignore_case = TRUE))) |>
    filter(str_detect(desc, regex("senior", ignore_case = TRUE)))
  
  df_customers_queens <- df_customers |>
    filter(str_detect(citystatezip, regex("queens", ignore_case = TRUE)))
  
  df_customer_id <- df_orders |>
    semi_join(df_customers_queens, by = "customerid") |>
    unnest(items) |>
    semi_join(df_products_cat, by = "sku") |>
    group_by(customerid) |>
    summarise(sku = str_flatten(unique(sku), ", "), anz = n()) |>
    filter(anz > 2)
  
  df_customers |>
    semi_join(df_customer_id, by = "customerid")
}

customerid	name	phone
7675	Anita Koch	315-492-7411

Puzzle 6

The cat lady has a very frugal cousin. She tells us: ” Noah […] loses money whenever she comes in the store”:

frugal_cousin <- function(df_orders, df_customers) {
  df_customer_id <- df_orders |>
    mutate(
      profit = map_dbl(items, ~ sum(.x$unit_price - .x$wholesale_cost))) |>
    group_by(customerid) |>
    summarise(profit = sum(profit)) |>
    top_n(n = 1, wt = -profit)
  
  df_customers |>
    semi_join(df_customer_id, "customerid")
}

customerid	name	phone
8342	Emily Randolph	914-868-0316

Puzzle 7

The frugal cousin tells us how she met her ex-boyfriend. They met while they were buying the same product, but in different colors. So we look at all the orders from the frugal cousin and search for that incident:

ex_boyfriend <- function(df_frugal_cousin, df_orders, df_customers) {
  df_orders_color <- df_orders |>
    semi_join(df_frugal_cousin, by = "customerid") |>
    unnest(items) |>
    mutate(day = floor_date(ordered, "day")) |>
    filter(!is.na(add_info))
  
  df_order_rel <- df_orders |>
    mutate(day = floor_date(ordered, "day")) |>
    semi_join(df_orders_color, by = "day") |>
    unnest(items) |>
    inner_join(
      df_orders_color, by = c("desc", "day"),
      suffix = c("_male", "_female")) |>
    filter(
      add_info_male != add_info_female,
      ordered_male >= ordered_female - dminutes(10)
      & ordered_male <= ordered_female + dminutes(10))
  
  df_customers |>
    semi_join(df_order_rel, by = c("customerid" = "customerid_male"))
}

customerid	name	phone
8835	Jonathan Adams	315-618-5263

Puzzle 8

Finally, we search for a collector. The collector is in possession of a complete set of Noah’s collectibles. By looking at the data, the relevant category is ‘COL’:

collector <- function(df_products, df_orders, df_customers, product_collect) {
  df_products_collect <- df_products |>
    filter(category == product_collect)
  
  df_orders_unnested <- df_orders |>
    unnest(items)
  
  df_customer_id <- df_orders_unnested |>
    semi_join(df_products_collect, by = c("sku")) |>
    group_by(customerid) |>
    summarise(anz_col = n_distinct(sku)) |>
    arrange(desc(anz_col)) |>
    slice(1)
  
  df_customers |>
    semi_join(df_customer_id, by = "customerid")
}

customerid	name	phone
4308	Travis Bartlett	929-906-5980

End

Participating in the “Hanukkah of Data” competition was both challenging and rewarding. We successfully solved 8 difficult puzzles, sharpening our problem-solving and coding skills. I look forward to the challenge again next year and am eager to see what it holds.

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/duju211/hanukkah_of_data, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

During (2023, Jan. 15). Datannery: Hanukkah of Data. Retrieved from https://www.datannery.com/posts/hanukkah-of-data/

BibTeX citation

@misc{during2023hanukkah,
  author = {During, Julian},
  title = {Datannery: Hanukkah of Data},
  url = {https://www.datannery.com/posts/hanukkah-of-data/},
  year = {2023}
}

Hanukkah of Data

Author

Affiliation

Published

Citation

Challenge

Data

Customers

Products

Orders

Puzzles

Puzzle 1

Puzzle 2

Puzzle 3

Puzzle 4

Puzzle 5

Puzzle 6

Puzzle 7

Puzzle 8

End

Footnotes

Corrections

Reuse

Citation