Hanukkah of data 2022

R
Puzzles
Author

Philippe Massicotte

Published

December 28, 2022

Hanukkah of data 2022

If you are looking for some data science challenges, don’t look anymore, Hanukkah of Data is there for you. You can think of it as similar to Advent of Code but tailored toward data-oriented puzzles with only eight challenges. In this blog post, I will try to solve all the puzzles using R.

Getting the data

The first step of the challenge is to get the data. The trick is that the data is contained in a password-protected zip file. The very first challenge is to find the password of the zip file. You have to find out the Hebrew year that corresponds to 2017-01-01. To do so, I simply used this website and found out that the password was 5777. With this information, let us download the compressed file and unzip it into a temporary folder.

url <- "https://hanukkah.bluebird.sh/5783/noahs-csv.zip"

td <- tempdir()
tf <- tempfile(tmpdir = td, fileext = ".zip")

curl::curl_download(url, destfile = tf)

system(command = paste0("unzip -o -P 5777 ", tf, " -d ", td))

If we look at all the unzipped files, we can see it consists in four CSV files that we can import in R.

files <- fs::dir_ls(td, glob = "*.csv")

files
#> /tmp/Rtmp8clYlA/noahs-customers.csv    /tmp/Rtmp8clYlA/noahs-orders.csv       
#> /tmp/Rtmp8clYlA/noahs-orders_items.csv /tmp/Rtmp8clYlA/noahs-products.csv

customers <- read_csv(files[str_detect(files, "noahs-customers.csv")])
orders_items <- read_csv(files[str_detect(files, "noahs-orders_items.csv")])
orders <- read_csv(files[str_detect(files, "noahs-orders.csv")])
products <- read_csv(files[str_detect(files, "noahs-products.csv")])

After exploring the data, I have found out that there is one duplicated customer in noahs-customers.csv. Specifically, customerid should uniquely identify each customer. Let’s remove this duplicated entry (more about that later).

customers |>
  janitor::get_dupes()

customers <- customers |>
  distinct(customerid, .keep_all = TRUE)

Explore the data

To explore the relationships across the different tables, I am using the {dm} R package. This was the first time using this package and it looks interesting if you work with relational databases. Let’s create a dm object using the dm() function with the four data frame we have.

# install.packages("dm")
# install.packages("DiagrammeR")
library(dm)

noahs_dm <- dm(customers, orders_items, orders, products)

noahs_dm
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `customers`, `orders_items`, `orders`, `products`
#> Columns: 19
#> Primary keys: 0
#> Foreign keys: 0

Then, we can specify the primary key of each table using the dm_add_pk() function.

noahs_dm <- noahs_dm |>
  dm_add_pk(customers, customerid) |>
  dm_add_pk(orders, orderid) |>
  dm_add_pk(products, sku)

noahs_dm
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `customers`, `orders_items`, `orders`, `products`
#> Columns: 19
#> Primary keys: 3
#> Foreign keys: 0

One condition to set up a primary key is that it must be unique. The enum_pk_candidates() function can be used to scan all values in a table and will determine if any variable is suitable to be used as a primary key. As we can see below, customerid in noahs-customers.csv is duplicated twice in the data. This is why I used the distinct() function earlier to remove the duplicated entry.

read_csv(files[str_detect(files, "noahs-customers.csv")]) |>
  enum_pk_candidates() |>
  gt::gt()
columns candidate why
customerid FALSE has duplicate values: 4308 (2)
name FALSE has duplicate values: Jennifer Smith (8), James Williams (7), David Johnson (6), James Johnson (6), Lisa Smith (6), …
address FALSE has duplicate values: 1 East River Pl (4), 12B Cooper Pl (3), 205 W 88th St (3), 1023 38th St (2), 1167A Lexington Ave (2), …
citystatezip FALSE has duplicate values: Brooklyn, NY 11234 (121), Corona, NY 11368 (121), Bronx, NY 10467 (111), Brooklyn, NY 11236 (99), Brooklyn, NY 11211 (98), …
birthdate FALSE has duplicate values: 1953-01-12 (5), 1970-04-15 (5), 1946-06-28 (4), 1949-11-16 (4), 1952-03-29 (4), …
phone FALSE has duplicate values: 838-616-4951 (2), 929-906-5980 (2)

We can now use the dm_add_fk() function to specify the foreign keys in each table. Basically, dm_add_fk(orders_items, sku, products) means that the sky variable/key is used to link the orders_items and products tables together.

noahs_dm <- noahs_dm |>
  dm_add_fk(orders_items, sku, products) |>
  dm_add_fk(orders_items, orderid, orders) |>
  dm_add_fk(orders, customerid, customers)

noahs_dm
#> ── Metadata ────────────────────────────────────────────────────────────────────
#> Tables: `customers`, `orders_items`, `orders`, `products`
#> Columns: 19
#> Primary keys: 3
#> Foreign keys: 3

Finally, we can visualize all the relationships with dm_draw().

noahs_dm |>
  dm_draw()

I am not going to use more of dm for the rest of this post, but I am sure there is a lot more to cover.

Puzzle 1: beehive

For this first puzzle, we have to find the phone number that matches the customer’s last name. The first step consists in extracting customers’ last names and filtering out to keep those with only 10 characters (the length of the phone numbers).

df <- customers |>
  select(name, phone) |>
  mutate(last_name = str_match(name, "\\s{1}(\\S+)$")[, 2]) |>
  mutate(last_name_n = nchar(last_name)) |>
  filter(last_name_n == 10)

df

Then, I am creating a lookup table that matches letters to the numbers of a phone. With that, I convert each letter of the last name into a string and filter to keep the phone number that is equal to the last name.

phone_lut <- c(
  "a" = 2,
  "b" = 2,
  "c" = 2,
  "d" = 3,
  "e" = 3,
  "f" = 3,
  "g" = 4,
  "h" = 4,
  "i" = 4,
  "j" = 5,
  "k" = 5,
  "l" = 5,
  "m" = 6,
  "n" = 6,
  "o" = 6,
  "p" = 7,
  "q" = 7,
  "r" = 7,
  "s" = 7,
  "t" = 8,
  "u" = 8,
  "v" = 8,
  "w" = 9,
  "x" = 9,
  "y" = 9,
  "z" = 9
)

df |>
  mutate(phone = str_remove_all(phone, "-")) |>
  mutate(
    number =
      map_chr(str_split(tolower(last_name), ""), 
      \(x) paste0(phone_lut[x], collapse = ""))
  ) |>
  filter(phone == number)

Puzzle 2: snail

The first step is to extract customer initials and only keep “JD”.

customers_initials <- customers  |>
  mutate(initials = gsub("[a-z]", "", name), .after = name) |>
  mutate(initials = str_remove_all(initials, " ")) |>
  filter(initials == "JD")

customers_initials

Now we can search for the customer who made Rug Cleaner orders in 2017.

customers_initials |>
  left_join(orders) |>
  filter(lubridate::year(ordered) == 2017) |>
  left_join(orders_items) |>
  left_join(products) |>
  filter(str_detect(desc, regex("Rug Cleaner", ignore_case = TRUE))) |>
  add_count(customerid) |>
  filter(n > 1) |>
  distinct(name, phone)

Puzzle 3: spider

We search for someone Aries born in the year of the Dog. After Googling around, I managed to get the corresponding years. Hence, the first step was to find out customers born in those years. The catch here is that we can not only look at the years for finding the possible candidates because Aries time is only between March 21st and April 19th. We also know that this person was also living in the same neighbourhood as Jeremy (the previous owner of the rug). It was just a matter of filtering out customers living in the same area.

jeremy <- customers |>
  filter(name == "Jeremy Davis")

customers |>
  filter(
    lubridate::year(birthdate) %in% 
    c(1922, 1934, 1946, 1958, 1970, 1982, 1994, 2006, 2018, 2030),
    between(format(birthdate, "%m%d"), "0321", "0419")
  ) |>
  filter(citystatezip == jeremy$citystatezip) |>
  distinct(name, phone)

Puzzle 4: owl

For this puzzle, we have to find the woman who buys from bakeries daily before 5 am.

products |>
  filter(str_starts(sku, "BKY")) |>
  left_join(orders_items) |>
  left_join(orders) |>
  filter(lubridate::hour(ordered) < 5) |>
  left_join(customers) |>
  count(customerid, name, phone, sort = TRUE) |>
  slice(1)

Puzzle 5: koala

We are told that the next owner lives in the Queen’s village and also regularly buys food for his/her older cats.

customers |>
  filter(str_detect(citystatezip, regex("queens village", ignore_case = TRUE))) |>
  left_join(orders) |>
  left_join(orders_items) |>
  left_join(products) |>
  filter(str_detect(desc, regex("senior cat", ignore_case = TRUE))) |>
  group_by(name, phone) |>
  summarise(n = n_distinct(orderid), .groups = "drop") |>
  slice_max(n, n = 1)

Puzzle 6: squirrel

Here, we have to find the person who only bought items that were on discount. First, calculate the total paid price and the total wholesale cost.

purchases <- orders_items |>
  left_join(orders) |>
  left_join(products) |>
  group_by(orderid) |>
  summarise(
    across(c(unit_price, wholesale_cost), 
    .fns = list("tot" = \(x) sum(qty * x)))
  ) |>
  left_join(orders) |>
  left_join(customers)

Now we can find customers who only buy when everything is on sale (i.e, when the total paid is lower than the total wholesale cost).

purchases |>
  group_by(customerid) |>
  filter(all(unit_price_tot < wholesale_cost_tot)) |>
  add_count(customerid) |>
  filter(n > 1) |>
  distinct(customerid, name, phone)

Puzzle 7: toucan

This puzzle was for me the most challenging. I had to think for a while before starting to implement the solution. The key ideas to solve this problem are:

  1. Extract the colour from the item description.
  2. Find the person who bought the same item as Emily a moment after her, but with a different colour.
colored_products <- products |>
  extract(desc,
    into = c("desc", "color"), "([^()\\n]+)(?:(?:\\()(\\w*)(?:\\)))?"
  ) |>
  mutate(desc = str_squish(desc)) |>
  mutate(color = na_if(color, "")) |>
  drop_na(color)
  
colored_products

Based on this list, we have to find the date on which Emily made purchases and for which the same item was bought in at least two colours.

colored_products |>
  left_join(orders_items) |>
  left_join(orders) |>
  left_join(customers) |>
  group_by(date = as.Date(ordered), desc) |>
  filter(length(unique(color)) > 1 & any(customerid == 8342)) |>
  ungroup() |>
  distinct(desc, color, name, color, phone, ordered) |>
  arrange(ordered) 

Looking at the time of purchase, we can find that Jonathan Adams also bought an Electric Machine with a different colour just a few seconds after Emily.

Puzzle 8: snake

For this puzzle, we have to find the customer who is collecting the most different items sold by Noah’s.

products |>
  filter(str_detect(desc, "Noah's")) |>
  left_join(orders_items) |>
  left_join(orders) |>
  group_by(customerid) |>
  summarise(n = n_distinct(desc)) |>
  slice_max(n, n = 1) |>
  left_join(customers) |>
  select(name, phone)

Conclusions

These quick data challenges were quite fun to solve. It looks like that more puzzles will be added in the future, so keep an eye open if you are interested.

Session info
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.2 (2022-10-31)
#>  os       Ubuntu 22.10
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en_CA:en
#>  collate  en_CA.UTF-8
#>  ctype    en_CA.UTF-8
#>  tz       America/Toronto
#>  date     2022-12-29
#>  pandoc   2.17.1.1 @ /usr/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
#>  ! package       * version date (UTC) lib source
#>  P assertthat      0.2.1   2019-03-21 [?] CRAN (R 4.2.2)
#>  P backports       1.4.1   2021-12-13 [?] CRAN (R 4.2.2)
#>  P bit             4.0.4   2020-08-04 [?] RSPM (R 4.2.2)
#>  P bit64           4.0.5   2020-08-30 [?] CRAN (R 4.2.2)
#>  P broom           1.0.0   2022-07-01 [?] CRAN (R 4.2.1)
#>  P cachem          1.0.6   2021-08-19 [?] CRAN (R 4.2.2)
#>  P callr           3.7.1   2022-07-13 [?] RSPM (R 4.2.0)
#>  P cellranger      1.1.0   2016-07-27 [?] CRAN (R 4.2.2)
#>  P cli             3.3.0   2022-04-25 [?] CRAN (R 4.2.1)
#>  P colorspace      2.0-3   2022-02-21 [?] CRAN (R 4.2.2)
#>  P crayon          1.5.1   2022-03-26 [?] CRAN (R 4.2.1)
#>  P curl            4.3.2   2021-06-23 [?] RSPM (R 4.2.2)
#>  P DBI             1.1.3   2022-06-18 [?] RSPM (R 4.2.0)
#>  P dbplyr          2.2.1   2022-06-27 [?] CRAN (R 4.2.2)
#>  P devtools        2.4.4   2022-07-20 [?] RSPM (R 4.2.0)
#>  P DiagrammeR      1.0.9   2022-03-05 [?] RSPM (R 4.2.2)
#>  P digest          0.6.29  2021-12-01 [?] CRAN (R 4.2.1)
#>  P dm            * 1.0.3   2022-10-12 [?] RSPM (R 4.2.2)
#>  P dplyr         * 1.0.9   2022-04-28 [?] CRAN (R 4.2.1)
#>  P ellipsis        0.3.2   2021-04-29 [?] CRAN (R 4.2.2)
#>  P evaluate        0.16    2022-08-09 [?] CRAN (R 4.2.1)
#>  P extrafont       0.18    2022-04-12 [?] CRAN (R 4.2.2)
#>  P extrafontdb     1.0     2012-06-11 [?] CRAN (R 4.2.2)
#>  P fansi           1.0.3   2022-03-24 [?] CRAN (R 4.2.2)
#>  P fastmap         1.1.0   2021-01-25 [?] CRAN (R 4.2.2)
#>  P forcats       * 0.5.1   2021-01-27 [?] RSPM (R 4.2.0)
#>  P fs              1.5.2   2021-12-08 [?] CRAN (R 4.2.2)
#>  P gargle          1.2.0   2021-07-02 [?] RSPM (R 4.2.2)
#>  P generics        0.1.3   2022-07-05 [?] CRAN (R 4.2.2)
#>  P ggplot2       * 3.3.6   2022-05-03 [?] CRAN (R 4.2.1)
#>  P ggpmthemes    * 0.0.2   2022-08-13 [?] Github (pmassicotte/ggpmthemes@993d61e)
#>  P glue            1.6.2   2022-02-24 [?] CRAN (R 4.2.2)
#>  P googledrive     2.0.0   2021-07-08 [?] CRAN (R 4.2.2)
#>  P googlesheets4   1.0.1   2022-08-13 [?] CRAN (R 4.2.2)
#>  P gt              0.7.0   2022-08-25 [?] RSPM (R 4.2.0)
#>  P gtable          0.3.0   2019-03-25 [?] CRAN (R 4.2.1)
#>  P haven           2.5.0   2022-04-15 [?] RSPM (R 4.2.0)
#>  P hms             1.1.1   2021-09-26 [?] RSPM (R 4.2.0)
#>  P htmltools       0.5.3   2022-07-18 [?] CRAN (R 4.2.2)
#>  P htmlwidgets     1.5.4   2021-09-08 [?] CRAN (R 4.2.1)
#>  P httpuv          1.6.5   2022-01-05 [?] RSPM (R 4.2.0)
#>  P httr            1.4.4   2022-08-17 [?] CRAN (R 4.2.2)
#>  P igraph          1.3.5   2022-09-22 [?] RSPM (R 4.2.2)
#>  P janitor         2.1.0   2021-01-05 [?] CRAN (R 4.2.1)
#>  P jsonlite        1.8.2   2022-10-02 [?] RSPM (R 4.2.0)
#>  P knitr           1.39    2022-04-26 [?] RSPM (R 4.2.0)
#>  P later           1.3.0   2021-08-18 [?] RSPM (R 4.2.2)
#>  P lifecycle       1.0.1   2021-09-24 [?] CRAN (R 4.2.1)
#>  P lubridate       1.8.0   2021-10-07 [?] RSPM (R 4.2.2)
#>  P magrittr        2.0.3   2022-03-30 [?] RSPM (R 4.2.0)
#>  P memoise         2.0.1   2021-11-26 [?] CRAN (R 4.2.2)
#>  P mime            0.12    2021-09-28 [?] CRAN (R 4.2.2)
#>  P miniUI          0.1.1.1 2018-05-18 [?] RSPM (R 4.2.0)
#>  P modelr          0.1.8   2020-05-19 [?] RSPM (R 4.2.0)
#>  P munsell         0.5.0   2018-06-12 [?] CRAN (R 4.2.2)
#>  P pillar          1.8.0   2022-07-18 [?] RSPM (R 4.2.0)
#>  P pkgbuild        1.3.1   2021-12-20 [?] RSPM (R 4.2.0)
#>  P pkgconfig       2.0.3   2019-09-22 [?] CRAN (R 4.2.2)
#>  P pkgload         1.3.0   2022-06-27 [?] RSPM (R 4.2.2)
#>  P prettyunits     1.1.1   2020-01-24 [?] CRAN (R 4.2.2)
#>  P processx        3.7.0   2022-07-07 [?] CRAN (R 4.2.1)
#>  P profvis         0.3.7   2020-11-02 [?] RSPM (R 4.2.0)
#>  P promises        1.2.0.1 2021-02-11 [?] CRAN (R 4.2.1)
#>  P ps              1.7.1   2022-06-18 [?] CRAN (R 4.2.1)
#>  P purrr         * 0.3.4   2020-04-17 [?] CRAN (R 4.2.1)
#>  P R6              2.5.1   2021-08-19 [?] CRAN (R 4.2.2)
#>  P RColorBrewer    1.1-3   2022-04-03 [?] CRAN (R 4.2.2)
#>  P Rcpp            1.0.9   2022-07-08 [?] RSPM (R 4.2.0)
#>  P readr         * 2.1.2   2022-01-30 [?] RSPM (R 4.2.2)
#>  P readxl          1.4.1   2022-08-17 [?] CRAN (R 4.2.2)
#>  P remotes         2.4.2   2021-11-30 [?] RSPM (R 4.2.2)
#>    renv            0.15.5  2022-05-26 [1] CRAN (R 4.2.2)
#>  P reprex          2.0.2   2022-08-17 [?] CRAN (R 4.2.2)
#>  P rlang           1.0.4   2022-07-12 [?] CRAN (R 4.2.1)
#>  P rmarkdown       2.15    2022-08-16 [?] RSPM (R 4.2.0)
#>  P rstudioapi      0.13    2020-11-12 [?] RSPM (R 4.2.0)
#>  P Rttf2pt1        1.3.8   2020-01-10 [?] RSPM (R 4.2.0)
#>  P rvest           1.0.2   2021-10-16 [?] RSPM (R 4.2.0)
#>  P sass            0.4.2   2022-07-16 [?] RSPM (R 4.2.2)
#>  P scales          1.2.0   2022-04-13 [?] RSPM (R 4.2.0)
#>  P sessioninfo     1.2.2   2021-12-06 [?] RSPM (R 4.2.0)
#>  P shiny           1.7.2   2022-07-19 [?] RSPM (R 4.2.0)
#>  P snakecase       0.11.0  2019-05-25 [?] CRAN (R 4.2.1)
#>  P stringi         1.7.8   2022-07-11 [?] CRAN (R 4.2.2)
#>  P stringr       * 1.4.0   2019-02-10 [?] RSPM (R 4.2.0)
#>  P tibble        * 3.1.8   2022-07-22 [?] CRAN (R 4.2.2)
#>  P tidyr         * 1.2.0   2022-02-01 [?] RSPM (R 4.2.2)
#>  P tidyselect      1.1.2   2022-02-21 [?] CRAN (R 4.2.1)
#>  P tidyverse     * 1.3.2   2022-07-18 [?] CRAN (R 4.2.1)
#>  P tzdb            0.3.0   2022-03-28 [?] CRAN (R 4.2.2)
#>  P urlchecker      1.0.1   2021-11-30 [?] RSPM (R 4.2.0)
#>  P usethis         2.1.6   2022-05-25 [?] RSPM (R 4.2.0)
#>  P utf8            1.2.2   2021-07-24 [?] CRAN (R 4.2.2)
#>  P vctrs           0.4.1   2022-04-13 [?] CRAN (R 4.2.1)
#>  P visNetwork      2.1.2   2022-09-29 [?] RSPM (R 4.2.2)
#>  P vroom           1.5.7   2021-11-30 [?] RSPM (R 4.2.2)
#>  P withr           2.5.0   2022-03-03 [?] CRAN (R 4.2.2)
#>  P xfun            0.32    2022-08-10 [?] CRAN (R 4.2.1)
#>  P xml2            1.3.3   2021-11-30 [?] CRAN (R 4.2.2)
#>  P xtable          1.8-4   2019-04-21 [?] RSPM (R 4.2.0)
#>  P yaml            2.3.5   2022-02-21 [?] CRAN (R 4.2.1)
#> 
#>  [1] /media/work/r-blog/renv/library/R-4.2/x86_64-pc-linux-gnu
#>  [2] /usr/local/lib/R/library
#> 
#>  P ── Loaded and on-disk path mismatch.
#> 
#> ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────