Datannery: Hairy Football Challenge

Julian During

Idea

Recently I came across a trend on social media: A football fan states ‘I won’t cut my hair until club ’X’ wins 5 games in a row’.

I asked myself the question: How long would this presumably be for my favorite club? Is it time to join the challenge?

Data

In this analysis the following libraries are used:

library(tarchetypes)
library(conflicted)
library(tidyverse)
library(targets)
library(RSQLite)
library(assertr)
library(distill)
library(slider)
library(httr2)
library(DBI)
library(fs)
library(gt)

conflicts_prefer(dplyr::filter)
conflicts_prefer(dplyr::lag)

To answer the initial question I use a data set from kaggle. It can be found under the following URL:

base_url <- "https://www.kaggle.com/api/v1/datasets/download/hugomathien/soccer"

Before we can start analyzing the data we need to download it:

Request and save the raw zip data using function from the httr2 package (Wickham 2025a)
Unzip the downloaded file with the unzip (R Core Team 2024) function
Delete the zip file (Hester, Wickham, and Csárdi 2025) and return the sqlite file

download_data <- function(base_url) {
  kaggle_req <- request(base_url) |>
    req_auth_basic(
      username = Sys.getenv("kaggle_user"), password = Sys.getenv("kaggle_key")
    )

  zip_path <- req_perform(kaggle_req, path = "soccer.zip")

  db_path <- unzip(zip_path$body, overwrite = TRUE)
  file_delete(zip_path$body)

  return(db_path)
}
db_file <- download_data(base_url)

The environment variables kaggle_user and kaggle_key can be created on the kaggle website

We now have a sqlite file in our project folder that we want to further investigate. As a first step we want to filter the included match data for a specific club.

Create a connection to the file using ‘DBI’ (R Special Interest Group on Databases (R-SIG-DB), Wickham, and Müller 2024) and ‘RSQLite’ (Müller et al. 2025)
Filter and transform the data using the Tidyverse suite of packages (Wickham 2023)
By using the ‘dplyr’ (Wickham et al. 2023) functions, the Tidyverse pipelines get translated to SQL automatically. We simply need to call the ‘collect’ function at the end of the pipeline
Add an assertive test (Fischetti 2023) to make sure the key variable (id) is unique

Search the database for games of my club:

club_games_raw <- function(db_file, club_name) {
  con <- dbConnect(SQLite(), db_file)
  on.exit(dbDisconnect(con))

  qry_club <- tbl(con, "Team") |>
    filter(team_long_name == club_name) |>
    select(team_api_id, team_long_name)

  qry_club_games <- tbl(con, "Match") |>
    select(id, home_team_api_id, away_team_api_id) |>
    pivot_longer(
      cols = c(home_team_api_id, away_team_api_id),
      values_to = "team_api_id", names_to = "home_away"
    ) |>
    inner_join(qry_club, by = join_by(team_api_id))

  df_club_games_raw <- tbl(con, "Match") |>
    inner_join(qry_club_games, by = join_by(id)) |>
    select(
      id, league_id, team_long_name, date, home_team_api_id, away_team_api_id,
      home_team_goal, away_team_goal, home_away
    ) |>
    collect()

  df_club_games_raw |>
    assert(is_uniq, id)
}
df_bvb_games_raw <- club_games_raw(db_file, "Borussia Dortmund")

Let’s take a first look at the data:

glimpse(df_bvb_games_raw)

Rows: 272
Columns: 9
$ id               <int> 7829, 7846, 7864, 7883, 7891, 7909, 7918, 7…
$ league_id        <int> 7809, 7809, 7809, 7809, 7809, 7809, 7809, 7…
$ team_long_name   <chr> "Borussia Dortmund", "Borussia Dortmund", "…
$ date             <chr> "2008-11-02 00:00:00", "2008-11-15 00:00:00…
$ home_team_api_id <int> 9789, 9789, 9789, 9789, 9789, 9789, 9789, 9…
$ away_team_api_id <int> 9911, 9810, 8721, 9788, 8178, 9823, 8398, 8…
$ home_team_goal   <int> 1, 4, 0, 2, 1, 1, 1, 0, 1, 3, 2, 4, 6, 3, 3…
$ away_team_goal   <int> 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 3, 0…
$ home_away        <chr> "home_team_api_id", "home_team_api_id", "ho…

Now that the data is locally available, we can further preprocess it using the full R functionality (which might be hard or impossible to translate into SQL):

Parse the date column (Spinu, Grolemund, and Wickham 2024)
Split the home_away column and extract the first part of the string (Wickham 2025b)
Determine if the result represents a ‘victory’
Make sure the date column is unique (assuming there can’t be two matches at the same day)

club_games <- function(df_club_games_raw) {
  df_club_games <- df_club_games_raw |>
    mutate(
      team_long_name,
      date = as_date(ymd_hms(date)),
      home_away = str_split_i(home_away, "_", 1),
      victory = if_else(
        home_away == "home",
        home_team_goal > away_team_goal,
        away_team_goal > home_team_goal
      ), .keep = "none"
    )

  df_club_games |>
    assert(is_uniq, date)
}
df_bvb_games <- club_games(df_bvb_games_raw)

The data is now in the right form to perform calculations across a sliding window (Vaughan 2024). Before continuing, define after how many victories in a row we are talking about a ‘streak’:

victories_streak <- 5L

Now to the sliding window calculations:

How many victories are achieved in a row
Calculate the number of the streak

club_winning_streak <- function(df_club_games, victories_streak) {
  df_club_games |>
    arrange(date) |>
    mutate(
      victories_in_a_row = slide_int(
        victory, \(x) sum(cumall(rev(x))),
        .before = Inf
      ),
      winning_streak_nr = cumsum(
        victories_in_a_row >= victories_streak &
          lag(victories_in_a_row < victories_streak, default = 0)
      )
    )
}
df_bvb_winning_streak <- club_winning_streak(df_bvb_games, victories_streak)

After the sliding window calculation, we filter observations that represent the last winning streak. Also clubs with less than two winning streaks are filtered out.

mean_days_between_streaks <- function(df_club_winning_streak) {
  df_club_winning_streak |>
    filter(
      winning_streak_nr != max(winning_streak_nr),
      max(winning_streak_nr) >= 2
    ) |>
    group_by(team_long_name, winning_streak_nr) |>
    summarise(
      min_date = min(date), max_date = max(date), .groups = "drop_last"
    ) |>
    mutate(
      days_between_streaks = difftime(max_date, min_date, units = "days")
    ) |>
    summarise(
      mean_days_between_streaks = round(
        mean(days_between_streaks, na.rm = TRUE), 2
      )
    )
}
df_bvb_mean_days_between_streaks <- mean_days_between_streaks(df_bvb_winning_streak)

Now that the data is in the right form, we can visualise it. Create a combination of a “Step” and a “Point” plot with ‘ggplot2’ (Wickham et al. 2025)

vis_winning_streak <- function(df_club_winning_streak,
                               df_club_mean_days_between_streaks) {
  df_club_winning_streak |>
    ggplot(aes(x = date, y = victories_in_a_row)) +
    geom_point() +
    geom_step() +
    geom_hline(yintercept = 5, linetype = "dotted") +
    theme_light() +
    labs(
      title = str_glue(
        "'{df_club_mean_days_between_streaks$team_long_name}' Winning Streaks"
      ),
      subtitle = str_glue(
        "Mean Time between Streaks: ",
        "{df_club_mean_days_between_streaks$mean_days_between_streaks} days"
      ),
      y = "Victories in a Row", x = "Date"
    )
}
gg_winning_streak <- vis_winning_streak(df_bvb_winning_streak, df_bvb_mean_days_between_streaks)

If we repeat the process of all clubs, we can look at the top 10 clubs with the lowest mean time between winning streaks:

team_long_name	mean_days_between_streaks
Real Madrid CF	127.25 days
FC Barcelona	132.30 days
Rangers	151.14 days
Celtic	166.80 days
Manchester United	172.00 days
SL Benfica	176.40 days
FC Bayern Munich	198.15 days
FC Basel	200.46 days
Juventus	203.31 days
RSC Anderlecht	215.00 days

We looked at the winning streaks of one particular club and compared it to the top ten clubs in the data. This gives us a good idea about the challenge and I am interested in how long the ongoing challenges will last :-).

Fischetti, Tony. 2023. Assertr: Assertive Programming for r Analysis Pipelines. https://docs.ropensci.org/assertr/.

Hester, Jim, Hadley Wickham, and Gábor Csárdi. 2025. Fs: Cross-Platform File System Operations Based on Libuv. https://fs.r-lib.org.

Müller, Kirill, Hadley Wickham, David A. James, and Seth Falcon. 2025. RSQLite: SQLite Interface for r. https://rsqlite.r-dbi.org.

R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

R Special Interest Group on Databases (R-SIG-DB), Hadley Wickham, and Kirill Müller. 2024. DBI: R Database Interface. https://dbi.r-dbi.org.

Spinu, Vitalie, Garrett Grolemund, and Hadley Wickham. 2024. Lubridate: Make Dealing with Dates a Little Easier. https://lubridate.tidyverse.org.

Vaughan, Davis. 2024. Slider: Sliding Window Functions. https://github.com/r-lib/slider.

Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.

———. 2025a. Httr2: Perform HTTP Requests and Process the Responses. https://httr2.r-lib.org.

———. 2025b. Stringr: Simple, Consistent Wrappers for Common String Operations. https://stringr.tidyverse.org.

Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and Teun van den Brand. 2025. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org.

Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org.

Hairy Football Challenge

Idea

Data

References

Corrections

Reuse

Citation