Hairy Football Challenge

Calculate mean times between winning streaks of football clubs using sliding window calculations.

Julian During www.datannery.com
2025-11-15

Idea

Recently I came across a trend on social media: A football fan states ‘I won’t cut my hair until club ’X’ wins 5 games in a row’.

I asked myself the question: How long would this presumably be for my favorite club? Is it time to join the challenge?

Data

In this analysis the following libraries are used:

To answer the initial question I use a data set from kaggle. It can be found under the following URL:

base_url <- "https://www.kaggle.com/api/v1/datasets/download/hugomathien/soccer"

Before we can start analyzing the data we need to download it:

download_data <- function(base_url) {
  kaggle_req <- request(base_url) |>
    req_auth_basic(
      username = Sys.getenv("kaggle_user"), password = Sys.getenv("kaggle_key")
    )

  zip_path <- req_perform(kaggle_req, path = "soccer.zip")

  db_path <- unzip(zip_path$body, overwrite = TRUE)
  file_delete(zip_path$body)

  return(db_path)
}
db_file <- download_data(base_url)

We now have a sqlite file in our project folder that we want to further investigate. As a first step we want to filter the included match data for a specific club.

Search the database for games of my club:

club_games_raw <- function(db_file, club_name) {
  con <- dbConnect(SQLite(), db_file)
  on.exit(dbDisconnect(con))

  qry_club <- tbl(con, "Team") |>
    filter(team_long_name == club_name) |>
    select(team_api_id, team_long_name)

  qry_club_games <- tbl(con, "Match") |>
    select(id, home_team_api_id, away_team_api_id) |>
    pivot_longer(
      cols = c(home_team_api_id, away_team_api_id),
      values_to = "team_api_id", names_to = "home_away"
    ) |>
    inner_join(qry_club, by = join_by(team_api_id))

  df_club_games_raw <- tbl(con, "Match") |>
    inner_join(qry_club_games, by = join_by(id)) |>
    select(
      id, league_id, team_long_name, date, home_team_api_id, away_team_api_id,
      home_team_goal, away_team_goal, home_away
    ) |>
    collect()

  df_club_games_raw |>
    assert(is_uniq, id)
}
df_bvb_games_raw <- club_games_raw(db_file, "Borussia Dortmund")

Let’s take a first look at the data:

glimpse(df_bvb_games_raw)
Rows: 272
Columns: 9
$ id               <int> 7829, 7846, 7864, 7883, 7891, 7909, 7918, 7…
$ league_id        <int> 7809, 7809, 7809, 7809, 7809, 7809, 7809, 7…
$ team_long_name   <chr> "Borussia Dortmund", "Borussia Dortmund", "…
$ date             <chr> "2008-11-02 00:00:00", "2008-11-15 00:00:00…
$ home_team_api_id <int> 9789, 9789, 9789, 9789, 9789, 9789, 9789, 9…
$ away_team_api_id <int> 9911, 9810, 8721, 9788, 8178, 9823, 8398, 8…
$ home_team_goal   <int> 1, 4, 0, 2, 1, 1, 1, 0, 1, 3, 2, 4, 6, 3, 3…
$ away_team_goal   <int> 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 3, 0…
$ home_away        <chr> "home_team_api_id", "home_team_api_id", "ho…

Now that the data is locally available, we can further preprocess it using the full R functionality (which might be hard or impossible to translate into SQL):

club_games <- function(df_club_games_raw) {
  df_club_games <- df_club_games_raw |>
    mutate(
      team_long_name,
      date = as_date(ymd_hms(date)),
      home_away = str_split_i(home_away, "_", 1),
      victory = if_else(
        home_away == "home",
        home_team_goal > away_team_goal,
        away_team_goal > home_team_goal
      ), .keep = "none"
    )

  df_club_games |>
    assert(is_uniq, date)
}
df_bvb_games <- club_games(df_bvb_games_raw)

The data is now in the right form to perform calculations across a sliding window (Vaughan 2024). Before continuing, define after how many victories in a row we are talking about a ‘streak’:

victories_streak <- 5L

Now to the sliding window calculations:

club_winning_streak <- function(df_club_games, victories_streak) {
  df_club_games |>
    arrange(date) |>
    mutate(
      victories_in_a_row = slide_int(
        victory, \(x) sum(cumall(rev(x))),
        .before = Inf
      ),
      winning_streak_nr = cumsum(
        victories_in_a_row >= victories_streak &
          lag(victories_in_a_row < victories_streak, default = 0)
      )
    )
}
df_bvb_winning_streak <- club_winning_streak(df_bvb_games, victories_streak)

After the sliding window calculation, we filter observations that represent the last winning streak. Also clubs with less than two winning streaks are filtered out.

mean_days_between_streaks <- function(df_club_winning_streak) {
  df_club_winning_streak |>
    filter(
      winning_streak_nr != max(winning_streak_nr),
      max(winning_streak_nr) >= 2
    ) |>
    group_by(team_long_name, winning_streak_nr) |>
    summarise(
      min_date = min(date), max_date = max(date), .groups = "drop_last"
    ) |>
    mutate(
      days_between_streaks = difftime(max_date, min_date, units = "days")
    ) |>
    summarise(
      mean_days_between_streaks = round(
        mean(days_between_streaks, na.rm = TRUE), 2
      )
    )
}
df_bvb_mean_days_between_streaks <- mean_days_between_streaks(df_bvb_winning_streak)

Now that the data is in the right form, we can visualise it. Create a combination of a “Step” and a “Point” plot with ‘ggplot2’ (Wickham et al. 2025)

vis_winning_streak <- function(df_club_winning_streak,
                               df_club_mean_days_between_streaks) {
  df_club_winning_streak |>
    ggplot(aes(x = date, y = victories_in_a_row)) +
    geom_point() +
    geom_step() +
    geom_hline(yintercept = 5, linetype = "dotted") +
    theme_light() +
    labs(
      title = str_glue(
        "'{df_club_mean_days_between_streaks$team_long_name}' Winning Streaks"
      ),
      subtitle = str_glue(
        "Mean Time between Streaks: ",
        "{df_club_mean_days_between_streaks$mean_days_between_streaks} days"
      ),
      y = "Victories in a Row", x = "Date"
    )
}
gg_winning_streak <- vis_winning_streak(df_bvb_winning_streak, df_bvb_mean_days_between_streaks)

If we repeat the process of all clubs, we can look at the top 10 clubs with the lowest mean time between winning streaks:

team_long_name mean_days_between_streaks
Real Madrid CF 127.25 days
FC Barcelona 132.30 days
Rangers 151.14 days
Celtic 166.80 days
Manchester United 172.00 days
SL Benfica 176.40 days
FC Bayern Munich 198.15 days
FC Basel 200.46 days
Juventus 203.31 days
RSC Anderlecht 215.00 days

We looked at the winning streaks of one particular club and compared it to the top ten clubs in the data. This gives us a good idea about the challenge and I am interested in how long the ongoing challenges will last :-).

Fischetti, Tony. 2023. Assertr: Assertive Programming for r Analysis Pipelines. https://docs.ropensci.org/assertr/.
Hester, Jim, Hadley Wickham, and Gábor Csárdi. 2025. Fs: Cross-Platform File System Operations Based on Libuv. https://fs.r-lib.org.
Müller, Kirill, Hadley Wickham, David A. James, and Seth Falcon. 2025. RSQLite: SQLite Interface for r. https://rsqlite.r-dbi.org.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
R Special Interest Group on Databases (R-SIG-DB), Hadley Wickham, and Kirill Müller. 2024. DBI: R Database Interface. https://dbi.r-dbi.org.
Spinu, Vitalie, Garrett Grolemund, and Hadley Wickham. 2024. Lubridate: Make Dealing with Dates a Little Easier. https://lubridate.tidyverse.org.
Vaughan, Davis. 2024. Slider: Sliding Window Functions. https://github.com/r-lib/slider.
Wickham, Hadley. 2023. Tidyverse: Easily Install and Load the Tidyverse. https://tidyverse.tidyverse.org.
———. 2025a. Httr2: Perform HTTP Requests and Process the Responses. https://httr2.r-lib.org.
———. 2025b. Stringr: Simple, Consistent Wrappers for Common String Operations. https://stringr.tidyverse.org.
Wickham, Hadley, Winston Chang, Lionel Henry, Thomas Lin Pedersen, Kohske Takahashi, Claus Wilke, Kara Woo, Hiroaki Yutani, Dewey Dunnington, and Teun van den Brand. 2025. Ggplot2: Create Elegant Data Visualisations Using the Grammar of Graphics. https://ggplot2.tidyverse.org.
Wickham, Hadley, Romain François, Lionel Henry, Kirill Müller, and Davis Vaughan. 2023. Dplyr: A Grammar of Data Manipulation. https://dplyr.tidyverse.org.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://codeberg.org/duju211/football_hair2, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

During (2025, Nov. 26). Datannery: Hairy Football Challenge. Retrieved from https://www.datannery.com/posts/hairy-football-challenge/

BibTeX citation

@misc{during2025hairy,
  author = {During, Julian},
  title = {Datannery: Hairy Football Challenge},
  url = {https://www.datannery.com/posts/hairy-football-challenge/},
  year = {2025}
}