Calculate mean times between winning streaks of football clubs using sliding window calculations.
Recently I came across a trend on social media: A football fan states ‘I won’t cut my hair until club ’X’ wins 5 games in a row’.
I asked myself the question: How long would this presumably be for my favorite club? Is it time to join the challenge?
In this analysis the following libraries are used:
To answer the initial question I use a data set from kaggle. It can be found under the following URL:
base_url <- "https://www.kaggle.com/api/v1/datasets/download/hugomathien/soccer"
Before we can start analyzing the data we need to download it:
httr2 package (Wickham 2025a)unzip (R Core Team 2024) functiondownload_data <- function(base_url) {
kaggle_req <- request(base_url) |>
req_auth_basic(
username = Sys.getenv("kaggle_user"), password = Sys.getenv("kaggle_key")
)
zip_path <- req_perform(kaggle_req, path = "soccer.zip")
db_path <- unzip(zip_path$body, overwrite = TRUE)
file_delete(zip_path$body)
return(db_path)
}
db_file <- download_data(base_url)
We now have a sqlite file in our project folder that we want to further investigate. As a first step we want to filter the included match data for a specific club.
id) is
uniqueSearch the database for games of my club:
club_games_raw <- function(db_file, club_name) {
con <- dbConnect(SQLite(), db_file)
on.exit(dbDisconnect(con))
qry_club <- tbl(con, "Team") |>
filter(team_long_name == club_name) |>
select(team_api_id, team_long_name)
qry_club_games <- tbl(con, "Match") |>
select(id, home_team_api_id, away_team_api_id) |>
pivot_longer(
cols = c(home_team_api_id, away_team_api_id),
values_to = "team_api_id", names_to = "home_away"
) |>
inner_join(qry_club, by = join_by(team_api_id))
df_club_games_raw <- tbl(con, "Match") |>
inner_join(qry_club_games, by = join_by(id)) |>
select(
id, league_id, team_long_name, date, home_team_api_id, away_team_api_id,
home_team_goal, away_team_goal, home_away
) |>
collect()
df_club_games_raw |>
assert(is_uniq, id)
}
df_bvb_games_raw <- club_games_raw(db_file, "Borussia Dortmund")
Let’s take a first look at the data:
glimpse(df_bvb_games_raw)
Rows: 272
Columns: 9
$ id <int> 7829, 7846, 7864, 7883, 7891, 7909, 7918, 7…
$ league_id <int> 7809, 7809, 7809, 7809, 7809, 7809, 7809, 7…
$ team_long_name <chr> "Borussia Dortmund", "Borussia Dortmund", "…
$ date <chr> "2008-11-02 00:00:00", "2008-11-15 00:00:00…
$ home_team_api_id <int> 9789, 9789, 9789, 9789, 9789, 9789, 9789, 9…
$ away_team_api_id <int> 9911, 9810, 8721, 9788, 8178, 9823, 8398, 8…
$ home_team_goal <int> 1, 4, 0, 2, 1, 1, 1, 0, 1, 3, 2, 4, 6, 3, 3…
$ away_team_goal <int> 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 3, 0…
$ home_away <chr> "home_team_api_id", "home_team_api_id", "ho…
Now that the data is locally available, we can further preprocess it using the full R functionality (which might be hard or impossible to translate into SQL):
home_away column and extract the first part of the string
(Wickham 2025b)club_games <- function(df_club_games_raw) {
df_club_games <- df_club_games_raw |>
mutate(
team_long_name,
date = as_date(ymd_hms(date)),
home_away = str_split_i(home_away, "_", 1),
victory = if_else(
home_away == "home",
home_team_goal > away_team_goal,
away_team_goal > home_team_goal
), .keep = "none"
)
df_club_games |>
assert(is_uniq, date)
}
df_bvb_games <- club_games(df_bvb_games_raw)
The data is now in the right form to perform calculations across a sliding window (Vaughan 2024). Before continuing, define after how many victories in a row we are talking about a ‘streak’:
victories_streak <- 5L
Now to the sliding window calculations:
club_winning_streak <- function(df_club_games, victories_streak) {
df_club_games |>
arrange(date) |>
mutate(
victories_in_a_row = slide_int(
victory, \(x) sum(cumall(rev(x))),
.before = Inf
),
winning_streak_nr = cumsum(
victories_in_a_row >= victories_streak &
lag(victories_in_a_row < victories_streak, default = 0)
)
)
}
df_bvb_winning_streak <- club_winning_streak(df_bvb_games, victories_streak)
After the sliding window calculation, we filter observations that represent the last winning streak. Also clubs with less than two winning streaks are filtered out.
mean_days_between_streaks <- function(df_club_winning_streak) {
df_club_winning_streak |>
filter(
winning_streak_nr != max(winning_streak_nr),
max(winning_streak_nr) >= 2
) |>
group_by(team_long_name, winning_streak_nr) |>
summarise(
min_date = min(date), max_date = max(date), .groups = "drop_last"
) |>
mutate(
days_between_streaks = difftime(max_date, min_date, units = "days")
) |>
summarise(
mean_days_between_streaks = round(
mean(days_between_streaks, na.rm = TRUE), 2
)
)
}
df_bvb_mean_days_between_streaks <- mean_days_between_streaks(df_bvb_winning_streak)
Now that the data is in the right form, we can visualise it. Create a combination of a “Step” and a “Point” plot with ‘ggplot2’ (Wickham et al. 2025)
vis_winning_streak <- function(df_club_winning_streak,
df_club_mean_days_between_streaks) {
df_club_winning_streak |>
ggplot(aes(x = date, y = victories_in_a_row)) +
geom_point() +
geom_step() +
geom_hline(yintercept = 5, linetype = "dotted") +
theme_light() +
labs(
title = str_glue(
"'{df_club_mean_days_between_streaks$team_long_name}' Winning Streaks"
),
subtitle = str_glue(
"Mean Time between Streaks: ",
"{df_club_mean_days_between_streaks$mean_days_between_streaks} days"
),
y = "Victories in a Row", x = "Date"
)
}
gg_winning_streak <- vis_winning_streak(df_bvb_winning_streak, df_bvb_mean_days_between_streaks)
If we repeat the process of all clubs, we can look at the top 10 clubs with the lowest mean time between winning streaks:
| team_long_name | mean_days_between_streaks |
|---|---|
| Real Madrid CF | 127.25 days |
| FC Barcelona | 132.30 days |
| Rangers | 151.14 days |
| Celtic | 166.80 days |
| Manchester United | 172.00 days |
| SL Benfica | 176.40 days |
| FC Bayern Munich | 198.15 days |
| FC Basel | 200.46 days |
| Juventus | 203.31 days |
| RSC Anderlecht | 215.00 days |
We looked at the winning streaks of one particular club and compared it to the top ten clubs in the data. This gives us a good idea about the challenge and I am interested in how long the ongoing challenges will last :-).
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://codeberg.org/duju211/football_hair2, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
During (2025, Nov. 26). Datannery: Hairy Football Challenge. Retrieved from https://www.datannery.com/posts/hairy-football-challenge/
BibTeX citation
@misc{during2025hairy,
author = {During, Julian},
title = {Datannery: Hairy Football Challenge},
url = {https://www.datannery.com/posts/hairy-football-challenge/},
year = {2025}
}