An article about friendly webscraping and a remarkable Time Trial
In Tour de France history, time trials always are a spectacle of their own. Oftentimes they are a decisive factor on who wins the general classification each year. This was also the case this year with a stunning performance of Jonas Vingegaard on the 16th stage from Passy to Combloux.
In this post we want to look at the performance of Jonas Vingegaard in more detail, heavily inspired by this post here.
In order to reproduce this analysis in one go, you have to perform the following steps:
renv::restore()
targets::tar_make()
Alternatively you could copy and paste the code chunks into your R session and execute them one after the other. You would have to install the following packages by hand:
Define global variables:
cycling_stats_url <- "https://www.procyclingstats.com/"
start_year <- 2000
stages_overview_path <- "race/tour-de-france/"
stages_urls_css <- ".basic a:nth-child(1)"
time_trial_regex <- regex("\\(ITT\\)", ignore_case = TRUE)
First we need the raw data. We will get the raw data from ’https://www.procyclingstats.com/.
The exact path where we will get the data, is determined by the string ‘race/tour-de-france/’ + the year (up until year 2000) of the Tour de France edtion:
df_tdf_editions <- tdf_editions(start_year, stages_overview_path)
# A tibble: 25 × 2
year so_path
<int> <glue>
1 2000 race/tour-de-france/2000
2 2001 race/tour-de-france/2001
3 2002 race/tour-de-france/2002
4 2003 race/tour-de-france/2003
5 2004 race/tour-de-france/2004
6 2005 race/tour-de-france/2005
7 2006 race/tour-de-france/2006
8 2007 race/tour-de-france/2007
9 2008 race/tour-de-france/2008
10 2009 race/tour-de-france/2009
# ℹ 15 more rows
Pull all paths into a vector for later analysis:
so_paths <- pull(df_tdf_editions, so_path)
Scrape the stage overview for each edition by using the following helper
function.
After a first introduction (next code chunk) we alter the url by adding the
path from the above code chunk.
We do this by applying the polite::nod
function.
After the exact url is determined, we scrape the content of the webpage using
the polite::scrape
function.
Keep only certain elements of the scraped html by looking for CSS ‘.basic a:nth-child(1)’. The CSS was determined using techniques described in this vignette.
scrape_overview <- function(session_bow, so_path, stages_urls_css) {
scrape_url <- nod(session_bow, path = so_path)
stages_overview_html <- scrape(scrape_url)
stages_overview_nodes <- stages_overview_html |>
html_elements(stages_urls_css)
tibble(
href = html_attr(stages_overview_nodes, "href"),
desc = html_text(stages_overview_nodes)) |>
mutate(url = scrape_url$url)
}
Before we apply this function multiple times, we first introduce ourselves with
the polite::bow
function. After that we apply the above function to each
edition. By doing this we automatically apply to the
scraping restrictions defined in robots.txt
.
df_stages_overview_raw <- stages_overview_raw(cycling_stats_url, so_paths, stages_urls_css)
The result looks like this:
# A tibble: 888 × 3
href desc url
<chr> <chr> <chr>
1 team/us-postal-service-2000 US Postal Service https://www.proc…
2 team/team-telekom-2000 Team Telekom https://www.proc…
3 team/festina-lotus-2000 Festina - Lotus https://www.proc…
4 team/festina-lotus-2000 Festina - Lotus https://www.proc…
5 team/kelme-costa-blanca-2000 Kelme - Costa Blanca https://www.proc…
6 team/polti-2000 Polti https://www.proc…
7 team/kelme-costa-blanca-2000 Kelme - Costa Blanca https://www.proc…
8 team/kelme-costa-blanca-2000 Kelme - Costa Blanca https://www.proc…
9 team/banesto-2000 Banesto https://www.proc…
10 team/mapei-quickstep-2000 Mapei - Quickstep https://www.proc…
# ℹ 878 more rows
Preprocess stages overview. Only keep rows with description and extract year
of the edition from href
:
stages_overview <- function(df_stages_overview_raw, cycling_stats_url) {
df_stages_overview_raw |>
filter(desc != "") |>
rename(url_stages_overview = url) |>
mutate(
year = parse_integer(
map_chr(str_split(url_stages_overview, "/"), \(x) x[length(x)])),
url_stage = str_glue("{cycling_stats_url}{href}")) |>
filter(year <= 2023)
}
df_stages_overview <- stages_overview(df_stages_overview_raw, cycling_stats_url)
Filter for time trial stages. Keep only rows containing the following key word to do so:
[1] "\\(ITT\\)"
After filtering, further process the description to keep the string short and simple:
stages_itt <- function(df_stages_overview, time_trial_regex) {
df_stages_overview |>
filter(str_detect(desc, time_trial_regex)) |>
distinct(href, .keep_all = TRUE) |>
mutate(
desc = str_remove_all(desc, str_glue("{time_trial_regex}|Stage|\\|")),
desc = str_squish(desc),
desc = str_glue("{desc} ({year})"))
}
df_stages_itt <- stages_itt(df_stages_overview, time_trial_regex)
# A tibble: 40 × 5
href desc url_stages_overview year url_stage
<chr> <glu> <chr> <int> <glue>
1 race/tour-de-france/2000… 1 Fu… https://www.procyc… 2000 https://…
2 race/tour-de-france/2000… 19 F… https://www.procyc… 2000 https://…
3 race/tour-de-france/2001… 11 G… https://www.procyc… 2001 https://…
4 race/tour-de-france/2001… 18 M… https://www.procyc… 2001 https://…
5 race/tour-de-france/2002… 9 La… https://www.procyc… 2002 https://…
6 race/tour-de-france/2002… 19 R… https://www.procyc… 2002 https://…
7 race/tour-de-france/2003… 12 G… https://www.procyc… 2003 https://…
8 race/tour-de-france/2003… 19 P… https://www.procyc… 2003 https://…
9 race/tour-de-france/2004… 16 B… https://www.procyc… 2004 https://…
10 race/tour-de-france/2004… 19 B… https://www.procyc… 2004 https://…
# ℹ 30 more rows
Pull all the links into a vector for later analysis:
stages_paths <- pull(df_stages_itt, href)
For every time trial, scrape the result of the stage. Use the following helper function. In the scraped html look for the CSS defined by ‘div.result-cont:nth-child(5) > div:nth-child(1) > table:nth-child(1)’.
scrape_stage <- function(session_bow, stage_path, stage_tbl_css) {
stage_url <- nod(session_bow, path = stage_path)
stage_html <- scrape(stage_url)
html_nodes(stage_html, stage_tbl_css) |>
html_table() |>
first() |>
clean_names() |>
mutate(rnk = row_number(), url_stage = stage_url$url)
}
Again, first bow
to the host and then apply the function from the
above code chunk repeatedly.
df_stage <- stage(cycling_stats_url, stages_paths, stage_tbl_css)
# A tibble: 6,569 × 15
rnk gc timelag bib h2h specialty rider age team uci
<int> <int> <chr> <int> <lgl> <chr> <chr> <int> <chr> <int>
1 1 1 +0:00 127 NA TT MILLAR… 23 Cofi… NA
2 2 2 +0:02 1 NA TT ARMSTR… 28 US P… NA
3 3 3 +0:13 51 NA Sprint JALABE… 31 O.N.… NA
4 4 4 +0:14 61 NA TT ULLRIC… 26 Team… NA
5 5 5 +0:17 52 NA GC CAÑADA… 25 O.N.… NA
6 6 6 +0:20 11 NA TT ZÜLLE … 31 Bane… NA
7 7 7 +0:21 3 NA TT EKIMOV… 34 US P… NA
8 8 8 +0:27 72 NA Classic BORGHE… 31 Merc… NA
9 9 9 +0:33 4 NA TT HAMILT… 29 US P… NA
10 10 10 +0:36 43 NA Classic DEKKER… 29 Rabo… NA
# ℹ 6,559 more rows
# ℹ 5 more variables: pnt <int>, x <lgl>, time <chr>, avg <dbl>,
# url_stage <chr>
Calculate time delta to winner time for each rider:
time_delta <- function(df_stage) {
df_stage |>
transmute(
url_stage, rider,
time_delta = case_when(
rnk == 1 ~ seconds(0),
str_detect(time, ",") ~ ms(str_remove_all(time, ",")),
TRUE ~ ms(str_sub(time, start = (str_length(time) / 2) + 1)))) |>
filter(!is.na(time_delta), time_delta >= 0)
}
df_time_delta <- time_delta(df_stage)
# A tibble: 6,548 × 3
url_stage rider time_delta
<chr> <chr> <Period>
1 https://www.procyclingstats.com/race/tour-de-fran… MILL… 0S
2 https://www.procyclingstats.com/race/tour-de-fran… ARMS… 2S
3 https://www.procyclingstats.com/race/tour-de-fran… JALA… 13S
4 https://www.procyclingstats.com/race/tour-de-fran… ULLR… 14S
5 https://www.procyclingstats.com/race/tour-de-fran… CAÑA… 16S
6 https://www.procyclingstats.com/race/tour-de-fran… ZÜLL… 20S
7 https://www.procyclingstats.com/race/tour-de-fran… EKIM… 21S
8 https://www.procyclingstats.com/race/tour-de-fran… BORG… 27S
9 https://www.procyclingstats.com/race/tour-de-fran… HAMI… 33S
10 https://www.procyclingstats.com/race/tour-de-fran… DEKK… 36S
# ℹ 6,538 more rows
Extract time for each winner:
df_winner_time <- winner_time(df_stage)
Combine everything into one data frame. Calculate total time for each rider:
total_time <- function(df_time_delta, df_winner_time, df_stages_itt) {
df_time_delta |>
inner_join(df_winner_time, by = "url_stage") |>
transmute(
url_stage, rider,
time = winner_time + time_delta,
diff_winner_perc = (time - winner_time) / winner_time) |>
left_join(df_stages_itt, by = "url_stage")
}
df_total_time <- total_time(df_time_delta, df_winner_time, df_stages_itt)
Summarise data per stage. Use this summary to arrange the description as an ordered factor. Exclude winner rows from the data:
total_time_summary <- function(df_total_time) {
df_total_time_summary <- df_total_time |>
filter(diff_winner_perc != 0) |>
group_by(desc) |>
summarise(median_diff_winner_perc = median(diff_winner_perc)) |>
arrange(desc(median_diff_winner_perc)) |>
mutate(desc = fct_inorder(desc))
df_total_time |>
filter(diff_winner_perc != 0) |>
mutate(desc = fct_rev(factor(desc, df_total_time_summary$desc)))
}
df_total_time_summary <- total_time_summary(df_total_time)
Visualize data. Arrange the stages based on the summary calculated in the preceding code chunk:
vis_total_time <- function(df_total_time_summary) {
set.seed(10923)
df_total_time_summary |>
ggplot(aes(y = desc, x = diff_winner_perc)) +
geom_beeswarm(alpha = 0.2) +
geom_boxplot() +
theme_light() +
theme(axis.text.x = element_text(angle = 90)) +
scale_x_continuous(labels = label_percent()) +
labs(y = "Stage", x = "Difference to Winner Time [%]")
}
gg_total_time <- vis_total_time(df_total_time_summary)
The top stages are almost all (except for Cap Découverte
) mountain time
trials. One can see that this years time trial stands out from the rest
nonetheless!
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/duju211/tour_tt, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
During (2024, Feb. 14). Datannery: Scraping Tour de France Data. Retrieved from https://www.datannery.com/posts/scraping-tour-de-france-data/
BibTeX citation
@misc{during2024scraping, author = {During, Julian}, title = {Datannery: Scraping Tour de France Data}, url = {https://www.datannery.com/posts/scraping-tour-de-france-data/}, year = {2024} }