Scraping and visualising Vendée Globe Data.
Vendeé Globe is a famous single-handed non-stop yacht race across the globe. Last year ‘Boris Herrman’ was in the news in Germany for ‘almost’ winning the ‘Vendée Globe’.
This year ‘Boris Herrman’ is one of the favorites for this years race. Let’s plot his progress in this years race compared to the other competitors.
In this post the following libraries are used. Most functions originate from the ‘tidyverse’ (Wickham et al. (2019)) suite of packages. Also behind the scenes ‘targets’ (Landau (2021b), Landau (2021a)) is used to manage the workflow. If you want to reproduce this analysis, you either go through the post step by step or perform the following steps:
renv::restore()
(Ushey and Wickham (2024))targets::tar_make()
Other packages will be cited where they are used. To avoid nasty conflicts, resolve them (Wickham (2023)) right here at the start.
library(rnaturalearthdata)
library(rnaturalearth)
library(tarchetypes)
library(conflicted)
library(tidyverse)
library(jsonlite)
library(janitor)
library(targets)
library(assertr)
library(parzer)
library(readxl)
library(withr)
library(rvest)
library(httr2)
library(renv)
library(fs)
library(sf)
conflicts_prefer(dplyr::filter)
The data lives as several .xlsx
files under the following location:
base_url <- "https://www.vendeeglobe.org/classement"
On this site there is a dropdown with all available “versions”.
They are listed inside a json
element.
Detect this element and extract them from the parsed json
(Ooms (2014)):
selector <- function(base_url) {
sel_json <- read_html(base_url) |>
html_element(".selector") |>
html_attr("x-data") |>
str_extract("\\[\\{.+\\}\\]")
as_tibble(parse_json(sel_json, simplifyVector = TRUE))
}
df_selector <- selector(base_url)
glimpse(df_selector)
Rows: 324
Columns: 2
$ id <chr> "560", "559", "557", "556", "555", "554", "552", "551"…
$ label <chr> "03 janvier 2025 - 11h00 (FR)", "03 janvier 2025 - 07h…
Some of these versions are in fact defunct. Filter those versions.
Links to the classification excel files are encoded in the above mentioned labels. Extract the time stamp from these labels. Before doing so, make sure that every id is unique (Fischetti (2023)):
times <- function(df_selector) {
df_selector |>
assert(is_uniq, id) |>
mutate(
label = str_remove(label, " \\(FR\\)$"),
time_stamp = dmy_hm(label, tz = "Europe/Paris"),
time_utc = as_datetime(time_stamp, tz = "UTC"),
.keep = "unused")
}
Sometimes the parsing of the timestamps is resulting in different results. This is because of the different naming of the locale on the operating system. Define locale based on operating system used:
loc <- if (Sys.info()[["sysname"]] == "Windows") {
"fr"
} else {
"fr_FR.UTF-8"
}
Execute the function with the correct locale. Set the locale temporarily (Hester et al. (2024)) to the one defined above:
df_times <- with_locale(c(LC_TIME = loc), times(df_selector_filtered))
Now we can build the URLs for the classification data:
excel_urls <- function(df_times) {
df_times |>
assert(not_na, time_utc) |>
mutate(
date = format(date(time_utc), "%Y%m%d"),
time = str_c(
str_pad(hour(time_utc), width = 2, pad = "0"),
str_pad(minute(time_utc), width = 2, pad = "0"),
str_pad(second(time_utc), width = 2, pad = "0"))) |>
mutate(
time_stamp,
url = as.character(str_glue(
"https://www.vendeeglobe.org/sites/default/files/ranking/",
"vendeeglobe_leaderboard_{date}_{time}.xlsx")),
.keep = "none")
}
df_excel_urls <- excel_urls(df_times)
Let’s get a quick overview of the data:
glimpse(df_excel_urls)
Rows: 321
Columns: 2
$ time_stamp <dttm> 2025-01-03 11:00:00, 2025-01-03 07:00:00, 2025-0…
$ url <chr> "https://www.vendeeglobe.org/sites/default/files/…
Extract the URLs from the above data frame:
urls_excel_files <- pull(df_excel_urls, url)
urls_excel <- urls_excel_files
Map over all the URLs performing the following tasks:
excel <- function(excel_url) {
class_req <- req_throttle(request(excel_url), rate = 12 / 60)
class_down <- req_perform(class_req, path = file_temp())
read_excel(class_down$body[[1]], skip = 4) |>
clean_names() |>
mutate(url = excel_url)
}
df_excel <- map_df(urls_excel, \(x) excel(x))
As an end result we get all the information from the excel files in one data frame:
glimpse(df_excel)
Rows: 14,124
Columns: 21
$ x1 <chr> "1", "2", "3", "4", "5", "6", "7", "8",…
$ x2 <chr> "FRAFRA 79", "FRAFRA 24", "FRAFRA 112",…
$ x3 <chr> "Charlie DalinMACIF Santé Prévoyance", …
$ heure_fr_hour_fr <chr> "10:30 FR\r\n", "10:30 FR\r\n", "10:30 …
$ latitude_latitude <chr> "16°54.61'S", "17°48.80'S", "19°55.56'S…
$ longitude_longitude <chr> "29°42.91'W", "30°02.87'W", "38°07.08'W…
$ cap_heading_7 <chr> "11°", "357°", "69°", "80°", "4°", "73°…
$ vitesse_speed_8 <chr> "12.3 kts", "11.2 kts", "11.5 kts", "13…
$ vmg_vmg_9 <chr> "12.3 kts", "10.5 kts", "7.9 kts", "7.9…
$ distance_distance_10 <chr> "6.2 nm", "5.3 nm", "5.8 nm", "6.7 nm",…
$ cap_heading_11 <chr> "11°", "1°", "69°", "53°", "14°", "40°"…
$ vitesse_speed_12 <chr> "11.3 kts", "10.4 kts", "11.0 kts", "9.…
$ vmg_vmg_13 <chr> "11.3 kts", "10.0 kts", "7.6 kts", "8.5…
$ distance_distance_14 <chr> "45.3 nm", "41.5 nm", "44.0 nm", "38.1 …
$ cap_heading_15 <chr> "1°", "355°", "29°", "31°", "9°", "1°",…
$ vitesse_speed_16 <chr> "7.2 kts", "6.1 kts", "7.7 kts", "8.7 k…
$ vmg_vmg_17 <chr> "7.0 kts", "5.7 kts", "7.6 kts", "8.7 k…
$ distance_distance_18 <chr> "173.5 nm", "145.6 nm", "183.9 nm", "20…
$ x19 <chr> "4152.1 nm", "4209.5 nm", "4482.5 nm", …
$ x20 <chr> "0.0 nm", "57.4 nm", "330.4 nm", "1328.…
$ url <chr> "https://www.vendeeglobe.org/sites/defa…
Define columns that hold KPI information:
kpi_cols <- c(vitesse = "vitesse_speed_8", vmg = "vmg_vmg_9",
cap_heading = "cap_heading_7", distance_to_finish = "x19",
distance_to_leader = "x20")
Preprocess values from excel. Parse latitude / longitude data with the help of the ‘parzer’ (Chamberlain and Sagouis (2021)) package. After various additional data wrangling steps, turn into a ‘sf’ object (Pebesma (2018)):
class_vendee <- function(df_excel, df_excel_urls, kpi_cols) {
df_class <- df_excel |>
filter(!is.na(latitude_latitude)) |>
mutate(
id = x1, name = x3, url,
rank = as.integer(x1),
nationality = str_trim(str_remove(x2, "\\d+")),
lat = parse_lat(latitude_latitude),
lon = parse_lon(longitude_longitude),
across(all_of(kpi_cols), \(x) parse_number(x)), .keep = "none") |>
left_join(df_excel_urls, by = join_by(url), relationship = "many-to-one")
df_class |>
verify(is_uniq(time_stamp, id)) |>
assert(not_na, everything()) |>
st_as_sf(coords = c("lon", "lat"), crs = st_crs(4326))
}
sf_class_vendee <- class_vendee(df_excel, df_excel_urls, kpi_cols)
Turn from ‘point’ data into ‘line’ data. Use the str_wrap_dateline
function for better plotting in the next step. Based on the position of
the competitor we either get a ‘LINESTRING’ or ‘MULTILINESTRING’:
class_vendee_line <- function(sf_class_vendee) {
sf_class_vendee |>
group_by(name) |>
summarise(geometry = st_combine(geometry), .groups = "drop") |>
st_cast("LINESTRING") |>
st_wrap_dateline()
}
sf_class_vendee_line <- class_vendee_line(sf_class_vendee)
Determine the current leader of the race:
We want to take a closer look at the current leader of the race and Boris Herrmann. Filter the ‘sf’ object for these teams. Once for the most current point in time and once for all points in time:
Before plotting the data, we need to download the world data. Do this in the following chunk (Massicotte and South (2023), South, Michael, and Massicotte (2024)):
sf_world <- ne_countries(scale = "medium", returnclass = "sf")
Define plotting function:
vis_class_vendee_line <- function(sf_class_vendee_line,
sf_class_vendee_highlight,
sf_class_vendee_line_hightlight, sf_world) {
vendee_bbox <- st_bbox(sf_class_vendee_line)
ggplot(data = sf_world, aes(group = name)) +
geom_sf() +
geom_sf(data = sf_class_vendee_line, alpha = 0.2) +
geom_sf(
data = sf_class_vendee_line_hightlight, mapping = aes(color = name)) +
geom_sf(
data = sf_class_vendee_highlight, mapping = aes(color = name)) +
coord_sf(
xlim = c(vendee_bbox[["xmin"]], vendee_bbox[["xmax"]]),
ylim = c(vendee_bbox[["ymin"]], vendee_bbox[["ymax"]])) +
theme_minimal() +
theme(legend.position = "bottom") +
labs(
title = "Vendeé Globe 2024",
subtitle = str_glue(
"Last known Timestamp: {unique(sf_class_vendee_highlight$time_stamp)}"),
color = "Name / Team")
}
gg_class_vendee_line <- vis_class_vendee_line(sf_class_vendee_line, sf_class_vendee_highlight,
sf_class_vendee_line_hightlight, sf_world)
We can now take a look at the current standings:
gg_class_vendee_line
And there you go! All the available data about the race plotted on a world map. As you can see ‘Boris Herrmann’ is behind the leading group by quite a margin. But (fingers crossed) 4th place seems to be in reach. Let’s see what happens in the final stretch of the race by updating our targets pipeline regurlarly in the coming days :-).
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://codeberg.org/duju211/vendee_globe, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
During (2025, Jan. 3). Datannery: Vendée Globe. Retrieved from https://www.datannery.com/posts/vende-globe/
BibTeX citation
@misc{during2025vendée, author = {During, Julian}, title = {Datannery: Vendée Globe}, url = {https://www.datannery.com/posts/vende-globe/}, year = {2025} }