Datannery: Vendée Globe

Julian During

Idea

Vendeé Globe is a famous single-handed non-stop yacht race across the globe. Last year ‘Boris Herrman’ was in the news in Germany for ‘almost’ winning the ‘Vendée Globe’.

This year ‘Boris Herrman’ is one of the favorites for this years race. Let’s plot his progress in this years race compared to the other competitors.

Data

In this post the following libraries are used. Most functions originate from the ‘tidyverse’ (Wickham et al. (2019)) suite of packages. Also behind the scenes ‘targets’ (Landau (2021b), Landau (2021a)) is used to manage the workflow. If you want to reproduce this analysis, you either go through the post step by step or perform the following steps:

Clone the repository
Run renv::restore() (Ushey and Wickham (2024))
Run targets::tar_make()

Other packages will be cited where they are used. To avoid nasty conflicts, resolve them (Wickham (2023)) right here at the start.

library(rnaturalearthdata)
library(rnaturalearth)
library(tarchetypes)
library(conflicted)
library(tidyverse)
library(jsonlite)
library(janitor)
library(targets)
library(assertr)
library(parzer)
library(readxl)
library(withr)
library(rvest)
library(httr2)
library(renv)
library(fs)
library(sf)

conflicts_prefer(dplyr::filter)

The data lives as several .xlsx files under the following location:

base_url <- "https://www.vendeeglobe.org/classement"

On this site there is a dropdown with all available “versions”. They are listed inside a json element. Detect this element and extract them from the parsed json (Ooms (2014)):

selector <- function(base_url) {
  sel_json <- read_html(base_url) |>
    html_element(".selector") |>
    html_attr("x-data") |>
    str_extract("\\[\\{.+\\}\\]")

  as_tibble(parse_json(sel_json, simplifyVector = TRUE))
}

df_selector <- selector(base_url)
glimpse(df_selector)

Rows: 324
Columns: 2
$ id    <chr> "560", "559", "557", "556", "555", "554", "552", "551"…
$ label <chr> "03 janvier 2025 - 11h00 (FR)", "03 janvier 2025 - 07h…

Some of these versions are in fact defunct. Filter those versions.

df_selector_filtered <- filter(df_selector, !id %in% c("74", "94", "96"))

Links to the classification excel files are encoded in the above mentioned labels. Extract the time stamp from these labels. Before doing so, make sure that every id is unique (Fischetti (2023)):

times <- function(df_selector) {
  df_selector |>
    assert(is_uniq, id) |>
    mutate(
      label = str_remove(label, " \\(FR\\)$"),
      time_stamp = dmy_hm(label, tz = "Europe/Paris"),
      time_utc = as_datetime(time_stamp, tz = "UTC"),
      .keep = "unused")
}

Sometimes the parsing of the timestamps is resulting in different results. This is because of the different naming of the locale on the operating system. Define locale based on operating system used:

loc <- if (Sys.info()[["sysname"]] == "Windows") {
     "fr"
 } else {
     "fr_FR.UTF-8"
 }

Execute the function with the correct locale. Set the locale temporarily (Hester et al. (2024)) to the one defined above:

df_times <- with_locale(c(LC_TIME = loc), times(df_selector_filtered))

Now we can build the URLs for the classification data:

excel_urls <- function(df_times) {
  df_times |>
    assert(not_na, time_utc) |>
    mutate(
      date = format(date(time_utc), "%Y%m%d"),
      time = str_c(
        str_pad(hour(time_utc), width = 2, pad = "0"),
        str_pad(minute(time_utc), width = 2, pad = "0"),
        str_pad(second(time_utc), width = 2, pad = "0"))) |>
    mutate(
      time_stamp,
      url = as.character(str_glue(
        "https://www.vendeeglobe.org/sites/default/files/ranking/",
        "vendeeglobe_leaderboard_{date}_{time}.xlsx")),
      .keep = "none")
}

df_excel_urls <- excel_urls(df_times)

Let’s get a quick overview of the data:

glimpse(df_excel_urls)

Rows: 321
Columns: 2
$ time_stamp <dttm> 2025-01-03 11:00:00, 2025-01-03 07:00:00, 2025-0…
$ url        <chr> "https://www.vendeeglobe.org/sites/default/files/…

Extract the URLs from the above data frame:

urls_excel_files <- pull(df_excel_urls, url)
urls_excel <- urls_excel_files

Map over all the URLs performing the following tasks:

Download (Wickham (2024)) the excel file
- Ensure that repeated calls never exceed the rate defined in ‘robots.txt’
Read in the Excel (Wickham and Bryan (2023)) file
Clean column names (Firke (2023)) for better handling later on

excel <- function(excel_url) {
  class_req <- req_throttle(request(excel_url), rate = 12 / 60)

  class_down <- req_perform(class_req, path = file_temp())

  read_excel(class_down$body[[1]], skip = 4) |>
    clean_names() |>
    mutate(url = excel_url)
}

df_excel <- map_df(urls_excel, \(x) excel(x))

As an end result we get all the information from the excel files in one data frame:

glimpse(df_excel)

Rows: 14,124
Columns: 21
$ x1                   <chr> "1", "2", "3", "4", "5", "6", "7", "8",…
$ x2                   <chr> "FRAFRA 79", "FRAFRA 24", "FRAFRA 112",…
$ x3                   <chr> "Charlie DalinMACIF Santé Prévoyance", …
$ heure_fr_hour_fr     <chr> "10:30 FR\r\n", "10:30 FR\r\n", "10:30 …
$ latitude_latitude    <chr> "16°54.61'S", "17°48.80'S", "19°55.56'S…
$ longitude_longitude  <chr> "29°42.91'W", "30°02.87'W", "38°07.08'W…
$ cap_heading_7        <chr> "11°", "357°", "69°", "80°", "4°", "73°…
$ vitesse_speed_8      <chr> "12.3 kts", "11.2 kts", "11.5 kts", "13…
$ vmg_vmg_9            <chr> "12.3 kts", "10.5 kts", "7.9 kts", "7.9…
$ distance_distance_10 <chr> "6.2 nm", "5.3 nm", "5.8 nm", "6.7 nm",…
$ cap_heading_11       <chr> "11°", "1°", "69°", "53°", "14°", "40°"…
$ vitesse_speed_12     <chr> "11.3 kts", "10.4 kts", "11.0 kts", "9.…
$ vmg_vmg_13           <chr> "11.3 kts", "10.0 kts", "7.6 kts", "8.5…
$ distance_distance_14 <chr> "45.3 nm", "41.5 nm", "44.0 nm", "38.1 …
$ cap_heading_15       <chr> "1°", "355°", "29°", "31°", "9°", "1°",…
$ vitesse_speed_16     <chr> "7.2 kts", "6.1 kts", "7.7 kts", "8.7 k…
$ vmg_vmg_17           <chr> "7.0 kts", "5.7 kts", "7.6 kts", "8.7 k…
$ distance_distance_18 <chr> "173.5 nm", "145.6 nm", "183.9 nm", "20…
$ x19                  <chr> "4152.1 nm", "4209.5 nm", "4482.5 nm", …
$ x20                  <chr> "0.0 nm", "57.4 nm", "330.4 nm", "1328.…
$ url                  <chr> "https://www.vendeeglobe.org/sites/defa…

Define columns that hold KPI information:

kpi_cols <- c(vitesse = "vitesse_speed_8", vmg = "vmg_vmg_9", 
     cap_heading = "cap_heading_7", distance_to_finish = "x19", 
     distance_to_leader = "x20")

Preprocess values from excel. Parse latitude / longitude data with the help of the ‘parzer’ (Chamberlain and Sagouis (2021)) package. After various additional data wrangling steps, turn into a ‘sf’ object (Pebesma (2018)):

class_vendee <- function(df_excel, df_excel_urls, kpi_cols) {
  df_class <- df_excel |>
    filter(!is.na(latitude_latitude)) |>
    mutate(
      id = x1, name = x3, url,
      rank = as.integer(x1),
      nationality = str_trim(str_remove(x2, "\\d+")),
      lat = parse_lat(latitude_latitude),
      lon = parse_lon(longitude_longitude),
      across(all_of(kpi_cols), \(x) parse_number(x)), .keep = "none") |>
    left_join(df_excel_urls, by = join_by(url), relationship = "many-to-one")

  df_class |>
    verify(is_uniq(time_stamp, id)) |>
    assert(not_na, everything()) |>
    st_as_sf(coords = c("lon", "lat"), crs = st_crs(4326))
}

sf_class_vendee <- class_vendee(df_excel, df_excel_urls, kpi_cols)

Turn from ‘point’ data into ‘line’ data. Use the str_wrap_dateline function for better plotting in the next step. Based on the position of the competitor we either get a ‘LINESTRING’ or ‘MULTILINESTRING’:

class_vendee_line <- function(sf_class_vendee) {
  sf_class_vendee |>
    group_by(name) |>
    summarise(geometry = st_combine(geometry), .groups = "drop") |>
    st_cast("LINESTRING") |>
    st_wrap_dateline()
}

sf_class_vendee_line <- class_vendee_line(sf_class_vendee)

Determine the current leader of the race:

df_leader <- filter(sf_class_vendee, time_stamp == max(time_stamp), 
     rank == min(rank))

We want to take a closer look at the current leader of the race and Boris Herrmann. Filter the ‘sf’ object for these teams. Once for the most current point in time and once for all points in time:

highlight_teams <- c("Boris HerrmannMalizia - Seaexplorer", df_leader[["name"]])
sf_class_vendee_highlight <- filter(sf_class_vendee, name %in% highlight_teams, 
     time_stamp == max(time_stamp))
sf_class_vendee_line_hightlight <- filter(sf_class_vendee_line, name %in% highlight_teams)

Before plotting the data, we need to download the world data. Do this in the following chunk (Massicotte and South (2023), South, Michael, and Massicotte (2024)):

sf_world <- ne_countries(scale = "medium", returnclass = "sf")

Define plotting function:

vis_class_vendee_line <- function(sf_class_vendee_line,
                                  sf_class_vendee_highlight,
                                  sf_class_vendee_line_hightlight, sf_world) {
  vendee_bbox <- st_bbox(sf_class_vendee_line)

  ggplot(data = sf_world, aes(group = name)) +
    geom_sf() +
    geom_sf(data = sf_class_vendee_line, alpha = 0.2) +
    geom_sf(
      data = sf_class_vendee_line_hightlight, mapping = aes(color = name)) +
    geom_sf(
      data = sf_class_vendee_highlight, mapping = aes(color = name)) +
    coord_sf(
      xlim = c(vendee_bbox[["xmin"]], vendee_bbox[["xmax"]]),
      ylim = c(vendee_bbox[["ymin"]], vendee_bbox[["ymax"]])) +
    theme_minimal() +
    theme(legend.position = "bottom") +
    labs(
      title = "Vendeé Globe 2024",
      subtitle = str_glue(
        "Last known Timestamp: {unique(sf_class_vendee_highlight$time_stamp)}"),
      color = "Name / Team")
}

gg_class_vendee_line <- vis_class_vendee_line(sf_class_vendee_line, sf_class_vendee_highlight, 
     sf_class_vendee_line_hightlight, sf_world)

Result

We can now take a look at the current standings:

gg_class_vendee_line

And there you go! All the available data about the race plotted on a world map. As you can see ‘Boris Herrmann’ is behind the leading group by quite a margin. But (fingers crossed) 4th place seems to be in reach. Let’s see what happens in the final stretch of the race by updating our targets pipeline regurlarly in the coming days :-).

Chamberlain, Scott, and Alban Sagouis. 2021. Parzer: Parse Messy Geographic Coordinates. https://CRAN.R-project.org/package=parzer.

Firke, Sam. 2023. Janitor: Simple Tools for Examining and Cleaning Dirty Data. https://CRAN.R-project.org/package=janitor.

Fischetti, Tony. 2023. Assertr: Assertive Programming for r Analysis Pipelines. https://CRAN.R-project.org/package=assertr.

Hester, Jim, Lionel Henry, Kirill Müller, Kevin Ushey, Hadley Wickham, and Winston Chang. 2024. Withr: Run Code ’with’ Temporarily Modified Global State. https://CRAN.R-project.org/package=withr.

Landau, William Michael. 2021a. Tarchetypes: Archetypes for Targets.

———. 2021b. “The Targets r Package: A Dynamic Make-Like Function-Oriented Pipeline Toolkit for Reproducibility and High-Performance Computing.” Journal of Open Source Software 6 (57): 2959. https://doi.org/10.21105/joss.02959.

Massicotte, Philippe, and Andy South. 2023. Rnaturalearth: World Map Data from Natural Earth. https://CRAN.R-project.org/package=rnaturalearth.

Ooms, Jeroen. 2014. “The Jsonlite Package: A Practical and Consistent Mapping Between JSON Data and r Objects.” arXiv:1403.2805 [Stat.CO]. https://arxiv.org/abs/1403.2805.

Pebesma, Edzer. 2018. “Simple Features for R: Standardized Support for Spatial Vector Data.” The R Journal 10 (1): 439–46. https://doi.org/10.32614/RJ-2018-009.

South, Andy, Schramm Michael, and Philippe Massicotte. 2024. Rnaturalearthdata: World Vector Map Data from Natural Earth Used in ’Rnaturalearth’. https://CRAN.R-project.org/package=rnaturalearthdata.

Ushey, Kevin, and Hadley Wickham. 2024. Renv: Project Environments. https://CRAN.R-project.org/package=renv.

Wickham, Hadley. 2023. Conflicted: An Alternative Conflict Resolution Strategy. https://CRAN.R-project.org/package=conflicted.

———. 2024. Httr2: Perform HTTP Requests and Process the Responses. https://CRAN.R-project.org/package=httr2.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, and Jennifer Bryan. 2023. Readxl: Read Excel Files. https://CRAN.R-project.org/package=readxl.

Vendée Globe

Idea

Data

Result

References

Corrections

Reuse

Citation