| Title: | Access and Query French Government Open Data from data.gouv.fr |
|---|---|
| Description: | Provides functions to search, retrieve metadata, and download datasets from https://data.gouv.fr, the official French government open data portal. Includes tools for querying datasets with filtering capabilities, automatic caching of downloaded resources, and flexible access methods using both direct CSV downloads and the data.gouv.fr tabular API. |
| Authors: | David Dorchies [aut, cre] (ORCID: <https://orcid.org/0000-0002-6595-7984>) |
| Maintainer: | David Dorchies <[email protected]> |
| License: | file LICENSE |
| Version: | 0.1.0 |
| Built: | 2026-05-25 14:52:50 UTC |
| Source: | https://forge.inrae.fr/umr-g-eau/datagouvfr |
Convert list provided by the APIs into a tibble
convert_list_to_tibble(l)convert_list_to_tibble(l)
l |
a [list] provided by the API (See [query_api]) |
This function is used internally by all the retrieving data functions for converting data after the call to [query_api].
A [tibble::tibble] with one row by record and one column by field.
# Get last meteo data around Espelette from the API (Lambert coords are in hm) df <- query_api(resource_id = get_latest_sim2_resource_id(), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df # Get last meteo data around Espelette from the CSV file (Lambert coords are in hm) df <- query_csv(resource_id = get_latest_sim2_resource_id(), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df # Get last meteo data around Espelette from the API if available (Lambert coords are in hm) df <- query(resource_metadata = get_latest_sim2_resource_id(metadata = TRUE), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df# Get last meteo data around Espelette from the API (Lambert coords are in hm) df <- query_api(resource_id = get_latest_sim2_resource_id(), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df # Get last meteo data around Espelette from the CSV file (Lambert coords are in hm) df <- query_csv(resource_id = get_latest_sim2_resource_id(), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df # Get last meteo data around Espelette from the API if available (Lambert coords are in hm) df <- query(resource_metadata = get_latest_sim2_resource_id(metadata = TRUE), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df
Download and filter data depending on their available format:
download_resource( resource_id, resource_metadata = get_resource_metadata(resource_id), url_pattern = "https://www.data.gouv.fr/fr/datasets/r/%s", cache_dir = Sys.getenv("DATAGOUVFR_CACHE_DIR", file.path(dirname(tempdir()), "datagouvfr")), force_download = FALSE ) query(resource_metadata, ..., force_download = FALSE) query_api( resource_id, ..., url_pattern = "https://tabular-api.data.gouv.fr/api/resources/%s/data/", raw_format = FALSE ) query_csv( resource_id, resource_metadata = get_resource_metadata(resource_id), ..., url_pattern = "https://www.data.gouv.fr/fr/datasets/r/%s", cache_dir = Sys.getenv("DATAGOUVFR_CACHE_DIR", file.path(dirname(tempdir()), "datagouvfr")), force_download = FALSE )download_resource( resource_id, resource_metadata = get_resource_metadata(resource_id), url_pattern = "https://www.data.gouv.fr/fr/datasets/r/%s", cache_dir = Sys.getenv("DATAGOUVFR_CACHE_DIR", file.path(dirname(tempdir()), "datagouvfr")), force_download = FALSE ) query(resource_metadata, ..., force_download = FALSE) query_api( resource_id, ..., url_pattern = "https://tabular-api.data.gouv.fr/api/resources/%s/data/", raw_format = FALSE ) query_csv( resource_id, resource_metadata = get_resource_metadata(resource_id), ..., url_pattern = "https://www.data.gouv.fr/fr/datasets/r/%s", cache_dir = Sys.getenv("DATAGOUVFR_CACHE_DIR", file.path(dirname(tempdir()), "datagouvfr")), force_download = FALSE )
resource_id |
resource ID (See [get_resources_id()]) |
resource_metadata |
resource metadata (one item of the list returned by the function [get_resources_metadata]) |
url_pattern |
URL pattern to get data from the API (injected in [sprintf] with the resource ID to complete the URL) |
cache_dir |
folder where resources are downloaded. It uses the value stored in the environment variable 'DATAGOUVFR_CACHE_DIR', or the system temporary folder if the later is not defined |
force_download |
force download instead of using cache for 'query_csv' |
... |
filter parameters (See details) |
raw_format |
if 'TRUE' the API response is not formatted as [tibble] |
- 'query_csv': download and cache a tabular file in CSV format and filter it - 'query_api': directly query the [data.gouv.fr tabular API](https://www.data.gouv.fr/en/dataservices/api-tabulaire-data-gouv-fr-beta/) - 'query': automatically launch 'query_csv' or 'query_api' depending on the availability of the tabular API given by the resource metadata
'...' are filter parameters that depend on the resource retrieved. Available filter are (replace 'column_name' by the name of the column):
- exact value: 'column_name__exact=value'
'url_pattern' is the URL of the api requested by the data.gouv.fr for displaying the resources. It is injected in [sprintf] with the resource ID to complete the URL.
A [tibble] containing the requested data or a [list] if 'query_api' has its argument 'raw_format' sets to 'TRUE'.
# Get last meteo data around Espelette from the API (Lambert coords are in hm) df <- query_api(resource_id = get_latest_sim2_resource_id(), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df # Get last meteo data around Espelette from the CSV file (Lambert coords are in hm) df <- query_csv(resource_id = get_latest_sim2_resource_id(), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df # Get last meteo data around Espelette from the API if available (Lambert coords are in hm) df <- query(resource_metadata = get_latest_sim2_resource_id(metadata = TRUE), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df# Get last meteo data around Espelette from the API (Lambert coords are in hm) df <- query_api(resource_id = get_latest_sim2_resource_id(), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df # Get last meteo data around Espelette from the CSV file (Lambert coords are in hm) df <- query_csv(resource_id = get_latest_sim2_resource_id(), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df # Get last meteo data around Espelette from the API if available (Lambert coords are in hm) df <- query(resource_metadata = get_latest_sim2_resource_id(metadata = TRUE), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400) df
This function fetches the dataset id from the web page base_url/dataset.
get_dataset_id( dataset, base_url = "https://www.data.gouv.fr/fr/datasets", url = file.path(base_url, dataset, "informations") )get_dataset_id( dataset, base_url = "https://www.data.gouv.fr/fr/datasets", url = file.path(base_url, dataset, "informations") )
dataset |
path of the dataset |
base_url |
URL of the data.gouv.fr datasets repository |
url |
complete url of the dataset (by default base_url/dataset) |
The dataset ID
# Get the ID of the SIM2 dataset get_dataset_id("donnees-changement-climatique-sim-quotidienne")# Get the ID of the SIM2 dataset get_dataset_id("donnees-changement-climatique-sim-quotidienne")
Get the latest resource id of a dataset
get_latest_sim2_resource_id( resources_metadata = get_resources_metadata(dataset_id), dataset_id = "6569b27598256cc583c917a7", metadata = FALSE )get_latest_sim2_resource_id( resources_metadata = get_resources_metadata(dataset_id), dataset_id = "6569b27598256cc583c917a7", metadata = FALSE )
resources_metadata |
resource metadata where to fetch latest resource available |
dataset_id |
dataset ID (See [get_dataset_id()], SIM2 dataset ID is used by default) |
metadata |
[logical] returns the complete resource metadata instead of only the resource id. |
The latest resource ID or metadata [list] depending on 'metadata' argument.
get_latest_sim2_resource_id()get_latest_sim2_resource_id()
Get dataset or resources metadata from dataset ID
get_resources_metadata( dataset_id, api_pattern = file.path("https://www.data.gouv.fr/api/2/datasets/%s/resources", "?page=1&type=main&page_size=6&q=") ) get_resource_metadata( resource_id, api_pattern = "https://www.data.gouv.fr/api/2/datasets/resources/%s/" ) get_dataset_metadata( dataset_id, api_pattern = "https://www.data.gouv.fr/api/2/datasets/%s/" )get_resources_metadata( dataset_id, api_pattern = file.path("https://www.data.gouv.fr/api/2/datasets/%s/resources", "?page=1&type=main&page_size=6&q=") ) get_resource_metadata( resource_id, api_pattern = "https://www.data.gouv.fr/api/2/datasets/resources/%s/" ) get_dataset_metadata( dataset_id, api_pattern = "https://www.data.gouv.fr/api/2/datasets/%s/" )
dataset_id |
Dataset ID (See [get_dataset_id()]) |
api_pattern |
API pattern to get resources metadata (See details) |
resource_id |
Resource ID |
'api_pattern' is the URL of the api requested by the data.gouv.fr for displaying the resources. It is injected in [sprintf] with the dataset ID to complete the URL.
A list of metadata
# Get metadata from SIM2 daily dataset dataset_id <- get_dataset_id("donnees-changement-climatique-sim-quotidienne") dataset_id dataset_metadata <- get_dataset_metadata(dataset_id) str(dataset_metadata) resources_metadata <- get_resources_metadata(dataset_id) str(resources_metadata)# Get metadata from SIM2 daily dataset dataset_id <- get_dataset_id("donnees-changement-climatique-sim-quotidienne") dataset_id dataset_metadata <- get_dataset_metadata(dataset_id) str(dataset_metadata) resources_metadata <- get_resources_metadata(dataset_id) str(resources_metadata)
Get SIM2 data from a period and a rectangular window
get_sim2_data( date_start = as.Date("1958-08-01"), date_end = Sys.Date(), ..., sim2_selected_meta = get_sim2_resources_metadata_from_date(date_start = date_start, date_end = date_end, sim2_metadata = get_resources_metadata("6569b27598256cc583c917a7")), cache_dir = Sys.getenv("DATAGOUVFR_CACHE_DIR", file.path(dirname(tempdir()), "datagouvfr")) )get_sim2_data( date_start = as.Date("1958-08-01"), date_end = Sys.Date(), ..., sim2_selected_meta = get_sim2_resources_metadata_from_date(date_start = date_start, date_end = date_end, sim2_metadata = get_resources_metadata("6569b27598256cc583c917a7")), cache_dir = Sys.getenv("DATAGOUVFR_CACHE_DIR", file.path(dirname(tempdir()), "datagouvfr")) )
date_start |
Start date of the period |
date_end |
End date of the period |
... |
Parameters passed to [query] |
sim2_selected_meta |
A tibble with the metadata of the SIM2 resources to download. (See [get_sim2_resources_metadata_from_date]). |
cache_dir |
folder where resources are downloaded. It uses the value stored in the environment variable 'DATAGOUVFR_CACHE_DIR', or the system temporary folder if the later is not defined |
Be careful, due to the structure of the data, the CSV files downloaded contains data for the whole France territory. 10 years of data correspond to about 1.1 GB to download. However these CSV files are only downloaded once and stored to the folder defined by the parameter 'cache_dir'.
A [tibble] with one row by time step and by cell.
# Get meteorological data of the last 3 months on Espelette territory data <- get_sim2_data( date_start = lubridate::`%m-%`(Sys.Date(), months(3)), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400 ) summary(data)# Get meteorological data of the last 3 months on Espelette territory data <- get_sim2_data( date_start = lubridate::`%m-%`(Sys.Date(), months(3)), LAMBX__greater = 2750, LAMBX__less = 3040, LAMBY__greater = 18100, LAMBY__less = 18400 ) summary(data)
This function is particularly adapted for the SIM2 dataset which has resources classified by periods.
The function [get_sim2_resources_periods] returns the periods corresponding to a list of SIM2 resources.
get_sim2_resources_metadata_from_date( date_start = as.Date("1958-08-01"), date_end = Sys.Date(), sim2_metadata = get_resources_metadata("6569b27598256cc583c917a7") ) get_sim2_resources_periods( sim2_metadata = get_resources_metadata("6569b27598256cc583c917a7") )get_sim2_resources_metadata_from_date( date_start = as.Date("1958-08-01"), date_end = Sys.Date(), sim2_metadata = get_resources_metadata("6569b27598256cc583c917a7") ) get_sim2_resources_periods( sim2_metadata = get_resources_metadata("6569b27598256cc583c917a7") )
date_start |
Start date of the period |
date_end |
End date of the period |
sim2_metadata |
Metadata of the SIM2 dataset (See [get_resources_metadata()], SIM2 dataset is used by default) |
The selected resources IDs. It also contains an attribute '"periods"' which contains the start and end dates of each resource.
# What periods are covered by each SIM2 resource? str(get_sim2_resources_periods()) # Select resources for data since 1990 metadata <- get_sim2_resources_metadata_from_date(date_start = as.Date("1990-01-01")) names(metadata) str(lapply(metadata, attr, which = "period"))# What periods are covered by each SIM2 resource? str(get_sim2_resources_periods()) # Select resources for data since 1990 metadata <- get_sim2_resources_metadata_from_date(date_start = as.Date("1990-01-01")) names(metadata) str(lapply(metadata, attr, which = "period"))