feat: Add ODIAC dataset #2771

reach-junaidsyed · 2025-05-04T05:17:46Z

Description

This PR introduces support for the ODIAC (Open-Data Inventory for Anthropogenic Carbon dioxide) Fossil Fuel CO2 Emissions dataset to TorchGeo.

Motivation

The ODIAC dataset provides valuable high-resolution (1km, monthly) global CO2 emission data, which is crucial for climate change research, atmospheric modeling, and environmental machine learning applications. Adding it to TorchGeo makes this important dataset easily accessible within the PyTorch ecosystem for geospatial tasks.

Dataset Details

Source: ODIAC Project at NIES, Japan
Data: Monthly global fossil fuel CO2 emissions (tonnes C / year / cell)
Format: 1km GeoTIFF files (.tif), downloaded as individual compressed archives (.tif.gz) per month.
Resolution: ~1km (0.00833 degrees)
CRS: EPSG:4326 (WGS 84)
License: CC BY 4.0 (Based on their terms)

Implementation

Created a new dataset class torchgeo.datasets.ODIAC inheriting from torchgeo.datasets.RasterDataset.
Supports selecting specific ODIAC version (e.g., 2023, 2022), years, and months. Defaults to the latest supported version and all available years/months for that version.
Implements automated downloading (download=True) of the required monthly .tif.gz files directly from the source URLs.
Includes MD5 checksum verification (checksum=True) for downloaded .tif.gz files (checksums populated based on provided lists).
Handles automatic extraction (gzip) of downloaded files into a structured directory format (<root>/<year>/<filename.tif>).
Uses the standard RasterDataset machinery for file indexing (R-tree), querying (__getitem__), CRS/resolution handling, and caching. The index is built manually in _build_index to handle the specific file structure and date parsing.
Includes a plot method using the 'magma' colormap for visualizing emission intensity.

Checklist

Implemented the dataset extending RasterDataset (which extends GeoDataset).
Added the dataset definition to torchgeo/datasets/odiac.py.
Added an import alias (from .odiac import ODIAC) and added ODIAC to __all__ in torchgeo/datasets/__init__.py.
Added a tests/data/odiac/data.py script that generates fake test data with the correct directory structure and filenames.
Added appropriate unit tests in tests/datasets/test_odiac.py.
Added the dataset to docs/api/datasets.rst under the heading ODIAC CO2 Emissions.
Added the dataset metadata to docs/api/datasets/geo_datasets.csv.
Ran linters (pre-commit run --all-files) and addressed issues.
Built documentation locally (cd docs && make html) and verified rendering.

Known Issues / Limitations

The md5s dictionary currently contains checksums primarily for the ODIAC2023 and ODIAC2022 monthly files. Checksums for all historical versions/years/months are not included but can be added if required. The download will proceed without checksum verification if a specific checksum is missing.
~~Local documentation build (make html) might show unrelated warnings originating from the pytorch-sphinx-theme submodule demo files.~~ (Note: Remove this line if you resolved the theme issues or if they didn't appear in the final build).

References

Dataset Homepage: https://db.cger.nies.go.jp/dataset/ODIAC/
Oda, T., & Maksyutov, S. (2011). Atmospheric Chemistry and Physics, 11(2), 543-556. https://doi.org/10.5194/acp-11-543-2011
Oda, T., et al. (2018). Earth System Science Data, 10(1), 87-107. https://doi.org/10.5194/essd-10-87-2018

reach-junaidsyed · 2025-05-04T05:35:07Z

@microsoft-github-policy-service agree

adamjstewart · 2025-05-04T09:48:48Z

Thanks for the PR!

A few thoughts on first glance:

I don't know if it's worth listing the MD5 of every single file. If the dataset is only distributed one file at a time, it might not be worth trying to checksum.
There seems to be a lot of custom RasterDataset logic here where you are manually populating your R-tree index. Note that in GeoDataset: rtree -> geopandas #2747, we are replacing R-tree with geopandas, so most of this logic would need to change. Furthermore, it isn't clear to me if any of this custom logic is actually needed. Is there a reason why the example in https://torchgeo.readthedocs.io/en/latest/tutorials/custom_raster_dataset.html isn't sufficient for your dataset?
There are quite a lot of lines of code that aren't currently covered by unit tests. Many of these are custom error handling logic that could be removed once you conform to the existing RasterDataset definition. Let me know if you need help with other parts of code coverage.

Will review in more detail once these major points have been addressed.

reach-junaidsyed · 2025-05-04T20:59:26Z

Thanks for the comments. I'll look into it!

Junaid Syed added 2 commits May 3, 2025 21:51

feat: Add ODIAC dataset

75d46f3

feat: Add ODIAC dataset

d6c17fa

github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing labels May 4, 2025

adamjstewart added this to the 0.8.0 milestone May 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add ODIAC dataset #2771

feat: Add ODIAC dataset #2771

Uh oh!

reach-junaidsyed commented May 4, 2025 •

edited

Loading

Uh oh!

reach-junaidsyed commented May 4, 2025

Uh oh!

adamjstewart commented May 4, 2025

Uh oh!

reach-junaidsyed commented May 4, 2025

Uh oh!

Uh oh!

feat: Add ODIAC dataset #2771

Are you sure you want to change the base?

feat: Add ODIAC dataset #2771

Uh oh!

Conversation

reach-junaidsyed commented May 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation

Dataset Details

Implementation

Checklist

Known Issues / Limitations

References

Uh oh!

reach-junaidsyed commented May 4, 2025

Uh oh!

adamjstewart commented May 4, 2025

Uh oh!

reach-junaidsyed commented May 4, 2025

Uh oh!

Uh oh!

reach-junaidsyed commented May 4, 2025 •

edited

Loading