Skip to content

feat: Add ODIAC dataset #2771

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

reach-junaidsyed
Copy link

@reach-junaidsyed reach-junaidsyed commented May 4, 2025

Description

This PR introduces support for the ODIAC (Open-Data Inventory for Anthropogenic Carbon dioxide) Fossil Fuel CO2 Emissions dataset to TorchGeo.

Motivation

The ODIAC dataset provides valuable high-resolution (1km, monthly) global CO2 emission data, which is crucial for climate change research, atmospheric modeling, and environmental machine learning applications. Adding it to TorchGeo makes this important dataset easily accessible within the PyTorch ecosystem for geospatial tasks.

Dataset Details

  • Source: ODIAC Project at NIES, Japan
  • Data: Monthly global fossil fuel CO2 emissions (tonnes C / year / cell)
  • Format: 1km GeoTIFF files (.tif), downloaded as individual compressed archives (.tif.gz) per month.
  • Resolution: ~1km (0.00833 degrees)
  • CRS: EPSG:4326 (WGS 84)
  • License: CC BY 4.0 (Based on their terms)

Implementation

  • Created a new dataset class torchgeo.datasets.ODIAC inheriting from torchgeo.datasets.RasterDataset.
  • Supports selecting specific ODIAC version (e.g., 2023, 2022), years, and months. Defaults to the latest supported version and all available years/months for that version.
  • Implements automated downloading (download=True) of the required monthly .tif.gz files directly from the source URLs.
  • Includes MD5 checksum verification (checksum=True) for downloaded .tif.gz files (checksums populated based on provided lists).
  • Handles automatic extraction (gzip) of downloaded files into a structured directory format (<root>/<year>/<filename.tif>).
  • Uses the standard RasterDataset machinery for file indexing (R-tree), querying (__getitem__), CRS/resolution handling, and caching. The index is built manually in _build_index to handle the specific file structure and date parsing.
  • Includes a plot method using the 'magma' colormap for visualizing emission intensity.

Checklist

  • Implemented the dataset extending RasterDataset (which extends GeoDataset).
  • Added the dataset definition to torchgeo/datasets/odiac.py.
  • Added an import alias (from .odiac import ODIAC) and added ODIAC to __all__ in torchgeo/datasets/__init__.py.
  • Added a tests/data/odiac/data.py script that generates fake test data with the correct directory structure and filenames.
  • Added appropriate unit tests in tests/datasets/test_odiac.py.
  • Added the dataset to docs/api/datasets.rst under the heading ODIAC CO2 Emissions.
  • Added the dataset metadata to docs/api/datasets/geo_datasets.csv.
  • Ran linters (pre-commit run --all-files) and addressed issues.
  • Built documentation locally (cd docs && make html) and verified rendering.

Known Issues / Limitations

  • The md5s dictionary currently contains checksums primarily for the ODIAC2023 and ODIAC2022 monthly files. Checksums for all historical versions/years/months are not included but can be added if required. The download will proceed without checksum verification if a specific checksum is missing.
  • Local documentation build (make html) might show unrelated warnings originating from the pytorch-sphinx-theme submodule demo files. (Note: Remove this line if you resolved the theme issues or if they didn't appear in the final build).

References

@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing labels May 4, 2025
@reach-junaidsyed
Copy link
Author

@microsoft-github-policy-service agree

@adamjstewart adamjstewart added this to the 0.8.0 milestone May 4, 2025
@adamjstewart
Copy link
Collaborator

Thanks for the PR!

A few thoughts on first glance:

  • I don't know if it's worth listing the MD5 of every single file. If the dataset is only distributed one file at a time, it might not be worth trying to checksum.
  • There seems to be a lot of custom RasterDataset logic here where you are manually populating your R-tree index. Note that in GeoDataset: rtree -> geopandas #2747, we are replacing R-tree with geopandas, so most of this logic would need to change. Furthermore, it isn't clear to me if any of this custom logic is actually needed. Is there a reason why the example in https://torchgeo.readthedocs.io/en/latest/tutorials/custom_raster_dataset.html isn't sufficient for your dataset?
  • There are quite a lot of lines of code that aren't currently covered by unit tests. Many of these are custom error handling logic that could be removed once you conform to the existing RasterDataset definition. Let me know if you need help with other parts of code coverage.

Will review in more detail once these major points have been addressed.

@reach-junaidsyed
Copy link
Author

Thanks for the comments. I'll look into it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants