Skip to content

GeoDataset: rtree -> geopandas #2747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 93 commits into from
May 19, 2025
Merged

Conversation

adamjstewart
Copy link
Collaborator

@adamjstewart adamjstewart commented Apr 21, 2025

This PR migrates our GeoDataset indices from rtree to geopandas, and is part of ongoing work to add time series support to TorchGeo: #2382.

Progress

  • Dataset Base Classes
    • GeoDataset
    • RasterDataset
    • VectorDataset
    • IntersectionDataset
    • UnionDataset
  • Custom Datasets
    • AgriFieldNet
    • Chesapeake CVPR
    • EDDMapS
    • EnviroAtlas
    • GBIF
    • GlobBiomass
    • iNaturalist
    • L7 Irish
    • LandCover.ai
    • MMFlood
    • Open Buildings
    • South Africa Crop Type
  • Splitters
    • random_bbox_assignment
    • random_bbox_splitting
    • random_grid_cell_assignment
    • roi_split
    • time_series_split
  • Samplers
    • GeoSampler
    • RandomGeoSampler
    • GridGeoSampler
    • PreChippedGeoSampler
    • BatchGeoSampler
    • RandomBatchGeoSampler
  • Dependencies
    • pyproject.toml
    • requirements.txt
  • Documentation
    • related libraries
    • tutorials
  • Testing
    • Code coverage
    • Silence warnings

Backwards-Incompatible Changes

I've tried to keep backwards-incompatible changes in this particular PR to a minimum, but there were a few that were unavoidable or minor:

  • GeoDataset now uses pyproj CRS instead of rasterio CRS
  • GeoDataset no longer provides a default __init__ method
  • Point datasets (EDDMapS, GBIF, iNaturalist) now return a GeoDataFrame instead of a list of bounding boxes
  • Point datasets (EDDMapS, GBIF, iNaturalist) no longer handle partial timestamp information
  • BoundingBox now takes datetime objects instead of floats for mint/maxt
  • roi_split now includes geometries like LineString and Point with zero-area overlap

Some of these changes could be undone if necessary, but I expect to have many more backwards-incompatible changes in future PRs.

Open Questions

  1. How should we handle VectorDataset now that we can store arbitrary Polygons in the index?
  2. Can we mix IntervalIndex, PeriodIndex, and DatetimeIndex?
  3. Can we mix date, datetime, pd.Timestamp, pd.Period?
  4. Can we mix Point, Polygon, etc.?
  5. CRS compatibility between rasterio, fiona, geopandas, and pyproj?
  6. Should spatial intersection with zero area count (point, edge)?
  7. Should we define a standard set of column names, or support arbitrary additional columns (e.g., cloud percent)?
  8. Can we unify GeoDataset and NonGeoDataset (and TileDataset)?

Motivation

Switching from rtree to geopandas is no small feat. This decision was made after careful consideration of several options and alternatives. The decision to switch from rtree to geopandas was made due to geopandas support for the following features that rtree does not support (items in bold are critical for time series integration):

  • Array features:
    • Integer indexing
    • Integer slicing
  • List features:
    • spatial aggregation: with index.dissolve
    • temporal aggretation: with index.groupby
  • Set features:
    • membership test
    • uniqueness: with pandas.unique
    • faster intersection/union
  • Database features:
    • multitheading
    • pickling
    • redundancy
  • Geospatial features:
    • reprojection: with index.to_crs
    • arbitrary shapely Geometrys: including Points and non-rectangular Polygons
  • Geotemporal features:
    • datetime64 objects
  • Visualization features:
    • plotting

@adamjstewart adamjstewart added the backwards-incompatible Changes that are not backwards compatible label Apr 21, 2025
@adamjstewart adamjstewart added this to the 0.8.0 milestone Apr 21, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets dependencies Packaging and dependencies labels Apr 21, 2025
@adamjstewart adamjstewart mentioned this pull request Apr 21, 2025
37 tasks
@github-actions github-actions bot added testing Continuous integration testing and removed testing Continuous integration testing labels Apr 22, 2025
@@ -354,10 +301,10 @@ class RasterDataset(GeoDataset):
date_format = '%Y%m%d'

#: Minimum timestamp if not in filename
mint: float = 0
mint: datetime = pd.Timestamp.min
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pandas defaults to datetime64[ns], which has a much smaller range than datetime:

>>> import pandas as pd
>>> pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145224193')
>>> pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
>>> datetime.min
datetime.datetime(1, 1, 1, 0, 0)
>>> datetime.max
datetime.datetime(9999, 12, 31, 23, 59, 59, 999999)

I can't imagine this being an issue, but something to be aware of.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May actually convert the types directly to pd.Timestamp, datetime-compatibility is not great...

@adamjstewart
Copy link
Collaborator Author

adamjstewart commented Apr 25, 2025

I mostly have the dataset base classes working now. However, the following datasets are either custom or hacky, making the porting a bit tricky. I would appreciate help from the people who wrote these datasets:

@isaaccorley maybe you can help with some of these as well.

@calebrob6
Copy link
Member

First question from me is about speed -- do we have a benchmark script to understand before/after lookup speed?

@adamjstewart
Copy link
Collaborator Author

@calebrob6
Copy link
Member

It looks like this PR isn't at the point where I can run the torchgeo IO benchmarks (I switched to this branch, tried, and got TypeError: BoundingBox.__init__() missing 2 required positional arguments: 'mint' and 'maxt' but I'm guessing that's expected right now). Do the GeopandasDataset and RTreeDataset implementations from the other repo mimic the difference between how we currently index in GeoDataset vs. this proposed implementation?

@adamjstewart
Copy link
Collaborator Author

That's the goal. There will likely be minor differences but we can correct those as we go.

@adrianboguszewski
Copy link
Contributor

What is required here from my side? Checking if the dataset works after changes?

@adamjstewart
Copy link
Collaborator Author

@adrianboguszewski LandCover.ai Geo doesn't look too bad, but we need to update the __getitem__ method to replace any R-tree specific stuff with the new syntax. If you want, you can open a PR off of my branch, push directly to my branch, or simply share what the code change would look like for the new geopandas backend. If you're busy, I can try to take a stab at it and you can double check that the changes work with the real dataset.

@github-actions github-actions bot added the samplers Samplers for indexing datasets label Apr 29, 2025
@adrianboguszewski
Copy link
Contributor

@adamjstewart yeah, little busy, so would be great if you create a PR and I test it :)

@sfalkena
Copy link
Contributor

@adamjstewart, @adrianboguszewski, let me take up landcover.ai dataset to get me into shape in adapting dataset classes

@adamjstewart
Copy link
Collaborator Author

@calebrob6 @nilsleh @sfalkena @lccol @isaaccorley I would like to have this PR completed in the next couple days. If you're able to finish converting the remaining custom datasets to geopandas by then let me know. If not, I can start working on them, as everything else is now done.

@github-actions github-actions bot added the scripts Training and evaluation scripts label May 4, 2025
Copy link
Contributor

@yichiac yichiac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error in l7irish

Copy link
Collaborator

@isaaccorley isaaccorley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never been so happy to see so many lines of code deleted

@adamjstewart adamjstewart merged commit 166e55a into microsoft:main May 19, 2025
20 checks passed
@adamjstewart adamjstewart deleted the deps/geopandas branch May 19, 2025 09:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backwards-incompatible Changes that are not backwards compatible datasets Geospatial or benchmark datasets dependencies Packaging and dependencies documentation Improvements or additions to documentation samplers Samplers for indexing datasets scripts Training and evaluation scripts testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants