[Enhancement] Add name-based alignment workflow for EarthDaily Sentinel-2 biophysical products by robmarkcole · Pull Request #626 · allenai/rslearn

robmarkcole · 2026-04-27T09:49:50Z

Close #625

… capabilities and update documentation

favyen2 · 2026-04-27T20:30:31Z

@claude can you review this PR?

claude · 2026-04-27T20:30:47Z

Claude finished @favyen2's task in 2m 45s —— View job

PR Review: Biophysical Source Layer Matching

Gather context and read relevant files
Review changed files
Post review feedback

Overall this is a clean, well-motivated feature. The design pattern — adding a get_items_for_windows hook to the base class and overriding it in Biophysical — is a good extensibility point. The tests are solid and the documentation is thorough. A few items worth addressing before merging:

Issues

1. DataSources.md drops the EarthDaily credential note (docs/DataSources.md)

The original table entry read "requires EarthDaily credentials"; the new entry removes that. Credentials are still required and this is important context for users comparing data sources. Suggest: "LAI/FAPAR/FCOVER, can match exact Sentinel-2 source scenes, requires EarthDaily credentials".

Fix this →

2. Unreachable None guard in _format_match_source_item_name (earthdaily.py:1338)

if self.match_source_item_template is None:
    raise ValueError("match_source_item_template is not configured")

_format_match_source_item_name is only called from get_items_for_windows when self.match_source_layer is not None. In __init__, when match_source_layer is not None, match_source_item_template is guaranteed to be set (either by the user or defaulted to "{source_item_name}_{variable_upper}"). So this check can never trigger — it's dead code that may mislead readers.

3. Redundant length check in Biophysical.get_items_for_windows (earthdaily.py)

# In Biophysical.get_items_for_windows:
if len(windows) != len(geometries):
    raise ValueError("windows and geometries must have the same length")

The base class get_items_for_windows (called via super() in the match_source_layer is None branch) already performs this check. The override also performs it in the match_source_layer is not None branch. The check in Biophysical is fine, but the check in the base class only adds value when calling super() — removing one of the duplicates (or keeping both with a comment) would avoid confusion about which check is authoritative.

4. Redundant bounds check in get_items_for_windows (earthdaily.py:1436)

if source_layer_data.group_time_ranges is not None and group_idx < len(
    source_layer_data.group_time_ranges
):

WindowLayerData.__init__ already validates that len(group_time_ranges) == len(serialized_item_groups) (raises ValueError otherwise), so group_idx < len(source_layer_data.group_time_ranges) is always True when group_time_ranges is not None. The simplified form if source_layer_data.group_time_ranges is not None: is clearer.

5. Whole-group skip when any item in the source group is missing (earthdaily.py:1423-1431)

if missing_item_names:
    logger.warning(...)
    continue  # skips entire source group

If a source group contains multiple Sentinel-2 items and even one has no corresponding biophysical product, the entire matched group is discarded. This may be the right behavior (biophysical products typically correspond 1-to-1 with scenes, and a partial group is ambiguous), but it's worth documenting the group semantics here or in the docstring — especially since a user could construct a config with max_matches > 1 and be surprised when all groups are silently dropped.

Positive Notes

The KeyError fix in get_item_by_name when collection.get_item() returns None (line 313) is a good defensive improvement — turns a confusing AttributeError into a clear KeyError that is caught properly in get_items_for_windows.
The string.Formatter().parse() approach to detect whether source_product_id is actually used in the template before requiring it to be non-None is smart.
The test using monkeypatch to verify get_items_for_windows is actually called (instead of get_items) is a good integration point.
Error messages are consistently descriptive with actionable guidance for users (e.g., mentioning that the source layer must appear earlier in the config).

favyen2 · 2026-04-27T20:50:06Z

I think I understand the use case, but I think it is too specific while requiring too many use cases to support at this time. Instead I would suggest either:

Offer a data source that provides both Sentinel-2 images and the derived biophysical products together. So an item has both image assets and the derived assets.
Or, include a separate script in an example that takes items from one layer and uses them to somehow make items in a different layer. It could be based on changing the item name, or (might be more general) it could be looking for an item in the second layer with a very similar timestamp. Then users can run the script for different sorts of layers, and it can be documented in the example for EarthDaily Sentinel-2 + biophysical, and can also be linked from the biophysical documentation.

robmarkcole · 2026-04-28T08:35:22Z

I agree option 2 is the better fit.

The new workflow keeps Sentinel-2 and biophysical products as separate EarthDaily data sources. Users prepare the dataset normally, with separate layers for sentinel2, lai, fapar, and/or fcover. After rslearn dataset prepare and before materialization, they run docs/examples/align_item_groups_by_name.py

The script rewrites the prepared biophysical item groups so they follow the Sentinel-2 item group order. It matches by EarthDaily’s deterministic item naming pattern:

S2C_31TEJ_20250424_0_L2A -> S2C_31TEJ_20250424_0_L2A_LAI

By default, the script uses: {reference_item_name}_{target_layer_upper}
so target layers named lai, fapar, and fcover map to _LAI, _FAPAR, and _FCOVER. Then users run rslearn dataset materialize to materialize the aligned layers.

…m groups utility added

…and data type

favyen2

This looks good, just need to revert changes to data_source.py, manage.py, and test_manage.py.

…usage in prepare_dataset_windows

Enhance EarthDaily Biophysical data source with source layer matching…

e099757

… capabilities and update documentation

robmarkcole changed the title ~~Enhance EarthDaily Biophysical data source with source layer matching capabilities and update documentation~~ [Enhancement] Enhance EarthDaily Biophysical data source with source layer matching capabilities and update documentation Apr 27, 2026

Merge branch 'master' into issue-625

b41e84c

robmarkcole added 2 commits April 28, 2026 09:59

Refactor EarthDaily data source documentation and examples; align ite…

73bb304

…m groups utility added

narrowed the PR workflow to name-based matching only.

d12f683

robmarkcole changed the title ~~[Enhancement] Enhance EarthDaily Biophysical data source with source layer matching capabilities and update documentation~~ [Enhancement] Add name-based alignment workflow for EarthDaily Sentinel-2 biophysical products Apr 28, 2026

Update EarthDaily Sentinel-2 example: correct data source class path …

c1d34bd

…and data type

favyen2 reviewed Apr 28, 2026

View reviewed changes

Comment thread rslearn/data_sources/data_source.py Outdated

Refactor DataSource methods: remove get_items_for_windows and update …

37b44c8

…usage in prepare_dataset_windows

favyen2 approved these changes Apr 28, 2026

View reviewed changes

robmarkcole merged commit 7038748 into master Apr 28, 2026
6 checks passed

robmarkcole deleted the issue-625 branch April 28, 2026 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Add name-based alignment workflow for EarthDaily Sentinel-2 biophysical products#626

[Enhancement] Add name-based alignment workflow for EarthDaily Sentinel-2 biophysical products#626
robmarkcole merged 6 commits intomasterfrom
issue-625

robmarkcole commented Apr 27, 2026

Uh oh!

favyen2 commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

favyen2 commented Apr 27, 2026 •

edited

Loading

Uh oh!

robmarkcole commented Apr 28, 2026 •

edited

Loading

Uh oh!

favyen2 left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

robmarkcole commented Apr 27, 2026

Uh oh!

favyen2 commented Apr 27, 2026

Uh oh!

claude Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Biophysical Source Layer Matching

Issues

Positive Notes

Uh oh!

favyen2 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robmarkcole commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

favyen2 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

claude Bot commented Apr 27, 2026 •

edited

Loading

favyen2 commented Apr 27, 2026 •

edited

Loading

robmarkcole commented Apr 28, 2026 •

edited

Loading