Skip to content

[Enhancement] Add name-based alignment workflow for EarthDaily Sentinel-2 biophysical products#626

Merged
robmarkcole merged 6 commits intomasterfrom
issue-625
Apr 28, 2026
Merged

[Enhancement] Add name-based alignment workflow for EarthDaily Sentinel-2 biophysical products#626
robmarkcole merged 6 commits intomasterfrom
issue-625

Conversation

@robmarkcole
Copy link
Copy Markdown
Collaborator

Close #625

@robmarkcole robmarkcole changed the title Enhance EarthDaily Biophysical data source with source layer matching capabilities and update documentation [Enhancement] Enhance EarthDaily Biophysical data source with source layer matching capabilities and update documentation Apr 27, 2026
@favyen2
Copy link
Copy Markdown
Collaborator

favyen2 commented Apr 27, 2026

@claude can you review this PR?

@claude
Copy link
Copy Markdown

claude Bot commented Apr 27, 2026

Claude finished @favyen2's task in 2m 45s —— View job


PR Review: Biophysical Source Layer Matching

  • Gather context and read relevant files
  • Review changed files
  • Post review feedback

Overall this is a clean, well-motivated feature. The design pattern — adding a get_items_for_windows hook to the base class and overriding it in Biophysical — is a good extensibility point. The tests are solid and the documentation is thorough. A few items worth addressing before merging:


Issues

1. DataSources.md drops the EarthDaily credential note (docs/DataSources.md)

The original table entry read "requires EarthDaily credentials"; the new entry removes that. Credentials are still required and this is important context for users comparing data sources. Suggest: "LAI/FAPAR/FCOVER, can match exact Sentinel-2 source scenes, requires EarthDaily credentials".

Fix this →

2. Unreachable None guard in _format_match_source_item_name (earthdaily.py:1338)

if self.match_source_item_template is None:
    raise ValueError("match_source_item_template is not configured")

_format_match_source_item_name is only called from get_items_for_windows when self.match_source_layer is not None. In __init__, when match_source_layer is not None, match_source_item_template is guaranteed to be set (either by the user or defaulted to "{source_item_name}_{variable_upper}"). So this check can never trigger — it's dead code that may mislead readers.

3. Redundant length check in Biophysical.get_items_for_windows (earthdaily.py)

# In Biophysical.get_items_for_windows:
if len(windows) != len(geometries):
    raise ValueError("windows and geometries must have the same length")

The base class get_items_for_windows (called via super() in the match_source_layer is None branch) already performs this check. The override also performs it in the match_source_layer is not None branch. The check in Biophysical is fine, but the check in the base class only adds value when calling super() — removing one of the duplicates (or keeping both with a comment) would avoid confusion about which check is authoritative.

4. Redundant bounds check in get_items_for_windows (earthdaily.py:1436)

if source_layer_data.group_time_ranges is not None and group_idx < len(
    source_layer_data.group_time_ranges
):

WindowLayerData.__init__ already validates that len(group_time_ranges) == len(serialized_item_groups) (raises ValueError otherwise), so group_idx < len(source_layer_data.group_time_ranges) is always True when group_time_ranges is not None. The simplified form if source_layer_data.group_time_ranges is not None: is clearer.

5. Whole-group skip when any item in the source group is missing (earthdaily.py:1423-1431)

if missing_item_names:
    logger.warning(...)
    continue  # skips entire source group

If a source group contains multiple Sentinel-2 items and even one has no corresponding biophysical product, the entire matched group is discarded. This may be the right behavior (biophysical products typically correspond 1-to-1 with scenes, and a partial group is ambiguous), but it's worth documenting the group semantics here or in the docstring — especially since a user could construct a config with max_matches > 1 and be surprised when all groups are silently dropped.


Positive Notes

  • The KeyError fix in get_item_by_name when collection.get_item() returns None (line 313) is a good defensive improvement — turns a confusing AttributeError into a clear KeyError that is caught properly in get_items_for_windows.
  • The string.Formatter().parse() approach to detect whether source_product_id is actually used in the template before requiring it to be non-None is smart.
  • The test using monkeypatch to verify get_items_for_windows is actually called (instead of get_items) is a good integration point.
  • Error messages are consistently descriptive with actionable guidance for users (e.g., mentioning that the source layer must appear earlier in the config).

@favyen2
Copy link
Copy Markdown
Collaborator

favyen2 commented Apr 27, 2026

I think I understand the use case, but I think it is too specific while requiring too many use cases to support at this time. Instead I would suggest either:

  1. Offer a data source that provides both Sentinel-2 images and the derived biophysical products together. So an item has both image assets and the derived assets.
  2. Or, include a separate script in an example that takes items from one layer and uses them to somehow make items in a different layer. It could be based on changing the item name, or (might be more general) it could be looking for an item in the second layer with a very similar timestamp. Then users can run the script for different sorts of layers, and it can be documented in the example for EarthDaily Sentinel-2 + biophysical, and can also be linked from the biophysical documentation.

@robmarkcole
Copy link
Copy Markdown
Collaborator Author

robmarkcole commented Apr 28, 2026

I agree option 2 is the better fit.

The new workflow keeps Sentinel-2 and biophysical products as separate EarthDaily data sources. Users prepare the dataset normally, with separate layers for sentinel2, lai, fapar, and/or fcover. After rslearn dataset prepare and before materialization, they run docs/examples/align_item_groups_by_name.py

The script rewrites the prepared biophysical item groups so they follow the Sentinel-2 item group order. It matches by EarthDaily’s deterministic item naming pattern:

S2C_31TEJ_20250424_0_L2A -> S2C_31TEJ_20250424_0_L2A_LAI

By default, the script uses: {reference_item_name}_{target_layer_upper}
so target layers named lai, fapar, and fcover map to _LAI, _FAPAR, and _FCOVER. Then users run rslearn dataset materialize to materialize the aligned layers.

@robmarkcole robmarkcole changed the title [Enhancement] Enhance EarthDaily Biophysical data source with source layer matching capabilities and update documentation [Enhancement] Add name-based alignment workflow for EarthDaily Sentinel-2 biophysical products Apr 28, 2026
Copy link
Copy Markdown
Collaborator

@favyen2 favyen2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, just need to revert changes to data_source.py, manage.py, and test_manage.py.

Comment thread rslearn/data_sources/data_source.py Outdated
@robmarkcole robmarkcole merged commit 7038748 into master Apr 28, 2026
6 checks passed
@robmarkcole robmarkcole deleted the issue-625 branch April 28, 2026 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants