Skip to content

Enhancement to Defining Train and Validation Splits docs #290

@robmarkcole

Description

@robmarkcole

In the main readme, under Defining Train and Validation Splits two approaches are listed for splitting the dataset, however only the second (tags) approach as an example.

I attempted approach one (Assign windows to different groups) using a script which copied data from windows/global_soil_moisture/... into a format windows/train|val|test on a completely different drive

I then assumed I could config:

data:
  class_path: rslearn.train.data_module.RslearnDataModule
  init_args:
    path: "/my-new-drive-location" # contains windows/train|val|test
   ...
    train_config:
      groups: ["train"]
      transforms: []
    val_config:
      groups: ["val"]
    test_config:
      groups: ["test"]

But this results in a long and confusing error from the datamodule. The root cause is that each window metadata.json hard codes the original group (global_soil_moisture). I addressed this with a code change on my branch (below). I think an example of approach one would prevent others from taking the path I did, or the approach above could be supported as many datasets come in this format
Thanks

# In ModelDataset
    def _serialize_item(self, example: Window) -> dict[str, Any]:
        data = example.get_metadata()
        # Persist the actual on-disk location because some datasets relocate windows
        # without updating the group stored in metadata. Using a relative path keeps
        # backward compatibility with existing config files while allowing us to
        # reconstruct the window even if the metadata `group` no longer matches the
        # directory hierarchy (e.g. splits moved from a single group into
        # train/val/test folders).
        try:
            relative_path = example.path.relative_to(self.dataset.path)
        except ValueError:
            # Path may fall outside the dataset root (e.g. remote URLs); fall back to
            # storing the absolute path in that case.
            relative_path = example.path
        data["relative_path"] = str(relative_path)
        return data

    def _deserialize_item(self, d: dict[str, Any]) -> Window:
        if "relative_path" in d:
            window_path = self.dataset.path / d["relative_path"]
        else:
            window_path = Window.get_window_root(
                self.dataset.path, d["group"], d["name"]
            )
        return Window.from_metadata(window_path, d)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions