-
Notifications
You must be signed in to change notification settings - Fork 12
Description
In the main readme, under Defining Train and Validation Splits two approaches are listed for splitting the dataset, however only the second (tags) approach as an example.
I attempted approach one (Assign windows to different groups) using a script which copied data from windows/global_soil_moisture/... into a format windows/train|val|test on a completely different drive
I then assumed I could config:
data:
class_path: rslearn.train.data_module.RslearnDataModule
init_args:
path: "/my-new-drive-location" # contains windows/train|val|test
...
train_config:
groups: ["train"]
transforms: []
val_config:
groups: ["val"]
test_config:
groups: ["test"]But this results in a long and confusing error from the datamodule. The root cause is that each window metadata.json hard codes the original group (global_soil_moisture). I addressed this with a code change on my branch (below). I think an example of approach one would prevent others from taking the path I did, or the approach above could be supported as many datasets come in this format
Thanks
# In ModelDataset
def _serialize_item(self, example: Window) -> dict[str, Any]:
data = example.get_metadata()
# Persist the actual on-disk location because some datasets relocate windows
# without updating the group stored in metadata. Using a relative path keeps
# backward compatibility with existing config files while allowing us to
# reconstruct the window even if the metadata `group` no longer matches the
# directory hierarchy (e.g. splits moved from a single group into
# train/val/test folders).
try:
relative_path = example.path.relative_to(self.dataset.path)
except ValueError:
# Path may fall outside the dataset root (e.g. remote URLs); fall back to
# storing the absolute path in that case.
relative_path = example.path
data["relative_path"] = str(relative_path)
return data
def _deserialize_item(self, d: dict[str, Any]) -> Window:
if "relative_path" in d:
window_path = self.dataset.path / d["relative_path"]
else:
window_path = Window.get_window_root(
self.dataset.path, d["group"], d["name"]
)
return Window.from_metadata(window_path, d)