Skip to content

test: add dataloader checkpoint integration test for retrieval recipes#1800

Open
oliverholworthy wants to merge 2 commits intomainfrom
test/retrieval-dataloader-checkpoint
Open

test: add dataloader checkpoint integration test for retrieval recipes#1800
oliverholworthy wants to merge 2 commits intomainfrom
test/retrieval-dataloader-checkpoint

Conversation

@oliverholworthy
Copy link
Copy Markdown
Contributor

Summary

  • Adds integration test validating that bi-encoder and cross-encoder StatefulDataLoaders correctly save and restore iteration state across checkpoint boundaries
  • Ensures resumed training skips already-seen samples by verifying batch-level continuity after checkpoint restore
  • Follows the same pattern as test_megatron_dataset_checkpoint.py but targets the retrieval build_dataloader path

Test plan

  • test_bi_encoder_dataloader_checkpoint — passes locally with torchrun --nproc-per-node=1
  • test_cross_encoder_dataloader_checkpoint — passes locally with torchrun --nproc-per-node=1
  • CI validation

Run locally:

HF_HOME="" uv run torchrun --nproc-per-node=1 -m pytest \
    tests/functional_tests/training/test_retrieval_dataloader_checkpoint.py -vs \
    --override-ini="addopts="

Validates that bi-encoder and cross-encoder StatefulDataLoaders correctly
save and restore iteration state across checkpoint boundaries, ensuring
resumed training skips already-seen samples.

Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Signed-off-by: Oliver Holworthy <oholworthy@nvidia.com>
Signed-off-by: Oliver Holworthy <1216955+oliverholworthy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant