fix(msmarco): address PR #46 review comments and code review findings#52
Open
ashwin2002 wants to merge 1 commit into
Open
fix(msmarco): address PR #46 review comments and code review findings#52ashwin2002 wants to merge 1 commit into
ashwin2002 wants to merge 1 commit into
Conversation
Step-based partitioning and correctness fixes for loadMSMARCODataset: - Replace flat even-split with MSMARCOEmbeddingProduct.getSteps() loop, matching CLI (MSMARCOLoader) and REST SIFT (loadSIFTDataset) behaviour - Add start_offset bounds check against STEPS so out-of-range values fail fast with IllegalArgumentException instead of AIOOBE - Add end_offset bounds check against steps[last] to prevent AIOOBE in the outer while loop when caller sends value beyond dataset size - Add end_offset <= start_offset guard to replace the deleted totalDocs <= 0 check, restoring the silent no-op diagnostic - Cap workers per range with effectivePool = min(poolSize, end-start) to prevent step=0 integer division that silently created zero-range workers when doc count was smaller than processConcurrency - Stage all WorkLoadGenerate instances in a local map before flushing to loader_tasks so a mid-loop failure leaves no orphaned tasks - Remove redundant ws.embeddingFilePath post-construction assignment already set by the WorkLoadSettings constructor - Strip trailing whitespace from three blank lines in SDKClientPool introduced by the PR Used Claude Code for code generation. Model used: claude-sonnet-4-6.
There was a problem hiding this comment.
Pull request overview
Addresses prior PR #46 review feedback by aligning the REST loadMSMARCODataset() partitioning with the CLI MSMARCOLoader step-based scheme, hardening offset validation, fixing a divide-by-zero edge case when doc count is smaller than the pool, and staging tasks atomically before publishing to the global loader_tasks map. Also drops a redundant embeddingFilePath assignment and cleans trailing whitespace in SDKClientPool.
Changes:
- Replace the flat even-split with a
MSMARCOEmbeddingProduct.getSteps()overlap loop and addstart_offset/end_offset/empty-range bounds checks that throwIllegalArgumentExceptionearly. - Cap workers via
effectivePool = min(poolSize, end-start), stageWorkLoadGenerateinstances in a local map and flush viaputAll, and remove the duplicatews.embeddingFilePathassignment. - Whitespace-only cleanup in
SDKClientPool.release_client().
Reviewed changes
Copilot reviewed 1 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/main/java/RestServer/TaskRequest.java | Rewrites MSMARCO partitioning to step-loop form, adds offset validation, atomic task staging, new task-name composition. |
| src/main/java/couchbase/sdk/SDKClientPool.java | Strips trailing whitespace from three blank lines in release_client(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+1061
to
+1062
| String task_name = "MSMARCOTask_" + TaskRequest.task_id.incrementAndGet() + k + "_" + ws.dr.create_s | ||
| + "_" + ws.dr.create_e; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Step-based partitioning and correctness fixes for loadMSMARCODataset:
Used Claude Code for code generation.
Model used: claude-sonnet-4-6.