(dataset) Add shard index remapping functionality #656

joellidin · 2025-11-14T13:53:58Z

Implement shard remapping to redirect shard 7 to shard 13, expanding
dataset access beyond the original 10-shard limit.

Increase max_dataset_idx from 10 to 13 to support higher shards
Add remap_shard_index() static method to handle shard 7 -> 13
Update prepare_shard() to use remapped indices for file location
Update create_dataset() to pass remapped index to dataset init
Enhance logging to show both original and remapped indices

Description

Related Issue(s)

Closes #[issue number]

Type of Change

Feature (adding new functionality)
Fix (resolving a bug or issue)
Docs (documentation updates)
Refactor (code changes that don't affect functionality)
Maintenance (dependency updates or other maintenance)
Tests (adding or improving tests)
Breaking change (fix or feature with incompatible API changes)
Other: _____

Branch Naming

My branch follows the project's naming convention (e.g., feature/add-new-capability)

Commit Messages

My commits are small, atomic, and have proper commit messages
Commit messages are in imperative mood with a capitalized summary under 50 chars

Code Quality

I've performed a self-review of my code
I've added appropriate docstrings following the project's conventions
I've added proper logging where necessary (without trailing periods)
I've applied linting and formatting with Ruff
My code generates no new warnings

Testing

I've added tests for new functionality or bug fixes
All tests pass locally with my changes
Test coverage has not decreased

Documentation

I've updated documentation to reflect my changes
I've updated comments in hard-to-understand areas

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

Enhancements
- Increased maximum dataset capacity to support larger datasets (max index extended).
- Improved shard indexing and routing: a specific shard is remapped to balance storage and access.
- Enhanced diagnostic logging now reports both the original and remapped shard indexes for clearer troubleshooting.

coderabbitai · 2025-11-14T13:54:19Z

Walkthrough

This change adds a shard-index remapping rule to ShardedDatasetManager (maps 7 → 13), raises max_dataset_idx from 10 to 14, and applies the remapped index in prepare_shard and create_dataset with updated log messages showing original → remapped values.

Changes

Cohort / File(s)	Summary
Shard index remapping and usage `src/tplr/sharded_dataset.py`	Added `remap_shard_index(shard_index: int) -> int` (maps 7→13); increased `max_dataset_idx` to 14; `prepare_shard` and `create_dataset` now remap the requested shard before calling `locate_shards` / creating `SharedShardedDataset`; logging updated to show original and remapped shard indices.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Manager as ShardedDatasetManager
    participant Storage as ShardLocator
    participant Dataset as SharedShardedDataset

    Client->>Manager: request prepare_shard(original_idx)
    Note right of Manager: remap_shard_index(original_idx)
    Manager-->>Manager: remapped_idx = remap_shard_index(original_idx)
    Manager->>Storage: locate_shards(remapped_idx)
    Storage-->>Manager: shard_files
    Manager->>Dataset: create dataset(remapped_idx, shard_files)
    Dataset-->>Manager: dataset_handle
    Manager-->>Client: return dataset_handle (logs original->remapped)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Changes are localized to one file and follow a consistent pattern.
Pay attention to logging text and any tests or callers that assume the previous shard numbering.
Check any other code that constructs SharedShardedDataset directly for consistency.

Possibly related PRs

v2.1.11 #636 — Modifies shard-index handling in the same file; ensures self.shard_index is set before dataset creation.
fix/shard switching at new run #635 — Adjusts shard-index initialization and synchronization within the sharded dataset management flow.

Poem

🐰
A seven-hop turned thirteen in moonlight,
I mapped the path and logged it right,
Fourteen now the max to seek,
Shards aligned — the future's neat and sleek!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically summarizes the main change: adding shard index remapping functionality to redirect shard 7 to 13.
Description check	✅ Passed	The PR description provides a clear summary of changes at the top but does not complete the template sections (Type of Change, Branch Naming, Commit Messages, Code Quality, Testing, Documentation are all unchecked).
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/remap-shards

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee43059 and fb48377.

📒 Files selected for processing (1)

src/tplr/sharded_dataset.py (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/tplr/sharded_dataset.py

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: test (3.11)
GitHub Check: test (3.12)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

src/tplr/sharded_dataset.py (1)

226-236: Shard remapping helper is correct; consider configurability if more remaps are expected

The implementation of remap_shard_index does exactly what the docstring says (map 7 → 13, others unchanged) and keeps the logic centralized, which is good.

If you anticipate adding more remaps or changing them per‑run, consider making this mapping data‑driven (e.g., a dict or config/env‑driven rule) so that you don’t need to touch code for future remapping tweaks.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 128ee9b and ee43059.

📒 Files selected for processing (1)

src/tplr/sharded_dataset.py (3 hunks)

🔇 Additional comments (1)

src/tplr/sharded_dataset.py (1)

224-225: Confirm max_dataset_idx value vs shard layout and remap semantics

max_dataset_idx = 14 means logical shard indices will cycle over [0, 13]. With remap_shard_index mapping 7 -> 13, physical shard 13 will be used for both logical shard 7 and logical shard 13, and physical shard 7 will never be touched (if it exists). The PR description mentions increasing max_dataset_idx from 10 to 13, which suggests a possible off‑by‑one or intent mismatch.

Please double‑check that:

Shards are actually available up to index 13 on disk/R2, and

The intended behavior is to alias both logical 7 and logical 13 to physical 13, rather than to only redirect 7 within a 0–9 range.

If the intent was “max logical shard index is 12”, this should be 13 instead of 14. Otherwise, a short comment explaining the aliasing would reduce confusion.

src/tplr/sharded_dataset.py

codecov · 2025-11-14T13:58:26Z

Codecov Report

❌ Patch coverage is 28.57143% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/tplr/sharded_dataset.py	28.57%	5 Missing ⚠️

❌ Your patch status has failed because the patch coverage (28.57%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (57.89%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

@@            Coverage Diff             @@
##              dev     #656      +/-   ##
==========================================
- Coverage   57.91%   57.89%   -0.02%     
==========================================
  Files          27       27              
  Lines        4890     4895       +5     
==========================================
+ Hits         2832     2834       +2     
- Misses       2058     2061       +3

Files with missing lines	Coverage Δ
src/tplr/sharded_dataset.py	`22.48% <28.57%> (+0.70%)`	⬆️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Implement shard remapping to redirect shard 7 to shard 13, expanding dataset access beyond the original 10-shard limit. - Increase max_dataset_idx from 10 to 13 to support higher shards - Add remap_shard_index() static method to handle shard 7 -> 13 - Update prepare_shard() to use remapped indices for file location - Update create_dataset() to pass remapped index to dataset init - Enhance logging to show both original and remapped indices

joellidin requested a review from shivam-MBZUAI November 14, 2025 13:54

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

src/tplr/sharded_dataset.py Outdated Show resolved Hide resolved

src/tplr/sharded_dataset.py Outdated Show resolved Hide resolved

joellidin force-pushed the feat/remap-shards branch from ee43059 to fb48377 Compare November 14, 2025 14:00

joellidin merged commit 56b5a86 into dev Nov 14, 2025
6 of 8 checks passed

joellidin deleted the feat/remap-shards branch November 14, 2025 14:23

coderabbitai bot mentioned this pull request Nov 14, 2025

v2.1.17 #658

Merged

21 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

(dataset) Add shard index remapping functionality #656

(dataset) Add shard index remapping functionality #656

Uh oh!

joellidin commented Nov 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

(dataset) Add shard index remapping functionality #656

(dataset) Add shard index remapping functionality #656

Uh oh!

Conversation

joellidin commented Nov 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Type of Change

Branch Naming

Commit Messages

Code Quality

Testing

Documentation

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joellidin commented Nov 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

codecov bot commented Nov 14, 2025 •

edited

Loading