Skip to content

Conversation

@joellidin
Copy link
Collaborator

@joellidin joellidin commented Nov 14, 2025

Implement shard remapping to redirect shard 7 to shard 13, expanding
dataset access beyond the original 10-shard limit.

  • Increase max_dataset_idx from 10 to 13 to support higher shards
  • Add remap_shard_index() static method to handle shard 7 -> 13
  • Update prepare_shard() to use remapped indices for file location
  • Update create_dataset() to pass remapped index to dataset init
  • Enhance logging to show both original and remapped indices

Description

Related Issue(s)

  • Closes #[issue number]

Type of Change

  • Feature (adding new functionality)
  • Fix (resolving a bug or issue)
  • Docs (documentation updates)
  • Refactor (code changes that don't affect functionality)
  • Maintenance (dependency updates or other maintenance)
  • Tests (adding or improving tests)
  • Breaking change (fix or feature with incompatible API changes)
  • Other: _____

Branch Naming

  • My branch follows the project's naming convention (e.g., feature/add-new-capability)

Commit Messages

  • My commits are small, atomic, and have proper commit messages
  • Commit messages are in imperative mood with a capitalized summary under 50 chars

Code Quality

  • I've performed a self-review of my code
  • I've added appropriate docstrings following the project's conventions
  • I've added proper logging where necessary (without trailing periods)
  • I've applied linting and formatting with Ruff
  • My code generates no new warnings

Testing

  • I've added tests for new functionality or bug fixes
  • All tests pass locally with my changes
  • Test coverage has not decreased

Documentation

  • I've updated documentation to reflect my changes
  • I've updated comments in hard-to-understand areas

If this is a breaking change

Screenshots/Examples

Additional Notes

Summary by CodeRabbit

  • Enhancements
    • Increased maximum dataset capacity to support larger datasets (max index extended).
    • Improved shard indexing and routing: a specific shard is remapped to balance storage and access.
    • Enhanced diagnostic logging now reports both the original and remapped shard indexes for clearer troubleshooting.

@coderabbitai
Copy link

coderabbitai bot commented Nov 14, 2025

Walkthrough

This change adds a shard-index remapping rule to ShardedDatasetManager (maps 7 → 13), raises max_dataset_idx from 10 to 14, and applies the remapped index in prepare_shard and create_dataset with updated log messages showing original → remapped values.

Changes

Cohort / File(s) Summary
Shard index remapping and usage
src/tplr/sharded_dataset.py
Added remap_shard_index(shard_index: int) -> int (maps 7→13); increased max_dataset_idx to 14; prepare_shard and create_dataset now remap the requested shard before calling locate_shards / creating SharedShardedDataset; logging updated to show original and remapped shard indices.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant Manager as ShardedDatasetManager
    participant Storage as ShardLocator
    participant Dataset as SharedShardedDataset

    Client->>Manager: request prepare_shard(original_idx)
    Note right of Manager: remap_shard_index(original_idx)
    Manager-->>Manager: remapped_idx = remap_shard_index(original_idx)
    Manager->>Storage: locate_shards(remapped_idx)
    Storage-->>Manager: shard_files
    Manager->>Dataset: create dataset(remapped_idx, shard_files)
    Dataset-->>Manager: dataset_handle
    Manager-->>Client: return dataset_handle (logs original->remapped)
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

  • Changes are localized to one file and follow a consistent pattern.
  • Pay attention to logging text and any tests or callers that assume the previous shard numbering.
  • Check any other code that constructs SharedShardedDataset directly for consistency.

Possibly related PRs

  • v2.1.11 #636 — Modifies shard-index handling in the same file; ensures self.shard_index is set before dataset creation.
  • fix/shard switching at new run #635 — Adjusts shard-index initialization and synchronization within the sharded dataset management flow.

Poem

🐰
A seven-hop turned thirteen in moonlight,
I mapped the path and logged it right,
Fourteen now the max to seek,
Shards aligned — the future's neat and sleek!

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding shard index remapping functionality to redirect shard 7 to 13.
Description check ✅ Passed The PR description provides a clear summary of changes at the top but does not complete the template sections (Type of Change, Branch Naming, Commit Messages, Code Quality, Testing, Documentation are all unchecked).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/remap-shards

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee43059 and fb48377.

📒 Files selected for processing (1)
  • src/tplr/sharded_dataset.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/tplr/sharded_dataset.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: test (3.11)
  • GitHub Check: test (3.12)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/tplr/sharded_dataset.py (1)

226-236: Shard remapping helper is correct; consider configurability if more remaps are expected

The implementation of remap_shard_index does exactly what the docstring says (map 7 → 13, others unchanged) and keeps the logic centralized, which is good.

If you anticipate adding more remaps or changing them per‑run, consider making this mapping data‑driven (e.g., a dict or config/env‑driven rule) so that you don’t need to touch code for future remapping tweaks.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 128ee9b and ee43059.

📒 Files selected for processing (1)
  • src/tplr/sharded_dataset.py (3 hunks)
🔇 Additional comments (1)
src/tplr/sharded_dataset.py (1)

224-225: Confirm max_dataset_idx value vs shard layout and remap semantics

max_dataset_idx = 14 means logical shard indices will cycle over [0, 13]. With remap_shard_index mapping 7 -> 13, physical shard 13 will be used for both logical shard 7 and logical shard 13, and physical shard 7 will never be touched (if it exists). The PR description mentions increasing max_dataset_idx from 10 to 13, which suggests a possible off‑by‑one or intent mismatch.

Please double‑check that:

  • Shards are actually available up to index 13 on disk/R2, and
  • The intended behavior is to alias both logical 7 and logical 13 to physical 13, rather than to only redirect 7 within a 0–9 range.

If the intent was “max logical shard index is 12”, this should be 13 instead of 14. Otherwise, a short comment explaining the aliasing would reduce confusion.

@codecov
Copy link

codecov bot commented Nov 14, 2025

Codecov Report

❌ Patch coverage is 28.57143% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/tplr/sharded_dataset.py 28.57% 5 Missing ⚠️

❌ Your patch status has failed because the patch coverage (28.57%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project status has failed because the head coverage (57.89%) is below the target coverage (85.00%). You can increase the head coverage or adjust the target coverage.

Impacted file tree graph

@@            Coverage Diff             @@
##              dev     #656      +/-   ##
==========================================
- Coverage   57.91%   57.89%   -0.02%     
==========================================
  Files          27       27              
  Lines        4890     4895       +5     
==========================================
+ Hits         2832     2834       +2     
- Misses       2058     2061       +3     
Files with missing lines Coverage Δ
src/tplr/sharded_dataset.py 22.48% <28.57%> (+0.70%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Implement shard remapping to redirect shard 7 to shard 13, expanding
dataset access beyond the original 10-shard limit.

- Increase max_dataset_idx from 10 to 13 to support higher shards
- Add remap_shard_index() static method to handle shard 7 -> 13
- Update prepare_shard() to use remapped indices for file location
- Update create_dataset() to pass remapped index to dataset init
- Enhance logging to show both original and remapped indices
@joellidin joellidin merged commit 56b5a86 into dev Nov 14, 2025
6 of 8 checks passed
@joellidin joellidin deleted the feat/remap-shards branch November 14, 2025 14:23
@coderabbitai coderabbitai bot mentioned this pull request Nov 14, 2025
21 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants