-
Notifications
You must be signed in to change notification settings - Fork 49
(dataset) Add shard index remapping functionality #656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThis change adds a shard-index remapping rule to Changes
Sequence Diagram(s)sequenceDiagram
actor Client
participant Manager as ShardedDatasetManager
participant Storage as ShardLocator
participant Dataset as SharedShardedDataset
Client->>Manager: request prepare_shard(original_idx)
Note right of Manager: remap_shard_index(original_idx)
Manager-->>Manager: remapped_idx = remap_shard_index(original_idx)
Manager->>Storage: locate_shards(remapped_idx)
Storage-->>Manager: shard_files
Manager->>Dataset: create dataset(remapped_idx, shard_files)
Dataset-->>Manager: dataset_handle
Manager-->>Client: return dataset_handle (logs original->remapped)
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/tplr/sharded_dataset.py (1)
226-236: Shard remapping helper is correct; consider configurability if more remaps are expectedThe implementation of
remap_shard_indexdoes exactly what the docstring says (map 7 → 13, others unchanged) and keeps the logic centralized, which is good.If you anticipate adding more remaps or changing them per‑run, consider making this mapping data‑driven (e.g., a dict or config/env‑driven rule) so that you don’t need to touch code for future remapping tweaks.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/tplr/sharded_dataset.py(3 hunks)
🔇 Additional comments (1)
src/tplr/sharded_dataset.py (1)
224-225: Confirmmax_dataset_idxvalue vs shard layout and remap semantics
max_dataset_idx = 14means logical shard indices will cycle over[0, 13]. Withremap_shard_indexmapping7 -> 13, physical shard 13 will be used for both logical shard 7 and logical shard 13, and physical shard 7 will never be touched (if it exists). The PR description mentions increasingmax_dataset_idxfrom 10 to 13, which suggests a possible off‑by‑one or intent mismatch.Please double‑check that:
- Shards are actually available up to index 13 on disk/R2, and
- The intended behavior is to alias both logical 7 and logical 13 to physical 13, rather than to only redirect 7 within a 0–9 range.
If the intent was “max logical shard index is 12”, this should be
13instead of14. Otherwise, a short comment explaining the aliasing would reduce confusion.
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (28.57%) is below the target coverage (85.00%). You can increase the patch coverage or adjust the target coverage. @@ Coverage Diff @@
## dev #656 +/- ##
==========================================
- Coverage 57.91% 57.89% -0.02%
==========================================
Files 27 27
Lines 4890 4895 +5
==========================================
+ Hits 2832 2834 +2
- Misses 2058 2061 +3
🚀 New features to boost your workflow:
|
Implement shard remapping to redirect shard 7 to shard 13, expanding dataset access beyond the original 10-shard limit. - Increase max_dataset_idx from 10 to 13 to support higher shards - Add remap_shard_index() static method to handle shard 7 -> 13 - Update prepare_shard() to use remapped indices for file location - Update create_dataset() to pass remapped index to dataset init - Enhance logging to show both original and remapped indices
ee43059 to
fb48377
Compare
Implement shard remapping to redirect shard 7 to shard 13, expanding
dataset access beyond the original 10-shard limit.
Description
Related Issue(s)
Type of Change
Branch Naming
Commit Messages
Code Quality
Testing
Documentation
If this is a breaking change
Screenshots/Examples
Additional Notes
Summary by CodeRabbit