-
Notifications
You must be signed in to change notification settings - Fork 49
fix: make data shard downloads more robust with retry logic and validation #663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: make data shard downloads more robust with retry logic and validation #663
Conversation
WalkthroughThis PR implements robust shard downloads with retry logic, exponential backoff, and file validation. It adds configurable parameters to ShardedDatasetManager, introduces async helper methods for retry-enabled preparation, enhances error handling in dataset creation and swapping, adds comprehensive unit tests, and provides detailed documentation. Changes
Sequence DiagramsequenceDiagram
participant App as Application
participant SDM as ShardedDatasetManager
participant Retry as Retry Loop
participant S3 as S3/Bucket
participant FS as File System
participant Val as Validation
App->>SDM: create_dataset()
SDM->>SDM: await _prepare_shard_with_retry()
loop Retry Loop (max_download_retries)
Retry->>FS: _validate_shard_files()
alt Files valid
Retry-->>SDM: Files exist & non-zero
else Files missing/invalid
Retry->>S3: Download tokens & ids
S3-->>FS: Transfer files
Retry->>Val: _validate_shard_files()
alt Validation passes
Val-->>Retry: ✓ Valid
else Validation fails
Retry->>Retry: exponential backoff
Note over Retry: Retry with delay
end
end
end
alt Download succeeded
Retry-->>SDM: return True
SDM->>SDM: barrier sync (multi-process)
SDM-->>App: Dataset ready
else Download failed
Retry-->>SDM: return False
SDM-->>App: RuntimeError: Unable to prepare shard
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
src/tplr/sharded_dataset.py (1)
505-507: Consider adding a deprecation warning at runtime.The docstring marks this method as deprecated, but callers won't see warnings at runtime. Consider adding
warnings.warn()to alert users.+import warnings + async def download_files( self, bucket: tplr.schemas.Bucket, tokens_file: os.PathLike, ids_file: os.PathLike, ) -> asyncio.TaskGroup: """Downloads the shard and its indices. DEPRECATED: Use _download_files_with_validation instead for better error handling. ... """ + warnings.warn( + "download_files is deprecated, use _download_files_with_validation instead", + DeprecationWarning, + stacklevel=2 + ) return await asyncio.gather(tests/unit/test_robust_shard_download.py (1)
305-307: Replace deprecatedasyncio.coroutinepattern.The
asyncio.coroutine(lambda: ...)()pattern is deprecated since Python 3.8 and will be removed in Python 3.12+. Use an async function instead.- dataset_manager.upcoming_dataset = asyncio.create_task( - asyncio.coroutine(lambda: True)() - ) + async def return_true(): + return True + dataset_manager.upcoming_dataset = asyncio.create_task(return_true())Apply the same pattern to lines 324-326 and 349-351.
docs/robust_shard_downloads.md (1)
100-104: Add language specification to fenced code block.The code block showing log output should specify a language (e.g.,
textorlog) per markdownlint rules.-``` +```text ERROR: Timeout downloading shard 5 after 600s (attempt 2/3) INFO: Retrying in 2s... ERROR: Failed to download shard 5 after 3 attempts. Files: /path/to/train_000005.npy, /path/to/sample_ids_000005.npy</blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: CodeRabbit UI **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between f7a67c5a4142e919b8226445e63e90f23df9bd5d and 123767cc1f6d048053bdde4ecaf2b1bb88f83f07. </details> <details> <summary>📒 Files selected for processing (3)</summary> * `docs/robust_shard_downloads.md` (1 hunks) * `src/tplr/sharded_dataset.py` (7 hunks) * `tests/unit/test_robust_shard_download.py` (1 hunks) </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>🧬 Code graph analysis (2)</summary> <details> <summary>src/tplr/sharded_dataset.py (1)</summary><blockquote> <details> <summary>src/tplr/comms.py (2)</summary> * `get_own_bucket` (220-278) * `s3_get_object` (477-630) </details> </blockquote></details> <details> <summary>tests/unit/test_robust_shard_download.py (1)</summary><blockquote> <details> <summary>src/tplr/sharded_dataset.py (8)</summary> * `SharedShardedDataset` (58-233) * `_validate_shard_files` (404-434) * `_download_files_with_validation` (436-497) * `_prepare_shard_with_retry` (323-402) * `initialize_datasets` (581-602) * `swap_datasets` (604-677) * `create_dataset` (527-579) * `remap_shard_index` (282-291) </details> </blockquote></details> </details><details> <summary>🪛 markdownlint-cli2 (0.18.1)</summary> <details> <summary>docs/robust_shard_downloads.md</summary> 100-100: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> </details> <details> <summary>🔇 Additional comments (13)</summary><blockquote> <details> <summary>src/tplr/sharded_dataset.py (7)</summary><blockquote> `247-276`: **LGTM! Configurable retry and timeout parameters.** The new parameters `max_download_retries` and `download_timeout` provide good flexibility for different network conditions. The defaults (3 retries, 600s timeout) are reasonable. --- `293-321`: **LGTM! Clean delegation to retry-enabled preparation.** The `prepare_shard` method cleanly delegates to `_prepare_shard_with_retry`, maintaining backward compatibility while adding robust download handling. --- `323-402`: **LGTM! Robust retry implementation with exponential backoff.** The retry logic is well-implemented: - Pre-download validation avoids unnecessary downloads - Exponential backoff (1s, 2s, 4s) prevents thundering herd issues - Clear error messages with attempt counts and file paths - Proper separation of timeout vs general exception handling --- `404-434`: **LGTM! Validation is straightforward and handles edge cases.** The validation method correctly checks for existence and non-zero size. The exception handling with warning log and graceful `False` return is appropriate. --- `436-497`: **LGTM! Concurrent download with proper exception handling.** The implementation correctly: - Downloads both files concurrently for performance - Uses `return_exceptions=True` to handle partial failures gracefully - Validates files after download to ensure integrity - Logs clear error messages for each failure mode --- `547-569`: **LGTM! Robust await-based dataset creation with proper synchronization.** The implementation correctly: - Has rank 0 perform the download while others wait - Uses barrier synchronization for multi-process setups - Validates files before creating the dataset - Raises clear `RuntimeError` on failure The error handling prevents the system from entering a broken state. --- `623-652`: **Verify non-rank-0 behavior when background preparation fails.** When `success` is `False` (line 627), the synchronous retry only executes for rank 0 (line 633). If rank 0's retry succeeds, all processes continue normally. However, if `self.rank != 0` and background preparation failed, the code proceeds without any retry or validation. This could cause `initialize_datasets` to fail later when non-rank-0 processes attempt to use files that rank 0 may still be downloading. Clarify whether non-rank-0 processes should wait for rank 0's retry completion, fail explicitly with a clear error, or rely on downstream synchronization mechanisms. </blockquote></details> <details> <summary>tests/unit/test_robust_shard_download.py (5)</summary><blockquote> `21-50`: **LGTM! Well-structured test fixtures.** The fixtures provide good isolation: - `mock_comms` creates properly mocked communication object - `temp_dataset_path` provides isolated temporary directory - `dataset_manager` combines both with proper environment setup --- `53-96`: **LGTM! Good coverage for validation scenarios.** The validation tests cover all key scenarios: valid files, missing files, and zero-size files. --- `99-160`: **LGTM! Download validation tests cover success and failure modes.** Good coverage of: - Successful download with file creation - Download returning `None` - Download raising exceptions --- `360-408`: **LGTM! Dataset creation tests cover key scenarios.** Good coverage including: - Successful creation with properly structured files - Download failure handling - Validation failure after download The `torch.distributed.barrier` patch is appropriate for single-process testing. --- `411-451`: **LGTM! Remapping and configuration tests are clear and complete.** The tests validate shard remapping logic and configurable parameters effectively. </blockquote></details> <details> <summary>docs/robust_shard_downloads.md (1)</summary><blockquote> `1-233`: **LGTM! Comprehensive and well-structured documentation.** The documentation effectively covers: - Problem statement and motivation - Solution components with code examples - Configuration options and usage patterns - Migration guide for existing deployments - Testing instructions The code examples align with the actual implementation. </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
docs/robust_shard_downloads.md (1)
237-237: Update placeholder PR reference to #663.The PR reference still contains the placeholder and should be updated to the actual PR number before merging.
-- [PR #XXX](https://github.com/one-covenant/templar/pull/XXX): Implementation PR +- [PR #663](https://github.com/one-covenant/templar/pull/663): Implementation PR
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
docs/robust_shard_downloads.md(1 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/robust_shard_downloads.md
100-100: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
- Format code with ruff to pass CI checks - Add missing R2_DATASET_WRITE credentials to test conftest - Add DATASET_BINS_PATH environment variable for tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (2)
src/tplr/sharded_dataset.py (2)
404-434: Consider adding basic content validation for enhanced robustness.The existence and size checks are good baseline validation. For additional resilience against corrupted downloads (e.g., partial writes, network corruption), consider validating the numpy file header or computing a quick checksum. This would catch files that have correct size but invalid content.
def _validate_shard_files( self, tokens_file: os.PathLike, ids_file: os.PathLike ) -> bool: try: if not os.path.exists(tokens_file) or not os.path.exists(ids_file): return False # Check file sizes are non-zero tokens_size = os.path.getsize(tokens_file) ids_size = os.path.getsize(ids_file) if tokens_size == 0 or ids_size == 0: tplr.logger.warning( f"Shard files exist but have zero size: tokens={tokens_size}, ids={ids_size}" ) return False + # Quick header validation for numpy files + for filepath in [tokens_file, ids_file]: + if str(filepath).endswith('.npy'): + with open(filepath, 'rb') as f: + magic = f.read(6) + if magic != b'\x93NUMPY': + tplr.logger.warning(f"Invalid numpy header in {filepath}") + return False + return True
499-525: Consider adding runtime deprecation warning.The docstring deprecation notice is good, but callers may not notice it. Consider adding a runtime warning to alert developers using this method.
async def download_files( self, bucket: tplr.schemas.Bucket, tokens_file: os.PathLike, ids_file: os.PathLike, ) -> asyncio.TaskGroup: """Downloads the shard and its indices. DEPRECATED: Use _download_files_with_validation instead for better error handling. ... """ + import warnings + warnings.warn( + "download_files is deprecated, use _download_files_with_validation instead", + DeprecationWarning, + stacklevel=2, + ) return await asyncio.gather(
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/tplr/sharded_dataset.py(7 hunks)tests/conftest.py(1 hunks)tests/unit/test_robust_shard_download.py(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/unit/test_robust_shard_download.py
🔇 Additional comments (5)
tests/conftest.py (1)
34-38: LGTM!The new mock environment variables follow the existing pattern and appropriately support the robust shard download test scenarios.
src/tplr/sharded_dataset.py (4)
247-276: LGTM!The new configuration parameters have sensible defaults and are well-documented. The 600s timeout per attempt and 3 retry attempts provide reasonable resilience for large shard downloads.
323-402: Well-structured retry logic with appropriate error handling.The exponential backoff implementation is correct (1s, 2s, 4s) and the separation of timeout vs. general exceptions provides clear diagnostics. The early-exit on valid existing files avoids unnecessary downloads.
436-497: LGTM!Good use of
asyncio.gatherwithreturn_exceptions=Trueto handle concurrent downloads gracefully. The result checking for both exceptions and None values covers the relevant failure modes, and post-download validation ensures file integrity.
660-676: LGTM!Good defensive cleanup logic: only rank 0 deletes to avoid races, handles missing files gracefully, and continues operation even if deletion fails. The explicit
del old_datasethelps ensure timely memory release.
The CI workflow sets environment variables from secrets, but if secrets are not configured, they become empty strings rather than undefined. os.environ.setdefault() doesn't override existing keys, even if empty. Changed to use a helper function that checks if the value is empty and sets mock values for testing when needed. This allows tests to run both locally and in CI without requiring actual R2 credentials.
|
@joellidin can you please check the PR? |
Critical fixes for robust shard download implementation: 1. **Distributed Training Deadlock (CRITICAL)** - Fixed deadlock when rank 0 download fails before barrier - Now all ranks reach barrier before any exception is raised - Prevents non-rank-0 processes from waiting indefinitely 2. **Spurious Warning in Multi-Process Training** - Fixed false warnings from non-rank-0 processes - Changed 'if not success' to 'if success is False' - Non-rank-0 dummy tasks return None (not False) 3. **Test Environment Setup** - Mock HuggingFace tokenizer to avoid gated model access - Add bittensor compatibility shim (bt.wallet = bt.Wallet) - Set mock HF_TOKEN for CI environment 4. **Test Logic Improvement** - Fixed retry counting in test_prepare_shard_retry_on_failure - Now properly tracks attempt pairs (tokens + ids per attempt) - Assertion matches actual behavior (3 attempts, not 4+ calls) These fixes ensure the code works correctly in distributed training environments and tests pass in CI without requiring credentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (5)
tests/unit/test_robust_shard_download.py (1)
310-362: Avoidasyncio.coroutinein new swap tests
asyncio.coroutine(lambda: True/False)is deprecated and will eventually be removed. It’s safer and clearer to use a small async helper instead, e.g.:- dataset_manager.upcoming_dataset = asyncio.create_task( - asyncio.coroutine(lambda: True)() - ) + async def _background_ok(): + return True + dataset_manager.upcoming_dataset = asyncio.create_task(_background_ok())(and similarly for the
Falsecase).tests/conftest.py (2)
12-41: Env helper semantics are reasonable; be aware of empty-string override
set_mock_envonly sets a variable whenos.environ.get(key)is falsy, so real, non-empty values from the environment are preserved (good for CI/runtime configuration). Just note that an explicitly empty string will be treated as “unset” and replaced by the mock value; if you ever rely on empty strings as a meaningful setting, you may want a stricter check (e.g.,if key not in os.environ:).
52-83: Pre-import mocking and fixtures look solid, but are quite centralisedThe combination of:
- pre-import tokenizer mocking (
AutoTokenizer.from_pretrained),- the bittensor wallet compatibility shim,
- loading
hparamsunder those mocks, and- constructing realistic
Comms/Validatorfixturesgives tests a robust, production-like environment while avoiding external dependencies. The trade-off is that a lot of test plumbing now lives in a single conftest, so future changes in
tplr/neuronsinternals will likely need updates here.Consider adding brief comments near the
with patch("transformers.AutoTokenizer.from_pretrained", ...)block and thecomms_instance/mock_validatorfixtures explaining which parts of the production initialization they are mirroring; that will make future refactors safer.Also applies to: 173-208, 253-296
src/tplr/sharded_dataset.py (2)
323-498: Retry + validation flow is sound; consider wiring timeout through and tightening validationThe
_prepare_shard_with_retry/_download_files_with_validationstack looks correct:
- pre-checks for existing valid files short-circuit unnecessary downloads,
- failed downloads,
Nonereturns, and exceptions are all treated as failure,- retries use exponential backoff with clear logging, and
- final failure logs include the concrete file paths.
Two possible follow-ups:
- Propagate
download_timeoutintos3_get_objectcallsRight now,
download_timeoutonly wraps_download_files_with_validationviaasyncio.wait_for, while eachs3_get_objectcall still uses its own default timeout. If you want the configured timeout to control per-object S3 ops, you could do:- self.comms.s3_get_object( - tokens_file, - bucket, - load_data=False, - show_progress=True, - ), + self.comms.s3_get_object( + tokens_file, + bucket, + timeout=self.download_timeout, + load_data=False, + show_progress=True, + ),(and similarly for the IDs file), assuming the signature matches
s3_get_objectas implemented.
- Stronger validation for obviously truncated data (optional)
_validate_shard_filescurrently only checks existence and non-zero size. Since you knowsequence_lengthand token dtype, you could optionally add a quick check thattokens_sizeis divisible bysequence_length * sizeof(uint32)or even that the inferred sample count roughly matchessample_idslength. That would catch some corrupt/truncated shards earlier, beforeSharedShardedDataset.mmap_tokens_and_idsraises.These are nice-to-haves; the current logic is already a clear improvement over the previous behavior.
499-525: Deprecateddownload_filesdocstring and return type are slightly misleading
download_filesis now marked deprecated in the docstring (good), but the annotation-> asyncio.TaskGroupdoesn’t match the actual return type:await asyncio.gather(...)returns a list/tuple of results, not aTaskGroup. To avoid confusion while keeping the method for backward compatibility, consider:
- updating the return annotation to
-> list[object](or similar), and/or- adding a
DeprecationWarningin the body if it’s still being used elsewhere.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/tplr/sharded_dataset.py(7 hunks)tests/conftest.py(1 hunks)tests/unit/test_robust_shard_download.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
src/tplr/sharded_dataset.py (1)
src/tplr/comms.py (2)
get_own_bucket(220-278)s3_get_object(477-630)
tests/unit/test_robust_shard_download.py (1)
src/tplr/sharded_dataset.py (7)
SharedShardedDataset(58-233)_validate_shard_files(404-434)_download_files_with_validation(436-497)initialize_datasets(589-610)swap_datasets(612-686)create_dataset(527-587)remap_shard_index(282-291)
🔇 Additional comments (4)
tests/unit/test_robust_shard_download.py (2)
221-262: Retry test now correctly models per-attempt behaviorThe updated
attempt_countlogic (incrementing only on thetrain_key) now treats each pair of token/id downloads as a single attempt, so both files consistently fail for attempts 1–2 and succeed together on attempt 3. This resolves the earlier inconsistency where individuals3_get_objectcalls were counted as attempts rather than full download cycles.
408-422: Test expectation doesn't match currentcreate_datasetbehaviorIn
test_create_dataset_validation_failure, you expect aRuntimeErrormatching"files do not exist or are invalid". However, with the current implementation:
_download_files_with_validationtreats zero-sized files as a download failure (returnsFalseafter validation fails)._prepare_shard_with_retrythen returnsFalse, andcreate_datasetraises the earlier"Failed to download shard {shard_index} after {max_download_retries} attempts..."error, never reaching the later"files do not exist or are invalid"branch.Adjust the test assertion to match the actual
"Failed to download shard"error message, or if you intend a distinct "invalid after download" failure mode, refactor_prepare_shard_with_retry/create_datasetto differentiate these cases and update both code and test accordingly.src/tplr/sharded_dataset.py (2)
546-577:create_datasetDDP flow now avoids the earlier deadlock, with an extra safety checkThe new
download_success/download_errorpattern ensures:
- rank 0 always reaches the
dist.barrier(whenworld_size > 1) even if the download fails, and- the
RuntimeErroris raised only after the barrier, so non-rank-0 processes don’t hang waiting.The final
_validate_shard_filescall before constructingSharedShardedDatasetis a useful extra guard against races where files disappear or get corrupted between download and dataset creation. Overall the flow looks correct.
247-277: Validatemax_download_retriesanddownload_timeoutparameters to prevent misconfigurationThe constructor accepts
max_download_retriesanddownload_timeoutbut stores them without validation. Callers could pass non-positive values:
max_download_retries <= 0would cause the retry loop to skip all attemptsdownload_timeout <= 0could causeasyncio.wait_for()to fail immediatelyAdd validation in
__init__to enforce sane minimums (e.g.,max(1, max_download_retries)andmax(1, download_timeout)) or raiseValueErroron invalid inputs to fail fast.
|
@joellidin can you please check the PR? |
I am at a conference. Please give me a few days. |
- Apply global patches for transformers.AutoTokenizer and AutoConfig - Prevents gated HuggingFace repo access during test collection - Fixes error when test_comms.py calls load_hparams() at module level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/conftest.py (2)
3-18: Global AutoTokenizer mock works, but consider session-scoped teardown and avoiding duplicate importsThe global patch of
transformers.AutoTokenizer.from_pretrainedto_mock_tokenizeris a clean way to avoid network/gated-model issues during collection, and the mock surface (pad/eos/bos/vocab/encode/decode) looks sufficient for most tests. However:
- The patcher is started at import time and never stopped, so any test that might want a real
AutoTokenizer(or a differently configured mock) has no straightforward escape hatch.MagicMock/patchare imported twice (Line 3 and Line 61), which is harmless but noisy.Consider:
- Converting this into a
session-scoped fixture thatyields and calls_tokenizer_patcher.stop()in teardown, so specific tests can temporarily unpatch if needed.- Consolidating the
unittest.mockimports in one place.These are optional cleanups; behavior is otherwise fine.
Also applies to: 61-61
70-75: Bittensor compatibility shim is pragmatic; add a small guard for unusual versionsThe
if not hasattr(bt, "wallet"): bt.wallet = bt.Walletpattern is a pragmatic way to bridgebt.walletvsbt.Walletnaming differences so that laterbittensor.walletpatching works in tests. One minor edge case: if a future/older bittensor build exposes neitherwalletnorWallet, this line will raiseAttributeError.If you want extra robustness, you could guard on
hasattr(bt, "Wallet")as well, or usegetattr(bt, "Wallet", None)and only alias when it’s present:-if not hasattr(bt, "wallet"): - bt.wallet = bt.Wallet +if not hasattr(bt, "wallet"): + wallet_cls = getattr(bt, "Wallet", None) + if wallet_cls is not None: + bt.wallet = wallet_clsNot strictly required if your supported bittensor versions all provide
Wallet, but it will make the tests more forward/backward compatible.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/conftest.py(2 hunks)
🔇 Additional comments (2)
tests/conftest.py (2)
30-35: Environment mock helper and defaults look correct
set_mock_envnicely avoids clobbering pre-existing env vars while still backfilling sensible defaults for the variousR2_*,DATASET_BINS_PATH, andHF_TOKENvalues. Theif not os.environ.get(key)condition matches the docstring (“not set or is empty”), and centralizing this logic should make future test env tweaks easier.No changes needed here.
Also applies to: 38-41, 44-58
82-85: TPLR/Validator imports aligned with tokenizer + env mockingImporting
tplr.comms,tplr.compress,tplr, andValidatorat module scope (after env setup and the global tokenizer patch) ensures that:
tplr.load_hparams()observes the mocked tokenizer and mock env values, avoiding network access and real bucket credentials.- Downstream tests can rely on a consistent, preinitialized configuration.
This wiring matches the stated testing goals for robust shard downloads; I don’t see issues here.
Also applies to: 87-87
- Adds mock bt.config to handle newer bittensor API - Fixes test_evaluator.py collection error - All 416 tests now collect successfully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
tests/conftest.py (1)
94-102: Module-level hparams loading may have side effects.Loading
hparamsat module level (line 102) executes during test collection, which could:
- Slow down test collection
- Cause issues if
load_hparams()has side effects or depends on runtime state- Make debugging harder if it fails during collection
If
hparamsis only needed by specific tests, consider moving it to a fixture.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
tests/conftest.py(2 hunks)
🔇 Additional comments (4)
tests/conftest.py (4)
38-43: Good defensive design for environment variable mocking.The
set_mock_envhelper properly checks if the variable is already set before applying the mock value, preventing accidental overwrite of real credentials in development environments.
129-316: Well-structured pytest fixtures.The fixtures are properly designed with:
- Appropriate scopes (session for config, function for instances)
- Clear separation of concerns (model, comms, validator)
- Detailed setup matching production expectations (totalks, shapes)
- Proper use of async for
comms_instanceThe session-wide
mock_configwithautouse=Trueensures consistent test environment across all tests.
19-26: Verify that session-wide patches don't interfere with other test modules.The patchers are started at module level and never stopped, making them active for the entire test session. While this prevents HuggingFace API calls, verify that:
- No other test modules expect the real
AutoTokenizerorAutoConfig- These patches don't interfere with tests that mock these same objects differently
- The hardcoded mock values (e.g.,
vocab_size=50000) are suitable for all testsDocument this session-wide behavior if intentional.
78-87: Reconsider scope of bittensor compatibility workaround.This code is in
tests/conftest.py, meaning the monkey-patching is scoped to the test environment only and does not affect production code. However, verify whether:
- These patches are necessary for the bittensor version in use, or if pinning a compatible version would be simpler
bt.walletandbt.configare actually accessed in tests after this patch- Using targeted mocks in individual test fixtures would be more maintainable than global module modification
The MagicMock for
bt.configis permissive and might obscure real compatibility issues in tests.
| # Global tokenizer mock that will be applied before test collection | ||
| _mock_tokenizer = MagicMock() | ||
| _mock_tokenizer.pad_token_id = 0 | ||
| _mock_tokenizer.eos_token_id = 1 | ||
| _mock_tokenizer.bos_token_id = 2 | ||
| _mock_tokenizer.vocab_size = 50000 | ||
| _mock_tokenizer.encode = MagicMock(return_value=[1, 2, 3]) | ||
| _mock_tokenizer.decode = MagicMock(return_value="test") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential test pollution from mutable global mock.
The global _mock_tokenizer is a mutable MagicMock shared across all tests. If any test modifies this object (e.g., changing vocab_size, resetting mocks, or modifying return values), all subsequent tests will observe those changes, leading to test interdependencies and flaky behavior.
Consider making the mock immutable or documenting that tests must not modify it. Alternatively, use a fresh mock instance per test via a fixture.
# Example: If test A does this:
_mock_tokenizer.vocab_size = 100
# Test B will see vocab_size=100 instead of the expected 50000🤖 Prompt for AI Agents
In tests/conftest.py around lines 5 to 12 the global _mock_tokenizer is a
mutable MagicMock shared across all tests causing test pollution; replace it
with a function-scoped pytest fixture that creates and returns a fresh MagicMock
per test with the same attributes (pad_token_id, eos_token_id, bos_token_id,
vocab_size, and encode/decode return values), update tests to accept that
fixture instead of importing the global, and remove or deprecate the global
variable; alternatively, if keeping a global is required, wrap it in an
immutable proxy or deep-freeze its attributes and clearly document it cannot be
modified.
| set_mock_env("R2_DATASET_READ_SECRET_ACCESS_KEY", "mock-dataset-read-secret-key") | ||
| set_mock_env("R2_DATASET_WRITE_ACCESS_KEY_ID", "mock-dataset-write-key-id") | ||
| set_mock_env("R2_DATASET_WRITE_SECRET_ACCESS_KEY", "mock-dataset-write-secret-key") | ||
| set_mock_env("DATASET_BINS_PATH", "/tmp/test-dataset") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded path may cause issues in parallel test execution.
The hardcoded path /tmp/test-dataset could cause conflicts if multiple test processes run in parallel (e.g., via pytest -n auto). Consider using a session-scoped temporary directory or adding a process ID suffix.
# Example: Use pytest's tmp_path_factory or add PID suffix
import tempfile
import os
set_mock_env("DATASET_BINS_PATH", f"/tmp/test-dataset-{os.getpid()}")
# Or use a fixture-based approach with tmp_path_factory🤖 Prompt for AI Agents
In tests/conftest.py around line 64, the test sets DATASET_BINS_PATH to a
hardcoded /tmp/test-dataset which can collide across parallel pytest workers;
change this to a unique location by using a session-scoped temporary directory
or appending a process-specific suffix (e.g., PID or worker id) so each test
process gets its own path, and update set_mock_env to receive that generated
path (or create the tmp dir via pytest's tmp_path_factory in a fixture and set
the env there).
|
@joellidin please check the PR when you are available, will closed it for now. But will make it open anytime when you want. |
Description
This PR implements comprehensive improvements to shard download reliability, addressing all requirements from Issue #600. The current implementation of shard downloads is not fully reliable - downloads can fail or stall, and retries do not always resolve the issue. Since shards are swapped only occasionally, failures can leave the system in a broken state.
Solution implemented:
max_download_retriesanddownload_timeoutfor different network conditionsRelated Issue(s)
Type of Change
Branch Naming
fix/robust-shard-downloads-issue-600Commit Messages
Code Quality
Testing
Testing Note: Local test execution is blocked by an existing bittensor compatibility issue in the codebase (
bt.walletvsbt.Walletinchain.py). However:python -m py_compile)Documentation
docs/robust_shard_downloads.mdwith comprehensive usage guideChanges Made
Modified Files
src/tplr/sharded_dataset.py(+250 lines)_prepare_shard_with_retry()- Retry logic with exponential backoff_validate_shard_files()- File existence and size validation_download_files_with_validation()- Download with result checkingcreate_dataset()- Await-based approach with validationswap_datasets()- Graceful failure handling with fallbackmax_download_retries,download_timeoutNew Files
tests/unit/test_robust_shard_download.py(500+ lines)docs/robust_shard_downloads.mdAcceptance Criteria Met
All criteria from Issue #600:
✅ Shard downloads succeed reliably without leaving system in broken state
✅ System fails gracefully with clear logs if download cannot complete
✅ Await-based direct download mode supported
create_dataset()Backward Compatibility
✅ Fully backward compatible - No breaking changes
Performance Impact
Example Usage
Basic (no changes required)
Custom Configuration
Screenshots/Examples
Success Case
Retry Case
Failure Case (clear error)
Additional Notes
Key Implementation Details
Retry Strategy:
Validation Flow:
Error Handling Hierarchy:
Future Enhancements
Potential improvements for future PRs:
Reviewer Notes
Key areas to review:
_prepare_shard_with_retry()- Exponential backoff implementation_validate_shard_files()- File size checksswap_datasets()- Fallback behaviortest_robust_shard_download.py- All scenarios coveredSummary by CodeRabbit
New Features
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.