🐛 fix(repository): restore slow-path bucket read dropped by shard migration#429
Merged
Conversation
…ration
batch_get_entity_and_buckets() classified BatchGetItem response items with
`sk.startswith(SK_BUCKET)` ("#BUCKET#"), but bucket state records use
SK="#STATE" since the per-shard migration (GHSA-76rv). The filter never
matched, so every bucket was silently dropped: the acquire() slow path
treated existing buckets as new, the conditional write failed, and acquire()
raised RateLimitExceeded with retry_after=0.0 instead of refilling.
This broke wait-then-acquire refill recovery: with speculative_writes=True
the refill fallback fails (masked in production by the aggregator Lambda);
with speculative_writes=False or --no-aggregator it is fully broken.
Latent since v0.7.0 (663b687, else->elif startswith(SK_BUCKET)); activated in
v0.10.1 (94a129a) when bucket SK changed to "#STATE" without updating this
response-classification filter.
Add regression tests the suite lacked: direct batch_get_entity_and_buckets
coverage (incl. the no-#META trigger) and a behavioral wait-then-acquire test
on both speculative and non-speculative paths. Also note a TODO to make
_now_ms() the single injectable clock seam.
Fixes #428
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #429 +/- ##
==========================================
+ Coverage 92.39% 92.43% +0.03%
==========================================
Files 37 37
Lines 7854 7854
==========================================
+ Hits 7257 7260 +3
+ Misses 597 594 -3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. |
5 tasks
…very Exercises the real-DynamoDB code path the unit tests (moto) and the production aggregator both masked for the bug fixed in 7e61fd3: - integration (test_repository.py): batch_get_entity_and_buckets() returns existing buckets against real DynamoDB, incl. the no-#META trigger. The sibling batch_get_buckets() was already covered here; the method that broke was not — that gap let the regression ship. - e2e (test_localstack.py): exhaust -> wait -> acquire succeeds on the no-aggregator stack (client refill-recovery slow path). - benchmark (test_localstack.py): a recovery stress loop over multiple entities. All three run on no-aggregator (minimal) stacks on purpose, so the aggregator cannot refill out-of-band and mask the client path that broke. Verified RED before the fix / GREEN after, against a live LocalStack. Refs #428 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
3 tasks
sodre
added a commit
that referenced
this pull request
Jun 26, 2026
…ration (backport to 0.10.x) (#431) ## Summary Backports the slow-path bucket-read fix from #429 (merged to `main`) to the `release/0.10.x` maintenance branch for a **v0.10.3** patch release. **Bug:** `Repository.batch_get_entity_and_buckets()` — the `acquire()` slow-path / refill-recovery read — classified `BatchGetItem` response items using the stale `sk.startswith(schema.SK_BUCKET)` (`"#BUCKET#"`) prefix. Bucket state records use `SK == "#STATE"` (`sk_state()`) since the per-shard partition-key migration ([GHSA-76rv-2r9v-c5m6](GHSA-76rv-2r9v-c5m6)), so the filter never matched and every bucket was silently dropped. The slow path then treated existing buckets as new, the conditional write failed, and `acquire()` raised `RateLimitExceeded` with `retry_after=0.0` instead of refilling. **Effect:** wait-then-acquire refill recovery was broken. - With `speculative_writes=True`, it breaks the refill fallback (masked in production only when the aggregator Lambda refills out-of-band). - With `speculative_writes=False` or `--no-aggregator`, it is fully broken. This bug shipped in **v0.10.1** and is present in **v0.10.2**; this PR delivers the fix as **v0.10.3**. **Fix:** one line — `elif sk.startswith(schema.SK_BUCKET):` → `elif sk == schema.sk_state():` (async source `repository.py`; regenerated into `sync_repository.py`). Cherry-picked cleanly from `main` commits `7e61fd3` (fix + unit tests) and `8123fea` (LocalStack tests), with `-x` provenance recorded. ## Test plan All verified locally on this branch against live LocalStack: - [x] **unit (moto):** `batch_get_entity_and_buckets` returns existing buckets incl. no-`#META` trigger; wait-then-acquire recovery on speculative + non-speculative paths (async + generated sync) — 8 passed. - [x] **integration/e2e/benchmark (LocalStack, no-aggregator stacks):** direct read coverage, exhaust→wait→acquire recovery, recovery stress loop — 4 passed. - [x] **sync-generation:** no drift; `ruff` + `mypy` clean. ## References - Original fix: #429 - Bug report: #428 (already closed on `main`; referenced here as context only) This PR targets the `0.10.x` maintenance line. 🤖 Generated with [Claude Code](https://claude.ai/code)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a latent bug in
Repository.batch_get_entity_and_buckets()(theacquire()slow-path / refill-recovery read) that silently dropped every bucket from theBatchGetItemresponse.sk.startswith(schema.SK_BUCKET)("#BUCKET#"). Since the per-shard partition-key migration (GHSA-76rv-2r9v-c5m6), bucket state records useSK == "#STATE"(sk_state()), so the filter never matched and every bucket was discarded. The slow path then treated existing buckets as new, the conditional write failed, andacquire()raisedRateLimitExceededwithretry_after=0.0instead of refilling.speculative_writes=Trueit breaks the refill fallback (masked in production only when the aggregator Lambda refills out-of-band); withspeculative_writes=Falseor--no-aggregatorit is fully broken.663b687, where the robustelsebranch was narrowed toelif sk.startswith(SK_BUCKET)); activated in v0.10.1 (commit94a129a, "migrate bucket items to per-shard partition keys") which changed the bucket SK to"#STATE"but missed this one response-classification filter. The siblingbatch_get_buckets()has no such filter, which is why cascade/parent reads andget_buckets()still worked. Still present onmain(v0.10.2).batch_get_entity_and_buckets—elif sk.startswith(schema.SK_BUCKET):→elif sk == schema.sk_state():(async sourcerepository.py; regenerated intosync_repository.py).TODO(clock-seam)atRepository._now_ms()noting thatlimiter.py/lease.pyuse inlineint(time.time() * 1000)instead of routing through_now_ms(), so it is not the single injectable clock; follow-up to put_now_ms()onRepositoryProtocoland route both through it.Test plan
Added regression tests the suite lacked (verified RED before the fix, GREEN after):
tests/unit/test_repository.py:test_batch_get_entity_and_buckets_returns_existing_buckets,test_batch_get_entity_and_buckets_finds_bucket_without_meta(the no-#METAreal-world trigger)tests/unit/test_limiter.py:TestRateLimiterRefillRecovery— exhaust → wait → acquire succeeds, on both speculative and non-speculative paths; asserts the rejection reports a realretry_after(catches theretry_after=0.0symptom)test_sync_repository.py,test_sync_limiter.py)mypyclean;ruff check/formatclean; sync-generation drift: none; pre-push patch coverage: 100%Fixes #428
🤖 Generated with Claude Code