🐛 fix(repository): restore slow-path bucket read dropped by shard migration by sodre · Pull Request #429 · zeroae/zae-limiter

sodre · 2026-06-26T01:55:50Z

Summary

Fixes a latent bug in Repository.batch_get_entity_and_buckets() (the acquire() slow-path / refill-recovery read) that silently dropped every bucket from the BatchGetItem response.

Bug: The method classified response items with sk.startswith(schema.SK_BUCKET) ("#BUCKET#"). Since the per-shard partition-key migration (GHSA-76rv-2r9v-c5m6), bucket state records use SK == "#STATE" (sk_state()), so the filter never matched and every bucket was discarded. The slow path then treated existing buckets as new, the conditional write failed, and acquire() raised RateLimitExceeded with retry_after=0.0 instead of refilling.
Effect: wait-then-acquire refill recovery was broken. With speculative_writes=True it breaks the refill fallback (masked in production only when the aggregator Lambda refills out-of-band); with speculative_writes=False or --no-aggregator it is fully broken.
When it broke: latent since v0.7.0 (commit 663b687, where the robust else branch was narrowed to elif sk.startswith(SK_BUCKET)); activated in v0.10.1 (commit 94a129a, "migrate bucket items to per-shard partition keys") which changed the bucket SK to "#STATE" but missed this one response-classification filter. The sibling batch_get_buckets() has no such filter, which is why cascade/parent reads and get_buckets() still worked. Still present on main (v0.10.2).
Fix: one line in batch_get_entity_and_buckets — elif sk.startswith(schema.SK_BUCKET): → elif sk == schema.sk_state(): (async source repository.py; regenerated into sync_repository.py).
Also: added a TODO(clock-seam) at Repository._now_ms() noting that limiter.py/lease.py use inline int(time.time() * 1000) instead of routing through _now_ms(), so it is not the single injectable clock; follow-up to put _now_ms() on RepositoryProtocol and route both through it.

Test plan

Added regression tests the suite lacked (verified RED before the fix, GREEN after):

tests/unit/test_repository.py: test_batch_get_entity_and_buckets_returns_existing_buckets, test_batch_get_entity_and_buckets_finds_bucket_without_meta (the no-#META real-world trigger)
tests/unit/test_limiter.py: TestRateLimiterRefillRecovery — exhaust → wait → acquire succeeds, on both speculative and non-speculative paths; asserts the rejection reports a real retry_after (catches the retry_after=0.0 symptom)
Sync twins regenerated (test_sync_repository.py, test_sync_limiter.py)
New tests: 8 passed (async + sync); full unit suite: 2623 passed
mypy clean; ruff check/format clean; sync-generation drift: none; pre-push patch coverage: 100%

Fixes #428

🤖 Generated with Claude Code

…ration batch_get_entity_and_buckets() classified BatchGetItem response items with `sk.startswith(SK_BUCKET)` ("#BUCKET#"), but bucket state records use SK="#STATE" since the per-shard migration (GHSA-76rv). The filter never matched, so every bucket was silently dropped: the acquire() slow path treated existing buckets as new, the conditional write failed, and acquire() raised RateLimitExceeded with retry_after=0.0 instead of refilling. This broke wait-then-acquire refill recovery: with speculative_writes=True the refill fallback fails (masked in production by the aggregator Lambda); with speculative_writes=False or --no-aggregator it is fully broken. Latent since v0.7.0 (663b687, else->elif startswith(SK_BUCKET)); activated in v0.10.1 (94a129a) when bucket SK changed to "#STATE" without updating this response-classification filter. Add regression tests the suite lacked: direct batch_get_entity_and_buckets coverage (incl. the no-#META trigger) and a behavioral wait-then-acquire test on both speculative and non-speculative paths. Also note a TODO to make _now_ms() the single injectable clock seam. Fixes #428 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

codecov · 2026-06-26T02:01:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.43%. Comparing base (4f7cd02) to head (8123fea).
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #429      +/-   ##
==========================================
+ Coverage   92.39%   92.43%   +0.03%     
==========================================
  Files          37       37              
  Lines        7854     7854              
==========================================
+ Hits         7257     7260       +3     
+ Misses        597      594       -3

Flag	Coverage Δ
doctest	`29.14% <0.00%> (ø)`
e2e	`42.92% <100.00%> (+0.56%)`	⬆️
integration	`51.34% <100.00%> (+0.58%)`	⬆️
unit	`92.23% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

…very Exercises the real-DynamoDB code path the unit tests (moto) and the production aggregator both masked for the bug fixed in 7e61fd3: - integration (test_repository.py): batch_get_entity_and_buckets() returns existing buckets against real DynamoDB, incl. the no-#META trigger. The sibling batch_get_buckets() was already covered here; the method that broke was not — that gap let the regression ship. - e2e (test_localstack.py): exhaust -> wait -> acquire succeeds on the no-aggregator stack (client refill-recovery slow path). - benchmark (test_localstack.py): a recovery stress loop over multiple entities. All three run on no-aggregator (minimal) stacks on purpose, so the aggregator cannot refill out-of-band and mask the client path that broke. Verified RED before the fix / GREEN after, against a live LocalStack. Refs #428 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ration (backport to 0.10.x) (#431) ## Summary Backports the slow-path bucket-read fix from #429 (merged to `main`) to the `release/0.10.x` maintenance branch for a **v0.10.3** patch release. **Bug:** `Repository.batch_get_entity_and_buckets()` — the `acquire()` slow-path / refill-recovery read — classified `BatchGetItem` response items using the stale `sk.startswith(schema.SK_BUCKET)` (`"#BUCKET#"`) prefix. Bucket state records use `SK == "#STATE"` (`sk_state()`) since the per-shard partition-key migration ([GHSA-76rv-2r9v-c5m6](GHSA-76rv-2r9v-c5m6)), so the filter never matched and every bucket was silently dropped. The slow path then treated existing buckets as new, the conditional write failed, and `acquire()` raised `RateLimitExceeded` with `retry_after=0.0` instead of refilling. **Effect:** wait-then-acquire refill recovery was broken. - With `speculative_writes=True`, it breaks the refill fallback (masked in production only when the aggregator Lambda refills out-of-band). - With `speculative_writes=False` or `--no-aggregator`, it is fully broken. This bug shipped in **v0.10.1** and is present in **v0.10.2**; this PR delivers the fix as **v0.10.3**. **Fix:** one line — `elif sk.startswith(schema.SK_BUCKET):` → `elif sk == schema.sk_state():` (async source `repository.py`; regenerated into `sync_repository.py`). Cherry-picked cleanly from `main` commits `7e61fd3` (fix + unit tests) and `8123fea` (LocalStack tests), with `-x` provenance recorded. ## Test plan All verified locally on this branch against live LocalStack: - [x] **unit (moto):** `batch_get_entity_and_buckets` returns existing buckets incl. no-`#META` trigger; wait-then-acquire recovery on speculative + non-speculative paths (async + generated sync) — 8 passed. - [x] **integration/e2e/benchmark (LocalStack, no-aggregator stacks):** direct read coverage, exhaust→wait→acquire recovery, recovery stress loop — 4 passed. - [x] **sync-generation:** no drift; `ruff` + `mypy` clean. ## References - Original fix: #429 - Bug report: #428 (already closed on `main`; referenced here as context only) This PR targets the `0.10.x` maintenance line. 🤖 Generated with [Claude Code](https://claude.ai/code)

sodre added this to the v1.0.0 milestone Jun 26, 2026

sodre added the area/limiter Core rate limiting logic label Jun 26, 2026

sodre marked this pull request as ready for review June 26, 2026 02:14

sodre mentioned this pull request Jun 26, 2026

♻️ Make Repository._now_ms() the single injectable clock seam #430

Open

5 tasks

sodre merged commit ced59de into main Jun 26, 2026
16 checks passed

sodre deleted the fix/batch-get-bucket-sk-filter branch June 26, 2026 03:00

sodre mentioned this pull request Jun 26, 2026

🐛 fix(repository): restore slow-path bucket read dropped by shard migration (backport to 0.10.x) #431

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🐛 fix(repository): restore slow-path bucket read dropped by shard migration#429

🐛 fix(repository): restore slow-path bucket read dropped by shard migration#429
sodre merged 2 commits into
mainfrom
fix/batch-get-bucket-sk-filter

sodre commented Jun 26, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sodre commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

codecov Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sodre commented Jun 26, 2026 •

edited

Loading

codecov Bot commented Jun 26, 2026 •

edited

Loading