Skip to content

🐛 fix(repository): restore slow-path bucket read dropped by shard migration#429

Merged
sodre merged 2 commits into
mainfrom
fix/batch-get-bucket-sk-filter
Jun 26, 2026
Merged

🐛 fix(repository): restore slow-path bucket read dropped by shard migration#429
sodre merged 2 commits into
mainfrom
fix/batch-get-bucket-sk-filter

Conversation

@sodre

@sodre sodre commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

Fixes a latent bug in Repository.batch_get_entity_and_buckets() (the acquire() slow-path / refill-recovery read) that silently dropped every bucket from the BatchGetItem response.

  • Bug: The method classified response items with sk.startswith(schema.SK_BUCKET) ("#BUCKET#"). Since the per-shard partition-key migration (GHSA-76rv-2r9v-c5m6), bucket state records use SK == "#STATE" (sk_state()), so the filter never matched and every bucket was discarded. The slow path then treated existing buckets as new, the conditional write failed, and acquire() raised RateLimitExceeded with retry_after=0.0 instead of refilling.
  • Effect: wait-then-acquire refill recovery was broken. With speculative_writes=True it breaks the refill fallback (masked in production only when the aggregator Lambda refills out-of-band); with speculative_writes=False or --no-aggregator it is fully broken.
  • When it broke: latent since v0.7.0 (commit 663b687, where the robust else branch was narrowed to elif sk.startswith(SK_BUCKET)); activated in v0.10.1 (commit 94a129a, "migrate bucket items to per-shard partition keys") which changed the bucket SK to "#STATE" but missed this one response-classification filter. The sibling batch_get_buckets() has no such filter, which is why cascade/parent reads and get_buckets() still worked. Still present on main (v0.10.2).
  • Fix: one line in batch_get_entity_and_bucketselif sk.startswith(schema.SK_BUCKET):elif sk == schema.sk_state(): (async source repository.py; regenerated into sync_repository.py).
  • Also: added a TODO(clock-seam) at Repository._now_ms() noting that limiter.py/lease.py use inline int(time.time() * 1000) instead of routing through _now_ms(), so it is not the single injectable clock; follow-up to put _now_ms() on RepositoryProtocol and route both through it.

Test plan

Added regression tests the suite lacked (verified RED before the fix, GREEN after):

  • tests/unit/test_repository.py: test_batch_get_entity_and_buckets_returns_existing_buckets, test_batch_get_entity_and_buckets_finds_bucket_without_meta (the no-#META real-world trigger)
  • tests/unit/test_limiter.py: TestRateLimiterRefillRecovery — exhaust → wait → acquire succeeds, on both speculative and non-speculative paths; asserts the rejection reports a real retry_after (catches the retry_after=0.0 symptom)
  • Sync twins regenerated (test_sync_repository.py, test_sync_limiter.py)
  • New tests: 8 passed (async + sync); full unit suite: 2623 passed
  • mypy clean; ruff check/format clean; sync-generation drift: none; pre-push patch coverage: 100%

Fixes #428

🤖 Generated with Claude Code

…ration

batch_get_entity_and_buckets() classified BatchGetItem response items with
`sk.startswith(SK_BUCKET)` ("#BUCKET#"), but bucket state records use
SK="#STATE" since the per-shard migration (GHSA-76rv). The filter never
matched, so every bucket was silently dropped: the acquire() slow path
treated existing buckets as new, the conditional write failed, and acquire()
raised RateLimitExceeded with retry_after=0.0 instead of refilling.

This broke wait-then-acquire refill recovery: with speculative_writes=True
the refill fallback fails (masked in production by the aggregator Lambda);
with speculative_writes=False or --no-aggregator it is fully broken.

Latent since v0.7.0 (663b687, else->elif startswith(SK_BUCKET)); activated in
v0.10.1 (94a129a) when bucket SK changed to "#STATE" without updating this
response-classification filter.

Add regression tests the suite lacked: direct batch_get_entity_and_buckets
coverage (incl. the no-#META trigger) and a behavioral wait-then-acquire test
on both speculative and non-speculative paths. Also note a TODO to make
_now_ms() the single injectable clock seam.

Fixes #428

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sodre sodre added this to the v1.0.0 milestone Jun 26, 2026
@sodre sodre added the area/limiter Core rate limiting logic label Jun 26, 2026
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.43%. Comparing base (4f7cd02) to head (8123fea).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #429      +/-   ##
==========================================
+ Coverage   92.39%   92.43%   +0.03%     
==========================================
  Files          37       37              
  Lines        7854     7854              
==========================================
+ Hits         7257     7260       +3     
+ Misses        597      594       -3     
Flag Coverage Δ
doctest 29.14% <0.00%> (ø)
e2e 42.92% <100.00%> (+0.56%) ⬆️
integration 51.34% <100.00%> (+0.58%) ⬆️
unit 92.23% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

@sodre sodre marked this pull request as ready for review June 26, 2026 02:14
…very

Exercises the real-DynamoDB code path the unit tests (moto) and the production
aggregator both masked for the bug fixed in 7e61fd3:

- integration (test_repository.py): batch_get_entity_and_buckets() returns
  existing buckets against real DynamoDB, incl. the no-#META trigger. The
  sibling batch_get_buckets() was already covered here; the method that broke
  was not — that gap let the regression ship.
- e2e (test_localstack.py): exhaust -> wait -> acquire succeeds on the
  no-aggregator stack (client refill-recovery slow path).
- benchmark (test_localstack.py): a recovery stress loop over multiple entities.

All three run on no-aggregator (minimal) stacks on purpose, so the aggregator
cannot refill out-of-band and mask the client path that broke. Verified RED
before the fix / GREEN after, against a live LocalStack.

Refs #428

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@sodre sodre merged commit ced59de into main Jun 26, 2026
16 checks passed
@sodre sodre deleted the fix/batch-get-bucket-sk-filter branch June 26, 2026 03:00
sodre added a commit that referenced this pull request Jun 26, 2026
…ration (backport to 0.10.x) (#431)

## Summary

Backports the slow-path bucket-read fix from #429 (merged to `main`) to
the `release/0.10.x` maintenance branch for a **v0.10.3** patch release.

**Bug:** `Repository.batch_get_entity_and_buckets()` — the `acquire()`
slow-path / refill-recovery read — classified `BatchGetItem` response
items using the stale `sk.startswith(schema.SK_BUCKET)` (`"#BUCKET#"`)
prefix. Bucket state records use `SK == "#STATE"` (`sk_state()`) since
the per-shard partition-key migration
([GHSA-76rv-2r9v-c5m6](GHSA-76rv-2r9v-c5m6)),
so the filter never matched and every bucket was silently dropped. The
slow path then treated existing buckets as new, the conditional write
failed, and `acquire()` raised `RateLimitExceeded` with
`retry_after=0.0` instead of refilling.

**Effect:** wait-then-acquire refill recovery was broken.
- With `speculative_writes=True`, it breaks the refill fallback (masked
in production only when the aggregator Lambda refills out-of-band).
- With `speculative_writes=False` or `--no-aggregator`, it is fully
broken.

This bug shipped in **v0.10.1** and is present in **v0.10.2**; this PR
delivers the fix as **v0.10.3**.

**Fix:** one line — `elif sk.startswith(schema.SK_BUCKET):` → `elif sk
== schema.sk_state():` (async source `repository.py`; regenerated into
`sync_repository.py`).

Cherry-picked cleanly from `main` commits `7e61fd3` (fix + unit tests)
and `8123fea` (LocalStack tests), with `-x` provenance recorded.

## Test plan

All verified locally on this branch against live LocalStack:

- [x] **unit (moto):** `batch_get_entity_and_buckets` returns existing
buckets incl. no-`#META` trigger; wait-then-acquire recovery on
speculative + non-speculative paths (async + generated sync) — 8 passed.
- [x] **integration/e2e/benchmark (LocalStack, no-aggregator stacks):**
direct read coverage, exhaust→wait→acquire recovery, recovery stress
loop — 4 passed.
- [x] **sync-generation:** no drift; `ruff` + `mypy` clean.

## References

- Original fix: #429
- Bug report: #428 (already closed on `main`; referenced here as context
only)

This PR targets the `0.10.x` maintenance line.

🤖 Generated with [Claude Code](https://claude.ai/code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/limiter Core rate limiting logic

Projects

None yet

Development

Successfully merging this pull request may close these issues.

🐛 Fix acquire() refill recovery: slow-path read drops all buckets

1 participant