Skip to content

🐛 fix(repository): restore slow-path bucket read dropped by shard migration (backport to 0.10.x)#431

Merged
sodre merged 2 commits into
release/0.10.xfrom
backport/0.10.3-slow-path-fix
Jun 26, 2026
Merged

🐛 fix(repository): restore slow-path bucket read dropped by shard migration (backport to 0.10.x)#431
sodre merged 2 commits into
release/0.10.xfrom
backport/0.10.3-slow-path-fix

Conversation

@sodre

@sodre sodre commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

Backports the slow-path bucket-read fix from #429 (merged to main) to the release/0.10.x maintenance branch for a v0.10.3 patch release.

Bug: Repository.batch_get_entity_and_buckets() — the acquire() slow-path / refill-recovery read — classified BatchGetItem response items using the stale sk.startswith(schema.SK_BUCKET) ("#BUCKET#") prefix. Bucket state records use SK == "#STATE" (sk_state()) since the per-shard partition-key migration (GHSA-76rv-2r9v-c5m6), so the filter never matched and every bucket was silently dropped. The slow path then treated existing buckets as new, the conditional write failed, and acquire() raised RateLimitExceeded with retry_after=0.0 instead of refilling.

Effect: wait-then-acquire refill recovery was broken.

  • With speculative_writes=True, it breaks the refill fallback (masked in production only when the aggregator Lambda refills out-of-band).
  • With speculative_writes=False or --no-aggregator, it is fully broken.

This bug shipped in v0.10.1 and is present in v0.10.2; this PR delivers the fix as v0.10.3.

Fix: one line — elif sk.startswith(schema.SK_BUCKET):elif sk == schema.sk_state(): (async source repository.py; regenerated into sync_repository.py).

Cherry-picked cleanly from main commits 7e61fd3 (fix + unit tests) and 8123fea (LocalStack tests), with -x provenance recorded.

Test plan

All verified locally on this branch against live LocalStack:

  • unit (moto): batch_get_entity_and_buckets returns existing buckets incl. no-#META trigger; wait-then-acquire recovery on speculative + non-speculative paths (async + generated sync) — 8 passed.
  • integration/e2e/benchmark (LocalStack, no-aggregator stacks): direct read coverage, exhaust→wait→acquire recovery, recovery stress loop — 4 passed.
  • sync-generation: no drift; ruff + mypy clean.

References

This PR targets the 0.10.x maintenance line.

🤖 Generated with Claude Code

sodre and others added 2 commits June 25, 2026 23:14
…ration

batch_get_entity_and_buckets() classified BatchGetItem response items with
`sk.startswith(SK_BUCKET)` ("#BUCKET#"), but bucket state records use
SK="#STATE" since the per-shard migration (GHSA-76rv). The filter never
matched, so every bucket was silently dropped: the acquire() slow path
treated existing buckets as new, the conditional write failed, and acquire()
raised RateLimitExceeded with retry_after=0.0 instead of refilling.

This broke wait-then-acquire refill recovery: with speculative_writes=True
the refill fallback fails (masked in production by the aggregator Lambda);
with speculative_writes=False or --no-aggregator it is fully broken.

Latent since v0.7.0 (663b687, else->elif startswith(SK_BUCKET)); activated in
v0.10.1 (94a129a) when bucket SK changed to "#STATE" without updating this
response-classification filter.

Add regression tests the suite lacked: direct batch_get_entity_and_buckets
coverage (incl. the no-#META trigger) and a behavioral wait-then-acquire test
on both speculative and non-speculative paths. Also note a TODO to make
_now_ms() the single injectable clock seam.

Fixes #428

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 7e61fd3)
…very

Exercises the real-DynamoDB code path the unit tests (moto) and the production
aggregator both masked for the bug fixed in 7e61fd3:

- integration (test_repository.py): batch_get_entity_and_buckets() returns
  existing buckets against real DynamoDB, incl. the no-#META trigger. The
  sibling batch_get_buckets() was already covered here; the method that broke
  was not — that gap let the regression ship.
- e2e (test_localstack.py): exhaust -> wait -> acquire succeeds on the
  no-aggregator stack (client refill-recovery slow path).
- benchmark (test_localstack.py): a recovery stress loop over multiple entities.

All three run on no-aggregator (minimal) stacks on purpose, so the aggregator
cannot refill out-of-band and mask the client path that broke. Verified RED
before the fix / GREEN after, against a live LocalStack.

Refs #428

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
(cherry picked from commit 8123fea)
@sodre sodre added the area/limiter Core rate limiting logic label Jun 26, 2026
@sodre sodre marked this pull request as ready for review June 26, 2026 03:20
@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 92.26%. Comparing base (4eb1a2b) to head (d2c2325).
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@                Coverage Diff                 @@
##           release/0.10.x     #431      +/-   ##
==================================================
+ Coverage           92.22%   92.26%   +0.03%     
==================================================
  Files                  36       36              
  Lines                7830     7830              
==================================================
+ Hits                 7221     7224       +3     
+ Misses                609      606       -3     
Flag Coverage Δ
doctest 29.04% <0.00%> (ø)
e2e 42.92% <100.00%> (+0.57%) ⬆️
integration 51.18% <100.00%> (+0.60%) ⬆️
unit 92.05% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

@sodre sodre merged commit 7d18595 into release/0.10.x Jun 26, 2026
16 of 20 checks passed
@sodre sodre deleted the backport/0.10.3-slow-path-fix branch June 26, 2026 03:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/limiter Core rate limiting logic

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant