Skip to content

fix(cloud): fail exhausted payout retries (#10628)#10633

Merged
lalalune merged 5 commits into
developfrom
fix/10628-payout-retry-exhausted
Jul 1, 2026
Merged

fix(cloud): fail exhausted payout retries (#10628)#10633
lalalune merged 5 commits into
developfrom
fix/10628-payout-retry-exhausted

Conversation

@lalalune

@lalalune lalalune commented Jul 1, 2026

Copy link
Copy Markdown
Member

Summary

Evidence

Tracked evidence: .github/issue-evidence/10628-payout-retry-exhausted/README.md

Validation run after sync with origin/develop:

ELIZA_SKIP_ARTIFACT_SYNC=1 bun install --frozen-lockfile
git diff --check
bun run --cwd packages/cloud/shared test src/lib/services/__tests__/payout-stale-lock-recovery.test.ts
bun run --cwd packages/cloud/shared lint
bun run --cwd packages/cloud/shared typecheck
bun run verify

Results:

  • bun install --frozen-lockfile: passed, no dependency changes.
  • git diff --check: passed.
  • Focused payout test: passed with 9 tests, 0 failures, 45 expectations.
  • packages/cloud/shared lint: passed.
  • packages/cloud/shared typecheck: passed.
  • Root bun run verify: passed, including 474/474 Turbo lint/typecheck tasks and final dist-path consumer checks.

N/A

  • Android capture: backend-only packages/cloud/shared payout processor change; no app, native, Android, or UI surface changed.
  • Screenshot/screen recording: backend-only service path.
  • Real LLM trajectory: payout processor path is not model-backed.
  • Solana stale-lock recovery: covered in this PR by escalating stale network='solana' processing rows with no broadcast_tx_hash to failed + requires_review instead of re-approving them. This intentionally takes the safe manual-review posture from Payout: retry-exhausted redemptions orphan in 'approved' (stuck, no alert) + Solana lacks nonce fencing on recovery #10628 rather than adding a second Solana fence mechanism.

Fixes #10628

Additional validation after f8fb3e0a5be

  • bun test /Users/shawwalters/eliza-workspace/milady/eliza-10200-runner-plan/packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts from /tmp: passed with 11 tests, 0 failures, 63 expectations.
  • bunx biome check packages/cloud/shared/src/lib/services/payout-processor.ts packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts: passed.
  • git diff --check: passed.
  • bun run --cwd packages/cloud/shared lint: passed.
  • bun run --cwd packages/cloud/shared typecheck: passed.
  • bun run verify: passed end-to-end, including 474/474 Turbo tasks and 28 dist-path consumer configs.

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your trial has ended. Reactivate Greptile to resume code reviews.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 16fcaa5d-1741-4c86-8570-b2a4e611f85e

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/10628-payout-retry-exhausted

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@lalalune

lalalune commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Reviewed — the fix is correct. Traced the boundary logic in both paths:

  • markFailed(retryable=true): re-approves only while retry_count < MAX_RETRY_ATTEMPTS - 1 (states 0,1). So a redemption gets exactly 3 selectable attempts (retry_count 0→1→2), and the failure that occurs at retry_count=2 falls through to the failed + requires_review + broadcast_tx_hash=null + alert branch (retry_count clamped to 3 via LEAST(...+1, MAX)). The old orphan state (approved / retry_count=3, unselectable, never alerted) can no longer be reached.
  • Stale-lock recovery branch: symmetric — < MAX_RETRY_ATTEMPTS - 1 re-approves, >= MAX_RETRY_ATTEMPTS - 1 fails, so a recovery strike that would push retry_count to the ceiling fails+flags rather than re-approving into the same unselectable state.

Both regression tests ((g) markFailed exhaustion, (f2) stale-lock exhaustion) assert the full terminal row shape (failed/retry_count=3/requires_review=true/processing_started_at=null/broadcast_tx_hash=null), zero raw sends, and exactly one alert — the right invariants. broadcast_tx_hash stays cleared so the never-broadcast safety property #10600 established is preserved.

Money path, so leaving the merge to a maintainer per protocol. No changes requested.

@lalalune

lalalune commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Reviewed #10633 at 7e61eb1a359d8f86e49944ad45f5191546a25431. The fix addresses both retry-exhaustion paths I expected: stale processing rows with no broadcast hash now fail when the recovery strike reaches the retry ceiling, and ordinary retryable pre-broadcast failures now fail with requires_review = true on the final strike instead of returning to an unselectable approved state. Broadcast-hash rows remain protected from auto retry.

Validation notes:

  • git diff --check origin/develop...HEAD passed.
  • bunx @biomejs/biome check packages/cloud/shared/src/lib/services/payout-processor.ts packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts passed.
  • First bun test packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts was not valid evidence because the suite logged PGlite setup skipped due missing local package exports.
  • After building prerequisites with bun run --cwd packages/core prebuild, bun run --cwd packages/core build:node, bun run --cwd packages/security build, and bun run --cwd plugins/plugin-sql build, the same payout test ran the real PGlite cases and passed all 9 tests, including the new Payout: retry-exhausted redemptions orphan in 'approved' (stuck, no alert) + Solana lacks nonce fencing on recovery #10628 final retry cases.
  • bun run --cwd packages/cloud/shared typecheck passed after those local package exports were built.

I did not merge because GitHub checks are still queued/unstable on the PR.

@lalalune

lalalune commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Reviewed the payout-state logic against processBatch's selection predicate (status = approved AND retry_count < MAX_RETRY_ATTEMPTS, MAX=3). The fix is correct:

  • Reset-to-approved (both stale-lock recovery and handleFailure retryable) increments retry_count and is gated to retry_count < MAX - 1, so an approved row's retry_count can only ever reach MAX - 1 (=2) — it can never land on >= MAX and be silently unselectable. The orphaned-approved-at-the-ceiling window from Payout: retry-exhausted redemptions orphan in 'approved' (stuck, no alert) + Solana lacks nonce fencing on recovery #10628 is closed.
  • Trace at MAX=3: attempts at retry_count 0→1→2 (approved/selectable), and the failure at retry_count==2 matches 0 reset rows → fails with retry_count = LEAST(3, MAX), requires_review = true. Exactly 3 attempts then a reviewable failed. The stale-recovery >= MAX-1 exhausted branch is disjoint from the < MAX-1 reapprove branch and runs after it (reapproved rows leave processing), so no double-count.

Two minor, non-blocking notes:

  1. Unhandled-rejection asymmetry on the alert. The stale-recovery exhausted path fires void payoutAlertsService.sendAlert(...) (fire-and-forget), while handleFailure awaits it. If the alert transport throws, the void path becomes an unhandled rejection. Consider await (it's already inside a summary block) or a .catch() for parity.

  2. Fallback failed update lacks a status guard. In handleFailure, when the retryable reset matches 0 rows the fallback .set({status:"failed"...}).where(eq(id)) has no status = 'processing' guard. Safe under the current per-worker processing-lock model (and no worse than the reset it replaces), but adding eq(status,'processing') would be defense-in-depth against a concurrent state transition writing failed over a row another path already advanced.

Correctness looks solid to merge; the two notes are hardening nits.

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your trial has ended. Reactivate Greptile to resume code reviews.

@NubsCarson

Copy link
Copy Markdown
Member

I picked up the two hardening notes from the review thread and pushed d66e924563 to this PR branch.

What changed:

  • Recovery alerts for exhausted/stuck payout rows are now awaited instead of fire-and-forget, so alert transport failures cannot become unhandled promise rejections.
  • Retryable markFailed updates now require the row to still be status = 'processing' before resetting to approved or falling through to terminal failed, preventing a concurrent state advance from being overwritten.
  • Added real-PGlite regression (h) final retryable failure does not overwrite a row no longer processing.
  • Also merged current develop (ad3d5f0943) into the branch before the patch, so the PR is current after [codex] test(app): align cloud startup smoke with first-run chooser #10627/fix(plugin-sql): scope PGlite managers by agent #10639.

Validation on current head d66e924563:

  • git diff --check origin/develop...HEAD && git diff --check
  • bunx @biomejs/biome check packages/cloud/shared/src/lib/services/payout-processor.ts packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts
  • bun run --cwd packages/core prebuild && bun run --cwd packages/core build:node && bun run --cwd packages/security build && bun run --cwd plugins/plugin-sql build
  • bun test packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts (10 tests passed; real PGlite path, including the new guard regression)
  • bun run --cwd packages/cloud/shared lint
  • bun run --cwd packages/cloud/shared typecheck

GitHub runners restarted and are queued/UNSTABLE; no local failure from the pushed hardening patch.

@lalalune

lalalune commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Re-reviewed the updated head d66e9245639a822a06c2f4584c3c56a390e5f3be after the two hardening commits.

The new delta is correct: exhausted/stuck alert sends are awaited, and retryable final-failure updates now require the row to still be status = 'processing' before moving it back to approved or terminal failed, so a concurrent completion is not overwritten.

Validation on the updated branch:

  • Fast-forwarded local worktree to d66e924563.
  • bun run --cwd packages/core build passed.
  • bunx @biomejs/biome check packages/cloud/shared/src/lib/services/payout-processor.ts packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts passed.
  • bun test packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts passed: 10 tests, including the new (h) final retryable failure does not overwrite a row no longer processing case.

@lalalune

lalalune commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Small issue-hygiene update: I changed the PR body footer from Fixes #10628 to Refs #10628.

Reason: this PR fixes the confirmed retry-exhausted approved orphan path, but #10628 also tracks the separate Solana nonce/fencing design question. The issue should remain open after this PR merges unless that second item is split out or resolved.

I also checked the currently surfaced check-pr-title failure: it is a stale CANCELLED run from the earlier head, not a new title-policy failure from this body edit.

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your trial has ended. Reactivate Greptile to resume code reviews.

@lalalune

lalalune commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Follow-up pushed: f8fb3e0a5be now covers the second #10628 concern as well as retry exhaustion.

What changed:

  • Stale network = 'solana' payout rows with status = 'processing', an expired processing lock, and no broadcast_tx_hash now transition to failed + requires_review instead of being re-approved.
  • EVM stale rows keep the existing safe retry/retry-ceiling behavior.
  • Added a PGlite processBatch() regression case proving the Solana row is escalated, not broadcast/re-approved, and an alert is sent.

Validation after the push:

  • bun test /Users/shawwalters/eliza-workspace/milady/eliza-10200-runner-plan/packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts from /tmp: passed, 11 tests / 63 expectations.
  • bunx biome check packages/cloud/shared/src/lib/services/payout-processor.ts packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts: passed.
  • git diff --check: passed.
  • bun run --cwd packages/cloud/shared lint: passed.
  • bun run --cwd packages/cloud/shared typecheck: passed.
  • bun run verify: passed end to end, including 474/474 Turbo tasks and 28 dist-path consumer configs.

Evidence N/A: backend-only payout processor change; no UI/native/LLM surface, so screenshot, screen recording, Android/iOS capture, and live model trajectory are not applicable.

@lalalune

lalalune commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Review (8-agent PR sweep): CODE CORRECT + fund-safe — MERGE-AFTER-OPS-SIGN-OFF (money path).

Traced the state machine against processBatch's selector (status='approved' AND retry_count < MAX_RETRY_ATTEMPTS). The fix is sound: the retryable-failure path is gated to retry_count < MAX-1 and increments on reset, so an approved row can never reach retry_count >= MAX (the orphan window from #10628 is closed); at the ceiling it transitions (approved,2) → (failed,3, requires_review=true) + alert. Concurrency-safe: both failing UPDATEs guard status='processing', so a row a concurrent worker moved to completed is not overwritten (verified by the "does not overwrite a row no longer processing" test). Alerts fire only after the DB transition succeeds. Solana stale locks escalate to failed (fail-closed; no nonce fence like EVM) — conservative and correct. No double-pay / lost-funds path introduced.

No code fixes needed. Being a money path, leaving the actual merge for ops sign-off (Solana-escalation + alert-frequency + the requires_review workflow).

@lalalune

lalalune commented Jul 1, 2026

Copy link
Copy Markdown
Member Author

Re-reviewed current head f8fb3e0a5be3 after the Solana stale-lock escalation was added.

The new behavior is the right direction for the remaining #10628 money-path concern: stale network = 'solana' processing rows with no broadcast_tx_hash are not re-approved, because there is no EVM-style nonce fence if a slow worker later resumes and signs with a fresh blockhash. EVM stale rows keep the bounded retry behavior and now fail for review when the recovery strike reaches the retry ceiling.

Validation in /tmp/eliza-pr-10633-new:

  • bun install --frozen-lockfile --ignore-scripts
  • git diff --check origin/develop...HEAD
  • bunx @biomejs/biome check packages/cloud/shared/src/lib/services/payout-processor.ts packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts .github/issue-evidence/10628-payout-retry-exhausted/README.md
  • First bun test packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts exposed a setup-only pass: PGlite cases self-skipped because @elizaos/core was not built in the fresh worktree.
  • Ran bun run build:core -> 64/64 tasks passed.
  • Reran bun test packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts -> real PGlite path executed, 11/11 tests passed, 63 assertions.

The currently surfaced check-pr-title failure is still a stale CANCELLED run, not a code/test failure on this head.

@lalalune lalalune changed the title fix(cloud): fail exhausted payout retries fix(cloud): fail exhausted payout retries (#10628) Jul 1, 2026

@greptile-apps greptile-apps Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your trial has ended. Reactivate Greptile to resume code reviews.

@NubsCarson

Copy link
Copy Markdown
Member

Refreshed this branch onto current origin/develop (0a3fd649de) and pushed merge commit 0e35bb9dd2 so the payout retry fix is no longer 30 commits behind.

Validation after refresh:

  • git diff --name-status origin/develop...HEAD scoped to payout-processor.ts, payout-stale-lock-recovery.test.ts, and the Payout: retry-exhausted redemptions orphan in 'approved' (stuck, no alert) + Solana lacks nonce fencing on recovery #10628 evidence README.
  • git diff --check origin/develop...HEAD passed.
  • bunx @biomejs/biome check packages/cloud/shared/src/lib/services/payout-processor.ts packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts .github/issue-evidence/10628-payout-retry-exhausted/README.md passed.
  • bun run --cwd packages/cloud/shared typecheck passed.
  • bun test --coverage-reporter=lcov --coverage-dir=.tmp/coverage-10628-refresh packages/cloud/shared/src/lib/services/__tests__/payout-stale-lock-recovery.test.ts passed: 11 tests, 63 expects.

This still looks like a money-path change that should wait for ops/owner sign-off before merge, but the branch itself is refreshed and locally validated.

@lalalune lalalune merged commit 1fb0e40 into develop Jul 1, 2026
14 of 44 checks passed
@lalalune lalalune deleted the fix/10628-payout-retry-exhausted branch July 1, 2026 04:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Payout: retry-exhausted redemptions orphan in 'approved' (stuck, no alert) + Solana lacks nonce fencing on recovery

2 participants