Skip to content

fix(api/history): gate approval-expiry sweep on Lambda runtime#1232

Open
cristim wants to merge 2 commits into
mainfrom
fix/cor-06-fix
Open

fix(api/history): gate approval-expiry sweep on Lambda runtime#1232
cristim wants to merge 2 commits into
mainfrom
fix/cor-06-fix

Conversation

@cristim

@cristim cristim commented Jun 11, 2026

Copy link
Copy Markdown
Member

Problem (COR-06, #1170)

GET /api/history expired stale pending/notified approvals in a best-effort background goroutine spawned just before the response returns. The "context.Background() ensures the transitions are not cancelled" guarantee does not hold on Lambda: the execution environment freezes as soon as the response is out, so the goroutine is unreliably suspended mid-sweep. Rows can stay "pending" indefinitely, and a thawed goroutine doing DB writes during a later request confuses latency and log attribution. The same package already handles this correctly for the SWR cache (riUtilizationCache carries isLambda and skips its background refresh).

Fix

internal/api/handler_history.go:

  • expireStaleExecutions (renamed from expireStaleExecutionsAsync, which is no longer always async) now gates on runtime.IsLambda(), the exact detection helper the SWR cache gate uses. On Lambda the sweep (a handful of cheap UPDATEs) runs synchronously before the handler returns; on long-running servers it stays asynchronous so the read response is never blocked on the transitions.
  • The sweep body is extracted into expireStaleExecutionsSweep, shared by both branches; it keeps context.Background() so neither request cancellation nor a Lambda request deadline aborts the best-effort transitions.
  • The GET stays a pure read in both modes (issue fix(api): getHistory GET writes (expireIfStale) + drops empty-account rows for scoped users (#621 regression) #1032 invariant): the response carries the pre-transition status; the next load reflects "expired".

Test evidence

New TestHandler_getHistory_ExpireIfStale_LambdaGuard in internal/api/handler_history_test.go covers both branches by flipping AWS_LAMBDA_RUNTIME_API with t.Setenv:

  • Lambda sub-test: asserts TransitionExecutionStatus has already fired when getHistory returns (no channel wait). Confirmed it FAILS pre-fix (git stash of the handler change, -count=5: 5/5 failures with expected: 1, actual: 0) and passes post-fix.
  • Non-Lambda sub-test: blocks the transition behind a release channel and asserts getHistory still returns while it is blocked, proving the sweep remains asynchronous off Lambda; no time.Sleep, channel-synchronized per project convention.

Verification: go build ./... clean; go test ./internal/api/ -race -count=1 1630 passed; go vet ./internal/api/ clean.

Closes #1170

@cristim cristim added triaged Item has been triaged priority/p3 Polish / idea / may never ship severity/medium Moderate harm urgency/eventually No deadline impact/few Limited audience effort/s Hours type/bug Defect labels Jun 11, 2026
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@cristim, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 23 minutes and 58 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f270e39f-4c4b-43fe-b4b5-99aa55bad055

📥 Commits

Reviewing files that changed from the base of the PR and between 451a70f and d1cbf16.

📒 Files selected for processing (2)
  • internal/api/handler_history.go
  • internal/api/handler_history_test.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/cor-06-fix

Comment @coderabbitai help to get the list of available commands.

@cristim

cristim commented Jun 11, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@cristim

cristim commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cristim added 2 commits June 26, 2026 17:42
The GET /api/history handler expired stale pending/notified approvals
in a best-effort background goroutine spawned right before the response
returns. On Lambda the execution environment freezes as soon as the
response is out, so the goroutine is unreliably suspended mid-sweep:
rows can stay "pending" indefinitely and a thawed goroutine doing DB
writes during a later request confuses latency and log attribution.

Mirror the SWR cache's isLambda gate (ri_utilization_cache.go) using
the same runtime.IsLambda detection helper: on Lambda the sweep (a
handful of cheap UPDATEs) runs synchronously before the handler
returns; on long-running servers it stays asynchronous so the read
response is never blocked on the transitions.

Regression tests cover both branches by flipping AWS_LAMBDA_RUNTIME_API
via t.Setenv: the Lambda sub-test asserts the transition fired before
getHistory returned (fails pre-fix), and the non-Lambda sub-test blocks
the transition and asserts getHistory still returns (sweep remains
asynchronous).

Closes #1170
Trailing reference to "the async sweep" was left over from the
expireStaleExecutionsAsync -> expireStaleExecutions rename in the
prior commit. Now the sweep is sync on Lambda and async on servers,
so the comment names both modes explicitly.

No behavioural change.
@cristim

cristim commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

Adversarial review (PR #1232)

Reviewed against the five risk surfaces from the brief (runtime detection, gate placement, concurrency safety, ctx-cancel handling, test coverage). The fix is sound; one rebase hygiene issue addressed in this push, one docstring nit cleaned up, nothing blocking.

Verified

  • Runtime detection: runtime.IsLambda() reads AWS_LAMBDA_RUNTIME_API != "". AWS sets this for every Lambda runtime (managed, container image, custom, extensions), and is not spoofable from outside the sandbox in any way that matters for this gate. Same helper the SWR cache uses (handler.go:131, app.go:278, scheduler.go:146), so the rule "where SWR-style background goroutines are unsafe" stays consistent across the codebase. No flimsy user-agent / hostname signal. ✓
  • Gate placement: inside expireStaleExecutions (the sweep function), not at the handler entry. Only caller is getHistory, but should anything else start invoking the sweep later, the gate still applies. ✓
  • Concurrency safety on Lambda: multiple concurrent invocations can target the same stale row, but TransitionExecutionStatus is a single UPDATE ... WHERE status IN ('pending','notified') (store_postgres.go:900) — the loser's rowcount is 0 and the resulting ErrExecutionNotInExpectedStatus is logged-and-skipped by the sweep's Warnf. Two simultaneous GETs cannot double-transition the same execution. The DB row is the lock; no in-process leader election needed. ✓
  • feedback_ctx_cancel_terminal.md: applies to fan-out loops that accumulate lastErr and continue past ctx.Err(). The sweep does neither — it iterates once, no lastErr, errors per-row go to Warnf. Not violated. ✓
  • feedback_ctx_aware_patterns.md: ctx-aware sleep in retry loops. No retries, no sleep. Not applicable. ✓
  • context.Background() deliberate choice: documented (handler_history.go:227-231) — on servers the goroutine outlives the request; on Lambda the request ctx carries the Lambda execution deadline and the best-effort transitions should not abort because the caller's deadline is close to expiring. Worst case on Lambda: timeout fires mid-sweep, remaining stale items get picked up by the next GET (the sweep is itself idempotent). ✓
  • Lambda sub-test: t.Setenv("AWS_LAMBDA_RUNTIME_API", "127.0.0.1:9001") flips the gate on. Asserts TransitionExecutionStatus already fired by the time getHistory returns (no channel wait, no sleep). The PR description's pre-fix run (git stash + -count=5) showed 5/5 failures pre-fix, passes post-fix — meets the §6 "test must fail on the broken commit" bar.
  • Non-Lambda sub-test: t.Setenv("AWS_LAMBDA_RUNTIME_API", "") flips it off. Uses release/swept channels with a 5s watchdog; getHistory must return while the mocked transition is blocked, proving the sweep stays async off-Lambda. No time.Sleep, no require.* in worker goroutines (feedback_no_sleep_in_tests.md, feedback_require_in_goroutines.md). The Lambda subtest registers t.Cleanup(func(){ mockStore.AssertExpectations(t) }); the non-Lambda subtest calls it inline (acceptable since it runs after the <-swept synchronization point).
  • getHistory invariant unchanged: response carries the pre-transition status in BOTH modes — assert on "pending" in both subtests (lines 1898, 1958). The GET stays a pure read (issue fix(api): getHistory GET writes (expireIfStale) + drops empty-account rows for scoped users (#621 regression) #1032), the row flips to "expired" on the next load. ✓
  • Approval-expiry semantics: idempotent. The sweep transitions pending|notified -> expired via the FROM-status-guarded UPDATE; re-running on the same execution affects 0 rows and logs a warn. No deletion, no double-revoke, no extra email. Safe to re-run on Lambda concurrent invocations and across timeout-induced partial sweeps. ✓
  • Local dev / test impact: gated-off, expected. The non-Lambda subtest pins this (AWS_LAMBDA_RUNTIME_API cleared via t.Setenv to be deterministic regardless of outer env), and there are no real DB writes from a goroutine in unit tests because the mock guards TransitionExecutionStatus.
  • No magic strings or hardcoded values introduced (feedback_no_hardcoded_magic_values.md, feedback_sdk_enum_string_literals.md): the FROM-statuses {"pending","notified"} and the target "expired" are pre-existing strings re-used from the old async path; the only new symbol added is runtime.IsLambda(), which is the agreed gate primitive.

Fixes pushed

  • Rebase onto current main to clear migration-collision noise (matches feedback_project_migration_number_collisions.md). The PR was based on an older main where audit_actor_stamps was at 000074_*; main has since been renumbered to 000077_* (PR fix(db): renumber audit_actor_stamps migration to clear 000074 collision #1261, commit 5894580f3). The PR diff vs main was carrying a spurious 000077_audit_actor_stamps => 000074_audit_actor_stamps rename that left two 000074_* files in the branch's tree (audit_actor_stamps + repair_partial_migration_058_067). GitHub's 3-way merge ref happened to resolve this cleanly — pre-commit Check for conflicting migration numbers Passed and migrate up in CI ran 74 (repair) and 77 (audit) just fine — but anyone checking out the branch locally hit the collision, and the diff was noisy for review. Rebased on origin/main (451a70f73); branch tree now contains only the audit_actor_stamps file at 077 (unchanged from main) and the in-scope handler files. Force-push with --force-with-lease.
  • Docstring follow-through after the sweep rename (d1cbf16d8): isStaleExecution still referred to "the async sweep" after the rename expireStaleExecutionsAsync -> expireStaleExecutions. Updated to "the expire sweep (sync on Lambda, async on servers)" so the comment reflects the new bi-modal behaviour.

UNSTABLE state

All five most recent CI runs on main itself are failure (gh run list --workflow="CI - Build & Test" --branch=main --limit=5):

Not blocking.

Verdict

LGTM. CR re-ping posted (@coderabbitai review).

@cristim

cristim commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

@coderabbitai review

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@cristim Thanks for the thorough adversarial pass — this is very helpful. The validation around the Lambda gate, placement inside expireStaleExecutions, idempotent transition behavior, and the preserved getHistory pure-read invariant all make sense. I’ll re-review the latest push now.

🐇✅


✏️ Learnings added
Learnt from: cristim
URL: https://github.com/LeanerCloud/CUDly/pull/1232

Timestamp: 2026-06-26T15:48:00.087Z
Learning: In this repository, background or SWR-style best-effort work that is unsafe after a response on AWS Lambda should be gated with `runtime.IsLambda()` (which checks `AWS_LAMBDA_RUNTIME_API != ""`), matching the existing pattern used by the SWR cache helpers in `handler.go`, `app.go`, and `scheduler.go`.

Learnt from: cristim
URL: https://github.com/LeanerCloud/CUDly/pull/1232

Timestamp: 2026-06-26T15:48:00.087Z
Learning: For `internal/api/handler_history.go`, `getHistory` must remain a pure read per issue `#1032`: approval-expiry transitions may be triggered as best-effort follow-up work, but the response must reflect the pre-transition status, and `"expired"` should only be visible on a subsequent load.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

cristim added a commit that referenced this pull request Jun 26, 2026
Rebased onto current main, which had already promoted audit_actor_stamps
to 000077 (5894580, the renumber landed in PR #1232 review). The COR-02
nested-rollup migration was added at 000077 in this PR, which now collides
with main's 000077_audit_actor_stamps and trips the pre-commit
"Check for conflicting migration numbers" hook (CI's pre-commit run was
failing on this branch with `Duplicate migration number(s) found: 000077`).

Rename to 000078:
- internal/database/postgres/migrations/000078_monthly_summary_nested_rollup.up.sql
- internal/database/postgres/migrations/000078_monthly_summary_nested_rollup.down.sql
- internal/database/postgres/migrations/000078_monthly_summary_nested_rollup_test.go

Body unchanged except for the leading `-- 000077:` / `-- 000077 down:`
header comments updated to 000078; the body-comment references to
`000067/000074` (the historical flat-AVG definitions) stay as-is because
they cite past migrations, not the file being renamed.

Verified:
- ls internal/database/postgres/migrations/*.up.sql | cut -c1-6 | sort | uniq -d
  prints nothing.
- go build ./... succeeds.
- go test ./internal/analytics/... passes 186/186.

Refs #1151 (COR-02).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort/s Hours impact/few Limited audience priority/p3 Polish / idea / may never ship severity/medium Moderate harm triaged Item has been triaged type/bug Defect urgency/eventually No deadline

Projects

None yet

Development

Successfully merging this pull request may close these issues.

COR-06: Best-effort approval-expiry goroutine has no Lambda guard, unlike the SWR cache pattern

1 participant