fix(query): auto-recover from context-overflow errors#1169
Conversation
Tag the three places that surface a 'context window exceeded' assistant message (Anthropic PTL, OpenAI-shim context_overflow category, Anthropic 500 with context keywords) with apiError: 'context_overflow' and add isContextOverflowMessage helper. Lets the query-loop recovery branch in the follow-up commit catch all three via a single predicate instead of duplicating string matchers, and keeps the content-prefix fallback so older sites that didn't get the tag are still recognised. Refs Gitlawb#1105
When a request fails because the conversation exceeds the provider context window, run a single auto-compact + retry instead of surfacing the error and stopping the turn. Covers external builds (no reactiveCompact / contextCollapse compiled in) and OpenAI-shim providers like Codex / GPT-5.5 that surface the limit through a 500 with context-overflow keywords rather than the Anthropic prompt-too-long path. Withholds the error in the streaming loop (parallel to the existing prompt-too-long withholding), runs compactConversation, replaces messagesForQuery with the post-compact summary, and continues the loop. Gated by hasAttemptedContextOverflowRecovery so a single turn cannot loop compact -> error -> compact forever, and the autocompact 3-strike circuit breaker in autoCompact.ts handles deeper recursion if the post-compact retry overflows again. Resets on each fresh tool round at the next_turn site so subsequent turns get a clean recovery attempt. Closes Gitlawb#1105
gnanam1990
left a comment
There was a problem hiding this comment.
Local checks on febcf78:
bun run build && node dist/cli.mjs --version— ✅ builds, prints0.10.0 (OpenClaude)bun test src/services/api/errors.test.ts— ✅ 6 passbun test src/services/compact src/services/api— ✅ 528 pass when isolated (the 1 fail inopenaiShim.test.tsreproduces only under multi-file ordering; in isolation: 94/0 — pre-existing test-state-leak, unrelated)tsc --noEmit— no new errors introduced (the 10 pre-existing failures on main remain unchanged)
Code review:
- The one-shot
hasAttemptedContextOverflowRecoveryguard correctly mirrorshasAttemptedReactiveCompact— gated at the message-withhold site and the recovery branch, and reset at the same fresh-tool-round sites. No infinite-loop surface. - Reusing
compactConversation(..., isAutoCompact=true)is a nice call — gets the existing 3-strike circuit breaker for free if the post-compact retry also overflows. - The
isContextOverflowMessagefallback by content-prefix is a sensible safety net for older emit sites that didn't carry theapiError: 'context_overflow'tag. Tests cover all three sources (PTL, OpenAI-shim, Anthropic-500) plus the rejection cases. - No red flags (no
tengu_*, noUSER_TYPE === 'ant', no new network calls, no new deps, no CI diff).
LGTM.
jatmn
left a comment
There was a problem hiding this comment.
Findings
- [P1] The new overflow recovery path also applies to compact/session-memory forks
src/query.ts:1300
This branch currently runs even whenquerySourceis'compact'or'session_memory'. Those flows already have specialized oversized-context handling, and the compact worker has its own prompt-too-long retry path. If a compact fork hits a non-PTLcontext_overflow, this code can re-entercompactConversation()from insidequeryLoopusing the forked compact prompt asmessagesForQuery/forkContextMessagesinstead of the original oversized conversation. That risks compacting the compact worker's own prompt rather than the real conversation payload, and it bypasses the dedicated compaction retry logic. Please guard this branch the same way the existing oversized-context logic guards compact/session-memory sources, and let those specialized callers handle their own recovery.
techbrewboss
left a comment
There was a problem hiding this comment.
Review summary
The context-overflow retry is valuable for the main user loop, but I think this needs one guard before merge: the new recovery path should not run for compact/session-memory fork queries. Those flows already have specialized oversized-context handling, and this branch can recursively invoke compaction from inside the compact worker with the fork prompt rather than the original conversation.
Findings
src/query.ts:1300- Guard context-overflow recovery away from compact/session-memory query sources.
Impact:isWithheldContextOverflowcurrently applies regardless ofquerySource, so aquerySource === 'compact'or'session_memory'fork that hits a non-PTL context overflow will enter this branch and callcompactConversation(messagesForQuery, ..., forkContextMessages: messagesForQuery). In those forked flows,messagesForQueryis the compact/session-memory worker prompt, not the original oversized conversation payload. That can compact the worker's prompt, bypass the dedicated compact retry behavior, and produce a misleading post-compact retry instead of letting the specialized caller handle the failure.
Suggested fix: mirror the existing compact/session-memory exclusion used by the pre-flight blocking-limit path and/or the specialized recovery paths, e.g. requirequerySource !== 'compact' && querySource !== 'session_memory'before settingisWithheldContextOverflowtrue.
Validation
I reviewed the PR diff and checked the surrounding query-loop logic locally. The existing blocking-limit path explicitly skips compact/session-memory sources because those are forked agents that inherit the full conversation and need their dedicated handlers; the new context-overflow recovery branch currently lacks the same guard.
Per @jatmn and @techbrewboss review on Gitlawb#1169: the new isWithheldContextOverflow branch was running regardless of querySource. Compact and session_memory forks pass the worker prompt as messagesForQuery, so recovering here would call compactConversation() with the worker prompt as forkContextMessages — bypassing the dedicated compact retry path and producing a misleading post-compact retry of the worker prompt rather than the real conversation. Mirror the existing pre-flight blocking-limit exclusion (~query.ts:691) and let the specialized fork callers handle their own oversized-context recovery.
|
Pushed No new query-loop test added — there's no Thanks @jatmn @techbrewboss for the catch. |
jatmn
left a comment
There was a problem hiding this comment.
Thanks for following up on the earlier review. The recovery branch now skips compact and session_memory sources, but I found one remaining issue in the companion withhold path.
Findings
- [P1] Don't withhold context-overflow errors for compact/session-memory forks
src/query.ts:923
The recovery branch now has thequerySource !== 'compact' && querySource !== 'session_memory'guard, but the streaming withhold condition above it still withholds anyisContextOverflowMessage(message)regardless ofquerySource. For a compact or session-memory fork that returns acontext_overflowAPI error, this hides the message from the stream, then the guarded recovery branch correctly skips it, and the generic API-error early return later exits withreason: 'completed'without yielding the original error. That means the specialized compact/session-memory caller does not get the diagnostic/retry path it needs. Please apply the same query-source exclusion to the withhold condition, or otherwise ensure the skipped recovery path surfaces the original error.
Blockers
Non-Blocking
Looks Good
Verdict: Changes Requested — withhold path needs the same query-source exclusion as the recovery branch. |
techbrewboss
left a comment
There was a problem hiding this comment.
Review summary
The auto-recovery path is useful and the new context-overflow classifier looks good, but there is still one guard mismatch that should block merge. The recovery branch now skips compact/session-memory fork queries, while the earlier streaming withhold path still hides the same errors for those fork sources.
Findings
src/query.ts:922- Don’t withhold context-overflow errors for compact/session-memory forks.
Impact: The recovery branch now correctly skipsquerySource === 'compact'and'session_memory', but the streaming withhold condition still hides anyisContextOverflowMessage(message)before that branch runs. For compact/session-memory forks, that means the originalcontext_overflowAPI error is withheld, the guarded recovery branch skips it, and the later generic API-error return exits without yielding the diagnostic to the specialized caller.
Suggested fix: Add the samequerySource !== 'compact' && querySource !== 'session_memory'guard to the withhold condition, or otherwise ensure the skipped recovery path surfaces the original error.
Validation
Reviewed the PR metadata, full diff, existing reviews, references/openclaude.md, and the surrounding query-loop logic. I also ran bun test src/services/api/errors.test.ts in a detached PR worktree using the repo’s installed dependencies: 6 pass / 0 fail. No malicious or suspicious behavior found.
|
Confirmed @jatmn's and @techbrewboss's finding against the code — they've independently landed on the same real issue. The recovery branch correctly carries the |
Summary
context_overflowcategory, Anthropic 500-with-context-keywords) via a newisContextOverflowMessagehelper and anapiError: 'context_overflow'tag on the assistant message.reactiveCompact/contextCollapse) and OpenAI-shim providers like Codex / GPT-5.5 that surface the limit through a 500 rather than the Anthropic PTL path.Closes #1105.
Impact
/compactor/newmanually.State.hasAttemptedContextOverflowRecoveryfield carried through every continue site inqueryLoop; resets at the same boundaries ashasAttemptedReactiveCompact(new tool round, continuation nudge, token-budget continuation). One-shot per turn; the existing autocompact 3-strike circuit breaker (autoCompact.ts:274) handles deeper recursion if the post-compact retry overflows again. The new branch sits AFTER the existingreactiveCompact/contextCollapsebranches, so internal builds keep their existing recovery and only fall through to this path if neither matched.Testing
bun run buildbun run smoke— fails onmaintoo (missing optional@orama/orama), not caused by this change.bun test src/services/api/errors.test.ts(new),bun test src/services/api/ src/services/compact/(529 pass / 0 fail),bun test src/__tests__(51 pass / 0 fail).bun run typecheck— no new errors in the touched files; remaining errors are pre-existing onmain.Notes
context_overflowcategory test only.