Skip to content

Fix hallucinated FINAL() answers with Claude 4.6 models#127

Open
huiwenn wants to merge 1 commit intoalexzhang13:mainfrom
huiwenn:fix/skip-premature-final-with-code-blocks
Open

Fix hallucinated FINAL() answers with Claude 4.6 models#127
huiwenn wants to merge 1 commit intoalexzhang13:mainfrom
huiwenn:fix/skip-premature-final-with-code-blocks

Conversation

@huiwenn
Copy link

@huiwenn huiwenn commented Feb 27, 2026

Problem

Claude 4.6 models (Sonnet and Opus) behave differently from earlier models when generating RLM responses. Instead of generating a code block and stopping to wait for execution feedback, these models continue generating past the code block — they hallucinate what they think the execution output will be, reason over that hallucinated output, and then commit to a FINAL() answer, all within a single turn.

For example, when asked "Which are the top 10 most spending customers?" with real customer data in context, Sonnet 4.6 would:

  1. Generate print(context) inside a ```repl ``` block (correct)
  2. Continue generating completely fabricated output (e.g. "Alice: $1200, Bob: $450, Kevin: $5200" — none of which exist in the actual data)
  3. Reason over that fabricated output and provide FINAL(1. Kevin: $5200, 2. Hannah: $4500, ...) — a confidently wrong answer based on hallucinated data

The code blocks are executed correctly and produce the real results, but because find_final_answer() finds the FINAL() in the same response, the completion loop exits immediately. The real execution results are never fed back to the model, and the hallucinated answer is returned as-is.

This is a fundamental mismatch: the RLM loop assumes models will stop after generating a code block and wait for feedback, but Claude 4.6 models do not — they eagerly generate the full response, including predicted outputs and final answers, in one shot.

Fix

Skip text-based FINAL() detection when the response contains code blocks. If code blocks are present, the model may have hallucinated the output, so we discard any FINAL() found in the response text and instead feed back the real execution results via format_iteration(). The model then sees the actual data and provides the correct answer in the next turn.

This does not affect:

  • FINAL_VAR() from code execution — still works, since it retrieves real variables from the REPL
  • FINAL() in text-only responses (no code blocks) — still works, since the model is answering based on results it received in prior turns

Verification

  • All 234 existing unit tests pass
  • Live-tested with Sonnet 4.6: model hallucinated fake customer data in iteration 1, self-corrected after seeing real execution results in iteration 2, and returned the correct answer (Eve Davis at $31,445.20 as top spender) in iteration 3

When models like Sonnet 4.6 generate code blocks and FINAL() in the
same response, the FINAL() is based on hallucinated execution output
rather than actual results. Skip text-based FINAL() detection when
code blocks are present, forcing real execution results to be fed
back so the model can answer correctly in the next turn.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@rolson24
Copy link

rolson24 commented Mar 2, 2026

Can't you just use a stop token at the end of the repl block? Set the stop token to be something like "```\n`

@alexzhang13
Copy link
Owner

This is a weird problem that's quite model dependent, let me think on it more. Ideally we'd like to have less hacky solutions to this...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants