Fix hallucinated FINAL() answers with Claude 4.6 models by huiwenn · Pull Request #127 · alexzhang13/rlm

huiwenn · 2026-02-27T19:51:12Z

Problem

Claude 4.6 models (Sonnet and Opus) behave differently from earlier models when generating RLM responses. Instead of generating a code block and stopping to wait for execution feedback, these models continue generating past the code block — they hallucinate what they think the execution output will be, reason over that hallucinated output, and then commit to a FINAL() answer, all within a single turn.

For example, when asked "Which are the top 10 most spending customers?" with real customer data in context, Sonnet 4.6 would:

Generate print(context) inside a ```repl ``` block (correct)
Continue generating completely fabricated output (e.g. "Alice: $1200, Bob: $450, Kevin: $5200" — none of which exist in the actual data)
Reason over that fabricated output and provide FINAL(1. Kevin: $5200, 2. Hannah: $4500, ...) — a confidently wrong answer based on hallucinated data

The code blocks are executed correctly and produce the real results, but because find_final_answer() finds the FINAL() in the same response, the completion loop exits immediately. The real execution results are never fed back to the model, and the hallucinated answer is returned as-is.

This is a fundamental mismatch: the RLM loop assumes models will stop after generating a code block and wait for feedback, but Claude 4.6 models do not — they eagerly generate the full response, including predicted outputs and final answers, in one shot.

Fix

Skip text-based FINAL() detection when the response contains code blocks. If code blocks are present, the model may have hallucinated the output, so we discard any FINAL() found in the response text and instead feed back the real execution results via format_iteration(). The model then sees the actual data and provides the correct answer in the next turn.

This does not affect:

FINAL_VAR() from code execution — still works, since it retrieves real variables from the REPL
FINAL() in text-only responses (no code blocks) — still works, since the model is answering based on results it received in prior turns

Verification

All 234 existing unit tests pass
Live-tested with Sonnet 4.6: model hallucinated fake customer data in iteration 1, self-corrected after seeing real execution results in iteration 2, and returned the correct answer (Eve Davis at $31,445.20 as top spender) in iteration 3

When models like Sonnet 4.6 generate code blocks and FINAL() in the same response, the FINAL() is based on hallucinated execution output rather than actual results. Skip text-based FINAL() detection when code blocks are present, forcing real execution results to be fed back so the model can answer correctly in the next turn. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rolson24 · 2026-03-02T15:17:25Z

Can't you just use a stop token at the end of the repl block? Set the stop token to be something like "```\n`

alexzhang13 · 2026-03-04T04:06:36Z

This is a weird problem that's quite model dependent, let me think on it more. Ideally we'd like to have less hacky solutions to this...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hallucinated FINAL() answers with Claude 4.6 models#127

Fix hallucinated FINAL() answers with Claude 4.6 models#127
huiwenn wants to merge 1 commit intoalexzhang13:mainfrom
huiwenn:fix/skip-premature-final-with-code-blocks

huiwenn commented Feb 27, 2026

Uh oh!

rolson24 commented Mar 2, 2026

Uh oh!

alexzhang13 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

huiwenn commented Feb 27, 2026

Problem

Fix

Verification

Uh oh!

rolson24 commented Mar 2, 2026

Uh oh!

alexzhang13 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants