Skip to content

Thinking capability fast-follow improvements#4829

Open
sarth6 wants to merge 11 commits intopydantic:mainfrom
sarth6:thinking-cap-improvements
Open

Thinking capability fast-follow improvements#4829
sarth6 wants to merge 11 commits intopydantic:mainfrom
sarth6:thinking-cap-improvements

Conversation

@sarth6
Copy link
Copy Markdown
Contributor

@sarth6 sarth6 commented Mar 24, 2026

Closes #4829

Summary

  • Fix correctness bugs in unified thinking translation across providers:
    • Anthropic: adaptive thinking now only activates for thinking=True or 'medium', not all effort levels
    • Bedrock: move _translate_thinking() call outside if model_settings: guard (empty dict after stripping is falsy)
    • Cerebras: add thinking=True branch that sets disable_reasoning=False
    • Groq: restructure if-chain to check thinking is False first → return 'hidden'
  • Strip thinking from model_settings after resolution in prepare_request() so downstream providers don't see it
  • Rename _get_thinking_param / _get_reasoning_effort_translate_thinking for consistency across all providers
  • Extract xAI inline effort map to module-level XAI_EFFORT_MAP
  • Update docs/thinking.md provider mapping tables for accuracy
  • Add comprehensive test suite (tests/test_thinking.py) covering:
    • Base thinking resolution via prepare_request()
    • Per-provider translation behavior (Anthropic, OpenAI Chat/Responses, Google, Groq, Bedrock, OpenRouter, Cerebras, xAI)
    • Thinking capability integration and serialization
    • Profile detection for thinking-capable models
    • No-mutation and no-leakage regression tests
    • Merge model settings with thinking fields (in test_settings.py)

Test plan

  • pytest tests/test_thinking.py tests/test_settings.py — all 123 tests pass
  • ruff check passes on modified files
  • CI lint + typecheck + full test suite

🤖 Generated with Claude Code

Review of the merged thinking-cap PR identified correctness bugs,
missing implementations, and gaps in test coverage. This plan
documents the findings and proposes fixes across 5 phases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size: M Medium PR (101-500 weighted lines) chore labels Mar 24, 2026
@sarth6 sarth6 marked this pull request as ready for review March 24, 2026 22:48
@sarth6
Copy link
Copy Markdown
Contributor Author

sarth6 commented Mar 24, 2026

@DouweM I put together a plan here on areas we could potentially improve the thinking capability, let me know what you think, or we can go over it on thursday! unless you think it's not high priority

devin-ai-integration[bot]

This comment was marked as resolved.

@DouweM
Copy link
Copy Markdown
Collaborator

DouweM commented Mar 25, 2026

Related issue (for human and AI reference) about OpenRouter supported thinking efforts: #4708

PLAN.md Outdated
values. Fix to round DOWN (conservative = cheaper, safer):
- `True` → `'low'` (match xAI SDK default)
- `'low'` → `'low'`
- `'medium'` → `'low'` (round down)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly about this one, do you? As long as we document what it means, I think either is fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed this one, yeah I will revert - ideally we map medium to the most "reasonable" default, but that's unclear here - we would need to see data on xai agent thinking usage to really know the answer to that.

- Anthropic: adaptive models now only use adaptive for True/'medium'; other
  effort levels fall through to budget-based thinking
- xAI: round down ambiguous effort levels (True/'medium' → 'low') for
  conservative defaults
- Groq: thinking=False now sends 'hidden' (closest to disable semantics)
- Cerebras: thinking=True/effort explicitly sets disable_reasoning=False
- Strip 'thinking' key from model_settings after resolution to prevent
  downstream code from reading the raw unresolved value
- Rename all provider thinking methods to _translate_thinking() for
  consistency and grep-ability
- Add comprehensive tests for all fixes
- Update docs thinking table to reflect corrected mappings

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

sarth6 and others added 2 commits March 25, 2026 14:33
Tests that import provider-specific modules (google.genai, anthropic,
openai, groq) need pytest.importorskip() guards to skip gracefully
when running under pydantic-ai-slim without optional dependencies.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix Bedrock bug: _translate_thinking was inside `if model_settings:`
  guard; after stripping 'thinking', empty dict was falsy, skipping the
  call. Moved outside the guard and use `settings` (always truthy).
- Make settings stripping more robust: normalize empty dict to None after
  removing 'thinking' key, preventing falsy-dict bugs in any provider.
- Add "tokens" unit to bare number in docs table for clarity.
- Clarify Cerebras comment: False is caught by prior elif branch.
- Add test for thinking-only settings returning None after stripping.
- Groq model name change (qwen-qwq-32b → qwen/qwen3-32b) is correct:
  qwen-qwq is deprecated, qwen/qwen3-32b is current on Groq.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

@DouweM
Copy link
Copy Markdown
Collaborator

DouweM commented Mar 25, 2026

Didn't mean to close this! I did create some conflicts in #4846 though 😅

Remove section dividers, excessive comments, and verbose docstrings from
test_thinking.py. Remove explanatory comments from model source files.
Move XAI_EFFORT_MAP to module level for better testability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Resolve conflicts in cerebras.py, groq.py, xai.py, and test_thinking.py.
Take upstream's xAI effort mapping values (True -> 'high', medium -> 'high')
and Groq's restructured if-chain while keeping our _translate_thinking rename
and module-level XAI_EFFORT_MAP extraction.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

@sarth6 sarth6 changed the title Unified thinking capability: improvement plan Fix unified thinking translation bugs and add comprehensive tests Mar 26, 2026
…ngs.py

- Remove Phase references from test docstrings
- Merge TestAnthropicAdaptiveEffortLevels into TestAnthropicThinkingTranslation
- Merge TestCerebrasThinkingTrue into TestCerebrasThinkingTranslation
- Move TestMergeModelSettingsThinking to test_settings.py (1:1 file correspondence)
- Fix stale snapshot in test_adaptive_model_with_effort_level

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sarth6 sarth6 changed the title Fix unified thinking translation bugs and add comprehensive tests Thinking capability fast-follow improvements Mar 26, 2026
sarth6 and others added 4 commits March 26, 2026 15:49
Replace `if budget is not None:` guards (which had unreachable else
branches) with explicit `assert budget is not None` assertions, making
tests stricter while eliminating dead code branches.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use Literal['low', 'high'] instead of str for XAI_EFFORT_MAP values
- Convert per-test importorskip calls to autouse fixtures (Bedrock, OpenRouter,
  Cerebras, xAI) matching the pattern used by Anthropic/OpenAI/Google/Groq
- Remove redundant importorskip calls from tests whose class already has an
  autouse fixture
- Add reason= parameter consistently per codebase convention

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts the removal of module docstring, section dividers (# Helpers,
# 1. Base class..., # 2. Per-provider..., etc.), function docstrings,
and inline explanatory comments that existed before this PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sarth6 sarth6 requested a review from DouweM March 27, 2026 14:33
@sarth6
Copy link
Copy Markdown
Contributor Author

sarth6 commented Mar 27, 2026

@DouweM ready for review! yeah you're right - anthropic adaptive ended up being the only true bug, most of everything else is improving typing/testing, and nitpicking to adhere to the plan (like renaming to _translate_thinking)

@sarth6
Copy link
Copy Markdown
Contributor Author

sarth6 commented Mar 27, 2026

Also fyi the CI failure appears to be unrelated

Comment on lines +58 to +66
XAI_EFFORT_MAP: dict[ThinkingLevel, Literal['low', 'high']] = {
True: 'high',
'minimal': 'low',
'low': 'low',
'medium': 'high',
'high': 'high',
'xhigh': 'high',
}
"""Maps unified thinking values to xAI reasoning_effort. xAI only supports 'low' and 'high'."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constant is only used in one place (_create_chat at line 570) and was previously defined inline there. Moving it to module level goes against the project guideline to scope single-use helpers and constants to their usage site to reduce namespace pollution and prevent accidental reuse of implementation details.

I'd suggest keeping the map inline where it's used, as it was before. This also eliminates the need for the TestXaiEffortMap test class, which is essentially testing a constant rather than behavior.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good will adjust, but I'll keep my Literal['low', 'high'] change as it's a typing improvement that alligns to the underlying xai sdk

Comment on lines +779 to +785
if model_settings and 'thinking' in model_settings:
thinking_value = model_settings['thinking']
if self.profile.supports_thinking or self.profile.thinking_always_enabled:
if thinking_value is False and self.profile.thinking_always_enabled:
pass # Silent ignore: model always thinks, can't disable
else:
if not (thinking_value is False and self.profile.thinking_always_enabled):
params = replace(params, thinking=thinking_value)
# else: silent ignore for unsupported models
stripped = {k: v for k, v in model_settings.items() if k != 'thinking'}
model_settings = cast(ModelSettings, stripped) if stripped else None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DouweM — This is a new behavioral change: after resolving the unified thinking setting, the key is stripped from model_settings via a dict comprehension + cast. The intent is to prevent downstream providers from seeing thinking as an unknown key, which is reasonable.

A couple of things worth considering:

  1. The cast(ModelSettings, stripped) creates a plain dict that satisfies ModelSettings structurally but isn't actually a TypedDict instance. This is fine at runtime but worth being aware of.
  2. This adds a dict comprehension on every prepare_request call when thinking is present. Not a performance concern in practice, but worth a sanity check that this is the pattern you want going forward — vs, say, having providers ignore unknown keys or having thinking be a ModelRequestParameters-only field.

If this approach is approved, the logic looks correct — the no-mutation and no-leakage tests cover it well.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The approach was discussed in the call (came from the transcript) - prevents downstream model code from accidentally reading the raw unresolved value

params = ModelRequestParameters(thinking=True)
settings: ModelSettings = {}
result = AnthropicModel._get_thinking_param(adaptive_model, settings, params)
result = AnthropicModel._translate_thinking(adaptive_model, settings, params)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All per-provider test classes in this file call private _translate_thinking methods directly (e.g., AnthropicModel._translate_thinking, OpenAIChatModel._translate_thinking, etc.). The project's test guidelines say to test through public APIs rather than private methods prefixed with _, because it ties tests to implementation details and makes refactoring harder.

I understand this was the pre-existing pattern before the rename, but since these tests are being significantly reworked and expanded, it would be a good opportunity to test through prepare_request (the public entry point) instead. The integration-style tests at the bottom of the file (TestThinkingIntegration) already demonstrate this approach. The TestPrepareRequestThinkingResolution class also tests through the public API — ideally all per-provider translation tests would follow the same pattern.

This doesn't need to block the PR, but it's worth considering for a follow-up since the current approach means any rename of the internal method (like this very PR!) requires updating dozens of test call sites.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DouweM what do you think?

return model_response

def _get_reasoning_effort(
def _translate_thinking(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rename from descriptive provider-specific names (_get_reasoning_effort, _get_reasoning, _get_thinking_config, _get_reasoning_format) to the uniform _translate_thinking across all providers loses some useful information. For example, this method returns a ReasoningEffort | Omit (a string), while OpenAIResponsesModel._translate_thinking below returns a Reasoning | Omit (a dict), and GoogleModel._translate_thinking returns a ThinkingConfigDict | None. The old names hinted at what each method returned.

Since these are internal methods and never called polymorphically (each provider calls its own), the consistency benefit is modest, while the descriptive names helped readers understand each provider's translation target. This is ultimately a judgment call, but @DouweM's earlier comment on the plan suggests he wasn't strongly in favor of this rename either — worth confirming it's desired.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Douwe already confirmed he was fine with the rename: #4829 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore size: M Medium PR (101-500 weighted lines)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants