Thinking capability fast-follow improvements by sarth6 · Pull Request #4829 · pydantic/pydantic-ai

sarth6 · 2026-03-24T22:40:44Z

Summary

Fix correctness bugs in unified thinking translation across providers:
- Anthropic: adaptive thinking now only activates for thinking=True or 'medium', not all effort levels
- Bedrock: move _translate_thinking() call outside if model_settings: guard (empty dict after stripping is falsy)
- Cerebras: add thinking=True branch that sets disable_reasoning=False
- Groq: restructure if-chain to check thinking is False first → return 'hidden'
Strip thinking from model_settings after resolution in prepare_request() so downstream providers don't see it
Rename _get_thinking_param / _get_reasoning_effort → _translate_thinking for consistency across all providers
Extract xAI inline effort map to module-level XAI_EFFORT_MAP
Update docs/thinking.md provider mapping tables for accuracy
Add comprehensive test suite (tests/test_thinking.py) covering:
- Base thinking resolution via prepare_request()
- Per-provider translation behavior (Anthropic, OpenAI Chat/Responses, Google, Groq, Bedrock, OpenRouter, Cerebras, xAI)
- Thinking capability integration and serialization
- Profile detection for thinking-capable models
- No-mutation and no-leakage regression tests
- Merge model settings with thinking fields (in test_settings.py)

Test plan

pytest tests/test_thinking.py tests/test_settings.py — all 123 tests pass
ruff check passes on modified files
CI lint + typecheck + full test suite

🤖 Generated with Claude Code

Review of the merged thinking-cap PR identified correctness bugs, missing implementations, and gaps in test coverage. This plan documents the findings and proposes fixes across 5 phases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sarth6 · 2026-03-24T22:49:39Z

@DouweM I put together a plan here on areas we could potentially improve the thinking capability, let me know what you think, or we can go over it on thursday! unless you think it's not high priority

DouweM · 2026-03-25T14:53:47Z

Related issue (for human and AI reference) about OpenRouter supported thinking efforts: #4708

PLAN.md

DouweM · 2026-03-25T17:07:00Z

PLAN.md

+values. Fix to round DOWN (conservative = cheaper, safer):
+- `True` → `'low'` (match xAI SDK default)
+- `'low'` → `'low'`
+- `'medium'` → `'low'` (round down)


I don't feel strongly about this one, do you? As long as we document what it means, I think either is fine.

Missed this one, yeah I will revert - ideally we map medium to the most "reasonable" default, but that's unclear here - we would need to see data on xai agent thinking usage to really know the answer to that.

PLAN.md

- Anthropic: adaptive models now only use adaptive for True/'medium'; other effort levels fall through to budget-based thinking - xAI: round down ambiguous effort levels (True/'medium' → 'low') for conservative defaults - Groq: thinking=False now sends 'hidden' (closest to disable semantics) - Cerebras: thinking=True/effort explicitly sets disable_reasoning=False - Strip 'thinking' key from model_settings after resolution to prevent downstream code from reading the raw unresolved value - Rename all provider thinking methods to _translate_thinking() for consistency and grep-ability - Add comprehensive tests for all fixes - Update docs thinking table to reflect corrected mappings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Tests that import provider-specific modules (google.genai, anthropic, openai, groq) need pytest.importorskip() guards to skip gracefully when running under pydantic-ai-slim without optional dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix Bedrock bug: _translate_thinking was inside `if model_settings:` guard; after stripping 'thinking', empty dict was falsy, skipping the call. Moved outside the guard and use `settings` (always truthy). - Make settings stripping more robust: normalize empty dict to None after removing 'thinking' key, preventing falsy-dict bugs in any provider. - Add "tokens" unit to bare number in docs table for clarity. - Clarify Cerebras comment: False is caught by prior elif branch. - Add test for thinking-only settings returning None after stripping. - Groq model name change (qwen-qwq-32b → qwen/qwen3-32b) is correct: qwen-qwq is deprecated, qwen/qwen3-32b is current on Groq. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

DouweM · 2026-03-25T23:39:07Z

Didn't mean to close this! I did create some conflicts in #4846 though 😅

Remove section dividers, excessive comments, and verbose docstrings from test_thinking.py. Remove explanatory comments from model source files. Move XAI_EFFORT_MAP to module level for better testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve conflicts in cerebras.py, groq.py, xai.py, and test_thinking.py. Take upstream's xAI effort mapping values (True -> 'high', medium -> 'high') and Groq's restructured if-chain while keeping our _translate_thinking rename and module-level XAI_EFFORT_MAP extraction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…ngs.py - Remove Phase references from test docstrings - Merge TestAnthropicAdaptiveEffortLevels into TestAnthropicThinkingTranslation - Merge TestCerebrasThinkingTrue into TestCerebrasThinkingTranslation - Move TestMergeModelSettingsThinking to test_settings.py (1:1 file correspondence) - Fix stale snapshot in test_adaptive_model_with_effort_level Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace `if budget is not None:` guards (which had unreachable else branches) with explicit `assert budget is not None` assertions, making tests stricter while eliminating dead code branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Use Literal['low', 'high'] instead of str for XAI_EFFORT_MAP values - Convert per-test importorskip calls to autouse fixtures (Bedrock, OpenRouter, Cerebras, xAI) matching the pattern used by Anthropic/OpenAI/Google/Groq - Remove redundant importorskip calls from tests whose class already has an autouse fixture - Add reason= parameter consistently per codebase convention Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reverts the removal of module docstring, section dividers (# Helpers, # 1. Base class..., # 2. Per-provider..., etc.), function docstrings, and inline explanatory comments that existed before this PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sarth6 · 2026-03-27T14:34:52Z

@DouweM ready for review! yeah you're right - anthropic adaptive ended up being the only true bug, most of everything else is improving typing/testing, and nitpicking to adhere to the plan (like renaming to _translate_thinking)

sarth6 · 2026-03-27T16:33:34Z

Also fyi the CI failure appears to be unrelated

github-actions · 2026-03-27T17:11:10Z

pydantic_ai_slim/pydantic_ai/models/xai.py

+XAI_EFFORT_MAP: dict[ThinkingLevel, Literal['low', 'high']] = {
+    True: 'high',
+    'minimal': 'low',
+    'low': 'low',
+    'medium': 'high',
+    'high': 'high',
+    'xhigh': 'high',
+}
+"""Maps unified thinking values to xAI reasoning_effort. xAI only supports 'low' and 'high'."""


This constant is only used in one place (_create_chat at line 570) and was previously defined inline there. Moving it to module level goes against the project guideline to scope single-use helpers and constants to their usage site to reduce namespace pollution and prevent accidental reuse of implementation details.

I'd suggest keeping the map inline where it's used, as it was before. This also eliminates the need for the TestXaiEffortMap test class, which is essentially testing a constant rather than behavior.

Sounds good will adjust, but I'll keep my Literal['low', 'high'] change as it's a typing improvement that alligns to the underlying xai sdk

github-actions · 2026-03-27T17:11:16Z

pydantic_ai_slim/pydantic_ai/models/__init__.py

+        if model_settings and 'thinking' in model_settings:
+            thinking_value = model_settings['thinking']
            if self.profile.supports_thinking or self.profile.thinking_always_enabled:
-                if thinking_value is False and self.profile.thinking_always_enabled:
-                    pass  # Silent ignore: model always thinks, can't disable
-                else:
+                if not (thinking_value is False and self.profile.thinking_always_enabled):
                    params = replace(params, thinking=thinking_value)
-            # else: silent ignore for unsupported models
+            stripped = {k: v for k, v in model_settings.items() if k != 'thinking'}
+            model_settings = cast(ModelSettings, stripped) if stripped else None


@DouweM — This is a new behavioral change: after resolving the unified thinking setting, the key is stripped from model_settings via a dict comprehension + cast. The intent is to prevent downstream providers from seeing thinking as an unknown key, which is reasonable.

A couple of things worth considering:

The cast(ModelSettings, stripped) creates a plain dict that satisfies ModelSettings structurally but isn't actually a TypedDict instance. This is fine at runtime but worth being aware of.

This adds a dict comprehension on every prepare_request call when thinking is present. Not a performance concern in practice, but worth a sanity check that this is the pattern you want going forward — vs, say, having providers ignore unknown keys or having thinking be a ModelRequestParameters-only field.

If this approach is approved, the logic looks correct — the no-mutation and no-leakage tests cover it well.

The approach was discussed in the call (came from the transcript) - prevents downstream model code from accidentally reading the raw unresolved value

github-actions · 2026-03-27T17:11:23Z

tests/test_thinking.py

        params = ModelRequestParameters(thinking=True)
        settings: ModelSettings = {}
-        result = AnthropicModel._get_thinking_param(adaptive_model, settings, params)
+        result = AnthropicModel._translate_thinking(adaptive_model, settings, params)


All per-provider test classes in this file call private _translate_thinking methods directly (e.g., AnthropicModel._translate_thinking, OpenAIChatModel._translate_thinking, etc.). The project's test guidelines say to test through public APIs rather than private methods prefixed with _, because it ties tests to implementation details and makes refactoring harder.

I understand this was the pre-existing pattern before the rename, but since these tests are being significantly reworked and expanded, it would be a good opportunity to test through prepare_request (the public entry point) instead. The integration-style tests at the bottom of the file (TestThinkingIntegration) already demonstrate this approach. The TestPrepareRequestThinkingResolution class also tests through the public API — ideally all per-provider translation tests would follow the same pattern.

This doesn't need to block the PR, but it's worth considering for a follow-up since the current approach means any rename of the internal method (like this very PR!) requires updating dozens of test call sites.

@DouweM what do you think?

github-actions · 2026-03-27T17:11:35Z

pydantic_ai_slim/pydantic_ai/models/openai.py

        return model_response

-    def _get_reasoning_effort(
+    def _translate_thinking(


The rename from descriptive provider-specific names (_get_reasoning_effort, _get_reasoning, _get_thinking_config, _get_reasoning_format) to the uniform _translate_thinking across all providers loses some useful information. For example, this method returns a ReasoningEffort | Omit (a string), while OpenAIResponsesModel._translate_thinking below returns a Reasoning | Omit (a dict), and GoogleModel._translate_thinking returns a ThinkingConfigDict | None. The old names hinted at what each method returned.

Since these are internal methods and never called polymorphically (each provider calls its own), the consistency benefit is modest, while the descriptive names helped readers understand each provider's translation target. This is ultimately a judgment call, but @DouweM's earlier comment on the plan suggests he wasn't strongly in favor of this rename either — worth confirming it's desired.

Douwe already confirmed he was fine with the rename: #4829 (comment)

github-actions bot added size: M Medium PR (101-500 weighted lines) chore labels Mar 24, 2026

sarth6 marked this pull request as ready for review March 24, 2026 22:48

This comment was marked as resolved.

Sign in to view

DouweM requested changes Mar 25, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

sarth6 and others added 2 commits March 25, 2026 14:33

This comment was marked as resolved.

Sign in to view

DouweM mentioned this pull request Mar 25, 2026

Fix capability type inference, support sync prepare_tools functions, MCP capability with just a URL #4846

Merged

4 tasks

DouweM closed this in #4846 Mar 25, 2026

DouweM reopened this Mar 25, 2026

This comment was marked as resolved.

Sign in to view

sarth6 changed the title ~~Unified thinking capability: improvement plan~~ Fix unified thinking translation bugs and add comprehensive tests Mar 26, 2026

sarth6 changed the title ~~Fix unified thinking translation bugs and add comprehensive tests~~ Thinking capability fast-follow improvements Mar 26, 2026

sarth6 and others added 4 commits March 26, 2026 15:49

Merge branch 'main' into thinking-cap-improvements

e9b5284

sarth6 requested a review from DouweM March 27, 2026 14:33

DouweM added the auto-review label Mar 27, 2026

github-actions bot reviewed Mar 27, 2026

View reviewed changes

github-actions bot removed the auto-review label Mar 27, 2026

Conversation

sarth6 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

sarth6 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

DouweM commented Mar 25, 2026

Uh oh!

Uh oh!

Uh oh!

DouweM Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

sarth6 Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

DouweM commented Mar 25, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

sarth6 commented Mar 27, 2026

Uh oh!

sarth6 commented Mar 27, 2026

Uh oh!

github-actions bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

sarth6 Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

sarth6 Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

sarth6 Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

sarth6 Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sarth6 commented Mar 24, 2026 •

edited

Loading

sarth6 commented Mar 24, 2026 •

edited

Loading