Thinking capability fast-follow improvements#4829
Thinking capability fast-follow improvements#4829sarth6 wants to merge 11 commits intopydantic:mainfrom
Conversation
Review of the merged thinking-cap PR identified correctness bugs, missing implementations, and gaps in test coverage. This plan documents the findings and proposes fixes across 5 phases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@DouweM I put together a plan here on areas we could potentially improve the thinking capability, let me know what you think, or we can go over it on thursday! unless you think it's not high priority |
|
Related issue (for human and AI reference) about OpenRouter supported thinking efforts: #4708 |
PLAN.md
Outdated
| values. Fix to round DOWN (conservative = cheaper, safer): | ||
| - `True` → `'low'` (match xAI SDK default) | ||
| - `'low'` → `'low'` | ||
| - `'medium'` → `'low'` (round down) |
There was a problem hiding this comment.
I don't feel strongly about this one, do you? As long as we document what it means, I think either is fine.
There was a problem hiding this comment.
Missed this one, yeah I will revert - ideally we map medium to the most "reasonable" default, but that's unclear here - we would need to see data on xai agent thinking usage to really know the answer to that.
- Anthropic: adaptive models now only use adaptive for True/'medium'; other effort levels fall through to budget-based thinking - xAI: round down ambiguous effort levels (True/'medium' → 'low') for conservative defaults - Groq: thinking=False now sends 'hidden' (closest to disable semantics) - Cerebras: thinking=True/effort explicitly sets disable_reasoning=False - Strip 'thinking' key from model_settings after resolution to prevent downstream code from reading the raw unresolved value - Rename all provider thinking methods to _translate_thinking() for consistency and grep-ability - Add comprehensive tests for all fixes - Update docs thinking table to reflect corrected mappings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Tests that import provider-specific modules (google.genai, anthropic, openai, groq) need pytest.importorskip() guards to skip gracefully when running under pydantic-ai-slim without optional dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix Bedrock bug: _translate_thinking was inside `if model_settings:` guard; after stripping 'thinking', empty dict was falsy, skipping the call. Moved outside the guard and use `settings` (always truthy). - Make settings stripping more robust: normalize empty dict to None after removing 'thinking' key, preventing falsy-dict bugs in any provider. - Add "tokens" unit to bare number in docs table for clarity. - Clarify Cerebras comment: False is caught by prior elif branch. - Add test for thinking-only settings returning None after stripping. - Groq model name change (qwen-qwq-32b → qwen/qwen3-32b) is correct: qwen-qwq is deprecated, qwen/qwen3-32b is current on Groq. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Didn't mean to close this! I did create some conflicts in #4846 though 😅 |
Remove section dividers, excessive comments, and verbose docstrings from test_thinking.py. Remove explanatory comments from model source files. Move XAI_EFFORT_MAP to module level for better testability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts in cerebras.py, groq.py, xai.py, and test_thinking.py. Take upstream's xAI effort mapping values (True -> 'high', medium -> 'high') and Groq's restructured if-chain while keeping our _translate_thinking rename and module-level XAI_EFFORT_MAP extraction. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ngs.py - Remove Phase references from test docstrings - Merge TestAnthropicAdaptiveEffortLevels into TestAnthropicThinkingTranslation - Merge TestCerebrasThinkingTrue into TestCerebrasThinkingTranslation - Move TestMergeModelSettingsThinking to test_settings.py (1:1 file correspondence) - Fix stale snapshot in test_adaptive_model_with_effort_level Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace `if budget is not None:` guards (which had unreachable else branches) with explicit `assert budget is not None` assertions, making tests stricter while eliminating dead code branches. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use Literal['low', 'high'] instead of str for XAI_EFFORT_MAP values - Convert per-test importorskip calls to autouse fixtures (Bedrock, OpenRouter, Cerebras, xAI) matching the pattern used by Anthropic/OpenAI/Google/Groq - Remove redundant importorskip calls from tests whose class already has an autouse fixture - Add reason= parameter consistently per codebase convention Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reverts the removal of module docstring, section dividers (# Helpers, # 1. Base class..., # 2. Per-provider..., etc.), function docstrings, and inline explanatory comments that existed before this PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@DouweM ready for review! yeah you're right - anthropic adaptive ended up being the only true bug, most of everything else is improving typing/testing, and nitpicking to adhere to the plan (like renaming to |
|
Also fyi the CI failure appears to be unrelated |
| XAI_EFFORT_MAP: dict[ThinkingLevel, Literal['low', 'high']] = { | ||
| True: 'high', | ||
| 'minimal': 'low', | ||
| 'low': 'low', | ||
| 'medium': 'high', | ||
| 'high': 'high', | ||
| 'xhigh': 'high', | ||
| } | ||
| """Maps unified thinking values to xAI reasoning_effort. xAI only supports 'low' and 'high'.""" |
There was a problem hiding this comment.
This constant is only used in one place (_create_chat at line 570) and was previously defined inline there. Moving it to module level goes against the project guideline to scope single-use helpers and constants to their usage site to reduce namespace pollution and prevent accidental reuse of implementation details.
I'd suggest keeping the map inline where it's used, as it was before. This also eliminates the need for the TestXaiEffortMap test class, which is essentially testing a constant rather than behavior.
There was a problem hiding this comment.
Sounds good will adjust, but I'll keep my Literal['low', 'high'] change as it's a typing improvement that alligns to the underlying xai sdk
| if model_settings and 'thinking' in model_settings: | ||
| thinking_value = model_settings['thinking'] | ||
| if self.profile.supports_thinking or self.profile.thinking_always_enabled: | ||
| if thinking_value is False and self.profile.thinking_always_enabled: | ||
| pass # Silent ignore: model always thinks, can't disable | ||
| else: | ||
| if not (thinking_value is False and self.profile.thinking_always_enabled): | ||
| params = replace(params, thinking=thinking_value) | ||
| # else: silent ignore for unsupported models | ||
| stripped = {k: v for k, v in model_settings.items() if k != 'thinking'} | ||
| model_settings = cast(ModelSettings, stripped) if stripped else None |
There was a problem hiding this comment.
@DouweM — This is a new behavioral change: after resolving the unified thinking setting, the key is stripped from model_settings via a dict comprehension + cast. The intent is to prevent downstream providers from seeing thinking as an unknown key, which is reasonable.
A couple of things worth considering:
- The
cast(ModelSettings, stripped)creates a plaindictthat satisfiesModelSettingsstructurally but isn't actually aTypedDictinstance. This is fine at runtime but worth being aware of. - This adds a dict comprehension on every
prepare_requestcall whenthinkingis present. Not a performance concern in practice, but worth a sanity check that this is the pattern you want going forward — vs, say, having providers ignore unknown keys or havingthinkingbe aModelRequestParameters-only field.
If this approach is approved, the logic looks correct — the no-mutation and no-leakage tests cover it well.
There was a problem hiding this comment.
The approach was discussed in the call (came from the transcript) - prevents downstream model code from accidentally reading the raw unresolved value
| params = ModelRequestParameters(thinking=True) | ||
| settings: ModelSettings = {} | ||
| result = AnthropicModel._get_thinking_param(adaptive_model, settings, params) | ||
| result = AnthropicModel._translate_thinking(adaptive_model, settings, params) |
There was a problem hiding this comment.
All per-provider test classes in this file call private _translate_thinking methods directly (e.g., AnthropicModel._translate_thinking, OpenAIChatModel._translate_thinking, etc.). The project's test guidelines say to test through public APIs rather than private methods prefixed with _, because it ties tests to implementation details and makes refactoring harder.
I understand this was the pre-existing pattern before the rename, but since these tests are being significantly reworked and expanded, it would be a good opportunity to test through prepare_request (the public entry point) instead. The integration-style tests at the bottom of the file (TestThinkingIntegration) already demonstrate this approach. The TestPrepareRequestThinkingResolution class also tests through the public API — ideally all per-provider translation tests would follow the same pattern.
This doesn't need to block the PR, but it's worth considering for a follow-up since the current approach means any rename of the internal method (like this very PR!) requires updating dozens of test call sites.
| return model_response | ||
|
|
||
| def _get_reasoning_effort( | ||
| def _translate_thinking( |
There was a problem hiding this comment.
The rename from descriptive provider-specific names (_get_reasoning_effort, _get_reasoning, _get_thinking_config, _get_reasoning_format) to the uniform _translate_thinking across all providers loses some useful information. For example, this method returns a ReasoningEffort | Omit (a string), while OpenAIResponsesModel._translate_thinking below returns a Reasoning | Omit (a dict), and GoogleModel._translate_thinking returns a ThinkingConfigDict | None. The old names hinted at what each method returned.
Since these are internal methods and never called polymorphically (each provider calls its own), the consistency benefit is modest, while the descriptive names helped readers understand each provider's translation target. This is ultimately a judgment call, but @DouweM's earlier comment on the plan suggests he wasn't strongly in favor of this rename either — worth confirming it's desired.
There was a problem hiding this comment.
Douwe already confirmed he was fine with the rename: #4829 (comment)
Closes #4829
Summary
thinking=Trueor'medium', not all effort levels_translate_thinking()call outsideif model_settings:guard (empty dict after stripping is falsy)thinking=Truebranch that setsdisable_reasoning=Falsethinking is Falsefirst → return'hidden'thinkingfrommodel_settingsafter resolution inprepare_request()so downstream providers don't see it_get_thinking_param/_get_reasoning_effort→_translate_thinkingfor consistency across all providersXAI_EFFORT_MAPtests/test_thinking.py) covering:prepare_request()test_settings.py)Test plan
pytest tests/test_thinking.py tests/test_settings.py— all 123 tests passruff checkpasses on modified files🤖 Generated with Claude Code