Skip to content

WIP: streaming + claude reasoning fix#1073

Closed
leonardmq wants to merge 13 commits intomainfrom
leonard/kil-420-adapter-add-streaming
Closed

WIP: streaming + claude reasoning fix#1073
leonardmq wants to merge 13 commits intomainfrom
leonard/kil-420-adapter-add-streaming

Conversation

@leonardmq
Copy link
Collaborator

@leonardmq leonardmq commented Feb 20, 2026

What does this PR do?

Changes:

  • add streaming proof of concept
  • add thinking_level on Claude models
  • add flags in LiteLLM adapter to enable thinking for Claude models (OpenRouter requires a different config than Anthropic; they conflict)
  • integration test verifying the streaming events and response completion

Checklists

  • Tests have been run locally and passed
  • New tests have been added to any work in /lib

Summary by CodeRabbit

  • New Features

    • Real-time streaming support across model adapters with chunk callback hooks and viewing streamed responses as they arrive
    • New "thinking" model variants for Claude family enabling enhanced reasoning modes
  • Tests

    • Comprehensive streaming tests and fixtures validating chunk streaming, final response assembly, re-iteration, and edge cases
  • Dependencies

    • Updated underlying model client library to a newer compatible version

@leonardmq leonardmq marked this pull request as draft February 20, 2026 10:21
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @leonardmq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the system's interaction with Claude models by enabling and improving reasoning capabilities and introducing a robust streaming mechanism. The changes allow for real-time processing of model outputs and ensure that reasoning steps are properly captured, particularly for models accessed via LiteLLM. This significantly improves the user experience for long-running model calls and provides more detailed insights into the model's thought process.

Highlights

  • Streaming Proof of Concept: Introduced a new StreamingCompletion class to provide an async iterable wrapper around litellm.acompletion for handling streaming responses, allowing chunks to be processed as they arrive and the final assembled response to be accessed after iteration.
  • Claude Model Reasoning Capabilities: Added reasoning_capable=True and thinking_level='medium' flags to various Claude models (Haiku, Sonnet, Opus) across both Anthropic and OpenRouter providers in the ml_model_list.
  • LiteLLM Adapter Enhancements: Modified the LiteLLM adapter to integrate the new streaming functionality, passing an on_chunk callback for real-time processing of streamed data. Also implemented specific logic for OpenRouter to correctly handle reasoning_effort parameters.
  • Comprehensive Integration Tests: Added new integration tests to verify the functionality of streaming events and response completion for Claude models, ensuring that reasoning content, tool calls, and final output are correctly captured and processed during streaming.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • libs/core/kiln_ai/adapters/litellm_utils/litellm_streaming.py
    • Added StreamingCompletion class for async iteration over LiteLLM streaming responses.
  • libs/core/kiln_ai/adapters/litellm_utils/test_litellm_streaming.py
    • Added unit tests for StreamingCompletion covering chunk yielding, response handling, and argument forwarding.
  • libs/core/kiln_ai/adapters/ml_model_list.py
    • Updated Claude model definitions to include reasoning_capable=True and thinking_level='medium' for various models.
  • libs/core/kiln_ai/adapters/model_adapters/base_adapter.py
    • Introduced StreamCallback type for streaming functionality.
    • Modified invoke, _run_returning_run_output, and _run methods to accept an on_chunk parameter for streaming.
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
    • Refactored acompletion_checking_response to use StreamingCompletion for handling streaming calls.
    • Removed direct litellm import, now using StreamingCompletion.
    • Implemented logic to pass on_chunk callback to the streaming process.
    • Added conditional logic to build_extra_body to include allowed_openai_params for OpenRouter when thinking_level is set.
  • libs/core/kiln_ai/adapters/model_adapters/test_base_adapter.py
    • Updated MockAdapter._run and several test methods to accept **kwargs to support the new on_chunk parameter.
    • Added new test cases for TestStreamCallback to verify on_chunk forwarding and default behavior.
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
    • Updated mock litellm.acompletion calls to use patch.object on LiteLlmAdapter.acompletion_checking_response for consistency with streaming changes.
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_streaming.py
    • Added new file with extensive integration tests for LiteLLM streaming, including ChunkRenderer and ChunkRawRenderer for verifying streamed content, reasoning, and tool calls across various Claude models.
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_tools.py
    • Modified mock litellm.acompletion calls to use patch.object on LiteLlmAdapter.acompletion_checking_response.
    • Added a test to ensure the on_chunk handler functions correctly end-to-end with real models.
  • libs/core/kiln_ai/adapters/model_adapters/test_saving_adapter_results.py
    • Updated MockAdapter._run method signature to accept **kwargs.
  • libs/core/kiln_ai/adapters/model_adapters/test_structured_output.py
    • Updated MockAdapter._run method signature to accept **kwargs.
    • Modified mock litellm.acompletion calls to use patch.object on LiteLlmAdapter.acompletion_checking_response.
  • libs/core/kiln_ai/adapters/test_prompt_adaptors.py
    • Updated mock litellm.acompletion calls to use patch.object on LiteLlmAdapter.acompletion_checking_response.
  • libs/core/kiln_ai/adapters/test_prompt_builders.py
    • Updated MockAdapter._run method signature to accept **kwargs.
  • libs/core/kiln_ai/datamodel/test_basemodel.py
    • Updated MockAdapter._run method signature to accept **kwargs.
  • libs/core/pyproject.toml
    • Updated litellm dependency to >=1.81.13.
    • Updated typer-slim dependency to >=0.21.2.
  • uv.lock
    • Updated lock file entries for annotated-doc, huggingface-hub, importlib-metadata, litellm, and tokenizers to newer versions.
    • Updated typer-slim entry to 0.21.2 and added its new dependency on annotated-doc.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

Important

Review skipped

Ignore keyword(s) in the title.

⛔ Ignored keywords (2)
  • WIP
  • Draft

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: a9d99f24-84f3-4e13-9c0e-3a8f1e56aad8

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Walkthrough

Adds an async StreamingCompletion wrapper and threads an on_chunk StreamCallback through adapter layers to support litellm streaming; updates litellm dependency, model metadata, and extensive tests to validate chunking, callback propagation, and final response assembly.

Changes

Cohort / File(s) Summary
Streaming Core
libs/core/kiln_ai/adapters/litellm_utils/litellm_streaming.py
New StreamingCompletion async iterator that calls litellm.acompletion(stream=True, ...), yields ModelResponseStream chunks, collects chunks, builds final response via litellm.stream_chunk_builder, and exposes .response after iteration.
Streaming Unit Tests
libs/core/kiln_ai/adapters/litellm_utils/test_litellm_streaming.py
Tests for chunk ordering, response finalization, premature .response access error, stream kwarg normalization, arg passthrough, builder invocation, re-iteration reset, and empty-stream behavior.
Adapter Interfaces
libs/core/kiln_ai/adapters/model_adapters/base_adapter.py
Adds StreamCallback type alias and on_chunk parameter to invoke, invoke_returning_run_output, _run_returning_run_output, and abstract _run signature to accept chunk callbacks.
LiteLlmAdapter Streaming
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
Integrates StreamingCompletion, propagates on_chunk through _run_model_turn and _run, invokes callback per chunk, and obtains final response from the stream wrapper; adds OpenRouter-specific allowed params handling.
MCP Adapter Streaming
libs/core/kiln_ai/adapters/model_adapters/mcp_adapter.py
Threads on_chunk through MCPAdapter public/private methods and updates imports to expose StreamCallback.
Tests: Adapter Mocks Updated
libs/core/kiln_ai/adapters/model_adapters/test_base_adapter.py, .../test_saving_adapter_results.py, .../test_prompt_builders.py, libs/core/kiln_ai/datamodel/test_basemodel.py
Updated test MockAdapter _run signatures to accept **kwargs and added tests (TestStreamCallback) to assert on_chunk propagation.
Litellm Adapter Tests Updated / Added
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py, test_litellm_adapter_tools.py, test_structured_output.py, test_prompt_adaptors.py
Replaced direct litellm.acompletion patches with LiteLlmAdapter.acompletion_checking_response mocking and added streaming-aware assertions; introduced new streaming test module with chunk renderers.
Streaming Integration Tests
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_streaming.py
New comprehensive streaming integration tests with ChunkRendererAbstract, ChunkRenderer, ChunkRawRenderer, fixtures, and multiple streaming validation scenarios across providers.
Model List Updates
libs/core/kiln_ai/adapters/ml_model_list.py
Adds multiple Claude "thinking" model enum entries and KilnModel configurations with thinking_level="medium" / related provider entries.
Dependency
libs/core/pyproject.toml
Bumps litellm dependency constraint from >=1.80.9 to >=1.81.13.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Client
    participant BaseAdapter as BaseAdapter
    participant LiteLlmAdapter as LiteLlmAdapter
    participant StreamingCompletion as StreamingCompletion
    participant litellm as litellm
    participant on_chunk as on_chunk Callback

    Client->>BaseAdapter: invoke(input, on_chunk=callback)
    BaseAdapter->>LiteLlmAdapter: _run(input, on_chunk=callback)
    LiteLlmAdapter->>LiteLlmAdapter: _run_model_turn(on_chunk=callback)
    LiteLlmAdapter->>StreamingCompletion: __aiter__() / create wrapper
    StreamingCompletion->>litellm: acompletion(..., stream=True)
    litellm-->>StreamingCompletion: async iterator of chunks
    loop For each chunk
        StreamingCompletion->>StreamingCompletion: collect chunk
        StreamingCompletion-->>LiteLlmAdapter: yield chunk
        LiteLlmAdapter->>on_chunk: await callback(chunk)
        on_chunk-->>LiteLlmAdapter: callback awaited
    end
    StreamingCompletion->>litellm: stream_chunk_builder(collected_chunks)
    litellm-->>StreamingCompletion: final assembled response
    StreamingCompletion-->>LiteLlmAdapter: expose .response
    LiteLlmAdapter-->>BaseAdapter: return RunOutput with final response
    BaseAdapter-->>Client: complete TaskRun
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~65 minutes

Possibly related PRs

Suggested labels

codex

Suggested reviewers

  • scosman
  • sfierro

"I'm a rabbit in the code-laden glen,
Hopping bytes and streaming again,
Chunks arrive like carrots in line,
Callbacks nibble each tasty time,
Final response—hooray!—now it's mine." 🐇✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 23.47% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'WIP: streaming + claude reasoning fix' is partially related to the changeset but marked as WIP and lacks specificity about the main changes. Replace 'WIP: streaming + claude reasoning fix' with a more specific, non-draft title that clearly describes the main contribution, such as 'Add streaming support and Claude thinking models configuration'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description check ✅ Passed The PR description covers key changes and includes completed checklists, but lacks details on implementation approach and related issues section is empty.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch leonard/kil-420-adapter-add-streaming

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This PR introduces streaming support for litellm completions and fixes an issue with Claude models missing reasoning. The changes are well-implemented, with a new StreamingCompletion wrapper to handle streaming logic cleanly. The on_chunk callback is plumbed through the adapter stack correctly. The fix for Claude models by adding allowed_openai_params for OpenRouter is a good, well-commented solution for a provider-specific issue. The addition of unit tests for the new streaming utility and extensive integration tests for various models demonstrates thoroughness. I've left one minor suggestion in a test helper to make it more robust. Overall, this is a great PR.

@leonardmq leonardmq changed the title fix: claude models missing reasoning WIP: streaming + claude reasoning fix Feb 20, 2026
@leonardmq leonardmq force-pushed the leonard/kil-420-adapter-add-streaming branch from aafe5f3 to 446d4d1 Compare February 20, 2026 10:28
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (1)

292-310: ⚠️ Potential issue | 🟠 Major

Always-streaming path creates a silent usage tracking regression for non-Claude providers.

acompletion_checking_response now unconditionally routes all completions through StreamingCompletion, whereas it previously used non-streaming litellm.acompletion directly. Before this change, response._hidden_params["response_cost"] and response.usage were reliably populated for all providers. Now, usage data depends on whether each provider includes it in streaming chunks — providers that only include usage in non-streaming responses, or emit it in ways stream_chunk_builder doesn't correctly reassemble, will silently return Usage() with all None fields.

This is an intentional change to support Claude extended thinking (which requires streaming), but it affects all providers. Usage tracking is not tested with streaming responses, creating a gap between the feature change and test coverage.

Recommended actions:

  1. Either preserve the non-streaming path when on_chunk is None:
if on_chunk is not None:
    stream = StreamingCompletion(**kwargs)
    async for chunk in stream:
        await on_chunk(chunk)
    response = stream.response
else:
    import litellm
    response = await litellm.acompletion(**kwargs)
  1. Or add an explicit test verifying usage_from_response correctly populates tokens and cost from a streamed response for at least one non-Claude provider (e.g., OpenAI GPT-4).

The build_extra_body thinking_level fix (reasoning_effort + allowed_openai_params) is correct.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py` around lines
292 - 310, The change in acompletion_checking_response routes all completions
through StreamingCompletion causing usage/_hidden_params["response_cost"] to be
missing for providers that only populate usage on the non-streaming path;
restore the previous non-streaming behavior when no chunk handler is provided or
add tests to ensure streaming reconstructs usage. Update
acompletion_checking_response so that if on_chunk is None it calls
litellm.acompletion(**kwargs) and assigns that to response, otherwise use
StreamingCompletion(**kwargs) and iterate chunks; alternatively add a test that
verifies usage_from_response (or response._hidden_params["response_cost"]) is
correctly populated for a non-Claude provider (e.g., OpenAI GPT-4) when using
StreamingCompletion/stream_chunk_builder. Ensure references to
StreamingCompletion, litellm.acompletion, acompletion_checking_response, and
usage_from_response are used to locate/edit the code and tests.
🧹 Nitpick comments (2)
libs/core/kiln_ai/adapters/model_adapters/base_adapter.py (1)

7-8: litellm dependency introduced in BaseAdapter.

from litellm.types.utils import ModelResponseStream couples the abstract base class (and all its non-litellm subclasses) to the litellm package. Consider moving StreamCallback to a standalone streaming_types.py module (using Any or a protocol) so non-litellm adapters don't carry a transitive litellm dependency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/base_adapter.py` around lines 7 -
8, BaseAdapter currently imports ModelResponseStream from litellm which forces a
transitive litellm dependency; create a new standalone module streaming_types.py
that defines a StreamCallback type (either as typing.Any or a lightweight
typing.Protocol matching ModelResponseStream’s public API) and export any
minimal streaming types there, then update
libs/core/kiln_ai/adapters/model_adapters/base_adapter.py to import
StreamCallback from streaming_types instead of ModelResponseStream and update
any references in BaseAdapter to use StreamCallback; leave direct litellm
imports only in adapters that actually need litellm-specific ModelResponseStream
and adjust their imports accordingly.
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_streaming.py (1)

77-92: Double getattr in render_chunk for reasoning_content.

Lines 85-87 call getattr(chunk.choices[0].delta, "reasoning_content", None) twice — once in the elif condition and once to assign text. The second call is redundant.

♻️ Proposed simplification
-            elif getattr(chunk.choices[0].delta, "reasoning_content", None) is not None:
-                text = getattr(chunk.choices[0].delta, "reasoning_content", None)
-                if text is not None:
-                    self.render_reasoning(text)
+            elif (text := getattr(chunk.choices[0].delta, "reasoning_content", None)) is not None:
+                self.render_reasoning(text)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_streaming.py`
around lines 77 - 92, In render_chunk, avoid calling
getattr(chunk.choices[0].delta, "reasoning_content", None) twice: first evaluate
and store it in a local variable (e.g., reasoning_text) then use that variable
both for the truthy check and for passing to render_reasoning; update the branch
under render_chunk where it currently has the two getattr calls so the condition
checks the stored reasoning_text and the subsequent call passes that same
variable to render_reasoning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@libs/core/kiln_ai/adapters/ml_model_list.py`:
- Line 1348: The provider entries that set thinking_level (e.g., the
dict/constructor call containing thinking_level="medium") must also set
reasoning_capable=True so Claude providers use single-turn reasoning and extract
"thinking" output; update every provider config that currently only sets
thinking_level (lines referenced include the occurrences at
thinking_level="medium" and the other positions) to include
reasoning_capable=True. For Anthropic providers where applicable, instead add or
also set anthropic_extended_thinking=True. Locate the provider definitions in
ml_model_list.py (the entries that include thinking_level) and add the
appropriate boolean flag to those same dict/constructor calls.

---

Outside diff comments:
In `@libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py`:
- Around line 292-310: The change in acompletion_checking_response routes all
completions through StreamingCompletion causing
usage/_hidden_params["response_cost"] to be missing for providers that only
populate usage on the non-streaming path; restore the previous non-streaming
behavior when no chunk handler is provided or add tests to ensure streaming
reconstructs usage. Update acompletion_checking_response so that if on_chunk is
None it calls litellm.acompletion(**kwargs) and assigns that to response,
otherwise use StreamingCompletion(**kwargs) and iterate chunks; alternatively
add a test that verifies usage_from_response (or
response._hidden_params["response_cost"]) is correctly populated for a
non-Claude provider (e.g., OpenAI GPT-4) when using
StreamingCompletion/stream_chunk_builder. Ensure references to
StreamingCompletion, litellm.acompletion, acompletion_checking_response, and
usage_from_response are used to locate/edit the code and tests.

---

Nitpick comments:
In `@libs/core/kiln_ai/adapters/model_adapters/base_adapter.py`:
- Around line 7-8: BaseAdapter currently imports ModelResponseStream from
litellm which forces a transitive litellm dependency; create a new standalone
module streaming_types.py that defines a StreamCallback type (either as
typing.Any or a lightweight typing.Protocol matching ModelResponseStream’s
public API) and export any minimal streaming types there, then update
libs/core/kiln_ai/adapters/model_adapters/base_adapter.py to import
StreamCallback from streaming_types instead of ModelResponseStream and update
any references in BaseAdapter to use StreamCallback; leave direct litellm
imports only in adapters that actually need litellm-specific ModelResponseStream
and adjust their imports accordingly.

In `@libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_streaming.py`:
- Around line 77-92: In render_chunk, avoid calling
getattr(chunk.choices[0].delta, "reasoning_content", None) twice: first evaluate
and store it in a local variable (e.g., reasoning_text) then use that variable
both for the truthy check and for passing to render_reasoning; update the branch
under render_chunk where it currently has the two getattr calls so the condition
checks the stored reasoning_text and the subsequent call passes that same
variable to render_reasoning.

@leonardmq leonardmq force-pushed the leonard/kil-420-adapter-add-streaming branch from 446d4d1 to 9642085 Compare February 20, 2026 10:32
@github-actions
Copy link

github-actions bot commented Feb 20, 2026

📊 Coverage Report

Overall Coverage: 91%

Diff: origin/main...HEAD

  • libs/core/kiln_ai/adapters/litellm_utils/litellm_streaming.py (100%)
  • libs/core/kiln_ai/adapters/model_adapters/base_adapter.py (100%)
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (33.3%): Missing lines 301-305,422
  • libs/core/kiln_ai/adapters/model_adapters/mcp_adapter.py (80.0%): Missing lines 86

Summary

  • Total: 48 lines
  • Missing: 7 lines
  • Coverage: 85%

Line-by-line

View line-by-line diff coverage

libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py

Lines 297-309

  297 
  298     async def acompletion_checking_response(
  299         self, on_chunk: StreamCallback | None = None, **kwargs
  300     ) -> Tuple[ModelResponse, Choices]:
! 301         stream = StreamingCompletion(**kwargs)
! 302         async for chunk in stream:
! 303             if on_chunk is not None:
! 304                 await on_chunk(chunk)
! 305         response = stream.response
  306 
  307         if (
  308             not isinstance(response, ModelResponse)
  309             or not response.choices

Lines 418-426

  418 
  419             # anthropic does not need allowed_openai_params, and we get an error if we pass it in
  420             # but openrouter for example does need it or throws an error
  421             if provider.name == ModelProviderName.openrouter:
! 422                 extra_body["allowed_openai_params"] = ["reasoning_effort"]
  423 
  424         if provider.require_openrouter_reasoning:
  425             # https://openrouter.ai/docs/use-cases/reasoning-tokens
  426             extra_body["reasoning"] = {

libs/core/kiln_ai/adapters/model_adapters/mcp_adapter.py

Lines 82-90

  82         input: InputType,
  83         input_source: DataSource | None = None,
  84         on_chunk: StreamCallback | None = None,
  85     ) -> TaskRun:
! 86         run_output, _ = await self.invoke_returning_run_output(
  87             input, input_source, on_chunk=on_chunk
  88         )
  89         return run_output


@leonardmq
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (9)
libs/core/kiln_ai/adapters/litellm_utils/litellm_streaming.py (1)

30-36: Re-iterating the same instance silently makes a duplicate API call.

Every async for chunk in stream: invocation calls __aiter__, which resets _response/_iterated and issues a fresh litellm.acompletion. Since this is intended as a single-use wrapper (matching the pattern in acompletion_checking_response), this should either be documented in the class docstring or enforced with a guard:

def __aiter__(self):
    if self._iterated:
        raise RuntimeError("StreamingCompletion has already been iterated; create a new instance.")
    ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/litellm_utils/litellm_streaming.py` around lines
30 - 36, The wrapper silently allows re-iteration causing duplicate
litellm.acompletion calls; update the StreamingCompletion iterator logic (the
__aiter__ method) to enforce single-use by checking self._iterated and raising a
RuntimeError("StreamingCompletion has already been iterated; create a new
instance.") if True, and only call litellm.acompletion and set
self._response/_iterated on the first iteration; alternatively add this
single-use behavior note to the class docstring for clarity.
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_streaming.py (4)

76-91: render_chunk assumes chunk.choices is non-empty.

Line 77 accesses chunk.choices[0] without a guard. While litellm streaming chunks should always contain at least one choice, a defensive check would prevent a confusing IndexError if an unexpected chunk shape arrives.

Suggested guard
     async def render_chunk(self, chunk: litellm.ModelResponseStream):
+        if not chunk.choices:
+            return
         if chunk.choices[0].finish_reason is not None:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_streaming.py`
around lines 76 - 91, The render_chunk function assumes chunk.choices[0] exists
and can raise IndexError for empty choices; add a defensive guard at the top of
render_chunk (in the render_chunk method) that checks if not chunk.choices (or
len(chunk.choices) == 0) and handle that case (e.g., call
self.render_unknown(chunk) or return) before referencing chunk.choices[0]; then
proceed with the existing logic for tool_calls, reasoning_content, content, and
finish_reason.

127-153: StructuredOutputMode.unknown is fragile — prefer default.

Line 139 uses StructuredOutputMode.unknown, which raises ValueError("Structured output mode is unknown.") in response_format_options(). It currently works only because the task has no output schema, so has_structured_output() short-circuits. If anyone later adds an output schema to the task fixture, these tests will fail with an opaque error. Using StructuredOutputMode.default would be safer and more representative of real usage.

Suggested fix
-                    structured_output_mode=StructuredOutputMode.unknown,
+                    structured_output_mode=StructuredOutputMode.default,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_streaming.py`
around lines 127 - 153, The test fixture uses StructuredOutputMode.unknown which
can later raise ValueError in response_format_options() if the task gains an
output schema; update the adapter_factory fixture to use
StructuredOutputMode.default instead of .unknown when constructing the
LiteLlmAdapter’s LiteLlmConfig/KilnAgentRunConfigProperties so
has_structured_output() won’t short-circuit and tests remain stable (look for
adapter_factory, LiteLlmAdapter, LiteLlmConfig, KilnAgentRunConfigProperties,
response_format_options).

97-107: Dead current_block_type field in ChunkRawRenderer.

Line 100 initializes self.current_block_type but it's never read or written afterwards. Looks like a copy-paste artifact from ChunkRenderer.

Fix
 class ChunkRawRenderer(ChunkRendererAbstract):
     def __init__(self):
         self.chunks: list[litellm.ModelResponseStream] = []
-        self.current_block_type: str | None = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_streaming.py`
around lines 97 - 107, The ChunkRawRenderer class defines an unused field
current_block_type (set in __init__) that is never read or mutated—remove this
dead field to clean up the class; update the __init__ of ChunkRawRenderer to
only initialize self.chunks and leave render_chunk and get_stream_text unchanged
(methods: ChunkRawRenderer.__init__, ChunkRawRenderer.render_chunk,
ChunkRawRenderer.get_stream_text).

156-172: Duplicated parametrize lists across 4 tests.

The same 11-entry model/provider list is copy-pasted in all four @pytest.mark.parametrize decorators. Extract it to a module-level constant to reduce maintenance burden and risk of drift.

Suggested approach
+STREAMING_TEST_MODELS = [
+    ("claude_sonnet_4_5_thinking", ModelProviderName.openrouter),
+    ("claude_sonnet_4_5_thinking", ModelProviderName.anthropic),
+    # ... all entries ...
+    ("minimax_m2_5", ModelProviderName.openrouter),
+]
+
 `@pytest.mark.paid`
-@pytest.mark.parametrize(
-    "model_id,provider_name",
-    [
-        ("claude_sonnet_4_5_thinking", ModelProviderName.openrouter),
-        ...
-    ],
-)
+@pytest.mark.parametrize("model_id,provider_name", STREAMING_TEST_MODELS)
 async def test_acompletion_streaming_response(...):

Also applies to: 271-287, 360-376, 389-405

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_streaming.py`
around lines 156 - 172, Extract the repeated 11-entry list used in the
pytest.mark.parametrize(...) decorators into a single module-level constant
(e.g., SUPPORTED_MODEL_PROVIDER_PAIRS) and replace each duplicated inline list
with that constant in the four tests that parametrize "model_id,provider_name";
keep the parameter names ("model_id", "provider_name") and the
pytest.mark.parametrize call but pass SUPPORTED_MODEL_PROVIDER_PAIRS as the
second argument to avoid copy-paste drift.
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_tools.py (1)

494-511: Dead litellm.acompletion patches remain in _run_model_turn tests.

In test_run_model_turn_parallel_tools, test_run_model_turn_sequential_tools, and test_run_model_turn_max_tool_calls_exceeded, the patch("litellm.acompletion", ...) is now unreachable because acompletion_checking_response (which is also patched) is what would call it. These patches are harmless but misleading dead code.

Example cleanup for test_run_model_turn_parallel_tools
     with patch.object(
         litellm_adapter, "cached_available_tools", return_value=[multiply_spy, add_spy]
     ):
-        with patch(
-            "litellm.acompletion",
-            side_effect=[mock_response, final_response],
-        ):
-            with patch.object(
-                litellm_adapter, "build_completion_kwargs", return_value={}
-            ):
-                with patch.object(
-                    litellm_adapter,
-                    "acompletion_checking_response",
-                    side_effect=[
-                        (mock_response, mock_response.choices[0]),
-                        (final_response, final_response.choices[0]),
-                    ],
-                ):
-                    result = await litellm_adapter._run_model_turn(
-                        provider, prior_messages, None, False
-                    )
+        with patch.object(
+            litellm_adapter, "build_completion_kwargs", return_value={}
+        ):
+            with patch.object(
+                litellm_adapter,
+                "acompletion_checking_response",
+                side_effect=[
+                    (mock_response, mock_response.choices[0]),
+                    (final_response, final_response.choices[0]),
+                ],
+            ):
+                result = await litellm_adapter._run_model_turn(
+                    provider, prior_messages, None, False
+                )

Also applies to: 615-633, 703-717

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_tools.py`
around lines 494 - 511, Remove the dead patch of "litellm.acompletion" from the
three tests where it is never reached because ac ompletion_checking_response is
patched to provide responses; specifically, edit
test_run_model_turn_parallel_tools, test_run_model_turn_sequential_tools, and
test_run_model_turn_max_tool_calls_exceeded to delete the with
patch("litellm.acompletion", ...) context managers (the
side_effect=[mock_response, final_response] blocks) and keep the existing
patches for litellm_adapter.acompletion_checking_response,
litellm_adapter.cached_available_tools, and
litellm_adapter.build_completion_kwargs so the tests remain functionally
identical but without the misleading, unreachable patch.
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (2)

96-103: on_chunk is forwarded to every tool-call iteration turn.

In _run_model_turn, the on_chunk callback fires for every inner loop iteration (tool-call turns), not just the final content turn. Consumers will receive interleaved chunks from reasoning, tool-call deltas, and final content across multiple model calls. This is probably fine for streaming UIs but worth documenting so callers know chunks aren't scoped to a single logical response.

Also applies to: 125-127, 189-191, 224-231

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py` around lines 96
- 103, The on_chunk callback passed into _run_model_turn is currently forwarded
to every inner tool-call iteration, causing streaming chunks from reasoning,
tool deltas, and final content to be interleaved; either explicitly document
this behavior in the _run_model_turn docstring (and the same places noted for
the other iterations) or change the forwarding so on_chunk is only invoked for
the final content turn (e.g., pass None or a no-op for tool-call iterations and
only pass the real on_chunk when emitting the final response); update the
comments/docstrings for the related inner-loop call sites so callers know chunks
are not scoped to a single logical response.

298-316: Always-streaming design is tested and working, but consider a non-streaming fallback and verify logprobs coverage.

The concern about stream_chunk_builder preserving response fields is partially validated by existing tests. test_litellm_adapter_streaming.py extensively verifies that reasoning_content and tool_calls are correctly reassembled during streaming. However, two gaps remain:

  1. Logprobs are untested in streaming: The codebase tracks supports_logprobs as a model capability, but there are no tests verifying logprobs are preserved through stream_chunk_builder reassembly. If logprobs are needed downstream during streaming, add coverage.

  2. Known upstream limitation: Some LiteLLM providers emit reasoning in streamed delta.content but do not provide structured reasoning_content during streaming—only in non-streaming responses. This is a provider-specific issue that may silently affect certain model/provider combinations.

Optional: Consider keeping a non-streaming path when on_chunk is None to avoid the streaming overhead (stream_chunk_builder reassembly) for callers that don't need live callbacks. This maintains backward compatibility for non-streaming callers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py` around lines
298 - 316, The acompletion_checking_response method currently always builds a
StreamingCompletion and reassembles chunks even when no on_chunk callback is
provided; change it to use a non-streaming path when on_chunk is None by calling
the synchronous/non-streaming completion API (or awaiting a single-shot
completion) instead of instantiating StreamingCompletion, and ensure the
returned ModelResponse/Choices still include logprobs by preserving whatever
field/flag the adapter uses (see supports_logprobs and stream_chunk_builder
reassembly logic) so streamed and non-streamed responses have equivalent
logprobs and reasoning_content/tool_calls; add tests exercising logprobs through
streaming reassembly (test_litellm_adapter_streaming.py) and a new non-streaming
case to verify behavior when on_chunk is None.
libs/core/kiln_ai/adapters/model_adapters/mcp_adapter.py (1)

49-78: on_chunk is accepted but never forwarded to the tool execution.

The on_chunk parameter is threaded through the call chain but silently ignored in _run (Line 77 – tool.run(...) receives no chunk callback). This is fine for interface conformance with BaseAdapter._run, but worth a brief inline comment so future readers know streaming isn't supported for MCP tools yet.

Suggested comment
+        # Note: on_chunk is accepted for interface conformance but MCP tools
+        # do not support streaming, so it is intentionally unused here.
         result = await tool.run(context=None, **tool_kwargs)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/model_adapters/mcp_adapter.py` around lines 49 -
78, The _run method currently accepts an on_chunk callback but never forwards or
documents it; update McpAdapter._run to either forward on_chunk into the tool
execution if the tool supports streaming or, if streaming isn't supported for
MCP tools yet, add a brief inline comment just above the tool.run(...) call
explaining that on_chunk is intentionally ignored (mentioning the parameters
_run and on_chunk and the call site tool.run) so future readers understand this
limitation and don't assume a bug.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@libs/core/kiln_ai/adapters/litellm_utils/litellm_streaming.py`:
- Around line 48-60: The async iterator __aiter__ in litellm_streaming.py can be
aborted leaving _iterated False and _response unset and never closing the
underlying litellm stream; wrap the streaming logic in a try/finally: create the
stream as before, iterate and yield chunks inside try, and in finally always
call stream.aclose() if stream exists, set self._response =
litellm.stream_chunk_builder(chunks) (even if empty) and self._iterated = True
so stream.response works after interruption; ensure any exceptions are re-raised
after finalization so behavior is preserved.

---

Duplicate comments:
In `@libs/core/kiln_ai/adapters/ml_model_list.py`:
- Around line 1304-1326: The Claude “Thinking” model providers set
thinking_level="medium" but lack explicit reasoning flags; search the model
adapter logic that consumes
thinking_level/reasoning_capable/anthropic_extended_thinking (look for usages of
thinking_level, reasoning_capable, anthropic_extended_thinking in the
model_adapters) and then update the KilnModel provider entries (the KilnModel
instance for ModelName.claude_4_5_haiku_thinking and the other listed Claude
"Thinking" KilnModel blocks) to include reasoning_capable=True on providers that
support single-call reasoning and anthropic_extended_thinking=True only on the
ModelProviderName.anthropic provider entries; apply the same flag changes to the
other referenced blocks (around lines 1394-1439, 1459-1479, 1606-1651,
1691-1731) so the adapter will choose the single-call reasoning path where
appropriate.

---

Nitpick comments:
In `@libs/core/kiln_ai/adapters/litellm_utils/litellm_streaming.py`:
- Around line 30-36: The wrapper silently allows re-iteration causing duplicate
litellm.acompletion calls; update the StreamingCompletion iterator logic (the
__aiter__ method) to enforce single-use by checking self._iterated and raising a
RuntimeError("StreamingCompletion has already been iterated; create a new
instance.") if True, and only call litellm.acompletion and set
self._response/_iterated on the first iteration; alternatively add this
single-use behavior note to the class docstring for clarity.

In `@libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py`:
- Around line 96-103: The on_chunk callback passed into _run_model_turn is
currently forwarded to every inner tool-call iteration, causing streaming chunks
from reasoning, tool deltas, and final content to be interleaved; either
explicitly document this behavior in the _run_model_turn docstring (and the same
places noted for the other iterations) or change the forwarding so on_chunk is
only invoked for the final content turn (e.g., pass None or a no-op for
tool-call iterations and only pass the real on_chunk when emitting the final
response); update the comments/docstrings for the related inner-loop call sites
so callers know chunks are not scoped to a single logical response.
- Around line 298-316: The acompletion_checking_response method currently always
builds a StreamingCompletion and reassembles chunks even when no on_chunk
callback is provided; change it to use a non-streaming path when on_chunk is
None by calling the synchronous/non-streaming completion API (or awaiting a
single-shot completion) instead of instantiating StreamingCompletion, and ensure
the returned ModelResponse/Choices still include logprobs by preserving whatever
field/flag the adapter uses (see supports_logprobs and stream_chunk_builder
reassembly logic) so streamed and non-streamed responses have equivalent
logprobs and reasoning_content/tool_calls; add tests exercising logprobs through
streaming reassembly (test_litellm_adapter_streaming.py) and a new non-streaming
case to verify behavior when on_chunk is None.

In `@libs/core/kiln_ai/adapters/model_adapters/mcp_adapter.py`:
- Around line 49-78: The _run method currently accepts an on_chunk callback but
never forwards or documents it; update McpAdapter._run to either forward
on_chunk into the tool execution if the tool supports streaming or, if streaming
isn't supported for MCP tools yet, add a brief inline comment just above the
tool.run(...) call explaining that on_chunk is intentionally ignored (mentioning
the parameters _run and on_chunk and the call site tool.run) so future readers
understand this limitation and don't assume a bug.

In `@libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_streaming.py`:
- Around line 76-91: The render_chunk function assumes chunk.choices[0] exists
and can raise IndexError for empty choices; add a defensive guard at the top of
render_chunk (in the render_chunk method) that checks if not chunk.choices (or
len(chunk.choices) == 0) and handle that case (e.g., call
self.render_unknown(chunk) or return) before referencing chunk.choices[0]; then
proceed with the existing logic for tool_calls, reasoning_content, content, and
finish_reason.
- Around line 127-153: The test fixture uses StructuredOutputMode.unknown which
can later raise ValueError in response_format_options() if the task gains an
output schema; update the adapter_factory fixture to use
StructuredOutputMode.default instead of .unknown when constructing the
LiteLlmAdapter’s LiteLlmConfig/KilnAgentRunConfigProperties so
has_structured_output() won’t short-circuit and tests remain stable (look for
adapter_factory, LiteLlmAdapter, LiteLlmConfig, KilnAgentRunConfigProperties,
response_format_options).
- Around line 97-107: The ChunkRawRenderer class defines an unused field
current_block_type (set in __init__) that is never read or mutated—remove this
dead field to clean up the class; update the __init__ of ChunkRawRenderer to
only initialize self.chunks and leave render_chunk and get_stream_text unchanged
(methods: ChunkRawRenderer.__init__, ChunkRawRenderer.render_chunk,
ChunkRawRenderer.get_stream_text).
- Around line 156-172: Extract the repeated 11-entry list used in the
pytest.mark.parametrize(...) decorators into a single module-level constant
(e.g., SUPPORTED_MODEL_PROVIDER_PAIRS) and replace each duplicated inline list
with that constant in the four tests that parametrize "model_id,provider_name";
keep the parameter names ("model_id", "provider_name") and the
pytest.mark.parametrize call but pass SUPPORTED_MODEL_PROVIDER_PAIRS as the
second argument to avoid copy-paste drift.

In `@libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_tools.py`:
- Around line 494-511: Remove the dead patch of "litellm.acompletion" from the
three tests where it is never reached because ac ompletion_checking_response is
patched to provide responses; specifically, edit
test_run_model_turn_parallel_tools, test_run_model_turn_sequential_tools, and
test_run_model_turn_max_tool_calls_exceeded to delete the with
patch("litellm.acompletion", ...) context managers (the
side_effect=[mock_response, final_response] blocks) and keep the existing
patches for litellm_adapter.acompletion_checking_response,
litellm_adapter.cached_available_tools, and
litellm_adapter.build_completion_kwargs so the tests remain functionally
identical but without the misleading, unreachable patch.

Comment on lines +48 to +60
async def __aiter__(self) -> AsyncIterator[ModelResponseStream]:
self._response = None
self._iterated = False

chunks: list[ModelResponseStream] = []
stream = await litellm.acompletion(*self._args, stream=True, **self._kwargs)

async for chunk in stream:
chunks.append(chunk)
yield chunk

self._response = litellm.stream_chunk_builder(chunks)
self._iterated = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Missing try/finally leaves response inaccessible after any interrupted iteration.

If the consumer's async for body raises (e.g., an on_chunk callback throws) or if the streaming call itself fails, Python sends GeneratorExit into the generator at the yield point. The two lines after the loop never execute, so _iterated stays False and stream.response will always raise RuntimeError — callers cannot distinguish "not yet started" from "stream failed".

Additionally, the litellm stream object (a CustomStreamWrapper) won't have .aclose() called implicitly when the generator is abandoned without exhaustion.

🐛 Proposed fix — `try/finally` for guaranteed state finalization
     async def __aiter__(self) -> AsyncIterator[ModelResponseStream]:
         self._response = None
         self._iterated = False

         chunks: list[ModelResponseStream] = []
-        stream = await litellm.acompletion(*self._args, stream=True, **self._kwargs)
-
-        async for chunk in stream:
-            chunks.append(chunk)
-            yield chunk
-
-        self._response = litellm.stream_chunk_builder(chunks)
-        self._iterated = True
+        try:
+            stream = await litellm.acompletion(*self._args, stream=True, **self._kwargs)
+            async for chunk in stream:
+                chunks.append(chunk)
+                yield chunk
+        finally:
+            self._response = litellm.stream_chunk_builder(chunks) if chunks else None
+            self._iterated = True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@libs/core/kiln_ai/adapters/litellm_utils/litellm_streaming.py` around lines
48 - 60, The async iterator __aiter__ in litellm_streaming.py can be aborted
leaving _iterated False and _response unset and never closing the underlying
litellm stream; wrap the streaming logic in a try/finally: create the stream as
before, iterate and yield chunks inside try, and in finally always call
stream.aclose() if stream exists, set self._response =
litellm.stream_chunk_builder(chunks) (even if empty) and self._iterated = True
so stream.response works after interruption; ensure any exceptions are re-raised
after finalization so behavior is preserved.

@leonardmq
Copy link
Collaborator Author

Replaced with PR that handles multiturn + streaming + protocols: #1107

@leonardmq leonardmq closed this Mar 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants