Refactor LLM chats: separate streaming logic and enforce strict typing by s-alexey · Pull Request #12 · Kaggle/kaggle-benchmarks

s-alexey · 2026-01-07T15:59:22Z

Major refactoring of the LLM interaction layer, significantly enhancing the llm.prompt method to establish it as the primary, unified entry point for all model communications. The goal is to abstract away model-specific logic, providing seamless support for structured outputs, automatic tool calling, and vision capabilities across all integrated models.

This simplifies task definitions and enhances the user experience by providing a consistent, high-level API.

Enhanced llm.prompt with automatic tool calling:
- The llm.prompt method has been upgraded to manage the entire conversation turn, now including a built-in, multi-step tool-calling loop.
- When tools are provided, prompt automatically orchestrates the interaction: it invokes the LLM, executes requested tools, sends the results back, and repeats this cycle until a final answer is generated. This eliminates the need for manual tool-handling logic in task definitions.
Automatic tool-calling emulation:
- For models that lack native tool-calling support, a new emulation layer transparently provides this functionality by wrapping the requests with structured prompts, making the feature available across all models.
Refactored Actor model:
- The llms.py module has been streamlined. API-specific logic has been moved into dedicated actors/genai.py and actors/openai.py modules.
- Experimental streaming functionality is now encapsulated in separate classes (e.g., StreamingGoogleGenAI, StreamingOpenAIResponsesAPI) to isolate it from the core API used for scheduled runs
Improved vision and image support:
Enhanced support for multimodal inputs, particularly for the Gemini API. The framework now correctly handles image content, including captions and various data formats (URLs and base64).
New agentic assertion:
Added assert_tool_was_invoked to allow for testing and evaluation of agentic behavior by verifying that a specific tool was used during a task.
Updated Examples & Tests
- Revised examples to demonstrate the simplified llm.prompt API for tool use.
- Added a comprehensive suite of API integration tests (test_api_integration.py) that run against live OpenAI, Google, and Model Proxy endpoints (when API keys are available) to ensure cross-model consistency.

dolaameng

Thanks for the refactoring, which is really helpful and clearer! Left some questions/comments to understand more.

We can keep iterating it.

tests/test_genai.py

dolaameng · 2026-01-08T23:13:11Z

src/kaggle_benchmarks/actors/llms.py

 """
 Defines a chat agent that interacts with a Large Language Model (LLM).

 Design Note:


nit: shall we update here too?

good idea, I will get back to that once the internal api is more stable

dolaameng · 2026-01-09T00:04:02Z

src/kaggle_benchmarks/actors/llms.py

+
+
+@dataclasses.dataclass
+class FunctionCall:


Shall we also consider thought signature for reasoning models? Context here. Or it's already in "LLMMessage.thinking"?

I would love to have it as well, perhaps in a separate PR as this one is already overcomplicated

dolaameng · 2026-01-09T00:42:04Z

src/kaggle_benchmarks/actors/llms.py

+        temperature: float | None = 0,
+        seed: int = 0,
+        tools: list[Any] | None = None,
+    ) -> LLMMessage[str]:


Being explicit is nice! Wonder if we also want to keep the flexibility for other parameters, e.g., the config

dolaameng · 2026-01-09T00:45:32Z

src/kaggle_benchmarks/actors/openai.py

+        return result
+
+
+class ModelProxyOpenAI(OpenAI):


Do we still support streaming output for Model Proxy?

We should, but I would rather wait till they fully support responses API so we don't have to implement it twice

src/kaggle_benchmarks/actors/openai.py

dolaameng · 2026-01-16T00:21:26Z

src/kaggle_benchmarks/actors/llms.py


-        answer = response.content
-        response._meta.update(chat=chat, schema=schema, raw_content=answer, **kwargs)
+        response._meta.update(


Do we still use _meta?

I would keep it for now, just to be safe

develra

LGTM - it's a bit hard for me to tell as a person pretty ignorant of this code if the test coverage is sufficient to be confident that these changes are safe. I think that it would be good to think through what might break as a result of these changes and make sure we have test coverage for it - especially given the somewhat sensitive timing of a new launch.

cicd/build/cloudbuild_branch.yaml

src/kaggle_benchmarks/actors/openai.py

develra · 2026-01-16T16:08:04Z

src/kaggle_benchmarks/actors/openai.py

+            {
+                "role": message.sender.role
+                if message.sender.role != "tool"
+                else "system",  # TODO: Remove this renaming once ModelProxy supports tools


Looking at this TODO - do we know if that is on the roadmap for ModelProxy?

src/kaggle_benchmarks/actors/openai.py

dolaameng · 2026-01-16T01:39:35Z

src/kaggle_benchmarks/actors/llms.py

+        temperature: float | None = 0,
+        seed: int = 0,
+        tools: list[Any] | None = None,
+    ) -> LLMMessage[str]:


For invoke, shall we return LLMMessage[T] for image output etc?

I don't know yet. Should we will have another method for image generation?

dolaameng · 2026-01-16T03:08:10Z

src/kaggle_benchmarks/actors/llms.py

+    tool_calls: list[FunctionCall] | None = None
+    usage: Usage | None = None
+
+    def add_chunk(self, chunk: str):


Curious what's the different uses of this new method and Message.stream?

dolaameng · 2026-01-16T03:13:07Z

src/kaggle_benchmarks/actors/llms.py

+
+@dataclasses.dataclass
+class LLMMessage(messages.Message[T]):
+    content: T


Look at the add_chunk method, do we essentially assume content is always string? If so shall we just rename it to text etc?

We can add another filed as image for image output in the future?

dolaameng · 2026-01-16T03:14:12Z

src/kaggle_benchmarks/actors/llms.py

+    content: T
+    _status: utils.Status = utils.Status.RUNNING
+    thinking: str | None = None
+    tool_calls: list[FunctionCall] | None = None


I think these fields are also useful to Message, do you think so?

dolaameng · 2026-01-16T19:03:19Z

tests/kaggle/test_integration.py

Discussed with @s-alexey that it will be great for us to test more existing examples to avoid back-incompatibility.

Major refactor of the LLM chat architecture to improve code organization, maintainability, and type safety. Key Changes: - Split `LLMChat` subclasses into distinct Non-Streaming and Streaming implementations. Streaming logic (primarily for notebooks) was complicating the core classes; this split makes primary actors more concise and less error-prone. - Moved provider-specific implementations into separate files: `openai.py` and `genai.py`. - Replaced the generic `LLMResponse` with a strictly typed version, specifically enforcing types for `tool_usage` and `token_usage`. - Updated `invoke` method to accept explicit arguments. - Migrated OpenAI integration from the `completion` API to the more user-friendly `responses` API. Testing: - Added coverage for common use cases using real APIs (tests run conditionally if environment keys are present).

s-alexey requested a review from dolaameng January 7, 2026 16:04

s-alexey added the wip Work in progress label Jan 7, 2026

s-alexey force-pushed the genai branch from 519e732 to 1b24919 Compare January 7, 2026 17:53

dolaameng reviewed Jan 9, 2026

View reviewed changes

src/kaggle_benchmarks/actors/openai.py Outdated Show resolved Hide resolved

s-alexey force-pushed the genai branch 3 times, most recently from 74c099d to 9b62a2d Compare January 14, 2026 16:00

dolaameng reviewed Jan 16, 2026

View reviewed changes

dolaameng requested review from develra and yibinlin-google January 16, 2026 15:39

develra approved these changes Jan 16, 2026

View reviewed changes

dolaameng reviewed Jan 16, 2026

View reviewed changes

s-alexey force-pushed the genai branch from 35396fc to 55f692f Compare January 22, 2026 08:59

s-alexey force-pushed the genai branch 2 times, most recently from 05c8fd2 to e611cf2 Compare February 6, 2026 18:46

s-alexey force-pushed the genai branch from e611cf2 to 418aa72 Compare February 6, 2026 18:55



		@dataclasses.dataclass
		class FunctionCall:

Conversation

s-alexey commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dolaameng left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

develra left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

s-alexey commented Jan 7, 2026 •

edited

Loading

dolaameng left a comment •

edited

Loading