Added basic message support for /v1/responses api by madclaws · Pull Request #82 · tilesprivacy/tiles

madclaws · 2026-02-02T17:39:32Z

No description provided.

coderabbitai · 2026-02-02T17:39:52Z

📝 Walkthrough

Walkthrough

This pull request introduces a new Responses API endpoint (/v1/responses) that enables streaming and non-streaming response generation with persistent storage. The implementation adds two new request/response schemas (ResponsesRequest and ResponsesResponse), integrates a responses cache in the backend runtime, and adds helper utilities for prepending previous responses and calculating token usage. The Rust client is updated to call the new endpoint with adjusted response parsing. Dependency versions are also bumped for pathspec, protobuf, and pytokens.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant API as /v1/responses<br/>Endpoint
    participant Backend as Backend Runtime<br/>(mlx.py)
    participant Cache as Responses<br/>Cache
    participant Model as Language<br/>Model

    Client->>API: POST ResponsesRequest<br/>(with optional previous_response_id)
    API->>Backend: generate_response_chat[_stream]<br/>(ResponsesRequest)
    
    alt with previous_response_id
        Backend->>Cache: fetch previous response
        Cache-->>Backend: previous response text
        Backend->>Backend: _prepend_previous_response
    end
    
    Backend->>Model: generate with input<br/>(possibly prepended)
    
    alt streaming=true
        loop for each token
            Model-->>Backend: token chunk
            Backend-->>Client: StreamingResponse<br/>(delta chunk)
        end
        Backend->>Backend: _calc_usage<br/>(token count)
        Backend->>Backend: _store_response<br/>(ResponsesResponse)
        Backend->>Cache: persist ResponsesResponse
        Backend-->>Client: final chunk with metrics
    else streaming=false
        Model-->>Backend: complete output
        Backend->>Backend: _calc_usage<br/>(token count)
        Backend->>Backend: _store_response<br/>(ResponsesResponse)
        Backend->>Cache: persist ResponsesResponse
        Backend-->>Client: ResponsesResponse<br/>(with usage & id)
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Added token metrics with the model response #80: Modifies server/backend/mlx.py streaming response paths and server/schemas.py generation metrics, sharing overlapping changes to response handling and metrics collection for streamed responses.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Description check	❓ Inconclusive	No description was provided by the author, making it impossible to evaluate relevance to the changeset.	Add a pull request description explaining the purpose and scope of the /v1/responses API implementation and its integration with the existing system.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: adding support for a new /v1/responses API endpoint with message handling across multiple files.
Docstring Coverage	✅ Passed	Docstring coverage is 82.35% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/responses-api

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

tiles/src/runtime/mlx.rs (1)

537-550: ⚠️ Potential issue | 🟡 Minor

Remove the unused messages field from the request body.

The ResponsesRequest schema does not include a messages field. The /v1/responses endpoint handler only uses request.input to process user input. The messages field will be ignored by the server or cause validation errors and should be removed.

server/api.py (1)

49-58: ⚠️ Potential issue | 🟠 Major

Global mutable state in /start will leak conversations across concurrent clients.

_messages and _memory_path are module-level globals overwritten on each call to /start. Any concurrent or sequential clients will interfere—if Client A calls /start, then Client B calls /start, then Client A calls /chat, Client A's request will use Client B's system prompt and memory path. The StartRequest schema has no session or conversation ID, and there is no session middleware or context management to isolate per-client state. Pass state as part of request/response (e.g., include conversation_id in StartRequest and carry it through subsequent endpoints), or use context variables to bind globals to async request context.

🤖 Fix all issues with AI agents

In `@server/api.py`:
- Around line 89-105: The create_chat_response endpoint lacks error handling for
both streaming and non-streaming paths: wrap the call to
runtime.backend.generate_response_chat_stream (stream branch) in a try/except
that catches exceptions, logs the error, and returns a StreamingResponse that
yields a structured error payload and appropriate status/headers (matching the
behavior used in /v1/chat/completions); similarly wrap the await
runtime.backend.generate_response_chat (non-stream branch) in try/except and
return a proper JSON error response on failure. Ensure you reference
create_chat_response, runtime.backend.generate_response_chat_stream, and
runtime.backend.generate_response_chat when applying the fixes.

In `@server/backend/mlx.py`:
- Around line 255-264: The code currently swallows all exceptions when attaching
metrics to resp and when storing resp into the _responses dict; update both
try/except blocks to catch Exception as e and log the error with context
(include response_id and metrics and a short message) rather than silently
passing, and consider failing fast (re-raise) or returning an error response if
storing into _responses fails so callers relying on previous_response_id can
detect the failure; locate the blocks handling resp.metadata["metrics"] and
_responses[response_id] in mlx.py and replace bare excepts with logging of e and
contextual identifiers.
- Around line 27-28: The module-level _responses: Dict[str, ResponsesResponse]
is an unbounded in-memory cache and needs a size/eviction policy; replace it
with a bounded cache (e.g., an LRU or TTL-backed structure) so entries are
evicted when capacity is exceeded or expired. Locate the _responses declaration
in mlx.py and swap the plain dict for a bounded cache implementation (for
example use collections.OrderedDict with manual LRU eviction, or
cachetools.LRUCache/TTLCache) and update any code that reads/writes _responses
to use the chosen cache API while preserving keys of type str and values of
ResponsesResponse.

🧹 Nitpick comments (7)

tiles/src/runtime/mlx.rs (2)
576-577: Remove commented-out debug code.

The commented println! statement should be removed before merging.
🧹 Remove debug artifact
             // Parse JSON
             let v: Value = serde_json::from_str(data).unwrap();
-            // println!("{:?}", v);
             // Check for metrics in the response
582-594: Consider adding error handling for unexpected response structure.

The deep JSON path v["output"][0]["content"][0]["text"] silently fails if the structure doesn't match. While safe, this makes debugging harder when the server response format changes.
💡 Suggested improvement for debuggability
-            if let Some(delta) = v["output"][0]["content"][0]["text"].as_str() {
+            let delta = v["output"]
+                .get(0)
+                .and_then(|o| o.get("content"))
+                .and_then(|c| c.get(0))
+                .and_then(|t| t.get("text"))
+                .and_then(|t| t.as_str());
+            
+            if let Some(delta) = delta {
                 accumulated.push_str(delta);
server/runtime.py (1)
1-3: Consider using a more specific type or Protocol.

Using Any provides minimal type safety. If the backend has a known interface, consider defining a Protocol or abstract base class.
from typing import Protocol, Optional

class Backend(Protocol):
    # Define expected methods here
    pass

backend: Optional[Backend] = None
server/backend/mlx.py (4)
222-229: Consider logging tokenization failures.

The fallback to zero tokens is reasonable, but logging the exception would help diagnose tokenizer issues.
📝 Add debug logging
 def _calc_usage(runner: MLXRunner, input_text: str, generated_text: str) -> Dict[str, int]:
     """Calculate token usage using the runner tokenizer; fall back to zeros on error."""
     try:
         input_tokens = len(runner.tokenizer.encode(input_text))
         output_tokens = len(runner.tokenizer.encode(generated_text))
         return {"input_tokens": input_tokens, "output_tokens": output_tokens}
-    except Exception:
+    except Exception as e:
+        logger.debug(f"Token counting failed, using fallback: {e}")
         return {"input_tokens": 0, "output_tokens": 0}
326-327: Token counting assumes one token per yield.

The output_tokens += 1 assumes each yielded string is exactly one token. If generate_streaming ever batches tokens or yields partial content, this count will be inaccurate. Consider using the tokenizer for accurate counts in the final chunk.

354-366: Consider logging the exception before yielding error chunk.

While the broad exception catch is acceptable for a streaming endpoint, logging the full traceback would aid debugging.
📝 Add exception logging
     except Exception as e:
+        logger.exception(f"Error during response generation: {e}")
         error_chunk = {
             "id": response_id,
455-460: TTFT metric is approximate for batch generation.

For non-streaming generation, ttft_ms is set to total generation time, which isn't the true "time to first token". This is acceptable as an approximation but could be documented.
📝 Add clarifying comment
         metrics_obj = {
+            # Note: For batch generation, ttft_ms equals total time (no streaming)
             "ttft_ms": generation_time * 1000.0,

coderabbitai · 2026-02-02T17:44:06Z

+@app.post("/v1/responses")
+async def create_chat_response(request: ResponsesRequest):
+    """
+    Create a response with openResponse format
+    """
+
+    global _messages
+
+    if request.stream:
+        # Streaming response
+        return StreamingResponse(
+            runtime.backend.generate_response_chat_stream(request),
+            media_type="text/plain",
+            headers={"Cache-Control": "no-cache"},
+        )
+    else:
+        return await runtime.backend.generate_response_chat(request)


⚠️ Potential issue | 🟡 Minor

Add error handling for /v1/responses to avoid broken streams.
Unlike /v1/chat/completions, exceptions here will bubble up and may abruptly terminate streaming connections without a structured error.

🔧 Suggested fix (parity with chat completions)

`@app.post`("/v1/responses") async def create_chat_response(request: ResponsesRequest): """ Create a response with openResponse format """ global _messages - if request.stream: - # Streaming response - return StreamingResponse( - runtime.backend.generate_response_chat_stream(request), - media_type="text/plain", - headers={"Cache-Control": "no-cache"}, - ) - else: - return await runtime.backend.generate_response_chat(request) + try: + if request.stream: + # Streaming response + return StreamingResponse( + runtime.backend.generate_response_chat_stream(request), + media_type="text/plain", + headers={"Cache-Control": "no-cache"}, + ) + return await runtime.backend.generate_response_chat(request) + except Exception as e: + raise HTTPException(status_code=500, detail=str(e))

🤖 Prompt for AI Agents

In `@server/api.py` around lines 89 - 105, The create_chat_response endpoint lacks error handling for both streaming and non-streaming paths: wrap the call to runtime.backend.generate_response_chat_stream (stream branch) in a try/except that catches exceptions, logs the error, and returns a StreamingResponse that yields a structured error payload and appropriate status/headers (matching the behavior used in /v1/chat/completions); similarly wrap the await runtime.backend.generate_response_chat (non-stream branch) in try/except and return a proper JSON error response on failure. Ensure you reference create_chat_response, runtime.backend.generate_response_chat_stream, and runtime.backend.generate_response_chat when applying the fixes.

coderabbitai · 2026-02-02T17:44:06Z

+# Store generated responses for follow-up support (previous_response_id)
+_responses: Dict[str, ResponsesResponse] = {}


⚠️ Potential issue | 🟠 Major

In-memory response cache grows unbounded.

The _responses dictionary has no eviction policy. For long-running servers, this will lead to memory exhaustion as responses accumulate.

💾 Consider adding cache size limit

+from collections import OrderedDict + +_MAX_RESPONSES_CACHE = 1000 # Configurable limit + # Store generated responses for follow-up support (previous_response_id) -_responses: Dict[str, ResponsesResponse] = {} +_responses: OrderedDict[str, ResponsesResponse] = OrderedDict() + +def _cache_response(response_id: str, response: ResponsesResponse) -> None: + """Add response to cache with LRU eviction.""" + _responses[response_id] = response + while len(_responses) > _MAX_RESPONSES_CACHE: + _responses.popitem(last=False) # Remove oldest

🤖 Prompt for AI Agents

In `@server/backend/mlx.py` around lines 27 - 28, The module-level _responses: Dict[str, ResponsesResponse] is an unbounded in-memory cache and needs a size/eviction policy; replace it with a bounded cache (e.g., an LRU or TTL-backed structure) so entries are evicted when capacity is exceeded or expired. Locate the _responses declaration in mlx.py and swap the plain dict for a bounded cache implementation (for example use collections.OrderedDict with manual LRU eviction, or cachetools.LRUCache/TTLCache) and update any code that reads/writes _responses to use the chosen cache API while preserving keys of type str and values of ResponsesResponse.

coderabbitai · 2026-02-02T17:44:06Z

+    if metrics:
+        try:
+            resp.metadata["metrics"] = metrics
+        except Exception:
+            pass
+    try:
+        _responses[response_id] = resp
+    except Exception:
+        pass
+    return resp


⚠️ Potential issue | 🟡 Minor

Silent exception handling hides potential bugs.

The try-except-pass blocks at lines 256-259 and 260-263 silently swallow all exceptions. If storing fails, subsequent requests using previous_response_id will silently fail to find the response.

🔧 Add logging for debugging

if metrics: try: resp.metadata["metrics"] = metrics - except Exception: - pass + except Exception as e: + logger.warning(f"Failed to attach metrics to response {response_id}: {e}") try: _responses[response_id] = resp - except Exception: - pass + except Exception as e: + logger.warning(f"Failed to store response {response_id}: {e}") return resp

🧰 Tools

🪛 Ruff (0.14.14)

[error] 258-259: try-except-pass detected, consider logging the exception

(S110)

[warning] 258-258: Do not catch blind exception: Exception

(BLE001)

[error] 262-263: try-except-pass detected, consider logging the exception

(S110)

[warning] 262-262: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@server/backend/mlx.py` around lines 255 - 264, The code currently swallows all exceptions when attaching metrics to resp and when storing resp into the _responses dict; update both try/except blocks to catch Exception as e and log the error with context (include response_id and metrics and a short message) rather than silently passing, and consider failing fast (re-raise) or returning an error response if storing into _responses fails so callers relying on previous_response_id can detect the failure; locate the blocks handling resp.metadata["metrics"] and _responses[response_id] in mlx.py and replace bare excepts with logging of e and contextual identifiers.

codecov · 2026-02-02T17:51:41Z

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
tiles/src/runtime/mlx.rs	0.00%	1 Missing ⚠️

📢 Thoughts on this report? Let us know!

madclaws added 3 commits January 30, 2026 18:16

feat: first stab at responses api

b126eeb

Merge branch 'main' into feat/responses-api

dfe3598

feat: integrated open response api w and w/0 streaming

5e1212d

coderabbitai bot reviewed Feb 2, 2026

View reviewed changes

madclaws merged commit ef02a60 into main Feb 2, 2026
3 of 4 checks passed

madclaws deleted the feat/responses-api branch February 2, 2026 17:51

coderabbitai bot mentioned this pull request Feb 4, 2026

open responses and mem model compatibility + fixes #83

Merged

coderabbitai bot mentioned this pull request Feb 21, 2026

Basic Identity system for Tiles #87

Merged

This was referenced Feb 28, 2026

Added harmony renderer suppport + responses api refactor #92

Merged

Implemented chatdb for chat persistence #94

Merged

Inference enhancements #95

Merged

coderabbitai bot mentioned this pull request Mar 28, 2026

fix: updated mlx related pkgs + end condition for stopping model request loop #111

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added basic message support for /v1/responses api #82

Added basic message support for /v1/responses api #82
madclaws merged 3 commits intomainfrom
feat/responses-api

madclaws commented Feb 2, 2026

Uh oh!

coderabbitai bot commented Feb 2, 2026 •

edited

Loading

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 2, 2026

Uh oh!

coderabbitai bot Feb 2, 2026

Uh oh!

coderabbitai bot Feb 2, 2026

Uh oh!

Uh oh!

codecov bot commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		# Store generated responses for follow-up support (previous_response_id)
		_responses: Dict[str, ResponsesResponse] = {}

Conversation

madclaws commented Feb 2, 2026

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Feb 2, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai bot commented Feb 2, 2026 •

edited

Loading