Skip to content

Added basic message support for /v1/responses api #82

Merged
madclaws merged 3 commits intomainfrom
feat/responses-api
Feb 2, 2026
Merged

Added basic message support for /v1/responses api #82
madclaws merged 3 commits intomainfrom
feat/responses-api

Conversation

@madclaws
Copy link
Copy Markdown
Member

@madclaws madclaws commented Feb 2, 2026

No description provided.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Feb 2, 2026

📝 Walkthrough

Walkthrough

This pull request introduces a new Responses API endpoint (/v1/responses) that enables streaming and non-streaming response generation with persistent storage. The implementation adds two new request/response schemas (ResponsesRequest and ResponsesResponse), integrates a responses cache in the backend runtime, and adds helper utilities for prepending previous responses and calculating token usage. The Rust client is updated to call the new endpoint with adjusted response parsing. Dependency versions are also bumped for pathspec, protobuf, and pytokens.

Sequence Diagram(s)

sequenceDiagram
    actor Client
    participant API as /v1/responses<br/>Endpoint
    participant Backend as Backend Runtime<br/>(mlx.py)
    participant Cache as Responses<br/>Cache
    participant Model as Language<br/>Model

    Client->>API: POST ResponsesRequest<br/>(with optional previous_response_id)
    API->>Backend: generate_response_chat[_stream]<br/>(ResponsesRequest)
    
    alt with previous_response_id
        Backend->>Cache: fetch previous response
        Cache-->>Backend: previous response text
        Backend->>Backend: _prepend_previous_response
    end
    
    Backend->>Model: generate with input<br/>(possibly prepended)
    
    alt streaming=true
        loop for each token
            Model-->>Backend: token chunk
            Backend-->>Client: StreamingResponse<br/>(delta chunk)
        end
        Backend->>Backend: _calc_usage<br/>(token count)
        Backend->>Backend: _store_response<br/>(ResponsesResponse)
        Backend->>Cache: persist ResponsesResponse
        Backend-->>Client: final chunk with metrics
    else streaming=false
        Model-->>Backend: complete output
        Backend->>Backend: _calc_usage<br/>(token count)
        Backend->>Backend: _store_response<br/>(ResponsesResponse)
        Backend->>Cache: persist ResponsesResponse
        Backend-->>Client: ResponsesResponse<br/>(with usage & id)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Added token metrics with the model response #80: Modifies server/backend/mlx.py streaming response paths and server/schemas.py generation metrics, sharing overlapping changes to response handling and metrics collection for streamed responses.
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Description check ❓ Inconclusive No description was provided by the author, making it impossible to evaluate relevance to the changeset. Add a pull request description explaining the purpose and scope of the /v1/responses API implementation and its integration with the existing system.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: adding support for a new /v1/responses API endpoint with message handling across multiple files.
Docstring Coverage ✅ Passed Docstring coverage is 82.35% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/responses-api

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tiles/src/runtime/mlx.rs (1)

537-550: ⚠️ Potential issue | 🟡 Minor

Remove the unused messages field from the request body.

The ResponsesRequest schema does not include a messages field. The /v1/responses endpoint handler only uses request.input to process user input. The messages field will be ignored by the server or cause validation errors and should be removed.

server/api.py (1)

49-58: ⚠️ Potential issue | 🟠 Major

Global mutable state in /start will leak conversations across concurrent clients.

_messages and _memory_path are module-level globals overwritten on each call to /start. Any concurrent or sequential clients will interfere—if Client A calls /start, then Client B calls /start, then Client A calls /chat, Client A's request will use Client B's system prompt and memory path. The StartRequest schema has no session or conversation ID, and there is no session middleware or context management to isolate per-client state. Pass state as part of request/response (e.g., include conversation_id in StartRequest and carry it through subsequent endpoints), or use context variables to bind globals to async request context.

🤖 Fix all issues with AI agents
In `@server/api.py`:
- Around line 89-105: The create_chat_response endpoint lacks error handling for
both streaming and non-streaming paths: wrap the call to
runtime.backend.generate_response_chat_stream (stream branch) in a try/except
that catches exceptions, logs the error, and returns a StreamingResponse that
yields a structured error payload and appropriate status/headers (matching the
behavior used in /v1/chat/completions); similarly wrap the await
runtime.backend.generate_response_chat (non-stream branch) in try/except and
return a proper JSON error response on failure. Ensure you reference
create_chat_response, runtime.backend.generate_response_chat_stream, and
runtime.backend.generate_response_chat when applying the fixes.

In `@server/backend/mlx.py`:
- Around line 255-264: The code currently swallows all exceptions when attaching
metrics to resp and when storing resp into the _responses dict; update both
try/except blocks to catch Exception as e and log the error with context
(include response_id and metrics and a short message) rather than silently
passing, and consider failing fast (re-raise) or returning an error response if
storing into _responses fails so callers relying on previous_response_id can
detect the failure; locate the blocks handling resp.metadata["metrics"] and
_responses[response_id] in mlx.py and replace bare excepts with logging of e and
contextual identifiers.
- Around line 27-28: The module-level _responses: Dict[str, ResponsesResponse]
is an unbounded in-memory cache and needs a size/eviction policy; replace it
with a bounded cache (e.g., an LRU or TTL-backed structure) so entries are
evicted when capacity is exceeded or expired. Locate the _responses declaration
in mlx.py and swap the plain dict for a bounded cache implementation (for
example use collections.OrderedDict with manual LRU eviction, or
cachetools.LRUCache/TTLCache) and update any code that reads/writes _responses
to use the chosen cache API while preserving keys of type str and values of
ResponsesResponse.
🧹 Nitpick comments (7)
tiles/src/runtime/mlx.rs (2)

576-577: Remove commented-out debug code.

The commented println! statement should be removed before merging.

🧹 Remove debug artifact
             // Parse JSON
             let v: Value = serde_json::from_str(data).unwrap();
-            // println!("{:?}", v);
             // Check for metrics in the response

582-594: Consider adding error handling for unexpected response structure.

The deep JSON path v["output"][0]["content"][0]["text"] silently fails if the structure doesn't match. While safe, this makes debugging harder when the server response format changes.

💡 Suggested improvement for debuggability
-            if let Some(delta) = v["output"][0]["content"][0]["text"].as_str() {
+            let delta = v["output"]
+                .get(0)
+                .and_then(|o| o.get("content"))
+                .and_then(|c| c.get(0))
+                .and_then(|t| t.get("text"))
+                .and_then(|t| t.as_str());
+            
+            if let Some(delta) = delta {
                 accumulated.push_str(delta);
server/runtime.py (1)

1-3: Consider using a more specific type or Protocol.

Using Any provides minimal type safety. If the backend has a known interface, consider defining a Protocol or abstract base class.

from typing import Protocol, Optional

class Backend(Protocol):
    # Define expected methods here
    pass

backend: Optional[Backend] = None
server/backend/mlx.py (4)

222-229: Consider logging tokenization failures.

The fallback to zero tokens is reasonable, but logging the exception would help diagnose tokenizer issues.

📝 Add debug logging
 def _calc_usage(runner: MLXRunner, input_text: str, generated_text: str) -> Dict[str, int]:
     """Calculate token usage using the runner tokenizer; fall back to zeros on error."""
     try:
         input_tokens = len(runner.tokenizer.encode(input_text))
         output_tokens = len(runner.tokenizer.encode(generated_text))
         return {"input_tokens": input_tokens, "output_tokens": output_tokens}
-    except Exception:
+    except Exception as e:
+        logger.debug(f"Token counting failed, using fallback: {e}")
         return {"input_tokens": 0, "output_tokens": 0}

326-327: Token counting assumes one token per yield.

The output_tokens += 1 assumes each yielded string is exactly one token. If generate_streaming ever batches tokens or yields partial content, this count will be inaccurate. Consider using the tokenizer for accurate counts in the final chunk.


354-366: Consider logging the exception before yielding error chunk.

While the broad exception catch is acceptable for a streaming endpoint, logging the full traceback would aid debugging.

📝 Add exception logging
     except Exception as e:
+        logger.exception(f"Error during response generation: {e}")
         error_chunk = {
             "id": response_id,

455-460: TTFT metric is approximate for batch generation.

For non-streaming generation, ttft_ms is set to total generation time, which isn't the true "time to first token". This is acceptable as an approximation but could be documented.

📝 Add clarifying comment
         metrics_obj = {
+            # Note: For batch generation, ttft_ms equals total time (no streaming)
             "ttft_ms": generation_time * 1000.0,

Comment thread server/api.py
Comment on lines +89 to +105
@app.post("/v1/responses")
async def create_chat_response(request: ResponsesRequest):
"""
Create a response with openResponse format
"""

global _messages

if request.stream:
# Streaming response
return StreamingResponse(
runtime.backend.generate_response_chat_stream(request),
media_type="text/plain",
headers={"Cache-Control": "no-cache"},
)
else:
return await runtime.backend.generate_response_chat(request)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add error handling for /v1/responses to avoid broken streams.
Unlike /v1/chat/completions, exceptions here will bubble up and may abruptly terminate streaming connections without a structured error.

🔧 Suggested fix (parity with chat completions)
 `@app.post`("/v1/responses")
 async def create_chat_response(request: ResponsesRequest):
     """
     Create a response with openResponse format
     """
 
     global _messages
 
-    if request.stream:
-        # Streaming response
-        return StreamingResponse(
-            runtime.backend.generate_response_chat_stream(request),
-            media_type="text/plain",
-            headers={"Cache-Control": "no-cache"},
-        )
-    else:
-        return await runtime.backend.generate_response_chat(request)
+    try:
+        if request.stream:
+            # Streaming response
+            return StreamingResponse(
+                runtime.backend.generate_response_chat_stream(request),
+                media_type="text/plain",
+                headers={"Cache-Control": "no-cache"},
+            )
+        return await runtime.backend.generate_response_chat(request)
+    except Exception as e:
+        raise HTTPException(status_code=500, detail=str(e))
🤖 Prompt for AI Agents
In `@server/api.py` around lines 89 - 105, The create_chat_response endpoint lacks
error handling for both streaming and non-streaming paths: wrap the call to
runtime.backend.generate_response_chat_stream (stream branch) in a try/except
that catches exceptions, logs the error, and returns a StreamingResponse that
yields a structured error payload and appropriate status/headers (matching the
behavior used in /v1/chat/completions); similarly wrap the await
runtime.backend.generate_response_chat (non-stream branch) in try/except and
return a proper JSON error response on failure. Ensure you reference
create_chat_response, runtime.backend.generate_response_chat_stream, and
runtime.backend.generate_response_chat when applying the fixes.

Comment thread server/backend/mlx.py
Comment on lines +27 to +28
# Store generated responses for follow-up support (previous_response_id)
_responses: Dict[str, ResponsesResponse] = {}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

In-memory response cache grows unbounded.

The _responses dictionary has no eviction policy. For long-running servers, this will lead to memory exhaustion as responses accumulate.

💾 Consider adding cache size limit
+from collections import OrderedDict
+
+_MAX_RESPONSES_CACHE = 1000  # Configurable limit
+
 # Store generated responses for follow-up support (previous_response_id)
-_responses: Dict[str, ResponsesResponse] = {}
+_responses: OrderedDict[str, ResponsesResponse] = OrderedDict()
+
+def _cache_response(response_id: str, response: ResponsesResponse) -> None:
+    """Add response to cache with LRU eviction."""
+    _responses[response_id] = response
+    while len(_responses) > _MAX_RESPONSES_CACHE:
+        _responses.popitem(last=False)  # Remove oldest
🤖 Prompt for AI Agents
In `@server/backend/mlx.py` around lines 27 - 28, The module-level _responses:
Dict[str, ResponsesResponse] is an unbounded in-memory cache and needs a
size/eviction policy; replace it with a bounded cache (e.g., an LRU or
TTL-backed structure) so entries are evicted when capacity is exceeded or
expired. Locate the _responses declaration in mlx.py and swap the plain dict for
a bounded cache implementation (for example use collections.OrderedDict with
manual LRU eviction, or cachetools.LRUCache/TTLCache) and update any code that
reads/writes _responses to use the chosen cache API while preserving keys of
type str and values of ResponsesResponse.

Comment thread server/backend/mlx.py
Comment on lines +255 to +264
if metrics:
try:
resp.metadata["metrics"] = metrics
except Exception:
pass
try:
_responses[response_id] = resp
except Exception:
pass
return resp
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent exception handling hides potential bugs.

The try-except-pass blocks at lines 256-259 and 260-263 silently swallow all exceptions. If storing fails, subsequent requests using previous_response_id will silently fail to find the response.

🔧 Add logging for debugging
     if metrics:
         try:
             resp.metadata["metrics"] = metrics
-        except Exception:
-            pass
+        except Exception as e:
+            logger.warning(f"Failed to attach metrics to response {response_id}: {e}")
     try:
         _responses[response_id] = resp
-    except Exception:
-        pass
+    except Exception as e:
+        logger.warning(f"Failed to store response {response_id}: {e}")
     return resp
🧰 Tools
🪛 Ruff (0.14.14)

[error] 258-259: try-except-pass detected, consider logging the exception

(S110)


[warning] 258-258: Do not catch blind exception: Exception

(BLE001)


[error] 262-263: try-except-pass detected, consider logging the exception

(S110)


[warning] 262-262: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@server/backend/mlx.py` around lines 255 - 264, The code currently swallows
all exceptions when attaching metrics to resp and when storing resp into the
_responses dict; update both try/except blocks to catch Exception as e and log
the error with context (include response_id and metrics and a short message)
rather than silently passing, and consider failing fast (re-raise) or returning
an error response if storing into _responses fails so callers relying on
previous_response_id can detect the failure; locate the blocks handling
resp.metadata["metrics"] and _responses[response_id] in mlx.py and replace bare
excepts with logging of e and contextual identifiers.

@madclaws madclaws merged commit ef02a60 into main Feb 2, 2026
3 of 4 checks passed
@madclaws madclaws deleted the feat/responses-api branch February 2, 2026 17:51
@codecov
Copy link
Copy Markdown

codecov bot commented Feb 2, 2026

Codecov Report

❌ Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
tiles/src/runtime/mlx.rs 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant