Added basic message support for /v1/responses api #82
Conversation
📝 WalkthroughWalkthroughThis pull request introduces a new Responses API endpoint ( Sequence Diagram(s)sequenceDiagram
actor Client
participant API as /v1/responses<br/>Endpoint
participant Backend as Backend Runtime<br/>(mlx.py)
participant Cache as Responses<br/>Cache
participant Model as Language<br/>Model
Client->>API: POST ResponsesRequest<br/>(with optional previous_response_id)
API->>Backend: generate_response_chat[_stream]<br/>(ResponsesRequest)
alt with previous_response_id
Backend->>Cache: fetch previous response
Cache-->>Backend: previous response text
Backend->>Backend: _prepend_previous_response
end
Backend->>Model: generate with input<br/>(possibly prepended)
alt streaming=true
loop for each token
Model-->>Backend: token chunk
Backend-->>Client: StreamingResponse<br/>(delta chunk)
end
Backend->>Backend: _calc_usage<br/>(token count)
Backend->>Backend: _store_response<br/>(ResponsesResponse)
Backend->>Cache: persist ResponsesResponse
Backend-->>Client: final chunk with metrics
else streaming=false
Model-->>Backend: complete output
Backend->>Backend: _calc_usage<br/>(token count)
Backend->>Backend: _store_response<br/>(ResponsesResponse)
Backend->>Cache: persist ResponsesResponse
Backend-->>Client: ResponsesResponse<br/>(with usage & id)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tiles/src/runtime/mlx.rs (1)
537-550:⚠️ Potential issue | 🟡 MinorRemove the unused
messagesfield from the request body.The
ResponsesRequestschema does not include amessagesfield. The/v1/responsesendpoint handler only usesrequest.inputto process user input. Themessagesfield will be ignored by the server or cause validation errors and should be removed.server/api.py (1)
49-58:⚠️ Potential issue | 🟠 MajorGlobal mutable state in
/startwill leak conversations across concurrent clients.
_messagesand_memory_pathare module-level globals overwritten on each call to/start. Any concurrent or sequential clients will interfere—if Client A calls/start, then Client B calls/start, then Client A calls/chat, Client A's request will use Client B's system prompt and memory path. TheStartRequestschema has no session or conversation ID, and there is no session middleware or context management to isolate per-client state. Pass state as part of request/response (e.g., include conversation_id in StartRequest and carry it through subsequent endpoints), or use context variables to bind globals to async request context.
🤖 Fix all issues with AI agents
In `@server/api.py`:
- Around line 89-105: The create_chat_response endpoint lacks error handling for
both streaming and non-streaming paths: wrap the call to
runtime.backend.generate_response_chat_stream (stream branch) in a try/except
that catches exceptions, logs the error, and returns a StreamingResponse that
yields a structured error payload and appropriate status/headers (matching the
behavior used in /v1/chat/completions); similarly wrap the await
runtime.backend.generate_response_chat (non-stream branch) in try/except and
return a proper JSON error response on failure. Ensure you reference
create_chat_response, runtime.backend.generate_response_chat_stream, and
runtime.backend.generate_response_chat when applying the fixes.
In `@server/backend/mlx.py`:
- Around line 255-264: The code currently swallows all exceptions when attaching
metrics to resp and when storing resp into the _responses dict; update both
try/except blocks to catch Exception as e and log the error with context
(include response_id and metrics and a short message) rather than silently
passing, and consider failing fast (re-raise) or returning an error response if
storing into _responses fails so callers relying on previous_response_id can
detect the failure; locate the blocks handling resp.metadata["metrics"] and
_responses[response_id] in mlx.py and replace bare excepts with logging of e and
contextual identifiers.
- Around line 27-28: The module-level _responses: Dict[str, ResponsesResponse]
is an unbounded in-memory cache and needs a size/eviction policy; replace it
with a bounded cache (e.g., an LRU or TTL-backed structure) so entries are
evicted when capacity is exceeded or expired. Locate the _responses declaration
in mlx.py and swap the plain dict for a bounded cache implementation (for
example use collections.OrderedDict with manual LRU eviction, or
cachetools.LRUCache/TTLCache) and update any code that reads/writes _responses
to use the chosen cache API while preserving keys of type str and values of
ResponsesResponse.
🧹 Nitpick comments (7)
tiles/src/runtime/mlx.rs (2)
576-577: Remove commented-out debug code.The commented
println!statement should be removed before merging.🧹 Remove debug artifact
// Parse JSON let v: Value = serde_json::from_str(data).unwrap(); - // println!("{:?}", v); // Check for metrics in the response
582-594: Consider adding error handling for unexpected response structure.The deep JSON path
v["output"][0]["content"][0]["text"]silently fails if the structure doesn't match. While safe, this makes debugging harder when the server response format changes.💡 Suggested improvement for debuggability
- if let Some(delta) = v["output"][0]["content"][0]["text"].as_str() { + let delta = v["output"] + .get(0) + .and_then(|o| o.get("content")) + .and_then(|c| c.get(0)) + .and_then(|t| t.get("text")) + .and_then(|t| t.as_str()); + + if let Some(delta) = delta { accumulated.push_str(delta);server/runtime.py (1)
1-3: Consider using a more specific type or Protocol.Using
Anyprovides minimal type safety. If the backend has a known interface, consider defining a Protocol or abstract base class.from typing import Protocol, Optional class Backend(Protocol): # Define expected methods here pass backend: Optional[Backend] = Noneserver/backend/mlx.py (4)
222-229: Consider logging tokenization failures.The fallback to zero tokens is reasonable, but logging the exception would help diagnose tokenizer issues.
📝 Add debug logging
def _calc_usage(runner: MLXRunner, input_text: str, generated_text: str) -> Dict[str, int]: """Calculate token usage using the runner tokenizer; fall back to zeros on error.""" try: input_tokens = len(runner.tokenizer.encode(input_text)) output_tokens = len(runner.tokenizer.encode(generated_text)) return {"input_tokens": input_tokens, "output_tokens": output_tokens} - except Exception: + except Exception as e: + logger.debug(f"Token counting failed, using fallback: {e}") return {"input_tokens": 0, "output_tokens": 0}
326-327: Token counting assumes one token per yield.The
output_tokens += 1assumes each yielded string is exactly one token. Ifgenerate_streamingever batches tokens or yields partial content, this count will be inaccurate. Consider using the tokenizer for accurate counts in the final chunk.
354-366: Consider logging the exception before yielding error chunk.While the broad exception catch is acceptable for a streaming endpoint, logging the full traceback would aid debugging.
📝 Add exception logging
except Exception as e: + logger.exception(f"Error during response generation: {e}") error_chunk = { "id": response_id,
455-460: TTFT metric is approximate for batch generation.For non-streaming generation,
ttft_msis set to total generation time, which isn't the true "time to first token". This is acceptable as an approximation but could be documented.📝 Add clarifying comment
metrics_obj = { + # Note: For batch generation, ttft_ms equals total time (no streaming) "ttft_ms": generation_time * 1000.0,
| @app.post("/v1/responses") | ||
| async def create_chat_response(request: ResponsesRequest): | ||
| """ | ||
| Create a response with openResponse format | ||
| """ | ||
|
|
||
| global _messages | ||
|
|
||
| if request.stream: | ||
| # Streaming response | ||
| return StreamingResponse( | ||
| runtime.backend.generate_response_chat_stream(request), | ||
| media_type="text/plain", | ||
| headers={"Cache-Control": "no-cache"}, | ||
| ) | ||
| else: | ||
| return await runtime.backend.generate_response_chat(request) |
There was a problem hiding this comment.
Add error handling for /v1/responses to avoid broken streams.
Unlike /v1/chat/completions, exceptions here will bubble up and may abruptly terminate streaming connections without a structured error.
🔧 Suggested fix (parity with chat completions)
`@app.post`("/v1/responses")
async def create_chat_response(request: ResponsesRequest):
"""
Create a response with openResponse format
"""
global _messages
- if request.stream:
- # Streaming response
- return StreamingResponse(
- runtime.backend.generate_response_chat_stream(request),
- media_type="text/plain",
- headers={"Cache-Control": "no-cache"},
- )
- else:
- return await runtime.backend.generate_response_chat(request)
+ try:
+ if request.stream:
+ # Streaming response
+ return StreamingResponse(
+ runtime.backend.generate_response_chat_stream(request),
+ media_type="text/plain",
+ headers={"Cache-Control": "no-cache"},
+ )
+ return await runtime.backend.generate_response_chat(request)
+ except Exception as e:
+ raise HTTPException(status_code=500, detail=str(e))🤖 Prompt for AI Agents
In `@server/api.py` around lines 89 - 105, The create_chat_response endpoint lacks
error handling for both streaming and non-streaming paths: wrap the call to
runtime.backend.generate_response_chat_stream (stream branch) in a try/except
that catches exceptions, logs the error, and returns a StreamingResponse that
yields a structured error payload and appropriate status/headers (matching the
behavior used in /v1/chat/completions); similarly wrap the await
runtime.backend.generate_response_chat (non-stream branch) in try/except and
return a proper JSON error response on failure. Ensure you reference
create_chat_response, runtime.backend.generate_response_chat_stream, and
runtime.backend.generate_response_chat when applying the fixes.
| # Store generated responses for follow-up support (previous_response_id) | ||
| _responses: Dict[str, ResponsesResponse] = {} |
There was a problem hiding this comment.
In-memory response cache grows unbounded.
The _responses dictionary has no eviction policy. For long-running servers, this will lead to memory exhaustion as responses accumulate.
💾 Consider adding cache size limit
+from collections import OrderedDict
+
+_MAX_RESPONSES_CACHE = 1000 # Configurable limit
+
# Store generated responses for follow-up support (previous_response_id)
-_responses: Dict[str, ResponsesResponse] = {}
+_responses: OrderedDict[str, ResponsesResponse] = OrderedDict()
+
+def _cache_response(response_id: str, response: ResponsesResponse) -> None:
+ """Add response to cache with LRU eviction."""
+ _responses[response_id] = response
+ while len(_responses) > _MAX_RESPONSES_CACHE:
+ _responses.popitem(last=False) # Remove oldest🤖 Prompt for AI Agents
In `@server/backend/mlx.py` around lines 27 - 28, The module-level _responses:
Dict[str, ResponsesResponse] is an unbounded in-memory cache and needs a
size/eviction policy; replace it with a bounded cache (e.g., an LRU or
TTL-backed structure) so entries are evicted when capacity is exceeded or
expired. Locate the _responses declaration in mlx.py and swap the plain dict for
a bounded cache implementation (for example use collections.OrderedDict with
manual LRU eviction, or cachetools.LRUCache/TTLCache) and update any code that
reads/writes _responses to use the chosen cache API while preserving keys of
type str and values of ResponsesResponse.
| if metrics: | ||
| try: | ||
| resp.metadata["metrics"] = metrics | ||
| except Exception: | ||
| pass | ||
| try: | ||
| _responses[response_id] = resp | ||
| except Exception: | ||
| pass | ||
| return resp |
There was a problem hiding this comment.
Silent exception handling hides potential bugs.
The try-except-pass blocks at lines 256-259 and 260-263 silently swallow all exceptions. If storing fails, subsequent requests using previous_response_id will silently fail to find the response.
🔧 Add logging for debugging
if metrics:
try:
resp.metadata["metrics"] = metrics
- except Exception:
- pass
+ except Exception as e:
+ logger.warning(f"Failed to attach metrics to response {response_id}: {e}")
try:
_responses[response_id] = resp
- except Exception:
- pass
+ except Exception as e:
+ logger.warning(f"Failed to store response {response_id}: {e}")
return resp🧰 Tools
🪛 Ruff (0.14.14)
[error] 258-259: try-except-pass detected, consider logging the exception
(S110)
[warning] 258-258: Do not catch blind exception: Exception
(BLE001)
[error] 262-263: try-except-pass detected, consider logging the exception
(S110)
[warning] 262-262: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@server/backend/mlx.py` around lines 255 - 264, The code currently swallows
all exceptions when attaching metrics to resp and when storing resp into the
_responses dict; update both try/except blocks to catch Exception as e and log
the error with context (include response_id and metrics and a short message)
rather than silently passing, and consider failing fast (re-raise) or returning
an error response if storing into _responses fails so callers relying on
previous_response_id can detect the failure; locate the blocks handling
resp.metadata["metrics"] and _responses[response_id] in mlx.py and replace bare
excepts with logging of e and contextual identifiers.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
No description provided.