feat(api): Add Responses API - Non-streaming endpoint with structured output by bxff · Pull Request #66 · tilesprivacy/tiles

bxff · 2026-01-14T17:54:44Z

Title: feat(api): Add Responses API (Phase 2) - Non-streaming endpoint with structured output

Description:
This PR introduces the non-streaming /v1/responses endpoint, delivering complete, structured responses with full OpenAI API compatibility—including usage statistics and JSON schema enforcement.

Dependencies: This PR is stacked on #64 and leverages its foundational infrastructure:

Uses MLXRunner.generate_batch() from feat(server): transition Mac backend to mlx-engine with structured output and LoRA support #64 for core inference
Relies on feat(server): transition Mac backend to mlx-engine with structured output and LoRA support #64's model lifecycle management (get_or_load_model, caching)
Extends the schema definitions and FastAPI app from feat(server): transition Mac backend to mlx-engine with structured output and LoRA support #64
Builds on the runtime.backend abstraction established in that PR

In short: #64 built the engine; this PR adds a dashboard.

Changes Made:

API Schema: Added ResponseRequest and ResponseResponse models supporting messages, temperature, max_tokens, stop, and response_format (with json_schema)
Backend Logic: Implemented generate_response() in MLX backend using efficient batch generation, plus proper token usage tracking (prompt, completion, total)
New Endpoint: POST /v1/responses with error handling and structured logging
Refactor: Improved runtime backend initialization for better testability

Testing:

Added automated tests in server/tests/test_responses_api.py:

Simple chat completion: Validates response structure and accurate usage stats
Structured JSON output: Confirms json_schema enforcement returns parseable JSON

Both tests pass cleanly using TestClient, replacing manual curl verification with a reproducible test suite.

Example response:

{
  "id": "chatcmpl-590676b9-b112-4252-96b7-cb457fabe529",
  "object": "chat.completion",
  "created": 1768411992,
  "model": "driaforall/mem-agent-mlx-4bit",
  "choices": [...],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 5,
    "total_tokens": 20
  }
}

…tput and LoRA support

This cleanup phase focuses on making the new mlx-engine implementation more resilient and easier to maintain long-term. Relative Imports and Packaging: - switched to package-relative imports across the mlx_engine directory - removed runtime sys.path hacks in the runner - cleaned up broken vision model entries and pruned unused source files Better System Integration: - used psutil for actual RAM detection instead of hardcoded guesses - added torch and psutil as direct dependencies - improved memory check logic with proper headroom calculation Robustness Guards: - fixed a bug in dill.py that accidentally cleared the original tokenizer's state - added safety checks for zero-height images and empty config lists - implemented hasattr guards for model-specific weight patching - moved vision add-ons to a formal BaseVisionAddOn interface General Polishes: - aligned docstring parameters with actual function signatures - fixed a typo in the prompt processing comments - preserved EOS token order during deduplication - added necessary checks to the model path loading in tests

…ctured completions

coderabbitai · 2026-01-14T17:55:03Z

📝 Walkthrough

Walkthrough

Adds a non-streaming /v1/responses API endpoint and ResponseRequest/ResponseResponse schemas, and implements a full MLX-based generation engine under server/backend/mlx_engine. The engine provides ModelKit/VisionModelKit loaders, KV-cache management, speculative decoding, multimodal vision add-ons, generation (streaming and batch), token utilities, and many helpers (quantization, image processing, stop-string detection, logging, etc.). MLXRunner and mlx_backend are updated to delegate to the engine; new tests exercise inference, LoRA adapter handling, and the responses API. Dependency pins and utilities to disable HF downloads and adjust tokenizers were added.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant API as FastAPI /v1/responses
    participant Backend as mlx_backend.generate_response
    participant Engine as MLX Engine (create_generator)
    participant Model as ModelKit / VisionModelKit

    Client->>API: POST /v1/responses (ResponseRequest)
    API->>Backend: create_response(request)
    Backend->>Engine: generate_response(request) / create_generator(...)
    Engine->>Model: load_model / process_prompt
    Model-->>Engine: processed prompt tokens (+embeddings)
    Engine->>Model: stream_generate / generate_batch (with json_schema/stop strings)
    loop token generation
        Model->>Engine: sampled tokens / top_logprobs
        Engine->>Engine: apply processors (stop, repetition, speculative)
    end
    Engine-->>Backend: full chat.completion payload
    Backend-->>API: HTTP 200 JSON response
    API-->>Client: response body (chat.completion)

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

v0.3.0 + Linux compatibility #30 — touches server API and backend/schema surface similar to this PR (overlapping changes to server/api.py and server/schemas.py).

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 55.90% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title 'feat(api): Add Responses API - Non-streaming endpoint with structured output' is clear, concise, and directly describes the main addition: a new non-streaming API endpoint with structured output support.
Description check	✅ Passed	The PR description is comprehensive and directly related to the changeset. It clearly explains the purpose (non-streaming /v1/responses endpoint), dependencies (PR `#64`), changes made (API schema, backend logic, new endpoint, refactoring), and testing approach with specific examples.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

🤖 Fix all issues with AI agents

In `@server/api.py`:
- Around line 84-92: In create_response, replace the current error handling that
calls logger.error and re-raises HTTPException without chaining: when catching
Exception as e, call logger.exception("Error in generate_response") to log the
traceback, and re-raise the HTTPException using "raise
HTTPException(status_code=500, detail=str(e)) from e" so the exception is
properly chained; locate the try/except around runtime.backend.generate_response
in create_response and make these two changes.

In `@server/backend/mlx_backend.py`:
- Around line 182-183: The async function generate_response currently calls the
blocking runner.generate_batch synchronously and should not block the event
loop; change the blocking call to run in a thread by wrapping it with
asyncio.to_thread (e.g., result = await asyncio.to_thread(runner.generate_batch,
...)), add an import for asyncio if missing, keep the function async and return
the same Dict[str, Any] from the awaited result, or alternatively make
generate_response a regular synchronous function if the API layer already
dispatches to threads—choose one approach and apply it consistently to the
runner.generate_batch invocation and any other blocking calls inside
generate_response.
- Around line 216-223: The current stop-sequence loop using request.stop /
stop_sequences and breaking on the first matched stop can truncate at the wrong
place; instead compute the earliest occurrence among all stop sequences and
slice response_text at that minimal positive index. Iterate over stop_sequences,
for each use a position lookup (e.g., find()) to get its index in response_text
if present, track the smallest non-negative index, and after the loop set
response_text = response_text[:min_index] only if a match was found; keep the
existing request.stop handling and variable names (request.stop, stop_sequences,
response_text).

In
`@server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py`:
- Around line 50-55: The code may raise AttributeError when accessing
config.vision_config.image_token_id if vision_config is missing or None; update
the logic around image_token_index assignment (the block that references config,
image_token_index, vision_config, and image_token_id) to first verify that
config has a non-None vision_config (e.g., check hasattr(config,
"vision_config") and config.vision_config is not None or use getattr with
default) before reading image_token_id, and fall back to None if neither
image_token_index nor a valid vision_config.image_token_id exists.

In `@server/backend/mlx_engine/stop_string_processor.py`:
- Around line 154-163: The method _check_partial_text_match currently constructs
and returns a StopStringProcessorResult which is inconsistent with
_check_incomplete_utf8 and _check_full_text_match; change its return type
annotation to Optional[_StoppingCriteriaResult] and return an instance of
_StoppingCriteriaResult (with status="partial_match" and stop_string=None)
instead of StopStringProcessorResult so the function signature and returned
object type match the other check methods.

In `@server/backend/mlx_engine/utils/image_utils.py`:
- Around line 66-78: The resize logic in the block that checks max_size may
produce a dimension that still exceeds max_size because it only constrains the
larger side; update the calculation to compute a single scale = min(max_size[0]
/ original_width, max_size[1] / original_height, 1.0) and then set new_width =
int(original_width * scale) and new_height = int(original_height * scale) before
calling img.resize(..., PIL.Image.LANCZOS) and logging via logger.info; use the
existing variables img, original_width, original_height, max_size, aspect_ratio
(if needed) and keep the logging message but report the final new dimensions.

In `@server/backend/mlx_engine/utils/logger.py`:
- Around line 14-15: The module currently sets the process-wide flag
logging.raiseExceptions = False which suppresses logging errors for all loggers;
remove that global assignment and instead handle logger-specific failures
locally—either wrap mlx_engine log emission points in try/except or attach a
custom logging.Handler (override handleError) to the mlx_engine logger to
swallow or log handler-specific exceptions; update logger.py to stop mutating
logging.raiseExceptions and implement one of these local error-handling
strategies for the mlx_engine logger.

In `@server/pyproject.toml`:
- Around line 10-17: The dependency list mixes exact pins and unpinned packages
and uses an outdated outlines pin; update outlines==1.1.1 to outlines==1.2.9,
choose and apply a consistent pinning strategy for all top-level packages (e.g.,
pin torch, psutil, sentencepiece, pillow, transformers, mlx-vlm, mlx-lm to
specific versions or move them to a separate constraints file), and audit
whether torch is required given mlx-lm/mlx-vlm (if not required, remove the
torch entry); ensure any changes reference the package names shown (outlines,
outlines-core, torch, psutil, sentencepiece, pillow, transformers, mlx-vlm,
mlx-lm) so the pyproject.toml dependency section is consistent and reproducible.

In `@server/tests/test_responses_api.py`:
- Around line 21-26: get_model_path can return None for the path, so calling
model_path.exists() may raise AttributeError; update the test to check for a
falsy model_path before calling .exists() (e.g. replace the current if-block
with a guard like "if not model_path or not model_path.exists(): print(...);
return") so the test skips when model_path is None or missing; reference:
get_model_path and the model_path variable in the test.

🧹 Nitpick comments (47)

server/backend/mlx_engine/utils/fix_mistral_pre_tokenizer.py (2)
41-51: Fragile reliance on private attributes.

Accessing tokenizer._tokenizer and tokenizer._tokenizer._tokenizer.pre_tokenizer couples this code tightly to the internal implementation of TokenizerWrapper and LlamaTokenizer. These private attributes may change without notice in future library versions.

Consider adding a comment documenting the library versions this was tested against, or wrapping the attribute access with hasattr checks for graceful degradation.
🛠️ Suggested defensive check
+    # Access internal tokenizer structure - tested with mlx-lm X.X and transformers X.X
     if not isinstance(tokenizer._tokenizer, LlamaTokenizer):
         logger.info(f"Tokenizer is of type {type(tokenizer._tokenizer)}. Skipping fix.")
         return
+    if not hasattr(tokenizer._tokenizer, '_tokenizer') or not hasattr(tokenizer._tokenizer._tokenizer, 'pre_tokenizer'):
+        logger.warning("Tokenizer internal structure unexpected. Skipping fix.")
+        return
52-54: Broad exception catch is acceptable here.

The static analysis flags Exception as too broad. In this context, it's reasonable since multiple exception types are possible (file I/O errors, JSON parsing errors, attribute errors from internal tokenizer changes, etc.), and the function is designed as a best-effort fix with non-fatal failure mode.

For improved debuggability, consider logging at DEBUG level for expected exceptions and WARNING for unexpected ones, or catching specific exceptions separately.
server/tests/test_lora.py (3)
7-8: Consider using pytest with proper package discovery instead of sys.path manipulation.

The sys.path.append hack is fragile. Consider configuring pytest with proper package discovery via pyproject.toml or a conftest.py that handles imports cleanly.

25-25: Remove print statements from tests.

Print statements in tests add noise. Test frameworks (unittest, pytest) already report pass/fail status. Remove these debug prints.
Proposed fix
         mock_load.assert_called_once_with(
             model_path, adapter_path=adapter_path
         )
-        print("LoRA path propagation verification (MLXRunner -> engine): PASSED")
And similarly for line 43:
                 self.assertEqual(kwargs.get("adapter_path"), Path("/tmp/fake_adapter"))
-                print("ModelKit LoRA path propagation (ModelKit -> mlx_lm): PASSED")
36-43: Consider using a single mock for config loading.

The nested patch context managers for Path.read_text and json.loads could be simplified. Also, the model_kit variable appears unused but its construction is the test's purpose (triggers the mock call). Consider adding _ = model_kit or a comment to clarify intent.
Clarify intent
-                model_kit = ModelKit(Path("/tmp/fake_model"), adapter_path="/tmp/fake_adapter")
+                # Construction triggers mlx_lm.load call being tested
+                _ = ModelKit(Path("/tmp/fake_model"), adapter_path="/tmp/fake_adapter")
server/backend/mlx_engine/vision_model_kit/_transformers_compatibility.py (3)
13-36: File modifications could corrupt config on partial write failure.

Writing to the same file that was just read risks data loss if the write fails mid-operation. Consider writing to a temporary file first, then atomically renaming, or at minimum preserving the original JSON formatting with indent.
Safer write with formatting preserved
     preprocessor_config["image_processor_type"] = "Qwen2VLImageProcessor"
-    with open(model_path / "preprocessor_config.json", "w") as f:
-        json.dump(preprocessor_config, f)
+    with open(model_path / "preprocessor_config.json", "w") as f:
+        json.dump(preprocessor_config, f, indent=2)
15-18: Redundant file read—preprocessor_config.json is read twice.

The file is read at line 15-16 to check image_processor_type, then read again at lines 32-33 to modify. Consider reading once and reusing the data.
Consolidate file reads
 def fix_qwen2_5_vl_image_processor(model_path: Path):
     try:
         with open(model_path / "preprocessor_config.json", "r") as f:
-            image_processor_type = json.load(f)["image_processor_type"]
+            preprocessor_config = json.load(f)
+        image_processor_type = preprocessor_config.get("image_processor_type")
         with open(model_path / "config.json", "r") as f:
             model_type = json.load(f)["model_type"]
-    except:  # noqa: E722
+    except (FileNotFoundError, KeyError, json.JSONDecodeError):
         return

     if not (
         image_processor_type == "Qwen2_5_VLImageProcessor"
         and model_type == "qwen2_5_vl"
     ):
         return

     # Replace image_processor_type with Qwen2VLImageProcessor
     logger.warning(
         "Replacing `image_processor_type` with Qwen2VLImageProcessor in preprocessor_config.json"
     )
-    with open(model_path / "preprocessor_config.json", "r") as f:
-        preprocessor_config = json.load(f)
     preprocessor_config["image_processor_type"] = "Qwen2VLImageProcessor"
     with open(model_path / "preprocessor_config.json", "w") as f:
-        json.dump(preprocessor_config, f)
+        json.dump(preprocessor_config, f, indent=2)
50-51: Accessing dict key solely to check existence—use in operator instead.

Line 51 reads the JSON and accesses ["size"] only to trigger a KeyError if missing. This is indirect; use explicit key checking for clarity.
Explicit key check
         with open(model_path / "preprocessor_config.json", "r") as f:
-            json.load(f)["size"]
+            preprocessor_config = json.load(f)
+        if "size" not in preprocessor_config:
+            return
     except:  # noqa: E722
         return
+    # ... then use preprocessor_config directly below instead of re-reading
server/backend/mlx_engine/utils/image_utils.py (1)
91-97: Unused loop variable i.

The static analysis correctly identifies that i is not used in this loop. Since only the image object is needed, use _ to indicate intentionally unused variable.
♻️ Proposed fix
-        for i, img in enumerate(resized_images):
+        for img in resized_images:
             new_img = PIL.Image.new("RGB", (max_width, max_height), (0, 0, 0))
server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)
39-43: Avoid using list comprehension for side effects.

Line 41 uses a list comprehension purely for its side effects (add_token), discarding the resulting list. This is non-idiomatic and allocates memory unnecessarily.
♻️ Proposed fix
     detokenizer = processor.detokenizer
     detokenizer.reset()
-    [detokenizer.add_token(token) for token in prompt_tokens]
+    for token in prompt_tokens:
+        detokenizer.add_token(token)
     detokenizer.finalize()
     prompt = detokenizer.text
server/backend/mlx_engine/external/datasets/dill.py (1)
113-116: Remove unused noqa directive.

The # noqa: E721 comment on line 115 is flagged as unused by Ruff (RUF100). The type(obj) is not str comparison using is is actually the idiomatic way to check exact type match, and modern linters don't flag this.
♻️ Proposed fix
     def memoize(self, obj):
         # Don't memoize strings since two identical strings can have different Python ids
-        if type(obj) is not str:  # noqa: E721
+        if type(obj) is not str:
             dill.Pickler.memoize(self, obj)
server/backend/mlx_engine/stop_string_processor.py (1)
56-59: Consider making stop_strings immutable or document mutation behavior.

The check at line 56 is defensive since __init__ validates non-empty stop_strings. However, since self.stop_strings is a mutable list, external code could modify it post-construction. Consider using tuple instead of list for immutability, or document the expected behavior.
Optional: Use tuple for immutability
-        self.stop_strings = stop_strings
+        self.stop_strings = tuple(stop_strings)
server/backend/mlx_runner.py (2)
160-168: Token existence check may give false positives for multi-token sequences.

The check tokenizer.encode(token, ...) returns non-empty if the token can be encoded, but this doesn't guarantee the token exists as a single vocabulary entry. A token like <|end|> might encode to multiple IDs. Consider checking if the encoded result matches a single known token ID.

This is likely acceptable for stop string matching since the StopStringProcessor handles multi-token sequences, but worth noting.

420-435: Memory calculation assumes unified memory architecture.

The function mixes MLX active memory with system RAM from psutil. This works correctly on Apple Silicon's unified memory, but the function name and docstring don't clarify this assumption. Consider documenting that this is designed for unified memory systems.
 def check_memory_available(required_gb: float) -> bool:
-    """Pre-flight check before model loading using actual system RAM."""
+    """Pre-flight check before model loading using actual system RAM.
+    
+    Note: Assumes unified memory architecture (e.g., Apple Silicon) where
+    GPU memory and system RAM share the same pool.
+    """
server/schemas.py (1)
80-86: Consider making usage required for consistency.

The usage field is Optional here but required in CompletionResponse. Based on the backend implementation in mlx_backend.py, usage is always returned with prompt_tokens, completion_tokens, and total_tokens. Consider aligning with CompletionResponse for API consistency.
Optional: Make usage required
 class ResponseResponse(BaseModel):
     id: str
     object: str = "chat.completion"
     created: int
     model: str
     choices: List[Dict[str, Any]]
-    usage: Optional[Dict[str, int]] = None
+    usage: Dict[str, int]
server/backend/mlx_engine/utils/logger.py (1)

17-18: Minor: Comment refers to "root logger" but this is a named logger.

The comment says "Configure root logger for mlx_engine" but logging.getLogger("mlx_engine") creates a named logger, not the root logger. Consider updating to "Configure logger for mlx_engine" for accuracy.
server/backend/mlx_engine/utils/top_logprobs.py (2)
1-3: Import ordering: local imports should come after third-party.

Per PEP 8 convention, third-party imports (mlx.core) should come before local imports (.token).
Suggested reordering
-from .token import Token
-
 import mlx.core as mx
+
+from .token import Token
23-28: Add strict=True to zip() for defensive coding.

While the three lists are derived from the same source and guaranteed to be equal length, adding strict=True documents this expectation and will catch any future refactoring errors. This also addresses the Ruff B905 hint.
Proposed fix
     return [
         Token(id=int(idx), text=txt, logprob=float(prob))
         for idx, txt, prob in zip(
-            top_indices.tolist(), text_list, top_logprob_values.tolist()
+            top_indices.tolist(), text_list, top_logprob_values.tolist(), strict=True
         )
     ]
server/runtime.py (2)
2-4: Unused logger: imported but never used.

The logging module is imported and a logger is created, but it's never used in this file. Consider removing it or using it to log backend selection (see next comment).

16-17: Consider logging a warning when no backend is available.

Returning None silently on unsupported platforms may lead to confusing downstream errors. A warning would help diagnose deployment issues.
Proposed enhancement
+import sys
+import logging
+
+logger = logging.getLogger("app")
+
 def get_backend():
     """
     Dynamically choose which backend should be used depending on the OS 
     """
     if sys.platform == "darwin":
         from .backend import mlx_backend
         return mlx_backend
     elif sys.platform.startswith("linux"):
         from .backend import linux
         return linux
     else:
+        logger.warning(f"No backend available for platform: {sys.platform}")
         return None
server/tests/test_inference.py (4)
5-6: Consider using proper package installation instead of sys.path manipulation.

The sys.path.append hack is brittle and can cause import issues. Since the PR summary mentions removing runtime sys.path hacks, consider installing the package in editable mode (pip install -e .) or using pytest's pythonpath configuration.

16-19: Prefix unused variables with underscore.

Per static analysis, model_name and commit_hash are unpacked but never used.
Suggested fix
-        model_path, model_name, commit_hash = get_model_path(model_spec)
+        model_path, _model_name, _commit_hash = get_model_path(model_spec)
11-19: Consider converting to proper pytest test with pytest.skip().

The early return pattern makes this more of a manual script than a pytest-compatible test. For proper CI integration:
Suggested approach
+import pytest
+
 def test_inference():
     model_spec = "driaforall/mem-agent-mlx-4bit"
-    print(f"Testing inference with model: {model_spec}")
-    
-    try:
-        model_path, model_name, commit_hash = get_model_path(model_spec)
-        if model_path is None or not model_path.exists():
-            print(f"Error: Model {model_spec} not found in cache. Please download it first.")
-            return
+    model_path, _, _ = get_model_path(model_spec)
+    if model_path is None or not model_path.exists():
+        pytest.skip(f"Model {model_spec} not found in cache")
44-44: Move import json to top of file.

The json module is imported mid-function. Moving it to the top-level imports improves readability and follows Python conventions.
server/tests/test_responses_api.py (1)

5-6: Same sys.path hack as test_inference.py.

Consider addressing both test files together when refactoring to proper package installation or pytest configuration.
server/backend/mlx_engine/__init__.py (1)
5-12: Consider sorting __all__ alphabetically.

Static analysis (RUF022) suggests sorting for consistency. This is a minor style preference.
Sorted version
 __all__ = [
+    "create_generator",
+    "is_draft_model_compatible",
+    "load_draft_model",
     "load_model",
-    "load_draft_model",
-    "is_draft_model_compatible",
+    "tokenize",
     "unload_draft_model",
-    "create_generator",
-    "tokenize",
 ]
server/backend/mlx_backend.py (2)
117-124: Duplicated json_schema extraction logic.

This block is nearly identical to lines 194-201 in generate_response. Consider extracting to a helper function.
Suggested helper
def _extract_json_schema(response_format: Optional[Dict[str, Any]]) -> Optional[str]:
    """Extract JSON schema string from response_format dict."""
    if not response_format:
        return None
    if response_format.get("type") == "json_schema":
        schema_info = response_format.get("json_schema", {})
        return json.dumps(schema_info.get("schema", {}))
    elif response_format.get("type") == "json_object":
        return "{}"
    return None
257-259: Token count is a rough approximation.

The * 1.3 heuristic can be significantly inaccurate (subword tokenizers vary widely). Since the runner has access to the actual tokenizer via model_kit.tokenizer, consider using it for accurate usage reporting:
Suggested approach
def count_tokens(runner: MLXRunner, text: str) -> int:
    """Count tokens using the model's tokenizer."""
    if runner.model_kit and hasattr(runner.model_kit, 'tokenizer'):
        return len(runner.model_kit.tokenizer.encode(text))
    # Fallback approximation
    return int(len(text.split()) * 1.3)
Then update the call sites in generate_response to pass the runner.
server/backend/mlx_engine/utils/prompt_processing.py (1)
9-16: Add return type annotation for consistency.

The function parameters have type hints, but the return type is missing. Based on the implementation, it returns an mx.array.
Suggested fix
 def process_prompt_text_only(
     prompt_tokens: mx.array,
     cache_wrapper: CacheWrapper,
     generate_args: dict,
     draft_model: Optional[nn.Module],
     speculative_decoding_toggle: Optional[bool],
     prompt_progress_callback: Optional[Callable[[float], bool]],
-):
+) -> mx.array:
server/backend/mlx_engine/utils/disable_hf_download.py (1)
25-27: Redundant patch: both lines modify the same object.

huggingface_hub and sys.modules["huggingface_hub"] reference the same module object, so line 27 is redundant. The comment suggests wanting to catch "other imports," but all imports of huggingface_hub resolve to the same cached module in sys.modules.
Simplified version
 def patch_huggingface_hub():
     """
     Patch the huggingface_hub module to use our local-only snapshot_download.
     This ensures that any import of snapshot_download from huggingface_hub
     will use our wrapped version.
     """
     huggingface_hub.snapshot_download = snapshot_download
-    # Also patch the module in sys.modules to ensure any other imports get our version
-    sys.modules["huggingface_hub"].snapshot_download = snapshot_download
server/backend/mlx_engine/model_kit/vision_add_ons/gemma3.py (1)
55-63: Prefix unused variable with underscore.

other_model_inputs is unpacked but never used, as flagged by static analysis.
Suggested fix
-        input_ids, pixel_values, attention_mask, other_model_inputs = (
+        input_ids, pixel_values, attention_mask, _other_model_inputs = (
             common_process_prompt_with_images(
                 prompt_tokens=prompt_tokens,
                 images_b64=images_b64,
                 processor=self.processor,
                 config=self.config,
                 max_size=max_size,
             )
         )
server/backend/mlx_engine/utils/eot_tokens.py (1)
54-59: Avoid shadowing the built-in id.

Using id as a variable name shadows Python's built-in function.
Suggested fix
-    for id in tokenizer.eos_token_ids:
-        text = tokenizer.decode(id)
+    for token_id in tokenizer.eos_token_ids:
+        text = tokenizer.decode(token_id)
         # Specific override for RNJ-1
-        if model_kit.model_type == "gemma3_text" and id == 1 and text == '"':
+        if model_kit.model_type == "gemma3_text" and token_id == 1 and text == '"':
             continue
-        temp_tokens.add(id)
+        temp_tokens.add(token_id)
server/backend/mlx_engine/utils/progress_decorators.py (1)
19-19: Minor: Replace en-dash with hyphen-minus in docstrings.

Ruff flagged ambiguous en-dash characters (–) in docstrings at lines 19, 52, and 82. While not a functional issue, using standard hyphen-minus (-) improves consistency.
Suggested fix
-        callback: A callback that accepts progress (0.0–100.0) and returns
+        callback: A callback that accepts progress (0.0-100.0) and returns
Apply the same fix to lines 52 and 82.
server/backend/mlx_engine/model_kit/vision_add_ons/pixtral.py (1)
57-65: Prefix unused variables with underscore.

The attention_mask and other_model_inputs variables are unpacked but never used, as flagged by static analysis.
Suggested fix
-        input_ids, pixel_values, attention_mask, other_model_inputs = (
+        input_ids, pixel_values, _attention_mask, _other_model_inputs = (
             common_process_prompt_with_images(
                 prompt_tokens=prompt_tokens,
                 images_b64=images_b64,
                 processor=self.processor,
                 config=self.config,
                 max_size=max_size,
             )
         )
server/backend/mlx_engine/model_kit/vision_add_ons/mistral3.py (1)
120-124: Consider making the model path check more robust.

The string matching approach works but is fragile if the model path changes slightly. Consider also checking for partial matches or making this configurable.
Alternative implementation
     `@staticmethod`
     def _is_lmstudio_mistral_3_2_small(model_path: Path) -> bool:
-        return "lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX" in str(
-            model_path
-        )
+        path_str = str(model_path).lower()
+        return "mistral-small-3.2" in path_str and "lmstudio" in path_str
server/backend/mlx_engine/vision_model_kit/vision_model_wrapper.py (4)
101-107: Use dict literal instead of zip for single key-value pair.

The zip() with a single-element list is overly complex and triggers a static analysis warning about missing strict= parameter.
Suggested fix
             if outputs.cross_attention_states is not None:
-                self.language_model_kwargs = {
-                    k: v
-                    for k, v in zip(
-                        ["cross_attention_states"], [outputs.cross_attention_states]
-                    )
-                }
+                self.language_model_kwargs = {
+                    "cross_attention_states": outputs.cross_attention_states
+                }
140-146: Use exception chaining for better debugging.

When re-raising with a new exception or bare re-raise, use proper chaining to preserve the stack trace.
Suggested fix
         except ValueError as e:
             # Create a friendly error message if a user tries to use mllama without images
             if "Cross attention states must be provided for layer" in str(e):
                 raise ValueError(
                     "Using this model without any images attached is not supported yet."
-                )
-            raise e
+                ) from e
+            raise
170-173: Avoid list comprehension for side effects.

Using a list comprehension solely for its side effects is non-idiomatic Python. Use a regular for loop instead.
Suggested fix
             detokenizer.reset()
-            [detokenizer.add_token(token) for token in prompt_tokens]
+            for token in prompt_tokens:
+                detokenizer.add_token(token)
             detokenizer.finalize()
186-191: Use exception chaining for the image requirement error.

Same issue as line 143-146: use from e for proper exception chaining.
Suggested fix
         except ValueError as e:
             if "`images` are expected as arguments" in str(e):
                 raise ValueError(
                     "Using this model without any images attached is not supported yet."
-                )
-            raise e
+                ) from e
+            raise
server/backend/mlx_engine/vision_model_kit/vision_model_kit.py (2)
24-31: Consider using instance attributes instead of class-level defaults.

Class-level attribute declarations with mutable defaults can cause unexpected sharing between instances. While None is safe, using explicit instance initialization in __init__ would be clearer.
Suggested refactor in __init__
     def __init__(
         self,
         model_path: Path,
         vocab_only: bool,
         trust_remote_code: bool,
     ):
+        self.config = None
+        self.model_weights = None
+        self.processor = None
+        self.has_processed_prompt = False
         fix_qwen2_5_vl_image_processor(model_path)
         fix_qwen2_vl_preprocessor(model_path)
And remove the class-level declarations at lines 24-31.
145-156: Prefix unused path parameter with underscore for clarity.

The path parameter in is_draft_model_compatible and load_draft_model is unused since speculative decoding is not supported. Prefixing with underscore documents this intentionally.
Suggested fix
-    def is_draft_model_compatible(self, path: str | Path) -> bool:
+    def is_draft_model_compatible(self, _path: str | Path) -> bool:
         return False

-    def load_draft_model(self, path: str | Path) -> None:
+    def load_draft_model(self, _path: str | Path) -> None:
         raise ValueError(
             "Speculative decoding is not currently supported for vision models"
         )
server/backend/mlx_engine/cache_wrapper.py (1)
89-90: Remove redundant self-assignments.

These lines assign variables to themselves and serve no purpose.
🔧 Suggested fix
     def _find_common_prefix(
         current_tokens: mx.array, prompt_tokens: mx.array, num_tokens_to_exclude: int
     ) -> int:
         """
         Determine the common prefix length between the current tokens and the prompt tokens.

         Args:
             current_tokens (mx.array): The cached tokens (self.tokens).
             prompt_tokens (mx.array): The prompt tokens.
             num_tokens_to_exclude (int): The minimum length of the remaining prompt tokens array.

         Returns:
             int: The length of the common prefix.
         """
-        prompt_tokens = prompt_tokens
-        current_tokens = current_tokens
         # Find the minimum length between the two arrays
         min_length = min(len(current_tokens), len(prompt_tokens))
server/backend/mlx_engine/model_kit/vision_add_ons/load_utils.py (1)
143-156: Potential KeyError if quantization is present but missing expected keys.

On line 148, config_dict["quantization"] assumes the quantization dict exists when get_class_predicate is called. While this is guarded by the outer check on line 138, if quantization is set to an unexpected value (e.g., empty dict or malformed), accessing config_dict["quantization"][p] on line 149 could behave unexpectedly.
🔧 Safer access pattern
         def get_class_predicate(p, m):
             # Always skip vision and audio models
             if skip_multimodal_module(p) and skip_vision:
                 return False
             # Handle custom per layer quantizations
-            if p in config_dict["quantization"]:
-                return config_dict["quantization"][p]
+            if p in quantization:
+                return quantization[p]
             if not hasattr(m, "to_quantized"):
                 return False
server/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.py (2)
104-112: Prefix unused variables with underscores.

attention_mask and other_model_inputs are unpacked but never used. Prefix them with underscores to indicate they are intentionally unused.
🔧 Suggested fix
-        input_ids, pixel_values, attention_mask, other_model_inputs = (
+        input_ids, pixel_values, _attention_mask, _other_model_inputs = (
             common_process_prompt_with_images(
                 prompt_tokens=prompt_tokens,
                 images_b64=images_b64,
                 processor=self.processor,
                 config=self.config,
                 max_size=max_size,
             )
         )
113-113: Consider replacing assert with explicit error handling.

Assertions can be disabled in production (e.g., with python -O). If input_ids being None is a genuine error condition, use an explicit check with a descriptive exception.
🔧 Suggested fix
-        assert input_ids is not None
+        if input_ids is None:
+            raise ValueError("Failed to process prompt: input_ids is None")
server/backend/mlx_engine/model_kit/model_kit.py (1)
35-57: Annotate VISION_ADD_ON_MAP with ClassVar.

The static analysis correctly identifies that mutable class attributes should be annotated with typing.ClassVar to clarify they are class-level constants, not instance attributes.
🔧 Suggested fix
+from typing import Callable, ClassVar, Optional, List, Tuple, Type
-from typing import Callable, Optional, List, Tuple

 class ModelKit:
     """
     Collection of objects and methods that are needed for operating a model.
     ...
     """

-    VISION_ADD_ON_MAP = {
+    VISION_ADD_ON_MAP: ClassVar[dict[str, Type[BaseVisionAddOn]]] = {
         "gemma3": Gemma3VisionAddOn,
         "gemma3n": Gemma3nVisionAddOn,
         "mistral3": Mistral3VisionAddOn,
         "pixtral": PixtralVisionAddOn,
     }
server/backend/mlx_engine/generate.py (1)
111-111: Use generator expression instead of list in any().

Creating a list is unnecessary when any() can short-circuit with a generator expression.
🔧 Suggested fix
-        if any([kv_bits, kv_group_size, quantized_kv_start]):
+        if any((kv_bits, kv_group_size, quantized_kv_start)):

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9181c9a and 4545a11.

⛔ Files ignored due to path filters (1)

server/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (43)

server/api.py
server/backend/mlx_backend.py
server/backend/mlx_engine/__init__.py
server/backend/mlx_engine/cache_wrapper.py
server/backend/mlx_engine/external/datasets/dill.py
server/backend/mlx_engine/generate.py
server/backend/mlx_engine/model_kit/__init__.py
server/backend/mlx_engine/model_kit/model_kit.py
server/backend/mlx_engine/model_kit/patches/gemma3n.py
server/backend/mlx_engine/model_kit/vision_add_ons/base.py
server/backend/mlx_engine/model_kit/vision_add_ons/gemma3.py
server/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.py
server/backend/mlx_engine/model_kit/vision_add_ons/load_utils.py
server/backend/mlx_engine/model_kit/vision_add_ons/mistral3.py
server/backend/mlx_engine/model_kit/vision_add_ons/pixtral.py
server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py
server/backend/mlx_engine/processors/repetition_penalty_processor.py
server/backend/mlx_engine/stop_string_processor.py
server/backend/mlx_engine/utils/disable_hf_download.py
server/backend/mlx_engine/utils/eot_tokens.py
server/backend/mlx_engine/utils/fix_mistral_pre_tokenizer.py
server/backend/mlx_engine/utils/image_utils.py
server/backend/mlx_engine/utils/kv_cache_quantization.py
server/backend/mlx_engine/utils/logger.py
server/backend/mlx_engine/utils/outlines_transformer_tokenizer.py
server/backend/mlx_engine/utils/progress_decorators.py
server/backend/mlx_engine/utils/prompt_processing.py
server/backend/mlx_engine/utils/register_models.py
server/backend/mlx_engine/utils/set_seed.py
server/backend/mlx_engine/utils/speculative_decoding.py
server/backend/mlx_engine/utils/token.py
server/backend/mlx_engine/utils/top_logprobs.py
server/backend/mlx_engine/vision_model_kit/_transformers_compatibility.py
server/backend/mlx_engine/vision_model_kit/vision_model_kit.py
server/backend/mlx_engine/vision_model_kit/vision_model_wrapper.py
server/backend/mlx_runner.py
server/main.py
server/pyproject.toml
server/runtime.py
server/schemas.py
server/tests/test_inference.py
server/tests/test_lora.py
server/tests/test_responses_api.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.{rs,toml}

⚙️ CodeRabbit configuration file

Review the Rust code for conformity with best practices in Rust, Systems programming. Highlight any deviations.

Files:

server/pyproject.toml

🧬 Code graph analysis (18)

server/backend/mlx_engine/utils/outlines_transformer_tokenizer.py (1)

server/backend/mlx_engine/external/datasets/dill.py (2)

Hasher (27-54)

hash (44-45)

server/backend/mlx_engine/utils/prompt_processing.py (1)

server/backend/mlx_engine/cache_wrapper.py (5)

CacheWrapper (25-334)

prompt_progress_callback (291-292)

set_draft_model (231-261)

unset_draft_model (263-268)

update_cache (270-328)

server/backend/mlx_engine/model_kit/vision_add_ons/gemma3.py (5)

server/backend/mlx_engine/model_kit/vision_add_ons/base.py (2)

BaseVisionAddOn (7-34)

compute_embeddings (19-34)

server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)

common_process_prompt_with_images (19-79)

server/backend/mlx_engine/model_kit/vision_add_ons/load_utils.py (1)

load_vision_addon (191-257)

server/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.py (1)

compute_embeddings (96-159)

server/backend/mlx_engine/model_kit/vision_add_ons/pixtral.py (1)

compute_embeddings (49-91)

server/backend/mlx_engine/utils/progress_decorators.py (1)

server/backend/mlx_engine/cache_wrapper.py (1)

StopPromptProcessing (19-22)

server/backend/mlx_engine/vision_model_kit/vision_model_wrapper.py (2)

server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)

common_process_prompt_with_images (19-79)

server/backend/mlx_engine/vision_model_kit/vision_model_kit.py (2)

language_model (159-160)

record_sampled_token (142-143)

server/backend/mlx_engine/model_kit/vision_add_ons/base.py (4)

server/backend/mlx_engine/model_kit/vision_add_ons/gemma3.py (1)

compute_embeddings (47-87)

server/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.py (1)

compute_embeddings (96-159)

server/backend/mlx_engine/model_kit/vision_add_ons/mistral3.py (1)

compute_embeddings (57-118)

server/backend/mlx_engine/model_kit/vision_add_ons/pixtral.py (1)

compute_embeddings (49-91)

server/backend/mlx_backend.py (2)

server/schemas.py (4)

ChatMessage (15-17)

ChatCompletionRequest (20-31)

downloadRequest (65-66)

ResponseRequest (69-77)

server/backend/mlx_runner.py (3)

_format_conversation (246-263)

generate_batch (242-244)

get_effective_max_tokens (139-147)

server/backend/mlx_engine/model_kit/__init__.py (1)

server/backend/mlx_engine/model_kit/patches/gemma3n.py (1)

apply_patches (58-65)

server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)

server/backend/mlx_engine/utils/image_utils.py (2)

convert_to_pil (10-15)

custom_resize (18-100)

server/backend/mlx_engine/utils/top_logprobs.py (1)

server/backend/mlx_engine/utils/token.py (1)

Token (6-14)

server/backend/mlx_engine/utils/eot_tokens.py (2)

server/backend/mlx_engine/model_kit/model_kit.py (1)

ModelKit (22-236)

server/backend/mlx_engine/vision_model_kit/vision_model_kit.py (1)

VisionModelKit (19-160)

server/tests/test_inference.py (3)

server/backend/mlx_runner.py (3)

MLXRunner (55-337)

generate_streaming (170-240)

get_memory_usage (265-280)

server/cache_utils.py (1)

get_model_path (173-191)

server/backend/mlx_engine/external/datasets/dill.py (1)

dumps (134-138)

server/backend/mlx_engine/utils/fix_mistral_pre_tokenizer.py (2)

server/backend/mlx_engine/generate.py (1)

tokenize (477-490)

server/backend/mlx_engine/model_kit/model_kit.py (1)

tokenize (136-140)

server/backend/mlx_engine/model_kit/vision_add_ons/load_utils.py (1)

server/backend/mlx_engine/model_kit/patches/gemma3n.py (1)

from_dict (25-36)

server/api.py (2)

server/schemas.py (2)

ChatMessage (15-17)

ResponseRequest (69-77)

server/backend/mlx_backend.py (1)

generate_response (182-245)

server/backend/mlx_runner.py (4)

server/backend/mlx_engine/generate.py (2)

create_generator (146-474)

tokenize (477-490)

server/backend/mlx_engine/model_kit/model_kit.py (1)

tokenize (136-140)

server/backend/mlx_engine/cache_wrapper.py (1)

StopPromptProcessing (19-22)

server/reasoning_utils.py (3)

ReasoningExtractor (15-171)

StreamingReasoningParser (245-438)

finalize (424-438)

server/backend/mlx_engine/model_kit/vision_add_ons/pixtral.py (6)

server/backend/mlx_engine/vision_model_kit/vision_model_wrapper.py (2)

process_prompt_with_images (154-206)

language_model (213-214)

server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)

common_process_prompt_with_images (19-79)

server/backend/mlx_engine/model_kit/vision_add_ons/load_utils.py (1)

load_vision_addon (191-257)

server/backend/mlx_engine/model_kit/vision_add_ons/gemma3.py (1)

compute_embeddings (47-87)

server/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.py (1)

compute_embeddings (96-159)

server/backend/mlx_engine/model_kit/vision_add_ons/mistral3.py (1)

compute_embeddings (57-118)

server/backend/mlx_engine/vision_model_kit/vision_model_kit.py (3)

server/backend/mlx_engine/model_kit/model_kit.py (9)

ModelKit (22-236)

_vocab_only_init (59-63)

_full_model_init (65-112)

process_prompt (142-178)

is_cross_prompt_cache_active (180-185)

record_token_to_cache (187-188)

is_draft_model_compatible (203-216)

load_draft_model (218-227)

unload_draft_model (229-236)

server/backend/mlx_engine/vision_model_kit/_transformers_compatibility.py (2)

fix_qwen2_5_vl_image_processor (8-36)

fix_qwen2_vl_preprocessor (39-60)

server/backend/mlx_engine/vision_model_kit/vision_model_wrapper.py (3)

VisionModelWrapper (13-214)

process_prompt_with_images (154-206)

language_model (213-214)

🪛 Ruff (0.14.11)

server/backend/mlx_engine/utils/set_seed.py

39-39: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_engine/utils/prompt_processing.py

18-18: Avoid specifying long messages outside the exception class

(TRY003)

27-30: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_engine/model_kit/vision_add_ons/gemma3.py

55-55: Unpacked variable other_model_inputs is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

server/backend/mlx_engine/utils/kv_cache_quantization.py

29-29: Avoid specifying long messages outside the exception class

(TRY003)

42-42: Avoid specifying long messages outside the exception class

(TRY003)

44-46: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_engine/utils/progress_decorators.py

19-19: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)

52-52: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)

82-82: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF002)

server/backend/mlx_engine/vision_model_kit/vision_model_wrapper.py

61-61: Unused method argument: input_embeddings

(ARG002)

104-106: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

143-145: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

143-145: Avoid specifying long messages outside the exception class

(TRY003)

146-146: Use raise without specifying exception name

Remove exception name

(TRY201)

188-190: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

188-190: Avoid specifying long messages outside the exception class

(TRY003)

191-191: Use raise without specifying exception name

Remove exception name

(TRY201)

server/backend/mlx_engine/__init__.py

5-12: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

server/backend/mlx_engine/cache_wrapper.py

245-245: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_engine/stop_string_processor.py

35-35: Avoid specifying long messages outside the exception class

(TRY003)

38-38: Avoid specifying long messages outside the exception class

(TRY003)

41-41: Avoid specifying long messages outside the exception class

(TRY003)

93-93: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_engine/external/datasets/dill.py

30-30: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

109-109: Do not catch blind exception: Exception

(BLE001)

115-115: Unused noqa directive (unused: E721)

Remove unused noqa directive

(RUF100)

server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py

38-38: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_engine/utils/top_logprobs.py

25-27: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

server/backend/mlx_engine/utils/eot_tokens.py

14-14: String contains ambiguous ｜ (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)

14-14: String contains ambiguous ｜ (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?

(RUF001)

63-65: Avoid specifying long messages outside the exception class

(TRY003)

server/tests/test_inference.py

16-16: Unpacked variable model_name is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

16-16: Unpacked variable commit_hash is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

58-58: Do not catch blind exception: Exception

(BLE001)

61-61: Do not catch blind exception: Exception

(BLE001)

server/backend/mlx_engine/utils/fix_mistral_pre_tokenizer.py

52-52: Do not catch blind exception: Exception

(BLE001)

server/backend/mlx_engine/model_kit/vision_add_ons/load_utils.py

33-33: Avoid specifying long messages outside the exception class

(TRY003)

105-107: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.py

104-104: Unpacked variable attention_mask is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

104-104: Unpacked variable other_model_inputs is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

server/api.py

89-89: Consider moving this statement to an else block

(TRY300)

90-90: Do not catch blind exception: Exception

(BLE001)

91-91: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

92-92: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

server/backend/mlx_engine/generate.py

112-114: Avoid specifying long messages outside the exception class

(TRY003)

116-116: Avoid specifying long messages outside the exception class

(TRY003)

313-315: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_runner.py

72-72: Avoid specifying long messages outside the exception class

(TRY003)

76-76: Consider moving this statement to an else block

(TRY300)

98-99: try-except-pass detected, consider logging the exception

(S110)

98-98: Do not catch blind exception: Exception

(BLE001)

123-123: Avoid specifying long messages outside the exception class

(TRY003)

136-137: try-except-pass detected, consider logging the exception

(S110)

136-136: Do not catch blind exception: Exception

(BLE001)

166-167: try-except-pass detected, consider logging the exception

(S110)

166-166: Do not catch blind exception: Exception

(BLE001)

186-186: Avoid specifying long messages outside the exception class

(TRY003)

249-249: Avoid specifying long messages outside the exception class

(TRY003)

270-270: Do not catch blind exception: Exception

(BLE001)

416-416: Do not catch blind exception: Exception

(BLE001)

426-426: Do not catch blind exception: Exception

(BLE001)

432-432: Consider moving this statement to an else block

(TRY300)

433-433: Do not catch blind exception: Exception

(BLE001)

server/tests/test_responses_api.py

29-29: Do not catch blind exception: Exception

(BLE001)

97-97: Do not catch blind exception: Exception

(BLE001)

server/tests/test_lora.py

16-16: Probable insecure usage of temporary file or directory: "/tmp/fake_model"

(S108)

17-17: Probable insecure usage of temporary file or directory: "/tmp/fake_adapter"

(S108)

38-38: Local variable model_kit is assigned to but never used

Remove assignment to unused variable model_kit

(F841)

38-38: Probable insecure usage of temporary file or directory: "/tmp/fake_model"

(S108)

38-38: Probable insecure usage of temporary file or directory: "/tmp/fake_adapter"

(S108)

42-42: Probable insecure usage of temporary file or directory: "/tmp/fake_adapter"

(S108)

server/backend/mlx_engine/utils/speculative_decoding.py

21-23: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_engine/model_kit/vision_add_ons/pixtral.py

57-57: Unpacked variable attention_mask is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

57-57: Unpacked variable other_model_inputs is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

server/backend/mlx_engine/model_kit/model_kit.py

35-40: Mutable class attributes should be annotated with typing.ClassVar

(RUF012)

171-173: Avoid specifying long messages outside the exception class

(TRY003)

222-222: Avoid specifying long messages outside the exception class

(TRY003)

224-224: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_engine/vision_model_kit/vision_model_kit.py

106-106: Unused method argument: prompt_progress_callback

(ARG002)

107-107: Unused method argument: generate_args

(ARG002)

109-109: Unused method argument: speculative_decoding_toggle

(ARG002)

145-145: Unused method argument: path

(ARG002)

148-148: Unused method argument: path

(ARG002)

149-151: Avoid specifying long messages outside the exception class

(TRY003)

154-156: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_engine/utils/image_utils.py

49-51: Avoid specifying long messages outside the exception class

(TRY003)

53-53: Avoid specifying long messages outside the exception class

(TRY003)

92-92: Loop control variable i not used within loop body

(B007)

server/backend/mlx_engine/utils/disable_hf_download.py

10-10: Unused function argument: args

(ARG001)

10-10: Unused function argument: kwargs

(ARG001)

14-16: Avoid specifying long messages outside the exception class

(TRY003)

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

… backend consistency

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@server/tests/test_responses_api.py`:
- Around line 97-100: The broad except Exception as e block currently prints the
error and traceback but does not re-raise, causing pytest to treat failures as
passes; change the handler to either re-raise the exception (raise) after
printing or call pytest.fail(...) with the exception message so the test fails
properly; modify the except Exception as e block in the test (the try/except
around the test logic) to re-raise or call pytest.fail(str(e)) so failures are
propagated to pytest.
- Around line 60-62: The test currently prints failure and returns in the
non-200 branch which lets pytest treat the test as passing; import pytest at the
top of the test file and change the early-return behavior in the failure branch
of the test_responses_api test (the block that prints f"Test FAILED:
{response.text}") to call pytest.fail(...) or raise an assertion (e.g.,
pytest.fail with a message including response.status_code and response.text) so
pytest marks the test as failed.

🧹 Nitpick comments (10)

server/pyproject.toml (1)
6-19: Version pinning improvements look good; consider moving black to dev dependencies.

The consistent explicit pinning across all dependencies and the outlines upgrade to 1.2.9 address previous review feedback—nice work.

However, black is a code formatter and doesn't belong in runtime dependencies. It should be moved to optional/dev dependencies:
 dependencies = [
     "fastapi==0.119.0",
     "uvicorn==0.38.0",
     "mlx-lm==0.28.3",
     "mlx-vlm==0.3.9",
     "outlines==1.2.9",
     "outlines-core==0.1.26",
     "psutil==7.2.1",
     "sentencepiece==0.2.1",
     "pillow==12.1.0",
     "transformers==4.49.0",
-    "black==25.9.0",
     "huggingface-hub==0.35.3",
 ]
+
+[project.optional-dependencies]
+dev = [
+    "black==25.9.0",
+]
server/backend/mlx_engine/utils/image_utils.py (4)
10-15: Consider adding error handling for invalid image data.

The function will raise a generic exception if base64 decoding fails or the image data is corrupted. Wrapping with a try-except to provide a clearer error message (e.g., indicating which image index failed) could improve debuggability.
♻️ Optional enhancement
 def convert_to_pil(images_b64: List[str]) -> list[PIL.Image.Image]:
     """Convert a list of base64 strings to PIL Images"""
-    return [
-        PIL.Image.open(BytesIO(base64.b64decode(img))).convert("RGB")
-        for img in images_b64 or []
-    ]
+    result = []
+    for i, img in enumerate(images_b64 or []):
+        try:
+            result.append(PIL.Image.open(BytesIO(base64.b64decode(img))).convert("RGB"))
+        except Exception as e:
+            raise ValueError(f"Failed to decode image at index {i}: {e}") from e
+    return result
43-44: Docstring references sys.stderr but implementation uses logger.

The side effects note mentions "Writes progress and status messages to sys.stderr" but the implementation uses logger.info() and logger.warning().
📝 Proposed fix
     Side effects:
-        Writes progress and status messages to sys.stderr.
+        Logs progress and status messages via the module logger.
64-64: Remove unused variable aspect_ratio.

This variable is computed but never used after the resize logic was refactored to use scale-based calculations.
♻️ Proposed fix
-        aspect_ratio = img.width / img.height
89-89: Unused loop variable should be _.

The loop variable i is not used within the loop body.
♻️ Proposed fix
-        for i, img in enumerate(resized_images):
+        for img in resized_images:
server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)
41-41: Avoid list comprehension for side effects.

Using a list comprehension purely for side effects (calling add_token) is not idiomatic Python and unnecessarily allocates a list of None values. Use a regular for loop instead.
♻️ Proposed fix
-    [detokenizer.add_token(token) for token in prompt_tokens]
+    for token in prompt_tokens:
+        detokenizer.add_token(token)
server/backend/mlx_engine/stop_string_processor.py (3)
24-45: Consider copying stop_strings to prevent external mutation.

Storing stop_strings by reference means external modifications to the list after initialization could cause unexpected behavior. A shallow copy would provide defensive isolation.
Proposed fix
-        self.stop_strings = stop_strings
+        self.stop_strings = list(stop_strings)
85-90: Return a copy of token_id_buffer to prevent accidental mutation.

Returning self.token_id_buffer directly exposes internal state. If the caller modifies the returned list, it would affect the processor's buffer. Returning a copy is safer.
Proposed fix
         elif result.status == "full_stop":
             return StopStringProcessorResult(
                 status="full_stop",
                 stop_string=result.stop_string,
-                stop_tokens=self.token_id_buffer,
+                stop_tokens=list(self.token_id_buffer),
             )
110-111: Docstring incorrectly states return type.

The docstring says it returns StopStringProcessorResult but the method signature and implementation return _StoppingCriteriaResult.
Proposed fix
         Returns:
-            StopStringProcessorResult indicating match status and stop string if matched
+            _StoppingCriteriaResult indicating match status and stop string if matched
server/tests/test_responses_api.py (1)
85-95: Consider explicit JSON parsing error handling and consistent failure reporting.

Two minor concerns:

Line 89: If json.loads(content) fails, the generic exception handler at line 97 will catch it, but a more specific json.JSONDecodeError catch would provide clearer diagnostics.

Line 95: Similar to the earlier case, printing "FAILED" without raising won't cause pytest to report a failure (though it's at the end of the function anyway).
Proposed improvement for JSON parsing
         if response.status_code == 200:
             data = response.json()
             print(f"Response: {json.dumps(data, indent=2)}")
             content = data["choices"][0]["message"]["content"]
-            parsed = json.loads(content)
+            try:
+                parsed = json.loads(content)
+            except json.JSONDecodeError as e:
+                pytest.fail(f"Response content is not valid JSON: {content!r}, error: {e}")
             print(f"Parsed JSON: {parsed}")
             assert parsed["capital"].lower() == "berlin"
             assert parsed["country"].lower() == "germany"
             print("JSON schema test PASSED")
         else:
-            print(f"JSON schema test FAILED: {response.text}")
+            pytest.fail(f"JSON schema test FAILED: {response.text}")

📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4545a11 and c0a72b8.

📒 Files selected for processing (8)

server/api.py
server/backend/mlx_backend.py
server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py
server/backend/mlx_engine/stop_string_processor.py
server/backend/mlx_engine/utils/image_utils.py
server/backend/mlx_engine/utils/logger.py
server/pyproject.toml
server/tests/test_responses_api.py

🚧 Files skipped from review as they are similar to previous changes (1)

server/backend/mlx_backend.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.{rs,toml}

⚙️ CodeRabbit configuration file

Review the Rust code for conformity with best practices in Rust, Systems programming. Highlight any deviations.

Files:

server/pyproject.toml

🧬 Code graph analysis (3)

server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)

server/backend/mlx_engine/utils/image_utils.py (2)

convert_to_pil (10-15)

custom_resize (18-97)

server/api.py (2)

server/schemas.py (2)

ChatMessage (15-17)

ResponseRequest (69-77)

server/backend/mlx_backend.py (1)

generate_response (183-253)

server/tests/test_responses_api.py (3)

server/backend/mlx_backend.py (1)

get_or_load_model (31-80)

server/cache_utils.py (1)

get_model_path (173-191)

server/backend/mlx_engine/external/datasets/dill.py (1)

dumps (134-138)

🪛 Ruff (0.14.11)

server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py

38-38: Avoid specifying long messages outside the exception class

(TRY003)

server/backend/mlx_engine/stop_string_processor.py

35-35: Avoid specifying long messages outside the exception class

(TRY003)

38-38: Avoid specifying long messages outside the exception class

(TRY003)

41-41: Avoid specifying long messages outside the exception class

(TRY003)

93-93: Avoid specifying long messages outside the exception class

(TRY003)

server/api.py

89-89: Consider moving this statement to an else block

(TRY300)

server/backend/mlx_engine/utils/image_utils.py

49-51: Avoid specifying long messages outside the exception class

(TRY003)

53-53: Avoid specifying long messages outside the exception class

(TRY003)

64-64: Local variable aspect_ratio is assigned to but never used

Remove assignment to unused variable aspect_ratio

(F841)

89-89: Loop control variable i not used within loop body

(B007)

server/tests/test_responses_api.py

29-29: Do not catch blind exception: Exception

(BLE001)

97-97: Do not catch blind exception: Exception

(BLE001)

🔇 Additional comments (19)

server/backend/mlx_engine/utils/logger.py (2)

12-17: LGTM! Local error suppression addresses prior feedback.

The custom MLXEngineStreamHandler correctly overrides handleError to suppress handler-specific exceptions locally, avoiding the process-wide side effect of setting logging.raiseExceptions = False. This is the recommended approach for library-scoped error handling.

20-42: LGTM! Well-structured logger setup.

The configuration follows best practices:

Namespaced logger avoids polluting the root logger

Handler clearing prevents duplicate output on re-import

propagate = False isolates mlx_engine logs from the application's root logger while still allowing child loggers to inherit this configuration

server/backend/mlx_engine/utils/image_utils.py (2)

1-7: LGTM!

Imports and logger setup are appropriate for the module's purpose.

66-75: Resize logic correctly respects both max dimensions.

The implementation now properly calculates a single scale factor using min(scale_w, scale_h, 1.0) to ensure both dimensions fit within max_size while preserving aspect ratio.

server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (4)

1-9: LGTM!

Imports are appropriate and well-organized.

12-16: LGTM!

Clean NamedTuple definition providing a well-typed return structure for the processed image prompt data.

50-59: Robust handling of config variations.

The image_token_index resolution now properly guards against missing or None vision_config attributes before accessing image_token_id.

61-83: LGTM!

The prepare_inputs call and extraction of processed inputs into the ProcessedImagePrompt return value are well-structured.

server/backend/mlx_engine/stop_string_processor.py (7)

1-9: LGTM!

Clean imports and well-defined type literals. The REPLACEMENT_CHAR constant correctly uses U+FFFD for detecting incomplete UTF-8 sequences.

12-18: LGTM!

Well-documented result type with clear field semantics.

95-97: LGTM!

Good use of an inner NamedTuple to separate internal result representation from the public API.

128-131: LGTM!

Correctly detects incomplete UTF-8 sequences by checking for the replacement character at the end of the decoded string.

133-151: LGTM!

Correctly identifies the earliest-occurring stop string, ensuring consistent behavior when multiple stop strings could match.

154-163: LGTM!

The return type inconsistency has been addressed - this method now correctly returns _StoppingCriteriaResult, consistent with the other check methods.

165-177: LGTM!

Clean implementation of suffix-prefix overlap detection with early termination via any().

server/api.py (2)

3-3: LGTM!

The import of ResponseRequest is correctly added to support the new endpoint.

84-92: LGTM! Error handling improvements from previous review have been addressed.

The endpoint correctly delegates to the backend, uses logger.exception for full traceback logging, and properly chains the exception with from e. The static analysis hint (TRY300) about moving the return to an else block is a minor style preference that doesn't improve clarity here.

server/tests/test_responses_api.py (2)

1-11: LGTM!

Imports and path setup are appropriate for the test file structure.

20-31: LGTM! Previous review feedback addressed.

The model path check now properly guards against None values before calling .exists(). The graceful skip pattern for missing models is appropriate for integration tests.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-14T18:18:41Z

+        else:
+            print(f"Test FAILED: {response.text}")
+            return


⚠️ Potential issue | 🟡 Minor

Early return on failure won't cause pytest to report test failure.

When the status code is not 200, the test prints the failure but returns without raising an exception or assertion. Pytest will report this as a passed test.

Proposed fix

else: print(f"Test FAILED: {response.text}") - return + pytest.fail(f"Expected status 200, got {response.status_code}: {response.text}")

You'll need to add import pytest at the top of the file.

🤖 Prompt for AI Agents

In `@server/tests/test_responses_api.py` around lines 60 - 62, The test currently prints failure and returns in the non-200 branch which lets pytest treat the test as passing; import pytest at the top of the test file and change the early-return behavior in the failure branch of the test_responses_api test (the block that prints f"Test FAILED: {response.text}") to call pytest.fail(...) or raise an assertion (e.g., pytest.fail with a message including response.status_code and response.text) so pytest marks the test as failed.

coderabbitai · 2026-01-14T18:18:41Z

+    except Exception as e:
+        print(f"Test FAILED with exception: {type(e).__name__}: {e}")
+        import traceback
+        traceback.print_exc()


⚠️ Potential issue | 🟡 Minor

Exception handler swallows failures - test will pass in pytest.

The outer exception handler catches all exceptions, prints them, but doesn't re-raise. This means any exception during the test will result in pytest reporting the test as passed.

Proposed fix

except Exception as e: print(f"Test FAILED with exception: {type(e).__name__}: {e}") import traceback traceback.print_exc() + raise # Re-raise so pytest reports the failure

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

except Exception as e:

print(f"Test FAILED with exception: {type(e).__name__}: {e}")

import traceback

traceback.print_exc()

except Exception as e:

print(f"Test FAILED with exception: {type(e).__name__}: {e}")

import traceback

traceback.print_exc()

raise # Re-raise so pytest reports the failure

🧰 Tools

🪛 Ruff (0.14.11)

97-97: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@server/tests/test_responses_api.py` around lines 97 - 100, The broad except Exception as e block currently prints the error and traceback but does not re-raise, causing pytest to treat failures as passes; change the handler to either re-raise the exception (raise) after printing or call pytest.fail(...) with the exception message so the test fails properly; modify the except Exception as e block in the test (the try/except around the test logic) to re-raise or call pytest.fail(str(e)) so failures are propagated to pytest.

bxff added 3 commits January 14, 2026 20:27

feat(server): transition Mac backend to mlx-engine with structured ou…

da4a0d7

…tput and LoRA support

feat(server): implement non-streaming Responses API endpoint for stru…

4545a11

…ctured completions

coderabbitai Bot reviewed Jan 14, 2026

View reviewed changes

refactor(server): improve API robustness, non-blocking inference, and…

c0a72b8

… backend consistency

coderabbitai Bot reviewed Jan 14, 2026

View reviewed changes

madclaws closed this Feb 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): Add Responses API - Non-streaming endpoint with structured output#66

feat(api): Add Responses API - Non-streaming endpoint with structured output#66
bxff wants to merge 4 commits intotilesprivacy:mainfrom
bxff:feature/responses-api

bxff commented Jan 14, 2026

Uh oh!

coderabbitai Bot commented Jan 14, 2026 •

edited

Loading

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jan 14, 2026

Uh oh!

coderabbitai Bot Jan 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bxff commented Jan 14, 2026

Uh oh!

coderabbitai Bot commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai Bot commented Jan 14, 2026 •

edited

Loading