feat(api): Add Responses API - Non-streaming endpoint with structured output#66
feat(api): Add Responses API - Non-streaming endpoint with structured output#66bxff wants to merge 4 commits intotilesprivacy:mainfrom
Conversation
…tput and LoRA support
This cleanup phase focuses on making the new mlx-engine implementation more resilient and easier to maintain long-term. Relative Imports and Packaging: - switched to package-relative imports across the mlx_engine directory - removed runtime sys.path hacks in the runner - cleaned up broken vision model entries and pruned unused source files Better System Integration: - used psutil for actual RAM detection instead of hardcoded guesses - added torch and psutil as direct dependencies - improved memory check logic with proper headroom calculation Robustness Guards: - fixed a bug in dill.py that accidentally cleared the original tokenizer's state - added safety checks for zero-height images and empty config lists - implemented hasattr guards for model-specific weight patching - moved vision add-ons to a formal BaseVisionAddOn interface General Polishes: - aligned docstring parameters with actual function signatures - fixed a typo in the prompt processing comments - preserved EOS token order during deduplication - added necessary checks to the model path loading in tests
…ctured completions
📝 WalkthroughWalkthroughAdds a non-streaming /v1/responses API endpoint and ResponseRequest/ResponseResponse schemas, and implements a full MLX-based generation engine under server/backend/mlx_engine. The engine provides ModelKit/VisionModelKit loaders, KV-cache management, speculative decoding, multimodal vision add-ons, generation (streaming and batch), token utilities, and many helpers (quantization, image processing, stop-string detection, logging, etc.). MLXRunner and mlx_backend are updated to delegate to the engine; new tests exercise inference, LoRA adapter handling, and the responses API. Dependency pins and utilities to disable HF downloads and adjust tokenizers were added. Sequence Diagram(s)sequenceDiagram
participant Client
participant API as FastAPI /v1/responses
participant Backend as mlx_backend.generate_response
participant Engine as MLX Engine (create_generator)
participant Model as ModelKit / VisionModelKit
Client->>API: POST /v1/responses (ResponseRequest)
API->>Backend: create_response(request)
Backend->>Engine: generate_response(request) / create_generator(...)
Engine->>Model: load_model / process_prompt
Model-->>Engine: processed prompt tokens (+embeddings)
Engine->>Model: stream_generate / generate_batch (with json_schema/stop strings)
loop token generation
Model->>Engine: sampled tokens / top_logprobs
Engine->>Engine: apply processors (stop, repetition, speculative)
end
Engine-->>Backend: full chat.completion payload
Backend-->>API: HTTP 200 JSON response
API-->>Client: response body (chat.completion)
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
🤖 Fix all issues with AI agents
In `@server/api.py`:
- Around line 84-92: In create_response, replace the current error handling that
calls logger.error and re-raises HTTPException without chaining: when catching
Exception as e, call logger.exception("Error in generate_response") to log the
traceback, and re-raise the HTTPException using "raise
HTTPException(status_code=500, detail=str(e)) from e" so the exception is
properly chained; locate the try/except around runtime.backend.generate_response
in create_response and make these two changes.
In `@server/backend/mlx_backend.py`:
- Around line 182-183: The async function generate_response currently calls the
blocking runner.generate_batch synchronously and should not block the event
loop; change the blocking call to run in a thread by wrapping it with
asyncio.to_thread (e.g., result = await asyncio.to_thread(runner.generate_batch,
...)), add an import for asyncio if missing, keep the function async and return
the same Dict[str, Any] from the awaited result, or alternatively make
generate_response a regular synchronous function if the API layer already
dispatches to threads—choose one approach and apply it consistently to the
runner.generate_batch invocation and any other blocking calls inside
generate_response.
- Around line 216-223: The current stop-sequence loop using request.stop /
stop_sequences and breaking on the first matched stop can truncate at the wrong
place; instead compute the earliest occurrence among all stop sequences and
slice response_text at that minimal positive index. Iterate over stop_sequences,
for each use a position lookup (e.g., find()) to get its index in response_text
if present, track the smallest non-negative index, and after the loop set
response_text = response_text[:min_index] only if a match was found; keep the
existing request.stop handling and variable names (request.stop, stop_sequences,
response_text).
In
`@server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py`:
- Around line 50-55: The code may raise AttributeError when accessing
config.vision_config.image_token_id if vision_config is missing or None; update
the logic around image_token_index assignment (the block that references config,
image_token_index, vision_config, and image_token_id) to first verify that
config has a non-None vision_config (e.g., check hasattr(config,
"vision_config") and config.vision_config is not None or use getattr with
default) before reading image_token_id, and fall back to None if neither
image_token_index nor a valid vision_config.image_token_id exists.
In `@server/backend/mlx_engine/stop_string_processor.py`:
- Around line 154-163: The method _check_partial_text_match currently constructs
and returns a StopStringProcessorResult which is inconsistent with
_check_incomplete_utf8 and _check_full_text_match; change its return type
annotation to Optional[_StoppingCriteriaResult] and return an instance of
_StoppingCriteriaResult (with status="partial_match" and stop_string=None)
instead of StopStringProcessorResult so the function signature and returned
object type match the other check methods.
In `@server/backend/mlx_engine/utils/image_utils.py`:
- Around line 66-78: The resize logic in the block that checks max_size may
produce a dimension that still exceeds max_size because it only constrains the
larger side; update the calculation to compute a single scale = min(max_size[0]
/ original_width, max_size[1] / original_height, 1.0) and then set new_width =
int(original_width * scale) and new_height = int(original_height * scale) before
calling img.resize(..., PIL.Image.LANCZOS) and logging via logger.info; use the
existing variables img, original_width, original_height, max_size, aspect_ratio
(if needed) and keep the logging message but report the final new dimensions.
In `@server/backend/mlx_engine/utils/logger.py`:
- Around line 14-15: The module currently sets the process-wide flag
logging.raiseExceptions = False which suppresses logging errors for all loggers;
remove that global assignment and instead handle logger-specific failures
locally—either wrap mlx_engine log emission points in try/except or attach a
custom logging.Handler (override handleError) to the mlx_engine logger to
swallow or log handler-specific exceptions; update logger.py to stop mutating
logging.raiseExceptions and implement one of these local error-handling
strategies for the mlx_engine logger.
In `@server/pyproject.toml`:
- Around line 10-17: The dependency list mixes exact pins and unpinned packages
and uses an outdated outlines pin; update outlines==1.1.1 to outlines==1.2.9,
choose and apply a consistent pinning strategy for all top-level packages (e.g.,
pin torch, psutil, sentencepiece, pillow, transformers, mlx-vlm, mlx-lm to
specific versions or move them to a separate constraints file), and audit
whether torch is required given mlx-lm/mlx-vlm (if not required, remove the
torch entry); ensure any changes reference the package names shown (outlines,
outlines-core, torch, psutil, sentencepiece, pillow, transformers, mlx-vlm,
mlx-lm) so the pyproject.toml dependency section is consistent and reproducible.
In `@server/tests/test_responses_api.py`:
- Around line 21-26: get_model_path can return None for the path, so calling
model_path.exists() may raise AttributeError; update the test to check for a
falsy model_path before calling .exists() (e.g. replace the current if-block
with a guard like "if not model_path or not model_path.exists(): print(...);
return") so the test skips when model_path is None or missing; reference:
get_model_path and the model_path variable in the test.
🧹 Nitpick comments (47)
server/backend/mlx_engine/utils/fix_mistral_pre_tokenizer.py (2)
41-51: Fragile reliance on private attributes.Accessing
tokenizer._tokenizerandtokenizer._tokenizer._tokenizer.pre_tokenizercouples this code tightly to the internal implementation ofTokenizerWrapperandLlamaTokenizer. These private attributes may change without notice in future library versions.Consider adding a comment documenting the library versions this was tested against, or wrapping the attribute access with
hasattrchecks for graceful degradation.🛠️ Suggested defensive check
+ # Access internal tokenizer structure - tested with mlx-lm X.X and transformers X.X if not isinstance(tokenizer._tokenizer, LlamaTokenizer): logger.info(f"Tokenizer is of type {type(tokenizer._tokenizer)}. Skipping fix.") return + if not hasattr(tokenizer._tokenizer, '_tokenizer') or not hasattr(tokenizer._tokenizer._tokenizer, 'pre_tokenizer'): + logger.warning("Tokenizer internal structure unexpected. Skipping fix.") + return
52-54: Broad exception catch is acceptable here.The static analysis flags
Exceptionas too broad. In this context, it's reasonable since multiple exception types are possible (file I/O errors, JSON parsing errors, attribute errors from internal tokenizer changes, etc.), and the function is designed as a best-effort fix with non-fatal failure mode.For improved debuggability, consider logging at
DEBUGlevel for expected exceptions andWARNINGfor unexpected ones, or catching specific exceptions separately.server/tests/test_lora.py (3)
7-8: Consider using pytest with proper package discovery instead of sys.path manipulation.The
sys.path.appendhack is fragile. Consider configuring pytest with proper package discovery viapyproject.tomlor aconftest.pythat handles imports cleanly.
25-25: Remove print statements from tests.Print statements in tests add noise. Test frameworks (unittest, pytest) already report pass/fail status. Remove these debug prints.
Proposed fix
mock_load.assert_called_once_with( model_path, adapter_path=adapter_path ) - print("LoRA path propagation verification (MLXRunner -> engine): PASSED")And similarly for line 43:
self.assertEqual(kwargs.get("adapter_path"), Path("/tmp/fake_adapter")) - print("ModelKit LoRA path propagation (ModelKit -> mlx_lm): PASSED")
36-43: Consider using a single mock for config loading.The nested
patchcontext managers forPath.read_textandjson.loadscould be simplified. Also, themodel_kitvariable appears unused but its construction is the test's purpose (triggers the mock call). Consider adding_ = model_kitor a comment to clarify intent.Clarify intent
- model_kit = ModelKit(Path("/tmp/fake_model"), adapter_path="/tmp/fake_adapter") + # Construction triggers mlx_lm.load call being tested + _ = ModelKit(Path("/tmp/fake_model"), adapter_path="/tmp/fake_adapter")server/backend/mlx_engine/vision_model_kit/_transformers_compatibility.py (3)
13-36: File modifications could corrupt config on partial write failure.Writing to the same file that was just read risks data loss if the write fails mid-operation. Consider writing to a temporary file first, then atomically renaming, or at minimum preserving the original JSON formatting with
indent.Safer write with formatting preserved
preprocessor_config["image_processor_type"] = "Qwen2VLImageProcessor" - with open(model_path / "preprocessor_config.json", "w") as f: - json.dump(preprocessor_config, f) + with open(model_path / "preprocessor_config.json", "w") as f: + json.dump(preprocessor_config, f, indent=2)
15-18: Redundant file read—preprocessor_config.jsonis read twice.The file is read at line 15-16 to check
image_processor_type, then read again at lines 32-33 to modify. Consider reading once and reusing the data.Consolidate file reads
def fix_qwen2_5_vl_image_processor(model_path: Path): try: with open(model_path / "preprocessor_config.json", "r") as f: - image_processor_type = json.load(f)["image_processor_type"] + preprocessor_config = json.load(f) + image_processor_type = preprocessor_config.get("image_processor_type") with open(model_path / "config.json", "r") as f: model_type = json.load(f)["model_type"] - except: # noqa: E722 + except (FileNotFoundError, KeyError, json.JSONDecodeError): return if not ( image_processor_type == "Qwen2_5_VLImageProcessor" and model_type == "qwen2_5_vl" ): return # Replace image_processor_type with Qwen2VLImageProcessor logger.warning( "Replacing `image_processor_type` with Qwen2VLImageProcessor in preprocessor_config.json" ) - with open(model_path / "preprocessor_config.json", "r") as f: - preprocessor_config = json.load(f) preprocessor_config["image_processor_type"] = "Qwen2VLImageProcessor" with open(model_path / "preprocessor_config.json", "w") as f: - json.dump(preprocessor_config, f) + json.dump(preprocessor_config, f, indent=2)
50-51: Accessing dict key solely to check existence—useinoperator instead.Line 51 reads the JSON and accesses
["size"]only to trigger aKeyErrorif missing. This is indirect; use explicit key checking for clarity.Explicit key check
with open(model_path / "preprocessor_config.json", "r") as f: - json.load(f)["size"] + preprocessor_config = json.load(f) + if "size" not in preprocessor_config: + return except: # noqa: E722 return + # ... then use preprocessor_config directly below instead of re-readingserver/backend/mlx_engine/utils/image_utils.py (1)
91-97: Unused loop variablei.The static analysis correctly identifies that
iis not used in this loop. Since only the image object is needed, use_to indicate intentionally unused variable.♻️ Proposed fix
- for i, img in enumerate(resized_images): + for img in resized_images: new_img = PIL.Image.new("RGB", (max_width, max_height), (0, 0, 0))server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)
39-43: Avoid using list comprehension for side effects.Line 41 uses a list comprehension purely for its side effects (
add_token), discarding the resulting list. This is non-idiomatic and allocates memory unnecessarily.♻️ Proposed fix
detokenizer = processor.detokenizer detokenizer.reset() - [detokenizer.add_token(token) for token in prompt_tokens] + for token in prompt_tokens: + detokenizer.add_token(token) detokenizer.finalize() prompt = detokenizer.textserver/backend/mlx_engine/external/datasets/dill.py (1)
113-116: Remove unusednoqadirective.The
# noqa: E721comment on line 115 is flagged as unused by Ruff (RUF100). Thetype(obj) is not strcomparison usingisis actually the idiomatic way to check exact type match, and modern linters don't flag this.♻️ Proposed fix
def memoize(self, obj): # Don't memoize strings since two identical strings can have different Python ids - if type(obj) is not str: # noqa: E721 + if type(obj) is not str: dill.Pickler.memoize(self, obj)server/backend/mlx_engine/stop_string_processor.py (1)
56-59: Consider makingstop_stringsimmutable or document mutation behavior.The check at line 56 is defensive since
__init__validates non-emptystop_strings. However, sinceself.stop_stringsis a mutable list, external code could modify it post-construction. Consider usingtupleinstead oflistfor immutability, or document the expected behavior.Optional: Use tuple for immutability
- self.stop_strings = stop_strings + self.stop_strings = tuple(stop_strings)server/backend/mlx_runner.py (2)
160-168: Token existence check may give false positives for multi-token sequences.The check
tokenizer.encode(token, ...)returns non-empty if the token can be encoded, but this doesn't guarantee the token exists as a single vocabulary entry. A token like<|end|>might encode to multiple IDs. Consider checking if the encoded result matches a single known token ID.This is likely acceptable for stop string matching since the StopStringProcessor handles multi-token sequences, but worth noting.
420-435: Memory calculation assumes unified memory architecture.The function mixes MLX active memory with system RAM from psutil. This works correctly on Apple Silicon's unified memory, but the function name and docstring don't clarify this assumption. Consider documenting that this is designed for unified memory systems.
def check_memory_available(required_gb: float) -> bool: - """Pre-flight check before model loading using actual system RAM.""" + """Pre-flight check before model loading using actual system RAM. + + Note: Assumes unified memory architecture (e.g., Apple Silicon) where + GPU memory and system RAM share the same pool. + """server/schemas.py (1)
80-86: Consider makingusagerequired for consistency.The
usagefield isOptionalhere but required inCompletionResponse. Based on the backend implementation inmlx_backend.py,usageis always returned withprompt_tokens,completion_tokens, andtotal_tokens. Consider aligning withCompletionResponsefor API consistency.Optional: Make usage required
class ResponseResponse(BaseModel): id: str object: str = "chat.completion" created: int model: str choices: List[Dict[str, Any]] - usage: Optional[Dict[str, int]] = None + usage: Dict[str, int]server/backend/mlx_engine/utils/logger.py (1)
17-18: Minor: Comment refers to "root logger" but this is a named logger.The comment says "Configure root logger for mlx_engine" but
logging.getLogger("mlx_engine")creates a named logger, not the root logger. Consider updating to "Configure logger for mlx_engine" for accuracy.server/backend/mlx_engine/utils/top_logprobs.py (2)
1-3: Import ordering: local imports should come after third-party.Per PEP 8 convention, third-party imports (
mlx.core) should come before local imports (.token).Suggested reordering
-from .token import Token - import mlx.core as mx + +from .token import Token
23-28: Addstrict=Truetozip()for defensive coding.While the three lists are derived from the same source and guaranteed to be equal length, adding
strict=Truedocuments this expectation and will catch any future refactoring errors. This also addresses the Ruff B905 hint.Proposed fix
return [ Token(id=int(idx), text=txt, logprob=float(prob)) for idx, txt, prob in zip( - top_indices.tolist(), text_list, top_logprob_values.tolist() + top_indices.tolist(), text_list, top_logprob_values.tolist(), strict=True ) ]server/runtime.py (2)
2-4: Unused logger: imported but never used.The
loggingmodule is imported and a logger is created, but it's never used in this file. Consider removing it or using it to log backend selection (see next comment).
16-17: Consider logging a warning when no backend is available.Returning
Nonesilently on unsupported platforms may lead to confusing downstream errors. A warning would help diagnose deployment issues.Proposed enhancement
+import sys +import logging + +logger = logging.getLogger("app") + def get_backend(): """ Dynamically choose which backend should be used depending on the OS """ if sys.platform == "darwin": from .backend import mlx_backend return mlx_backend elif sys.platform.startswith("linux"): from .backend import linux return linux else: + logger.warning(f"No backend available for platform: {sys.platform}") return Noneserver/tests/test_inference.py (4)
5-6: Consider using proper package installation instead of sys.path manipulation.The
sys.path.appendhack is brittle and can cause import issues. Since the PR summary mentions removing runtime sys.path hacks, consider installing the package in editable mode (pip install -e .) or using pytest'spythonpathconfiguration.
16-19: Prefix unused variables with underscore.Per static analysis,
model_nameandcommit_hashare unpacked but never used.Suggested fix
- model_path, model_name, commit_hash = get_model_path(model_spec) + model_path, _model_name, _commit_hash = get_model_path(model_spec)
11-19: Consider converting to proper pytest test withpytest.skip().The early
returnpattern makes this more of a manual script than a pytest-compatible test. For proper CI integration:Suggested approach
+import pytest + def test_inference(): model_spec = "driaforall/mem-agent-mlx-4bit" - print(f"Testing inference with model: {model_spec}") - - try: - model_path, model_name, commit_hash = get_model_path(model_spec) - if model_path is None or not model_path.exists(): - print(f"Error: Model {model_spec} not found in cache. Please download it first.") - return + model_path, _, _ = get_model_path(model_spec) + if model_path is None or not model_path.exists(): + pytest.skip(f"Model {model_spec} not found in cache")
44-44: Moveimport jsonto top of file.The
jsonmodule is imported mid-function. Moving it to the top-level imports improves readability and follows Python conventions.server/tests/test_responses_api.py (1)
5-6: Samesys.pathhack astest_inference.py.Consider addressing both test files together when refactoring to proper package installation or pytest configuration.
server/backend/mlx_engine/__init__.py (1)
5-12: Consider sorting__all__alphabetically.Static analysis (RUF022) suggests sorting for consistency. This is a minor style preference.
Sorted version
__all__ = [ + "create_generator", + "is_draft_model_compatible", + "load_draft_model", "load_model", - "load_draft_model", - "is_draft_model_compatible", + "tokenize", "unload_draft_model", - "create_generator", - "tokenize", ]server/backend/mlx_backend.py (2)
117-124: Duplicated json_schema extraction logic.This block is nearly identical to lines 194-201 in
generate_response. Consider extracting to a helper function.Suggested helper
def _extract_json_schema(response_format: Optional[Dict[str, Any]]) -> Optional[str]: """Extract JSON schema string from response_format dict.""" if not response_format: return None if response_format.get("type") == "json_schema": schema_info = response_format.get("json_schema", {}) return json.dumps(schema_info.get("schema", {})) elif response_format.get("type") == "json_object": return "{}" return None
257-259: Token count is a rough approximation.The
* 1.3heuristic can be significantly inaccurate (subword tokenizers vary widely). Since the runner has access to the actual tokenizer viamodel_kit.tokenizer, consider using it for accurate usage reporting:Suggested approach
def count_tokens(runner: MLXRunner, text: str) -> int: """Count tokens using the model's tokenizer.""" if runner.model_kit and hasattr(runner.model_kit, 'tokenizer'): return len(runner.model_kit.tokenizer.encode(text)) # Fallback approximation return int(len(text.split()) * 1.3)Then update the call sites in
generate_responseto pass the runner.server/backend/mlx_engine/utils/prompt_processing.py (1)
9-16: Add return type annotation for consistency.The function parameters have type hints, but the return type is missing. Based on the implementation, it returns an
mx.array.Suggested fix
def process_prompt_text_only( prompt_tokens: mx.array, cache_wrapper: CacheWrapper, generate_args: dict, draft_model: Optional[nn.Module], speculative_decoding_toggle: Optional[bool], prompt_progress_callback: Optional[Callable[[float], bool]], -): +) -> mx.array:server/backend/mlx_engine/utils/disable_hf_download.py (1)
25-27: Redundant patch: both lines modify the same object.
huggingface_hubandsys.modules["huggingface_hub"]reference the same module object, so line 27 is redundant. The comment suggests wanting to catch "other imports," but all imports ofhuggingface_hubresolve to the same cached module insys.modules.Simplified version
def patch_huggingface_hub(): """ Patch the huggingface_hub module to use our local-only snapshot_download. This ensures that any import of snapshot_download from huggingface_hub will use our wrapped version. """ huggingface_hub.snapshot_download = snapshot_download - # Also patch the module in sys.modules to ensure any other imports get our version - sys.modules["huggingface_hub"].snapshot_download = snapshot_downloadserver/backend/mlx_engine/model_kit/vision_add_ons/gemma3.py (1)
55-63: Prefix unused variable with underscore.
other_model_inputsis unpacked but never used, as flagged by static analysis.Suggested fix
- input_ids, pixel_values, attention_mask, other_model_inputs = ( + input_ids, pixel_values, attention_mask, _other_model_inputs = ( common_process_prompt_with_images( prompt_tokens=prompt_tokens, images_b64=images_b64, processor=self.processor, config=self.config, max_size=max_size, ) )server/backend/mlx_engine/utils/eot_tokens.py (1)
54-59: Avoid shadowing the built-inid.Using
idas a variable name shadows Python's built-in function.Suggested fix
- for id in tokenizer.eos_token_ids: - text = tokenizer.decode(id) + for token_id in tokenizer.eos_token_ids: + text = tokenizer.decode(token_id) # Specific override for RNJ-1 - if model_kit.model_type == "gemma3_text" and id == 1 and text == '"': + if model_kit.model_type == "gemma3_text" and token_id == 1 and text == '"': continue - temp_tokens.add(id) + temp_tokens.add(token_id)server/backend/mlx_engine/utils/progress_decorators.py (1)
19-19: Minor: Replace en-dash with hyphen-minus in docstrings.Ruff flagged ambiguous en-dash characters (
–) in docstrings at lines 19, 52, and 82. While not a functional issue, using standard hyphen-minus (-) improves consistency.Suggested fix
- callback: A callback that accepts progress (0.0–100.0) and returns + callback: A callback that accepts progress (0.0-100.0) and returnsApply the same fix to lines 52 and 82.
server/backend/mlx_engine/model_kit/vision_add_ons/pixtral.py (1)
57-65: Prefix unused variables with underscore.The
attention_maskandother_model_inputsvariables are unpacked but never used, as flagged by static analysis.Suggested fix
- input_ids, pixel_values, attention_mask, other_model_inputs = ( + input_ids, pixel_values, _attention_mask, _other_model_inputs = ( common_process_prompt_with_images( prompt_tokens=prompt_tokens, images_b64=images_b64, processor=self.processor, config=self.config, max_size=max_size, ) )server/backend/mlx_engine/model_kit/vision_add_ons/mistral3.py (1)
120-124: Consider making the model path check more robust.The string matching approach works but is fragile if the model path changes slightly. Consider also checking for partial matches or making this configurable.
Alternative implementation
`@staticmethod` def _is_lmstudio_mistral_3_2_small(model_path: Path) -> bool: - return "lmstudio-community/Mistral-Small-3.2-24B-Instruct-2506-MLX" in str( - model_path - ) + path_str = str(model_path).lower() + return "mistral-small-3.2" in path_str and "lmstudio" in path_strserver/backend/mlx_engine/vision_model_kit/vision_model_wrapper.py (4)
101-107: Use dict literal instead of zip for single key-value pair.The
zip()with a single-element list is overly complex and triggers a static analysis warning about missingstrict=parameter.Suggested fix
if outputs.cross_attention_states is not None: - self.language_model_kwargs = { - k: v - for k, v in zip( - ["cross_attention_states"], [outputs.cross_attention_states] - ) - } + self.language_model_kwargs = { + "cross_attention_states": outputs.cross_attention_states + }
140-146: Use exception chaining for better debugging.When re-raising with a new exception or bare re-raise, use proper chaining to preserve the stack trace.
Suggested fix
except ValueError as e: # Create a friendly error message if a user tries to use mllama without images if "Cross attention states must be provided for layer" in str(e): raise ValueError( "Using this model without any images attached is not supported yet." - ) - raise e + ) from e + raise
170-173: Avoid list comprehension for side effects.Using a list comprehension solely for its side effects is non-idiomatic Python. Use a regular for loop instead.
Suggested fix
detokenizer.reset() - [detokenizer.add_token(token) for token in prompt_tokens] + for token in prompt_tokens: + detokenizer.add_token(token) detokenizer.finalize()
186-191: Use exception chaining for the image requirement error.Same issue as line 143-146: use
from efor proper exception chaining.Suggested fix
except ValueError as e: if "`images` are expected as arguments" in str(e): raise ValueError( "Using this model without any images attached is not supported yet." - ) - raise e + ) from e + raiseserver/backend/mlx_engine/vision_model_kit/vision_model_kit.py (2)
24-31: Consider using instance attributes instead of class-level defaults.Class-level attribute declarations with mutable defaults can cause unexpected sharing between instances. While
Noneis safe, using explicit instance initialization in__init__would be clearer.Suggested refactor in __init__
def __init__( self, model_path: Path, vocab_only: bool, trust_remote_code: bool, ): + self.config = None + self.model_weights = None + self.processor = None + self.has_processed_prompt = False fix_qwen2_5_vl_image_processor(model_path) fix_qwen2_vl_preprocessor(model_path)And remove the class-level declarations at lines 24-31.
145-156: Prefix unused path parameter with underscore for clarity.The
pathparameter inis_draft_model_compatibleandload_draft_modelis unused since speculative decoding is not supported. Prefixing with underscore documents this intentionally.Suggested fix
- def is_draft_model_compatible(self, path: str | Path) -> bool: + def is_draft_model_compatible(self, _path: str | Path) -> bool: return False - def load_draft_model(self, path: str | Path) -> None: + def load_draft_model(self, _path: str | Path) -> None: raise ValueError( "Speculative decoding is not currently supported for vision models" )server/backend/mlx_engine/cache_wrapper.py (1)
89-90: Remove redundant self-assignments.These lines assign variables to themselves and serve no purpose.
🔧 Suggested fix
def _find_common_prefix( current_tokens: mx.array, prompt_tokens: mx.array, num_tokens_to_exclude: int ) -> int: """ Determine the common prefix length between the current tokens and the prompt tokens. Args: current_tokens (mx.array): The cached tokens (self.tokens). prompt_tokens (mx.array): The prompt tokens. num_tokens_to_exclude (int): The minimum length of the remaining prompt tokens array. Returns: int: The length of the common prefix. """ - prompt_tokens = prompt_tokens - current_tokens = current_tokens # Find the minimum length between the two arrays min_length = min(len(current_tokens), len(prompt_tokens))server/backend/mlx_engine/model_kit/vision_add_ons/load_utils.py (1)
143-156: PotentialKeyErrorifquantizationis present but missing expected keys.On line 148,
config_dict["quantization"]assumes the quantization dict exists whenget_class_predicateis called. While this is guarded by the outer check on line 138, ifquantizationis set to an unexpected value (e.g., empty dict or malformed), accessingconfig_dict["quantization"][p]on line 149 could behave unexpectedly.🔧 Safer access pattern
def get_class_predicate(p, m): # Always skip vision and audio models if skip_multimodal_module(p) and skip_vision: return False # Handle custom per layer quantizations - if p in config_dict["quantization"]: - return config_dict["quantization"][p] + if p in quantization: + return quantization[p] if not hasattr(m, "to_quantized"): return Falseserver/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.py (2)
104-112: Prefix unused variables with underscores.
attention_maskandother_model_inputsare unpacked but never used. Prefix them with underscores to indicate they are intentionally unused.🔧 Suggested fix
- input_ids, pixel_values, attention_mask, other_model_inputs = ( + input_ids, pixel_values, _attention_mask, _other_model_inputs = ( common_process_prompt_with_images( prompt_tokens=prompt_tokens, images_b64=images_b64, processor=self.processor, config=self.config, max_size=max_size, ) )
113-113: Consider replacingassertwith explicit error handling.Assertions can be disabled in production (e.g., with
python -O). Ifinput_idsbeingNoneis a genuine error condition, use an explicit check with a descriptive exception.🔧 Suggested fix
- assert input_ids is not None + if input_ids is None: + raise ValueError("Failed to process prompt: input_ids is None")server/backend/mlx_engine/model_kit/model_kit.py (1)
35-57: AnnotateVISION_ADD_ON_MAPwithClassVar.The static analysis correctly identifies that mutable class attributes should be annotated with
typing.ClassVarto clarify they are class-level constants, not instance attributes.🔧 Suggested fix
+from typing import Callable, ClassVar, Optional, List, Tuple, Type -from typing import Callable, Optional, List, Tuple class ModelKit: """ Collection of objects and methods that are needed for operating a model. ... """ - VISION_ADD_ON_MAP = { + VISION_ADD_ON_MAP: ClassVar[dict[str, Type[BaseVisionAddOn]]] = { "gemma3": Gemma3VisionAddOn, "gemma3n": Gemma3nVisionAddOn, "mistral3": Mistral3VisionAddOn, "pixtral": PixtralVisionAddOn, }server/backend/mlx_engine/generate.py (1)
111-111: Use generator expression instead of list inany().Creating a list is unnecessary when
any()can short-circuit with a generator expression.🔧 Suggested fix
- if any([kv_bits, kv_group_size, quantized_kv_start]): + if any((kv_bits, kv_group_size, quantized_kv_start)):
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
server/uv.lockis excluded by!**/*.lock
📒 Files selected for processing (43)
server/api.pyserver/backend/mlx_backend.pyserver/backend/mlx_engine/__init__.pyserver/backend/mlx_engine/cache_wrapper.pyserver/backend/mlx_engine/external/datasets/dill.pyserver/backend/mlx_engine/generate.pyserver/backend/mlx_engine/model_kit/__init__.pyserver/backend/mlx_engine/model_kit/model_kit.pyserver/backend/mlx_engine/model_kit/patches/gemma3n.pyserver/backend/mlx_engine/model_kit/vision_add_ons/base.pyserver/backend/mlx_engine/model_kit/vision_add_ons/gemma3.pyserver/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.pyserver/backend/mlx_engine/model_kit/vision_add_ons/load_utils.pyserver/backend/mlx_engine/model_kit/vision_add_ons/mistral3.pyserver/backend/mlx_engine/model_kit/vision_add_ons/pixtral.pyserver/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.pyserver/backend/mlx_engine/processors/repetition_penalty_processor.pyserver/backend/mlx_engine/stop_string_processor.pyserver/backend/mlx_engine/utils/disable_hf_download.pyserver/backend/mlx_engine/utils/eot_tokens.pyserver/backend/mlx_engine/utils/fix_mistral_pre_tokenizer.pyserver/backend/mlx_engine/utils/image_utils.pyserver/backend/mlx_engine/utils/kv_cache_quantization.pyserver/backend/mlx_engine/utils/logger.pyserver/backend/mlx_engine/utils/outlines_transformer_tokenizer.pyserver/backend/mlx_engine/utils/progress_decorators.pyserver/backend/mlx_engine/utils/prompt_processing.pyserver/backend/mlx_engine/utils/register_models.pyserver/backend/mlx_engine/utils/set_seed.pyserver/backend/mlx_engine/utils/speculative_decoding.pyserver/backend/mlx_engine/utils/token.pyserver/backend/mlx_engine/utils/top_logprobs.pyserver/backend/mlx_engine/vision_model_kit/_transformers_compatibility.pyserver/backend/mlx_engine/vision_model_kit/vision_model_kit.pyserver/backend/mlx_engine/vision_model_kit/vision_model_wrapper.pyserver/backend/mlx_runner.pyserver/main.pyserver/pyproject.tomlserver/runtime.pyserver/schemas.pyserver/tests/test_inference.pyserver/tests/test_lora.pyserver/tests/test_responses_api.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{rs,toml}
⚙️ CodeRabbit configuration file
Review the Rust code for conformity with best practices in Rust, Systems programming. Highlight any deviations.
Files:
server/pyproject.toml
🧬 Code graph analysis (18)
server/backend/mlx_engine/utils/outlines_transformer_tokenizer.py (1)
server/backend/mlx_engine/external/datasets/dill.py (2)
Hasher(27-54)hash(44-45)
server/backend/mlx_engine/utils/prompt_processing.py (1)
server/backend/mlx_engine/cache_wrapper.py (5)
CacheWrapper(25-334)prompt_progress_callback(291-292)set_draft_model(231-261)unset_draft_model(263-268)update_cache(270-328)
server/backend/mlx_engine/model_kit/vision_add_ons/gemma3.py (5)
server/backend/mlx_engine/model_kit/vision_add_ons/base.py (2)
BaseVisionAddOn(7-34)compute_embeddings(19-34)server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)
common_process_prompt_with_images(19-79)server/backend/mlx_engine/model_kit/vision_add_ons/load_utils.py (1)
load_vision_addon(191-257)server/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.py (1)
compute_embeddings(96-159)server/backend/mlx_engine/model_kit/vision_add_ons/pixtral.py (1)
compute_embeddings(49-91)
server/backend/mlx_engine/utils/progress_decorators.py (1)
server/backend/mlx_engine/cache_wrapper.py (1)
StopPromptProcessing(19-22)
server/backend/mlx_engine/vision_model_kit/vision_model_wrapper.py (2)
server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)
common_process_prompt_with_images(19-79)server/backend/mlx_engine/vision_model_kit/vision_model_kit.py (2)
language_model(159-160)record_sampled_token(142-143)
server/backend/mlx_engine/model_kit/vision_add_ons/base.py (4)
server/backend/mlx_engine/model_kit/vision_add_ons/gemma3.py (1)
compute_embeddings(47-87)server/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.py (1)
compute_embeddings(96-159)server/backend/mlx_engine/model_kit/vision_add_ons/mistral3.py (1)
compute_embeddings(57-118)server/backend/mlx_engine/model_kit/vision_add_ons/pixtral.py (1)
compute_embeddings(49-91)
server/backend/mlx_backend.py (2)
server/schemas.py (4)
ChatMessage(15-17)ChatCompletionRequest(20-31)downloadRequest(65-66)ResponseRequest(69-77)server/backend/mlx_runner.py (3)
_format_conversation(246-263)generate_batch(242-244)get_effective_max_tokens(139-147)
server/backend/mlx_engine/model_kit/__init__.py (1)
server/backend/mlx_engine/model_kit/patches/gemma3n.py (1)
apply_patches(58-65)
server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)
server/backend/mlx_engine/utils/image_utils.py (2)
convert_to_pil(10-15)custom_resize(18-100)
server/backend/mlx_engine/utils/top_logprobs.py (1)
server/backend/mlx_engine/utils/token.py (1)
Token(6-14)
server/backend/mlx_engine/utils/eot_tokens.py (2)
server/backend/mlx_engine/model_kit/model_kit.py (1)
ModelKit(22-236)server/backend/mlx_engine/vision_model_kit/vision_model_kit.py (1)
VisionModelKit(19-160)
server/tests/test_inference.py (3)
server/backend/mlx_runner.py (3)
MLXRunner(55-337)generate_streaming(170-240)get_memory_usage(265-280)server/cache_utils.py (1)
get_model_path(173-191)server/backend/mlx_engine/external/datasets/dill.py (1)
dumps(134-138)
server/backend/mlx_engine/utils/fix_mistral_pre_tokenizer.py (2)
server/backend/mlx_engine/generate.py (1)
tokenize(477-490)server/backend/mlx_engine/model_kit/model_kit.py (1)
tokenize(136-140)
server/backend/mlx_engine/model_kit/vision_add_ons/load_utils.py (1)
server/backend/mlx_engine/model_kit/patches/gemma3n.py (1)
from_dict(25-36)
server/api.py (2)
server/schemas.py (2)
ChatMessage(15-17)ResponseRequest(69-77)server/backend/mlx_backend.py (1)
generate_response(182-245)
server/backend/mlx_runner.py (4)
server/backend/mlx_engine/generate.py (2)
create_generator(146-474)tokenize(477-490)server/backend/mlx_engine/model_kit/model_kit.py (1)
tokenize(136-140)server/backend/mlx_engine/cache_wrapper.py (1)
StopPromptProcessing(19-22)server/reasoning_utils.py (3)
ReasoningExtractor(15-171)StreamingReasoningParser(245-438)finalize(424-438)
server/backend/mlx_engine/model_kit/vision_add_ons/pixtral.py (6)
server/backend/mlx_engine/vision_model_kit/vision_model_wrapper.py (2)
process_prompt_with_images(154-206)language_model(213-214)server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)
common_process_prompt_with_images(19-79)server/backend/mlx_engine/model_kit/vision_add_ons/load_utils.py (1)
load_vision_addon(191-257)server/backend/mlx_engine/model_kit/vision_add_ons/gemma3.py (1)
compute_embeddings(47-87)server/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.py (1)
compute_embeddings(96-159)server/backend/mlx_engine/model_kit/vision_add_ons/mistral3.py (1)
compute_embeddings(57-118)
server/backend/mlx_engine/vision_model_kit/vision_model_kit.py (3)
server/backend/mlx_engine/model_kit/model_kit.py (9)
ModelKit(22-236)_vocab_only_init(59-63)_full_model_init(65-112)process_prompt(142-178)is_cross_prompt_cache_active(180-185)record_token_to_cache(187-188)is_draft_model_compatible(203-216)load_draft_model(218-227)unload_draft_model(229-236)server/backend/mlx_engine/vision_model_kit/_transformers_compatibility.py (2)
fix_qwen2_5_vl_image_processor(8-36)fix_qwen2_vl_preprocessor(39-60)server/backend/mlx_engine/vision_model_kit/vision_model_wrapper.py (3)
VisionModelWrapper(13-214)process_prompt_with_images(154-206)language_model(213-214)
🪛 Ruff (0.14.11)
server/backend/mlx_engine/utils/set_seed.py
39-39: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_engine/utils/prompt_processing.py
18-18: Avoid specifying long messages outside the exception class
(TRY003)
27-30: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_engine/model_kit/vision_add_ons/gemma3.py
55-55: Unpacked variable other_model_inputs is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
server/backend/mlx_engine/utils/kv_cache_quantization.py
29-29: Avoid specifying long messages outside the exception class
(TRY003)
42-42: Avoid specifying long messages outside the exception class
(TRY003)
44-46: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_engine/utils/progress_decorators.py
19-19: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
52-52: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
82-82: Docstring contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF002)
server/backend/mlx_engine/vision_model_kit/vision_model_wrapper.py
61-61: Unused method argument: input_embeddings
(ARG002)
104-106: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
143-145: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
143-145: Avoid specifying long messages outside the exception class
(TRY003)
146-146: Use raise without specifying exception name
Remove exception name
(TRY201)
188-190: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
188-190: Avoid specifying long messages outside the exception class
(TRY003)
191-191: Use raise without specifying exception name
Remove exception name
(TRY201)
server/backend/mlx_engine/__init__.py
5-12: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
server/backend/mlx_engine/cache_wrapper.py
245-245: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_engine/stop_string_processor.py
35-35: Avoid specifying long messages outside the exception class
(TRY003)
38-38: Avoid specifying long messages outside the exception class
(TRY003)
41-41: Avoid specifying long messages outside the exception class
(TRY003)
93-93: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_engine/external/datasets/dill.py
30-30: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
109-109: Do not catch blind exception: Exception
(BLE001)
115-115: Unused noqa directive (unused: E721)
Remove unused noqa directive
(RUF100)
server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py
38-38: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_engine/utils/top_logprobs.py
25-27: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
server/backend/mlx_engine/utils/eot_tokens.py
14-14: String contains ambiguous | (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?
(RUF001)
14-14: String contains ambiguous | (FULLWIDTH VERTICAL LINE). Did you mean | (VERTICAL LINE)?
(RUF001)
63-65: Avoid specifying long messages outside the exception class
(TRY003)
server/tests/test_inference.py
16-16: Unpacked variable model_name is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
16-16: Unpacked variable commit_hash is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
58-58: Do not catch blind exception: Exception
(BLE001)
61-61: Do not catch blind exception: Exception
(BLE001)
server/backend/mlx_engine/utils/fix_mistral_pre_tokenizer.py
52-52: Do not catch blind exception: Exception
(BLE001)
server/backend/mlx_engine/model_kit/vision_add_ons/load_utils.py
33-33: Avoid specifying long messages outside the exception class
(TRY003)
105-107: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_engine/model_kit/vision_add_ons/gemma3n.py
104-104: Unpacked variable attention_mask is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
104-104: Unpacked variable other_model_inputs is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
server/api.py
89-89: Consider moving this statement to an else block
(TRY300)
90-90: Do not catch blind exception: Exception
(BLE001)
91-91: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
92-92: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
server/backend/mlx_engine/generate.py
112-114: Avoid specifying long messages outside the exception class
(TRY003)
116-116: Avoid specifying long messages outside the exception class
(TRY003)
313-315: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_runner.py
72-72: Avoid specifying long messages outside the exception class
(TRY003)
76-76: Consider moving this statement to an else block
(TRY300)
98-99: try-except-pass detected, consider logging the exception
(S110)
98-98: Do not catch blind exception: Exception
(BLE001)
123-123: Avoid specifying long messages outside the exception class
(TRY003)
136-137: try-except-pass detected, consider logging the exception
(S110)
136-136: Do not catch blind exception: Exception
(BLE001)
166-167: try-except-pass detected, consider logging the exception
(S110)
166-166: Do not catch blind exception: Exception
(BLE001)
186-186: Avoid specifying long messages outside the exception class
(TRY003)
249-249: Avoid specifying long messages outside the exception class
(TRY003)
270-270: Do not catch blind exception: Exception
(BLE001)
416-416: Do not catch blind exception: Exception
(BLE001)
426-426: Do not catch blind exception: Exception
(BLE001)
432-432: Consider moving this statement to an else block
(TRY300)
433-433: Do not catch blind exception: Exception
(BLE001)
server/tests/test_responses_api.py
29-29: Do not catch blind exception: Exception
(BLE001)
97-97: Do not catch blind exception: Exception
(BLE001)
server/tests/test_lora.py
16-16: Probable insecure usage of temporary file or directory: "/tmp/fake_model"
(S108)
17-17: Probable insecure usage of temporary file or directory: "/tmp/fake_adapter"
(S108)
38-38: Local variable model_kit is assigned to but never used
Remove assignment to unused variable model_kit
(F841)
38-38: Probable insecure usage of temporary file or directory: "/tmp/fake_model"
(S108)
38-38: Probable insecure usage of temporary file or directory: "/tmp/fake_adapter"
(S108)
42-42: Probable insecure usage of temporary file or directory: "/tmp/fake_adapter"
(S108)
server/backend/mlx_engine/utils/speculative_decoding.py
21-23: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_engine/model_kit/vision_add_ons/pixtral.py
57-57: Unpacked variable attention_mask is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
57-57: Unpacked variable other_model_inputs is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
server/backend/mlx_engine/model_kit/model_kit.py
35-40: Mutable class attributes should be annotated with typing.ClassVar
(RUF012)
171-173: Avoid specifying long messages outside the exception class
(TRY003)
222-222: Avoid specifying long messages outside the exception class
(TRY003)
224-224: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_engine/vision_model_kit/vision_model_kit.py
106-106: Unused method argument: prompt_progress_callback
(ARG002)
107-107: Unused method argument: generate_args
(ARG002)
109-109: Unused method argument: speculative_decoding_toggle
(ARG002)
145-145: Unused method argument: path
(ARG002)
148-148: Unused method argument: path
(ARG002)
149-151: Avoid specifying long messages outside the exception class
(TRY003)
154-156: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_engine/utils/image_utils.py
49-51: Avoid specifying long messages outside the exception class
(TRY003)
53-53: Avoid specifying long messages outside the exception class
(TRY003)
92-92: Loop control variable i not used within loop body
(B007)
server/backend/mlx_engine/utils/disable_hf_download.py
10-10: Unused function argument: args
(ARG001)
10-10: Unused function argument: kwargs
(ARG001)
14-16: Avoid specifying long messages outside the exception class
(TRY003)
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
… backend consistency
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@server/tests/test_responses_api.py`:
- Around line 97-100: The broad except Exception as e block currently prints the
error and traceback but does not re-raise, causing pytest to treat failures as
passes; change the handler to either re-raise the exception (raise) after
printing or call pytest.fail(...) with the exception message so the test fails
properly; modify the except Exception as e block in the test (the try/except
around the test logic) to re-raise or call pytest.fail(str(e)) so failures are
propagated to pytest.
- Around line 60-62: The test currently prints failure and returns in the
non-200 branch which lets pytest treat the test as passing; import pytest at the
top of the test file and change the early-return behavior in the failure branch
of the test_responses_api test (the block that prints f"Test FAILED:
{response.text}") to call pytest.fail(...) or raise an assertion (e.g.,
pytest.fail with a message including response.status_code and response.text) so
pytest marks the test as failed.
🧹 Nitpick comments (10)
server/pyproject.toml (1)
6-19: Version pinning improvements look good; consider movingblackto dev dependencies.The consistent explicit pinning across all dependencies and the
outlinesupgrade to 1.2.9 address previous review feedback—nice work.However,
blackis a code formatter and doesn't belong in runtime dependencies. It should be moved to optional/dev dependencies:dependencies = [ "fastapi==0.119.0", "uvicorn==0.38.0", "mlx-lm==0.28.3", "mlx-vlm==0.3.9", "outlines==1.2.9", "outlines-core==0.1.26", "psutil==7.2.1", "sentencepiece==0.2.1", "pillow==12.1.0", "transformers==4.49.0", - "black==25.9.0", "huggingface-hub==0.35.3", ] + +[project.optional-dependencies] +dev = [ + "black==25.9.0", +]server/backend/mlx_engine/utils/image_utils.py (4)
10-15: Consider adding error handling for invalid image data.The function will raise a generic exception if base64 decoding fails or the image data is corrupted. Wrapping with a try-except to provide a clearer error message (e.g., indicating which image index failed) could improve debuggability.
♻️ Optional enhancement
def convert_to_pil(images_b64: List[str]) -> list[PIL.Image.Image]: """Convert a list of base64 strings to PIL Images""" - return [ - PIL.Image.open(BytesIO(base64.b64decode(img))).convert("RGB") - for img in images_b64 or [] - ] + result = [] + for i, img in enumerate(images_b64 or []): + try: + result.append(PIL.Image.open(BytesIO(base64.b64decode(img))).convert("RGB")) + except Exception as e: + raise ValueError(f"Failed to decode image at index {i}: {e}") from e + return result
43-44: Docstring references sys.stderr but implementation uses logger.The side effects note mentions "Writes progress and status messages to sys.stderr" but the implementation uses
logger.info()andlogger.warning().📝 Proposed fix
Side effects: - Writes progress and status messages to sys.stderr. + Logs progress and status messages via the module logger.
64-64: Remove unused variableaspect_ratio.This variable is computed but never used after the resize logic was refactored to use scale-based calculations.
♻️ Proposed fix
- aspect_ratio = img.width / img.height
89-89: Unused loop variable should be_.The loop variable
iis not used within the loop body.♻️ Proposed fix
- for i, img in enumerate(resized_images): + for img in resized_images:server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)
41-41: Avoid list comprehension for side effects.Using a list comprehension purely for side effects (calling
add_token) is not idiomatic Python and unnecessarily allocates a list ofNonevalues. Use a regular for loop instead.♻️ Proposed fix
- [detokenizer.add_token(token) for token in prompt_tokens] + for token in prompt_tokens: + detokenizer.add_token(token)server/backend/mlx_engine/stop_string_processor.py (3)
24-45: Consider copyingstop_stringsto prevent external mutation.Storing
stop_stringsby reference means external modifications to the list after initialization could cause unexpected behavior. A shallow copy would provide defensive isolation.Proposed fix
- self.stop_strings = stop_strings + self.stop_strings = list(stop_strings)
85-90: Return a copy oftoken_id_bufferto prevent accidental mutation.Returning
self.token_id_bufferdirectly exposes internal state. If the caller modifies the returned list, it would affect the processor's buffer. Returning a copy is safer.Proposed fix
elif result.status == "full_stop": return StopStringProcessorResult( status="full_stop", stop_string=result.stop_string, - stop_tokens=self.token_id_buffer, + stop_tokens=list(self.token_id_buffer), )
110-111: Docstring incorrectly states return type.The docstring says it returns
StopStringProcessorResultbut the method signature and implementation return_StoppingCriteriaResult.Proposed fix
Returns: - StopStringProcessorResult indicating match status and stop string if matched + _StoppingCriteriaResult indicating match status and stop string if matchedserver/tests/test_responses_api.py (1)
85-95: Consider explicit JSON parsing error handling and consistent failure reporting.Two minor concerns:
- Line 89: If
json.loads(content)fails, the generic exception handler at line 97 will catch it, but a more specificjson.JSONDecodeErrorcatch would provide clearer diagnostics.- Line 95: Similar to the earlier case, printing "FAILED" without raising won't cause pytest to report a failure (though it's at the end of the function anyway).
Proposed improvement for JSON parsing
if response.status_code == 200: data = response.json() print(f"Response: {json.dumps(data, indent=2)}") content = data["choices"][0]["message"]["content"] - parsed = json.loads(content) + try: + parsed = json.loads(content) + except json.JSONDecodeError as e: + pytest.fail(f"Response content is not valid JSON: {content!r}, error: {e}") print(f"Parsed JSON: {parsed}") assert parsed["capital"].lower() == "berlin" assert parsed["country"].lower() == "germany" print("JSON schema test PASSED") else: - print(f"JSON schema test FAILED: {response.text}") + pytest.fail(f"JSON schema test FAILED: {response.text}")
📜 Review details
Configuration used: Path: .coderabbit.yml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
server/api.pyserver/backend/mlx_backend.pyserver/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.pyserver/backend/mlx_engine/stop_string_processor.pyserver/backend/mlx_engine/utils/image_utils.pyserver/backend/mlx_engine/utils/logger.pyserver/pyproject.tomlserver/tests/test_responses_api.py
🚧 Files skipped from review as they are similar to previous changes (1)
- server/backend/mlx_backend.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.{rs,toml}
⚙️ CodeRabbit configuration file
Review the Rust code for conformity with best practices in Rust, Systems programming. Highlight any deviations.
Files:
server/pyproject.toml
🧬 Code graph analysis (3)
server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (1)
server/backend/mlx_engine/utils/image_utils.py (2)
convert_to_pil(10-15)custom_resize(18-97)
server/api.py (2)
server/schemas.py (2)
ChatMessage(15-17)ResponseRequest(69-77)server/backend/mlx_backend.py (1)
generate_response(183-253)
server/tests/test_responses_api.py (3)
server/backend/mlx_backend.py (1)
get_or_load_model(31-80)server/cache_utils.py (1)
get_model_path(173-191)server/backend/mlx_engine/external/datasets/dill.py (1)
dumps(134-138)
🪛 Ruff (0.14.11)
server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py
38-38: Avoid specifying long messages outside the exception class
(TRY003)
server/backend/mlx_engine/stop_string_processor.py
35-35: Avoid specifying long messages outside the exception class
(TRY003)
38-38: Avoid specifying long messages outside the exception class
(TRY003)
41-41: Avoid specifying long messages outside the exception class
(TRY003)
93-93: Avoid specifying long messages outside the exception class
(TRY003)
server/api.py
89-89: Consider moving this statement to an else block
(TRY300)
server/backend/mlx_engine/utils/image_utils.py
49-51: Avoid specifying long messages outside the exception class
(TRY003)
53-53: Avoid specifying long messages outside the exception class
(TRY003)
64-64: Local variable aspect_ratio is assigned to but never used
Remove assignment to unused variable aspect_ratio
(F841)
89-89: Loop control variable i not used within loop body
(B007)
server/tests/test_responses_api.py
29-29: Do not catch blind exception: Exception
(BLE001)
97-97: Do not catch blind exception: Exception
(BLE001)
🔇 Additional comments (19)
server/backend/mlx_engine/utils/logger.py (2)
12-17: LGTM! Local error suppression addresses prior feedback.The custom
MLXEngineStreamHandlercorrectly overrideshandleErrorto suppress handler-specific exceptions locally, avoiding the process-wide side effect of settinglogging.raiseExceptions = False. This is the recommended approach for library-scoped error handling.
20-42: LGTM! Well-structured logger setup.The configuration follows best practices:
- Namespaced logger avoids polluting the root logger
- Handler clearing prevents duplicate output on re-import
propagate = Falseisolates mlx_engine logs from the application's root logger while still allowing child loggers to inherit this configurationserver/backend/mlx_engine/utils/image_utils.py (2)
1-7: LGTM!Imports and logger setup are appropriate for the module's purpose.
66-75: Resize logic correctly respects both max dimensions.The implementation now properly calculates a single scale factor using
min(scale_w, scale_h, 1.0)to ensure both dimensions fit within max_size while preserving aspect ratio.server/backend/mlx_engine/model_kit/vision_add_ons/process_prompt_with_images.py (4)
1-9: LGTM!Imports are appropriate and well-organized.
12-16: LGTM!Clean NamedTuple definition providing a well-typed return structure for the processed image prompt data.
50-59: Robust handling of config variations.The
image_token_indexresolution now properly guards against missing or Nonevision_configattributes before accessingimage_token_id.
61-83: LGTM!The
prepare_inputscall and extraction of processed inputs into theProcessedImagePromptreturn value are well-structured.server/backend/mlx_engine/stop_string_processor.py (7)
1-9: LGTM!Clean imports and well-defined type literals. The
REPLACEMENT_CHARconstant correctly uses U+FFFD for detecting incomplete UTF-8 sequences.
12-18: LGTM!Well-documented result type with clear field semantics.
95-97: LGTM!Good use of an inner NamedTuple to separate internal result representation from the public API.
128-131: LGTM!Correctly detects incomplete UTF-8 sequences by checking for the replacement character at the end of the decoded string.
133-151: LGTM!Correctly identifies the earliest-occurring stop string, ensuring consistent behavior when multiple stop strings could match.
154-163: LGTM!The return type inconsistency has been addressed - this method now correctly returns
_StoppingCriteriaResult, consistent with the other check methods.
165-177: LGTM!Clean implementation of suffix-prefix overlap detection with early termination via
any().server/api.py (2)
3-3: LGTM!The import of
ResponseRequestis correctly added to support the new endpoint.
84-92: LGTM! Error handling improvements from previous review have been addressed.The endpoint correctly delegates to the backend, uses
logger.exceptionfor full traceback logging, and properly chains the exception withfrom e. The static analysis hint (TRY300) about moving the return to an else block is a minor style preference that doesn't improve clarity here.server/tests/test_responses_api.py (2)
1-11: LGTM!Imports and path setup are appropriate for the test file structure.
20-31: LGTM! Previous review feedback addressed.The model path check now properly guards against
Nonevalues before calling.exists(). The graceful skip pattern for missing models is appropriate for integration tests.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| else: | ||
| print(f"Test FAILED: {response.text}") | ||
| return |
There was a problem hiding this comment.
Early return on failure won't cause pytest to report test failure.
When the status code is not 200, the test prints the failure but returns without raising an exception or assertion. Pytest will report this as a passed test.
Proposed fix
else:
print(f"Test FAILED: {response.text}")
- return
+ pytest.fail(f"Expected status 200, got {response.status_code}: {response.text}")You'll need to add import pytest at the top of the file.
🤖 Prompt for AI Agents
In `@server/tests/test_responses_api.py` around lines 60 - 62, The test currently
prints failure and returns in the non-200 branch which lets pytest treat the
test as passing; import pytest at the top of the test file and change the
early-return behavior in the failure branch of the test_responses_api test (the
block that prints f"Test FAILED: {response.text}") to call pytest.fail(...) or
raise an assertion (e.g., pytest.fail with a message including
response.status_code and response.text) so pytest marks the test as failed.
| except Exception as e: | ||
| print(f"Test FAILED with exception: {type(e).__name__}: {e}") | ||
| import traceback | ||
| traceback.print_exc() |
There was a problem hiding this comment.
Exception handler swallows failures - test will pass in pytest.
The outer exception handler catches all exceptions, prints them, but doesn't re-raise. This means any exception during the test will result in pytest reporting the test as passed.
Proposed fix
except Exception as e:
print(f"Test FAILED with exception: {type(e).__name__}: {e}")
import traceback
traceback.print_exc()
+ raise # Re-raise so pytest reports the failure📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| print(f"Test FAILED with exception: {type(e).__name__}: {e}") | |
| import traceback | |
| traceback.print_exc() | |
| except Exception as e: | |
| print(f"Test FAILED with exception: {type(e).__name__}: {e}") | |
| import traceback | |
| traceback.print_exc() | |
| raise # Re-raise so pytest reports the failure |
🧰 Tools
🪛 Ruff (0.14.11)
97-97: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@server/tests/test_responses_api.py` around lines 97 - 100, The broad except
Exception as e block currently prints the error and traceback but does not
re-raise, causing pytest to treat failures as passes; change the handler to
either re-raise the exception (raise) after printing or call pytest.fail(...)
with the exception message so the test fails properly; modify the except
Exception as e block in the test (the try/except around the test logic) to
re-raise or call pytest.fail(str(e)) so failures are propagated to pytest.
Title: feat(api): Add Responses API (Phase 2) - Non-streaming endpoint with structured output
Description:
This PR introduces the non-streaming
/v1/responsesendpoint, delivering complete, structured responses with full OpenAI API compatibility—including usage statistics and JSON schema enforcement.Dependencies: This PR is stacked on #64 and leverages its foundational infrastructure:
MLXRunner.generate_batch()from feat(server): transition Mac backend to mlx-engine with structured output and LoRA support #64 for core inferenceget_or_load_model, caching)runtime.backendabstraction established in that PRIn short: #64 built the engine; this PR adds a dashboard.
Changes Made:
ResponseRequestandResponseResponsemodels supportingmessages,temperature,max_tokens,stop, andresponse_format(withjson_schema)generate_response()in MLX backend using efficient batch generation, plus proper token usage tracking (prompt, completion, total)POST /v1/responseswith error handling and structured loggingTesting:
Added automated tests in
server/tests/test_responses_api.py:json_schemaenforcement returns parseable JSONBoth tests pass cleanly using
TestClient, replacing manualcurlverification with a reproducible test suite.Example response:
{ "id": "chatcmpl-590676b9-b112-4252-96b7-cb457fabe529", "object": "chat.completion", "created": 1768411992, "model": "driaforall/mem-agent-mlx-4bit", "choices": [...], "usage": { "prompt_tokens": 15, "completion_tokens": 5, "total_tokens": 20 } }