fix: auto-install Python + Metal shader warmup (#43)

raullenchai · Your Name · claude · web-flow · commit c010de5ed1bf · 2026-03-21T07:04:02.000-07:00
* fix: auto-install Python + Metal shader warmup on startup P0 — install.sh: if no Python 3.10+ and no Homebrew, automatically downloads standalone Python from python-build-standalone (no sudo needed). Eliminates the #1 install blocker for users without Homebrew. P0 — first request hang: adds a warmup step after model load that runs one forward pass to trigger Metal shader compilation. Prints "Warming up (compiling Metal shaders)..." so users know what's happening. Prevents the first real request from hanging 5+ minutes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: strip think tags from Anthropic endpoint + disk space check P2: Think tags leaked through Anthropic /v1/messages endpoint because it bypassed the reasoning parser entirely. Both streaming and non-streaming paths now use the reasoning parser to separate reasoning from content, emitting only content to Anthropic clients. P1: Add disk space check before model download — queries HuggingFace for model repo size and warns if available disk is insufficient. Skips silently for local/cached models. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: standalone Python URL + move warmup to lifespan hook P0: The hardcoded python-build-standalone URL pointed at the old indygreg repo which now 404s. Updated to astral-sh/python-build-standalone with cpython 3.12.13 (release 20260320), verified accessible. P2: Metal shader warmup ran in CLI before batched/hybrid engines were started (they start in the FastAPI lifespan hook). Moved warmup into the lifespan hook so it runs after engine.start() for all engine types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: add generate_warmup() to BatchedEngine and HybridEngine Both engines inherited the no-op base generate_warmup(), so Metal shader warmup in the lifespan hook was silently skipped for --continuous-batching and hybrid modes. Now both engines override it with a real forward pass, matching SimpleEngine's implementation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Your Name <you@example.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
diff --git a/install.sh b/install.sh
@@ -46,15 +46,27 @@ done
 
 if [ -z "$PYTHON" ]; then
     echo ""
-    echo "  Python 3.10+ not found."
+    echo "  Python 3.10+ not found. Installing automatically..."
     if command -v brew &>/dev/null; then
         echo "  Installing Python 3.12 via Homebrew..."
         brew install python@3.12
         PYTHON="python3.12"
     else
-        echo "  Please install Python 3.10+ from https://www.python.org/downloads/"
-        echo "  Or install Homebrew first: https://brew.sh"
-        exit 1
+        # Download standalone Python — no Homebrew or sudo needed
+        STANDALONE_DIR="${HOME}/.rapid-mlx-python"
+        PY_VERSION="3.12.13"
+        PY_BUILD="20260320"
+        PY_URL="https://github.com/astral-sh/python-build-standalone/releases/download/${PY_BUILD}/cpython-${PY_VERSION}+${PY_BUILD}-aarch64-apple-darwin-install_only.tar.gz"
+        echo "  Downloading Python ${PY_VERSION} (standalone, no sudo needed)..."
+        mkdir -p "$STANDALONE_DIR"
+        curl -fsSL "$PY_URL" | tar xz -C "$STANDALONE_DIR" --strip-components=1
+        PYTHON="${STANDALONE_DIR}/bin/python3"
+        if ! "$PYTHON" --version &>/dev/null; then
+            echo "  Error: Failed to install standalone Python."
+            echo "  Please install Python 3.10+ from https://www.python.org/downloads/"
+            exit 1
+        fi
+        echo "  Installed Python $("$PYTHON" --version 2>&1) to $STANDALONE_DIR"
     fi
 fi
 
diff --git a/vllm_mlx/cli.py b/vllm_mlx/cli.py
@@ -16,6 +16,78 @@
 import sys
 
 
+def _check_disk_space(model_name: str) -> None:
+    """Check if there's enough disk space to download the model.
+
+    Queries HuggingFace for model repo size and compares with available space.
+    Warns (but does not block) if disk space is insufficient.
+    Skips silently if the model is already local or if the check fails.
+    """
+    import os
+    from pathlib import Path
+
+    # Skip if model is a local path that already exists
+    if os.path.exists(model_name):
+        return
+
+    # Check if model is already cached by huggingface_hub
+    try:
+        from huggingface_hub import try_to_load_from_cache
+
+        # Quick check: see if config.json is cached (implies model is downloaded)
+        cached = try_to_load_from_cache(model_name, "config.json")
+        if isinstance(cached, str) and os.path.exists(cached):
+            return
+    except Exception:
+        pass
+
+    # Query HuggingFace API for model size
+    try:
+        from huggingface_hub import model_info
+
+        info = model_info(model_name, files_metadata=True)
+        # safetensors_total or siblings file sizes
+        model_size_bytes = 0
+        if hasattr(info, "safetensors") and info.safetensors:
+            # Total size from safetensors metadata
+            params = info.safetensors
+            if hasattr(params, "total"):
+                # This is parameter count, not file size — use siblings instead
+                pass
+        # Sum file sizes from siblings
+        if hasattr(info, "siblings") and info.siblings:
+            for sibling in info.siblings:
+                if hasattr(sibling, "size") and sibling.size:
+                    model_size_bytes += sibling.size
+
+        if model_size_bytes == 0:
+            return  # Can't determine size, skip check
+
+        # Get available disk space
+        cache_dir = Path.home() / ".cache" / "huggingface"
+        stat = os.statvfs(str(cache_dir) if cache_dir.exists() else str(Path.home()))
+        available_bytes = stat.f_bavail * stat.f_frsize
+
+        model_size_gb = model_size_bytes / (1024**3)
+        available_gb = available_bytes / (1024**3)
+
+        # Need ~10% extra for temp files during download
+        required_bytes = int(model_size_bytes * 1.1)
+
+        if available_bytes < required_bytes:
+            print()
+            print(
+                f"  Warning: Model requires ~{model_size_gb:.1f} GB "
+                f"but only {available_gb:.1f} GB available on disk."
+            )
+            print(
+                "  The download may fail. Free up disk space or choose a smaller model."
+            )
+            print()
+    except Exception:
+        pass  # Non-critical — don't block startup on check failure
+
+
 def serve_command(args):
     """Start the OpenAI-compatible server."""
     import logging
@@ -203,6 +275,9 @@ def serve_command(args):
     else:
         print("Mode: Simple (maximum throughput)")
 
+    # Check disk space before downloading model
+    _check_disk_space(args.model)
+
     # Load model with unified server
     load_model(
         args.model,
@@ -224,6 +299,8 @@ def serve_command(args):
     )
 
     # Start server
+    # Note: Metal shader warmup runs in the FastAPI lifespan hook (server.py)
+    # so it works for all engine types including batched/hybrid which start later.
     print()
     host_display = "localhost" if args.host == "0.0.0.0" else args.host
     print(f"  Ready: http://{host_display}:{args.port}/v1")
diff --git a/vllm_mlx/engine/base.py b/vllm_mlx/engine/base.py
@@ -69,6 +69,14 @@ def preserve_native_tool_format(self) -> bool:
     def preserve_native_tool_format(self, value: bool) -> None:
         self._preserve_native_tool_format = value
 
+    def generate_warmup(self) -> None:
+        """Run a minimal generation to compile Metal shaders.
+
+        This prevents the first real request from hanging for minutes
+        while shaders compile on-demand.
+        """
+        pass  # Subclasses may override
+
     @abstractmethod
     async def start(self) -> None:
         """Start the engine (load model if not loaded)."""
diff --git a/vllm_mlx/engine/batched.py b/vllm_mlx/engine/batched.py
@@ -186,6 +186,20 @@ def tokenizer(self) -> Any:
             return getattr(self._processor, "tokenizer", self._processor)
         return self._tokenizer
 
+    def generate_warmup(self) -> None:
+        """Run a minimal forward pass to compile Metal shaders."""
+        if not self._loaded or self._model is None or self._is_mllm:
+            return
+        try:
+            import mlx.core as mx
+
+            tokens = self._tokenizer.encode("Hi")
+            input_ids = mx.array([tokens])
+            self._model(input_ids)
+            mx.eval(mx.zeros(1))
+        except Exception:
+            pass  # Non-fatal
+
     async def start(self) -> None:
         """Start the engine (load model if not loaded)."""
         if self._loaded:
diff --git a/vllm_mlx/engine/hybrid.py b/vllm_mlx/engine/hybrid.py
@@ -127,6 +127,14 @@ def tokenizer(self) -> Any:
         """Get the tokenizer."""
         return self._shared_tokenizer
 
+    def generate_warmup(self) -> None:
+        """Run a minimal forward pass to compile Metal shaders."""
+        # Delegate to the simple engine if available (it has the model loaded)
+        if self._simple is not None:
+            self._simple.generate_warmup()
+        elif self._batched is not None:
+            self._batched.generate_warmup()
+
     async def start(self) -> None:
         """Start the engine (load shared model and initialize sub-engines)."""
         if self._loaded:
diff --git a/vllm_mlx/engine/simple.py b/vllm_mlx/engine/simple.py
@@ -101,6 +101,24 @@ def tokenizer(self) -> Any:
             return getattr(self._model, "processor", None)
         return self._model.tokenizer
 
+    def generate_warmup(self) -> None:
+        """Run a minimal generation to compile Metal shaders."""
+        if not self._loaded or self._model is None or self._is_mllm:
+            return
+        try:
+            import mlx.core as mx
+
+            model = self._model
+            tokenizer = model.tokenizer
+            # Encode a short prompt and generate 1 token
+            tokens = tokenizer.encode("Hi")
+            input_ids = mx.array([tokens])
+            # Run one forward pass to trigger shader compilation
+            model.model(input_ids)
+            mx.eval(mx.zeros(1))
+        except Exception:
+            pass  # Non-fatal
+
     async def start(self) -> None:
         """Start the engine (load model if not loaded)."""
         if self._loaded:
diff --git a/vllm_mlx/server.py b/vllm_mlx/server.py
@@ -327,6 +327,23 @@ async def lifespan(app: FastAPI):
     if _engine is not None and hasattr(_engine, "_loaded") and not _engine._loaded:
         await _engine.start()
 
+    # Warmup: generate one token to trigger Metal shader compilation.
+    # Runs here (not in CLI) so all engine types are fully started first.
+    if _engine is not None:
+        import time as _time
+
+        logger.info("Warming up (compiling Metal shaders)...")
+        _warmup_start = _time.monotonic()
+        try:
+            import mlx.core as mx
+
+            _engine.generate_warmup()
+            mx.eval(mx.zeros(1))  # Force sync
+        except Exception as e:
+            logger.debug(f"Warmup failed (non-fatal): {e}")
+        _warmup_secs = _time.monotonic() - _warmup_start
+        logger.info(f"Warmup complete ({_warmup_secs:.1f}s)")
+
     # Load persisted cache from disk (AFTER engine start — AsyncEngineCore must exist)
     if _engine is not None and hasattr(_engine, "load_cache_from_disk"):
         _load_prefix_cache_from_disk()
@@ -2204,10 +2221,10 @@ async def create_anthropic_message(
         output.text, openai_request
     )
 
-    # Clean output text
+    # Clean output text — strip think tags so Anthropic clients get pure content
     final_content = None
     if cleaned_text:
-        final_content = clean_output_text(cleaned_text)
+        final_content = strip_thinking_tags(clean_output_text(cleaned_text))
 
     # Determine finish reason
     finish_reason = "tool_calls" if tool_calls else output.finish_reason
@@ -2370,10 +2387,15 @@ async def _stream_anthropic_messages(
     }
     yield f"event: content_block_start\ndata: {json.dumps(content_block_start)}\n\n"
 
-    # Stream content deltas
+    # Stream content deltas — use reasoning parser to strip think tags
     accumulated_text = ""
+    accumulated_raw = ""
     completion_tokens = 0
 
+    # Reset reasoning parser state for this stream
+    if _reasoning_parser:
+        _reasoning_parser.reset_state()
+
     async for output in engine.stream_chat(messages=messages, **chat_kwargs):
         delta_text = output.new_text
 
@@ -2382,8 +2404,25 @@ async def _stream_anthropic_messages(
             completion_tokens = output.completion_tokens
 
         if delta_text:
-            # Filter special tokens
-            content = strip_special_tokens(delta_text)
+            content = None
+
+            # Use reasoning parser to separate reasoning from content
+            if _reasoning_parser:
+                previous_raw = accumulated_raw
+                accumulated_raw += delta_text
+                delta_msg = _reasoning_parser.extract_reasoning_streaming(
+                    previous_raw, accumulated_raw, delta_text
+                )
+                if delta_msg is not None:
+                    # Only emit content, discard reasoning for Anthropic clients
+                    content = delta_msg.content
+            else:
+                # No reasoning parser — pass through with special token filter
+                content = strip_special_tokens(delta_text)
+
+            if content:
+                # Filter special tokens from parser output too
+                content = strip_special_tokens(content)
 
             if content:
                 accumulated_text += content
@@ -2394,6 +2433,27 @@ async def _stream_anthropic_messages(
                 }
                 yield f"event: content_block_delta\ndata: {json.dumps(delta_event)}\n\n"
 
+    # Handle reasoning parser finalization (e.g. no-tag correction)
+    if _reasoning_parser and accumulated_raw:
+        final_msg = (
+            _reasoning_parser.finalize_streaming(accumulated_raw)
+            if hasattr(_reasoning_parser, "finalize_streaming")
+            else None
+        )
+        if final_msg and final_msg.content:
+            # Emit corrected content (model didn't use think tags at all)
+            content = strip_special_tokens(final_msg.content)
+            if content:
+                accumulated_text = content  # Replace accumulated
+                delta_event = {
+                    "type": "content_block_delta",
+                    "index": 0,
+                    "delta": {"type": "text_delta", "text": content},
+                }
+                yield f"event: content_block_delta\ndata: {json.dumps(delta_event)}\n\n"
+        # Reset parser state for next request
+        _reasoning_parser.reset_state()
+
     # Check for tool calls in accumulated text
     _, tool_calls = _parse_tool_calls_with_parser(accumulated_text, openai_request)