Skip to content

Commit 9efd2f5

Browse files
LloydLloyd
authored andcommitted
gemma-4-31B-it: hot-patch transformers CUDA→numpy crash on image requests
Multimodal requests to gemma-4-31B-it return HTTP 500 with `TypeError: can't convert cuda:0 device type tensor to numpy` for inputs SGLang decodes to a GPU tensor (video data-URLs, broken image URLs, etc). The crash is a bare `image.numpy()` on a CUDA tensor at transformers/image_processing_backends.py:458, reached via the gemma4 image processor. `--disable-fast-image-processor` (added in v0.0.196) only closed the generic fast-processor path; this second path is unaffected because the tensor is already on GPU upstream of that flag. Wrap `sglang serve` in a shell that sed-patches the line to `image.cpu().numpy()` before launch. `.cpu()` is a no-op on CPU tensors, so the patch is idempotent and safe across restarts. Avoids rebuilding the pinned SGLang image; all serve flags (incl. --disable-fast-image-processor) are unchanged. Verified: valid images already return 200; video/broken-URL inputs reproduce the 500 on both backends pre-patch. See nearai/infra#156.
1 parent f8ad79e commit 9efd2f5

1 file changed

Lines changed: 34 additions & 21 deletions

File tree

small-models.yaml

Lines changed: 34 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -507,27 +507,40 @@ services:
507507
# https://lmsysorg.mintlify.app/cookbook/autoregressive/Google/Gemma4
508508
image: lmsysorg/sglang:gemma4@sha256:87cecd3c9f4d17632c44b2d7cd1a20c50377c42b461d9ca39b153b4bb2b6e6ae
509509
container_name: model-sg-gemma-4-31b
510-
command: >
511-
sglang serve
512-
--model-path google/gemma-4-31B-it
513-
--revision ba74f5b6c647c0911554e50278d6f6f4477f9010
514-
--tp 2
515-
--reasoning-parser gemma4
516-
--tool-call-parser gemma4
517-
--mem-fraction-static 0.85
518-
--max-running-requests 64
519-
--chunked-prefill-size 8192
520-
--num-continuous-decode-steps 5
521-
--enable-mixed-chunk
522-
--disable-fast-image-processor
523-
--model-loader-extra-config '{"enable_multithread_load": "true", "num_threads": 64}'
524-
--port 8000
525-
--host 0.0.0.0
526-
--enable-cache-report
527-
--enable-metrics
528-
--trust-remote-code
529-
--log-requests-level 0
530-
--served-model-name google/gemma-4-31B-it
510+
# The command wraps `sglang serve` in a shell that first hot-patches the
511+
# transformers gemma4 image processor baked into the image. A bare
512+
# `image.numpy()` on a CUDA tensor crashes multimodal (image) requests with
513+
# TypeError: can't convert cuda:0 device type tensor to numpy
514+
# for inputs SGLang decodes to a GPU tensor (video data-URLs, broken image
515+
# URLs, etc). `--disable-fast-image-processor` does NOT cover this path — the
516+
# tensor is already on GPU upstream of that flag. `.cpu()` is a no-op on CPU
517+
# tensors, so the patch is safe and idempotent. See nearai/infra#156.
518+
command:
519+
- /bin/sh
520+
- -c
521+
- |
522+
sed -i 's/image = image\.numpy()/image = image.cpu().numpy()/' \
523+
/usr/local/lib/python3.12/dist-packages/transformers/image_processing_backends.py
524+
exec sglang serve \
525+
--model-path google/gemma-4-31B-it \
526+
--revision ba74f5b6c647c0911554e50278d6f6f4477f9010 \
527+
--tp 2 \
528+
--reasoning-parser gemma4 \
529+
--tool-call-parser gemma4 \
530+
--mem-fraction-static 0.85 \
531+
--max-running-requests 64 \
532+
--chunked-prefill-size 8192 \
533+
--num-continuous-decode-steps 5 \
534+
--enable-mixed-chunk \
535+
--disable-fast-image-processor \
536+
--model-loader-extra-config '{"enable_multithread_load": "true", "num_threads": 64}' \
537+
--port 8000 \
538+
--host 0.0.0.0 \
539+
--enable-cache-report \
540+
--enable-metrics \
541+
--trust-remote-code \
542+
--log-requests-level 0 \
543+
--served-model-name google/gemma-4-31B-it
531544
volumes:
532545
- hugginface_cache:/root/.cache/huggingface
533546
- kernel_cache:/root/.cache/deep_gemm

0 commit comments

Comments
 (0)