gemma-4-31B-it: hot-patch transformers CUDA→numpy crash on image requests#52
Merged
Conversation
…ests Multimodal requests to gemma-4-31B-it return HTTP 500 with `TypeError: can't convert cuda:0 device type tensor to numpy` for inputs SGLang decodes to a GPU tensor (video data-URLs, broken image URLs, etc). The crash is a bare `image.numpy()` on a CUDA tensor at transformers/image_processing_backends.py:458, reached via the gemma4 image processor. `--disable-fast-image-processor` (added in v0.0.196) only closed the generic fast-processor path; this second path is unaffected because the tensor is already on GPU upstream of that flag. Wrap `sglang serve` in a shell that sed-patches the line to `image.cpu().numpy()` before launch. `.cpu()` is a no-op on CPU tensors, so the patch is idempotent and safe across restarts. Avoids rebuilding the pinned SGLang image; all serve flags (incl. --disable-fast-image-processor) are unchanged. Verified: valid images already return 200; video/broken-URL inputs reproduce the 500 on both backends pre-patch. See nearai/infra#156.
The sed patch could silently no-op (path moved on a python/transformers bump, image repin, or pattern change) and `sed` returns 0 on no-match, so sglang would start unpatched and resume 500ing on image requests with no signal — text traffic stays green and the error rate barely moves. Add a post-sed `grep` guard that aborts startup (exit 1) unless the fixed `image.cpu().numpy()` form is present. Checking the fixed form (not that sed changed something) also tolerates a future image that already carries the upstream fix. Use `$$BACKENDS` so docker compose doesn't interpolate it.
PierreLeGuen
approved these changes
May 30, 2026
Contributor
PierreLeGuen
left a comment
There was a problem hiding this comment.
Reviewed the PR diff and surrounding Gemma service config in small-models.yaml. Verified the rendered Compose command, the pinned image entrypoint/config, and the baked transformers commit/source path containing the targeted image.numpy() line. Ran git diff --check and docker compose config --quiet for all YAML files; all passed. CI validate checks are green. I did not run a live GPU/model request locally.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
google/gemma-4-31B-itreturns HTTP 500 on a subset of multimodal (image) requests:Live on both backends (gpu11 + gpu02), ~1000 errors / 15 min. Valid images succeed; the crash fires for inputs SGLang decodes to a GPU tensor (video data-URLs, broken image URLs). Tracking: nearai/infra#156.
Root cause
A bare
image.numpy()on a CUDA tensor attransformers/image_processing_backends.py:458, reached via the gemma4 image processor (processing_gemma4.py→image_processing_pil_gemma4.py). The earlier--disable-fast-image-processor(v0.0.196) only closed the generic fast-processor path; this path is unaffected because the tensor is already on GPU upstream of that flag.Fix
Wrap
sglang servein a shell thatsed-patches the offending line toimage.cpu().numpy()before launch:.cpu()is a no-op on CPU tensors → idempotent and safe across restarts (the pattern no longer matches once patched).lmsysorg/sglang:gemma4@sha256:87cecd…image.serveflags unchanged (incl.--disable-fast-image-processor);execkeeps sglang as PID 1 underinit: truefor correct signal handling.Verification
compose/upwithforce_recreate:true) — a plainupwon't recreate it.Follow-up (not in this PR)
Upstream fix belongs in transformers (gemma4 image processor should
.cpu()before.numpy()); this is a deploy-side hotfix until the image is rebuilt/repinned on a fixed transformers.