perf(multimodal): avoid feature hashing for pad values#1030
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request optimizes multimodal request processing by shifting from feature-tensor hashing to stable payload hashing. By calculating hashes while raw input bytes are available, the system significantly reduces host-side request latency and avoids redundant, computationally expensive operations on large feature tensors. The existing feature-hash mechanism is preserved as a fallback, ensuring backward compatibility while providing a measurable performance boost. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
| ) | ||
| if os.path.exists(source): | ||
| return Image.open(source).convert("RGB") | ||
| with Image.open(source) as image: |
There was a problem hiding this comment.
the code decodes the media from the path first, then reopens the same path to compute the hash. If the file changes between those two reads, the feature and hash may come from different contents, producing an incorrect radix cache key.
Please read the file bytes once and use the same bytes for both decoding and hashing.
There was a problem hiding this comment.
thanks, fixed it in commit 9b2f20dd. Local file inputs now create a temporary snapshot while streaming the source file in chunks. The hash is computed over the exact bytes copied to that snapshot, and image/video/audio/Qwen-video decoding reads from the same snapshot.
|
|
||
| def _combine_mm_hashes(self, hashes: list[int | None], modality: str) -> int | None: | ||
| valid_hashes = [value for value in hashes if value is not None] | ||
| if not valid_hashes: |
There was a problem hiding this comment.
_combine_mm_hashes() silently drops None hashes. If any payload is processed successfully but its hash is unavailable, the combined hash may only cover part of the multimodal input instead of falling back to feature hashing.
I think this should return None whenever any input hash is None, so set_pad_value() can use the existing feature-hash fallback.
There was a problem hiding this comment.
_combine_mm_hashes() now returns None if any input hash is unavailable, so MultimodalDataItem.set_pad_value() falls back to the existing feature hashing path instead of building a partial payload hash. Also, added a focused unit test for this case.
|
LGTM |
|
@hashkanna please fix ci error |
94cf2b7 to
770d860
Compare
@pathfinder-pf The failing job is the known flake in test_bench_serving_dense_tp_4 that #1077 just fixed. Rebasing onto main to pick up the fix. |
Motivation
Fixes #1029.
Single-image VLM requests currently derive
MultimodalDataItem.pad_valueby hashing the processed feature tensor when no precomputed hash is present. In a v6e-4 profile ofQwen/Qwen2.5-VL-7B-Instruct, this tokenizer-side feature hashing showed up as a measurable host-side request cost:MultimodalDataItem.set_pad_value3412.899us221.838mshash_feature3316.494us215.572msThe tokenizer already has access to the raw image/video/audio payload bytes before feature preprocessing. This PR carries a stable content hash into
MultimodalDataItem.hash, allowingset_pad_value()to avoid walking large processed feature tensors on the request path.Related inspiration: Modal's multimodal inference optimization write-up discusses the same general principle of carrying stable media-content identity through preprocessing instead of repeatedly hashing large processed tensors on the request path: https://modal.com/blog/boosting-multimodal-inference-performance-by-greater-than-10-with-a-single-python-dictionary
Modifications
MultimodalDataItem.hashbeforeset_pad_value()runs.hash_featureas the fallback for callers without a precomputed hash.Accuracy Tests
No model output changes are expected. This changes the way multimodal pad values are derived for cache-key differentiation, while preserving the existing feature-hash fallback.
Local validation:
uv run --project python python -m unittest sgl_jax.test.multimodal.test_multimodal_pad_value_hashpython3 -m py_compile python/sgl_jax/srt/multimodal/manager/multimodal_tokenizer.py python/sgl_jax/test/multimodal/test_multimodal_pad_value_hash.pyruff check python/sgl_jax/srt/multimodal/manager/multimodal_tokenizer.py python/sgl_jax/test/multimodal/test_multimodal_pad_value_hash.pypre-commit run --all-files --show-diff-on-failureTPU validation:
Benchmarking and Profiling
Profiling confirmation on v6e-4 with
Qwen/Qwen2.5-VL-7B-Instruct:64/64HTTP 20064/64HTTP 200hash_featurecalls650hash_featuremean3316.494ushash_featuretotal215.572msMultimodalDataItem.set_pad_valuecalls6565MultimodalDataItem.set_pad_valuemean3412.899us31.320usMultimodalDataItem.set_pad_valuetotal221.838ms2.036ms0.3174s0.3115s0.3247s0.3180sThe optimized run had one request-level outlier at
8.32436s; median and p95 remained stable.Checklist