[VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport by yhyang201 · Pull Request #22662 · sgl-project/sglang

yhyang201 · 2026-04-13T04:52:34Z

Motivation

SGLANG_USE_CUDA_IPC_TRANSPORT=1 python -m sglang.launch_server     --model-path Qwen/Qwen3-VL-4B-Instruct     --tp 4     --trust-remote-code

Under SGLANG_USE_CUDA_IPC_TRANSPORT=1, each non-source TP rank was creating a full CUDA context on the producer GPU (~500 MiB per rank, ~1.6 GiB at TP=4) because _new_shared_cuda(*handle) routes through CUDAGuard(handle[0]), and handle[0] is the producer's device index.

Rewrite handle[0] to the consumer's own device before opening the IPC handle. cudaIpcOpenMemHandle uses cudaIpcMemLazyEnablePeerAccess, so the producer's memory is mapped via P2P without touching the producer GPU's context. Applied to both pooled and non-pooled paths; _pool_storage_cache is keyed by consumer device.
Scale MmItemMemoryPool by tokenizer_worker_num (128 MiB floor) so SGLANG_MM_FEATURE_CACHE_MB is the total budget across workers.
Reduce default SGLANG_MM_FEATURE_CACHE_MB from 4 GiB to 1 GiB; add a one-shot WARNING when the pool cannot fit a tensor.

Before:

After:

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-13T04:52:38Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

yhyang201 · 2026-04-13T05:26:32Z

/tag-and-rerun-ci

upd

acc13a6

upd

9a6c019

yhyang201 requested review from JustinTong0323, mickqian and yuan-luo as code owners April 13, 2026 05:14

upd

2d99067

yhyang201 changed the title ~~[VLM] Fix~~ [VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport Apr 13, 2026

github-actions bot added the run-ci label Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport#22662

[VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport#22662
yhyang201 wants to merge 3 commits intosgl-project:mainfrom
yhyang201:fix-cuda-ipc-memory

yhyang201 commented Apr 13, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Apr 13, 2026

Uh oh!

yhyang201 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yhyang201 commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot commented Apr 13, 2026

Uh oh!

yhyang201 commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yhyang201 commented Apr 13, 2026 •

edited

Loading