Skip to content

[VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport#22662

Open
yhyang201 wants to merge 3 commits intosgl-project:mainfrom
yhyang201:fix-cuda-ipc-memory
Open

[VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport#22662
yhyang201 wants to merge 3 commits intosgl-project:mainfrom
yhyang201:fix-cuda-ipc-memory

Conversation

@yhyang201
Copy link
Copy Markdown
Collaborator

@yhyang201 yhyang201 commented Apr 13, 2026

Motivation

SGLANG_USE_CUDA_IPC_TRANSPORT=1 python -m sglang.launch_server     --model-path Qwen/Qwen3-VL-4B-Instruct     --tp 4     --trust-remote-code

Under SGLANG_USE_CUDA_IPC_TRANSPORT=1, each non-source TP rank was creating a full CUDA context on the producer GPU (~500 MiB per rank, ~1.6 GiB at TP=4) because _new_shared_cuda(*handle) routes through CUDAGuard(handle[0]), and handle[0] is the producer's device index.

  • Rewrite handle[0] to the consumer's own device before opening the IPC handle. cudaIpcOpenMemHandle uses cudaIpcMemLazyEnablePeerAccess, so the producer's memory is mapped via P2P without touching the producer GPU's context. Applied to both pooled and non-pooled paths; _pool_storage_cache is keyed by consumer device.
  • Scale MmItemMemoryPool by tokenizer_worker_num (128 MiB floor) so SGLANG_MM_FEATURE_CACHE_MB is the total budget across workers.
  • Reduce default SGLANG_MM_FEATURE_CACHE_MB from 4 GiB to 1 GiB; add a one-shot WARNING when the pool cannot fit a tensor.

Before:
image

After:
image

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@yhyang201 yhyang201 changed the title [VLM] Fix [VLM] Reduce GPU memory footprint of CUDA IPC MM feature transport Apr 13, 2026
@yhyang201
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant