Process-Scoped GPU Memory Accounting#1204
Process-Scoped GPU Memory Accounting#1204divyanshsinghvi wants to merge 8 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
|
cc: @tzhouam Shoud I remove the logic of sequential initlaization in omni_stage.py also in this PR or send a follow up PR post this is merged? |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 8e08010228
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
vllm_omni/worker/base.py
Outdated
| visible_devices = os.environ.get("CUDA_VISIBLE_DEVICES", "") | ||
| if visible_devices: | ||
| try: | ||
| physical_indices = [int(x.strip()) for x in visible_devices.split(",") if x.strip()] | ||
| if local_rank < len(physical_indices): |
There was a problem hiding this comment.
Handle UUID/MIG CUDA_VISIBLE_DEVICES mappings
When CUDA_VISIBLE_DEVICES is set to UUIDs or MIG IDs (common in Kubernetes and multi-tenant setups), parsing with int(...) fails and this helper falls back to local_rank. That means nvmlDeviceGetHandleByIndex is called with a physical index that does not correspond to the visible device list, so the process memory query can target the wrong GPU. In those environments, the available KV cache calculation can be significantly wrong and lead to over-allocation. Consider supporting UUID/MIG strings (e.g., use nvmlDeviceGetHandleByUUID when parsing fails) to keep the NVML lookup aligned with CUDA device visibility.
Useful? React with 👍 / 👎.
vllm_omni/worker/base.py
Outdated
| local_rank, | ||
| physical_device, | ||
| e, | ||
| ) | ||
| return 0 |
There was a problem hiding this comment.
Avoid treating NVML failure as zero process memory
If nvmlInit or the NVML query fails, _get_process_gpu_memory returns 0, and the caller then treats the full pre-load requested_memory as available. In environments where NVML is unavailable or permission-restricted but CUDA is functional (e.g., some containers), this overestimates KV cache capacity because model weights/activations already occupy GPU memory after profile_run, which can lead to OOM during initialization. A safer fallback would be to use a global free-memory query or the profiling result when NVML errors occur, rather than assuming zero process usage.
Useful? React with 👍 / 👎.
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Nice work! I went through the code, and here’s my understanding:
Would it make sense to wrap the sequential init into a helper function and add a (mock) check for process-level memory availability? |
|
|
||
|
|
||
| class GPUGenerationWorker(OmniWorkerMixin, GPUWorker): | ||
| class GPUGenerationWorker(OmniWorkerMixin, OmniGPUWorkerBase): |
There was a problem hiding this comment.
@princepride please check whether the diffusion worker needs to inherit from this OmniGPUWorkerBase
There was a problem hiding this comment.
I just created a wrapper over GPUWorker, so that I can patch the corressponding functions. If there is a cleaner approach can shift to that.
There was a problem hiding this comment.
Unfortunately, DiffusionWorker cannot inherit from it due to interface differences. However, I wonder if we could extract the NVML util functions into an independent module, so that DiffusionWorker could also use it to assess the GPU memory usage of the current process.
There was a problem hiding this comment.
Extract an abstraction would work? ALL workers inherits this abstraction.
There was a problem hiding this comment.
Yes, I think it's a good idea.
There was a problem hiding this comment.
whats inside the workerbase?
There was a problem hiding this comment.
You mean the vllm one or omni wrapper (inheriting from it which is only patching the function and keeping other functionality same) ?
There was a problem hiding this comment.
Omni wrapper, diffusion worker can't inherit from vLLM GPU worker
There was a problem hiding this comment.
Yes I saw the code and diffusion_worker.py doesnt inherit from GPUWorker.
I abstracted the functions but didnt introduce them to the diffusion_worker as I think currently diffusion_worker require it cuMemAllocator from vllm to get memory usage.
There was a problem hiding this comment.
The functionality for the vLLM engine looks good to me.
I have double-checked the code for the AR part's available memory determination and the dummy run in the diffusion engine. Since the diffusion engine's dummy run does not record memory usage, I think the current design should work even for mixed-structured models like Bagel.
Wow, This feature is very important to us, isn't it? The current stage loading is too slow, and we need to pass in additional parameters to extend the stage initialization timeout. |
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
Signed-off-by: Divyansh Singhvi <divyanshsinghvi@gmail.com>
|
LGTM, marked as ready |
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
#1147
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)