Skip to content

docs: add 8 new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration (#416)#426

Merged
rootsongjc merged 1 commit into
Project-HAMi:masterfrom
mesutoezdil:docs/faq-entries-416
Jun 11, 2026
Merged

docs: add 8 new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration (#416)#426
rootsongjc merged 1 commit into
Project-HAMi:masterfrom
mesutoezdil:docs/faq-entries-416

Conversation

@mesutoezdil

@mesutoezdil mesutoezdil commented May 29, 2026

Copy link
Copy Markdown
Contributor

Adds 8 new FAQ entries to docs/faq/faq.md covering the three topic areas defined in the issue. All questions were sourced from the research compiled in #415.

New entries

GPU virtualization model

  • How does HAMi enforce GPU memory and compute limits? Explains the libvgpu.so CUDA API interception mechanism, what it covers, and what it does not (DinD, direct driver API calls). Links to GPU Virtualization.
  • HAMi vGPU vs NVIDIA MIG. Side-by-side comparison table covering hardware requirements, isolation mechanism, enforcement strength, granularity, and dynamic reconfiguration. Guidance on when to use each.
  • Why does nvidia-smi inside a container show less memory than the host? Explains that this is intentional - libvgpu.so intercepts memory query calls and returns the allocated limit.
  • Why is my gpumem limit not enforced? Covers the four root causes: CUDA_DISABLE_CONTROL, Docker-in-Docker, direct NVML/driver API calls, and misconfigured container runtime.

Scheduling interaction

  • Does HAMi replace kube-scheduler or run alongside it? Explains the extender model, the MutatingWebhook schedulerName assignment, and the impact on non-HAMi pods (none). Includes a note on multi-replica leader election.

Ecosystem integration

  • HAMi with vLLM multi-GPU tensor parallelism. Documents the NCCL segfault issue (CUDA_DEVICE_MEMORY_SHARED_CACHE per-container, fixed in v2.7.0), single-GPU usage, and Volcano multi-pod setup. Links to issues #1764 and #1853.
  • HAMi with NVIDIA GPU Operator and DCGM. Explains the device plugin conflict and how to disable GPU Operator's device plugin. Notes that DCGM Exporter is unaffected.
  • Prometheus and Grafana monitoring. Covers the metrics endpoint, key metric names, scrape config, and importing the bundled static/grafana/gpu-dashboard.json dashboard.

Closes #416.
Refs #415.

@netlify

netlify Bot commented May 29, 2026

Copy link
Copy Markdown

Deploy Preview for project-hami ready!

Name Link
🔨 Latest commit 337648d
🔍 Latest deploy log https://app.netlify.com/projects/project-hami/deploys/6a2135240832dd00082fe628
😎 Deploy Preview https://deploy-preview-426--project-hami.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@hami-robot hami-robot Bot added the size/L label May 29, 2026
@mesutoezdil

Copy link
Copy Markdown
Contributor Author

done @rootsongjc

@mesutoezdil mesutoezdil force-pushed the docs/faq-entries-416 branch from 24a8fb2 to 359c2cc Compare June 4, 2026 07:13
@rootsongjc

Copy link
Copy Markdown
Contributor

I think this article as an FAQ might be too long.

@rootsongjc

Copy link
Copy Markdown
Contributor

And some of the FAQs could be added to the Concept document, or to other documents, or referenced from existing documents on websites. Instead of putting it all in the FAQ, which makes it difficult to maintain later on.

@mesutoezdil mesutoezdil force-pushed the docs/faq-entries-416 branch from 359c2cc to c03cd3c Compare June 4, 2026 08:09
@hami-robot hami-robot Bot added size/M and removed size/L labels Jun 4, 2026
@mesutoezdil

Copy link
Copy Markdown
Contributor Author

And some of the FAQs could be added to the Concept document, or to other documents, or referenced from existing documents on websites. Instead of putting it all in the FAQ, which makes it difficult to maintain later on.

ok now?

… pages

Signed-off-by: mesutoezdil <mesudozdil@gmail.com>
@mesutoezdil mesutoezdil force-pushed the docs/faq-entries-416 branch from c03cd3c to 337648d Compare June 4, 2026 08:19

@windsonsea windsonsea left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@hami-robot

hami-robot Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mesutoezdil, windsonsea

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hami-robot hami-robot Bot added the approved label Jun 9, 2026
@rootsongjc rootsongjc merged commit 5776a63 into Project-HAMi:master Jun 11, 2026
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[docs/faq] Write new FAQ entries covering GPU virtualization, scheduling, and ecosystem integration

3 participants