Skip to content

[Bugfix] Guard cache hugepage allocation in Kubernetes#965

Draft
dante159753 wants to merge 2 commits into
ModelEngine-Group:developfrom
dante159753:cache-memory-logs
Draft

[Bugfix] Guard cache hugepage allocation in Kubernetes#965
dante159753 wants to merge 2 commits into
ModelEngine-Group:developfrom
dante159753:cache-memory-logs

Conversation

@dante159753

Copy link
Copy Markdown
Contributor

What changed

  • Added a cache_use_hugepage CacheStore option, defaulting to false.
  • Kept io_direct independent from explicit hugepage allocation: direct I/O host buffers no longer try MAP_HUGETLB unless cache_use_hugepage: true is set.
  • When hugepage allocation is disabled, Ascend direct-I/O host buffers use anonymous mmap with MADV_NOHUGEPAGE; when enabled, they retain the existing 1GiB -> 2MiB hugetlb fallback and anonymous mmap with hugepage advice.
  • Included startup diagnostics around cache buffer allocation and documented the new config in the example and PipelineStore guide.

Why

In Kubernetes deployments, hugetlb/TLB resources can be visible inside a pod while not actually being usable by that pod because the pod cgroup has no usable hugepage quota or the runtime has not mounted/configured hugepages correctly. In that state, UCM's previous default direct-I/O path tried explicit hugetlb allocation during service startup, which could fail or proceed into an unstable allocation path and cause vLLM engine initialization to abort.

The safer default is to avoid explicit hugepage allocation unless the operator opts in after confirming the pod has usable hugepage resources.

Impact

  • Existing configs keep working without adding the new option.
  • io_direct: true still enables direct I/O, but does not imply explicit hugetlb allocation by default.
  • Operators that want the old hugepage behavior can set cache_use_hugepage: true.
  • Startup logs now make cache buffer allocation size, direct-I/O mode, hugepage mode, and allocation failures easier to diagnose.

Verification

  • git diff --check
  • codespell via pre-commit during commit
  • C++ build/tests were not run locally because cmake is not installed in this workspace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant