Description
Some HPC applications may want to use hugepages (2 MiB / 1 GiB page sizes) to reduce TLB cache pressure.
In container runtimes, there are several examples to support hugepages:
-
https://kubernetes.io/docs/tasks/manage-hugepages/scheduling-hugepages/
Some references on hugepages: -
https://repost.aws/knowledge-center/configure-hugepages-ec2-linux-instance
-
https://help.ubuntu.com/community/KVM%20-%20Using%20Hugepages
-
https://docs.oracle.com/database/121/UNXAR/appi_vlm.htm#UNXAR391
We need to explicitly enable hugepages on part of our testing infra and implement the option, like:
backend.ai create -r mem=16G --resource-opt shm=1G,huge-2Mi=512M,huge-1Gi=4G ...
or,
backend.ai create -r mem=16G -r mem.huge-2g=512M -r mem.huge-1g=1G --resource-opt shm=1G ...
The first option (resource-opt) does not prevent the overlapped usage but just allow the hugepage access from containers with limits.
The second option (resource-slot) treats hugepages as an accounted resource that cannot be shared between different containers. For consistency with MIG slots (cuda.mig-5g
, ...), I've removed the trailing i
(binary suffix) in the resource slot names.