Open
Description
Some HPC applications may want to use hugepages (2 MiB / 1 GiB page sizes) to reduce TLB cache pressure.
In container runtimes, there are several examples to support hugepages:
Some references on hugepages:
- https://repost.aws/knowledge-center/configure-hugepages-ec2-linux-instance
- https://help.ubuntu.com/community/KVM%20-%20Using%20Hugepages
- https://access.redhat.com/solutions/36741
- https://docs.oracle.com/database/121/UNXAR/appi_vlm.htm#UNXAR391
We need to explicitly enable hugepages on part of our testing infra and implement the option, like:
backend.ai create -r mem=16G --resource-opt shm=1G,huge-2Mi=512M,huge-1Gi=4G ...
or,
backend.ai create -r mem=16G -r mem.huge-2g=512M -r mem.huge-1g=1G --resource-opt shm=1G ...
The first option (resource-opt) does not prevent the overlapped usage but just allow the hugepage access from containers with limits.
The second option (resource-slot) treats hugepages as an accounted resource that cannot be shared between different containers. For consistency with MIG slots (cuda.mig-5g
, ...), I've removed the trailing i
(binary suffix) in the resource slot names.