Skip to content

Hami not working with openshift. #961

Open
@meetzuber

Description

@meetzuber

Hi

I have installed Hami on openshift 4.14 but getting error while using GPU in pod. Please find the logs below.

nvidia-smi output

$ oc exec -it gpu-pod -n kube-system -- bash

(base) jovyan@gpu-pod:/$ nvidia-smi
[HAMI-core Msg(17:140395794540352:libvgpu.c:837)]: Initializing.....
[HAMI-core ERROR (pid:17 thread=140395794540352 multiprocess_memory_limit.c:702)]: Fail to open shrreg /usr/local/vgpu/41ad5d5c-4392-4bcc-8014-55a2851b94bc.cache: errno=1
3
[HAMI-core ERROR (pid:17 thread=140395794540352 multiprocess_memory_limit.c:707)]: Fail to init shrreg /usr/local/vgpu/41ad5d5c-4392-4bcc-8014-55a2851b94bc.cache: errno=9
[HAMI-core ERROR (pid:17 thread=140395794540352 multiprocess_memory_limit.c:711)]: Fail to write shrreg /usr/local/vgpu/41ad5d5c-4392-4bcc-8014-55a2851b94bc.cache: errno=
9
[HAMI-core ERROR (pid:17 thread=140395794540352 multiprocess_memory_limit.c:714)]: Fail to reseek shrreg /usr/local/vgpu/41ad5d5c-4392-4bcc-8014-55a2851b94bc.cache: errno
=9
[HAMI-core ERROR (pid:17 thread=140395794540352 multiprocess_memory_limit.c:724)]: Fail to lock shrreg /usr/local/vgpu/41ad5d5c-4392-4bcc-8014-55a2851b94bc.cache: errno=9
Segmentation fault (core dumped)
(base) jovyan@gpu-pod:/$

hami-device-plugin logs

I0327 10:31:47.732347 4106103 register.go:210] Successfully registered annotation. Next check in 30s seconds... I0327 10:32:17.750992 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:32:20.175352 4106103 register.go:173] nvml registered device id=1, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:32:20.175476 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:32:22.450079 4106103 register.go:173] nvml registered device id=2, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:32:22.450210 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:32:24.756897 4106103 register.go:173] nvml registered device id=3, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:32:24.757041 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:32:27.050690 4106103 register.go:173] nvml registered device id=4, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:32:27.050816 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:32:29.311718 4106103 register.go:173] nvml registered device id=5, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:32:29.311849 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:32:31.589590 4106103 register.go:173] nvml registered device id=6, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:32:31.589732 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:32:34.124015 4106103 register.go:173] nvml registered device id=7, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:32:34.124096 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:32:36.461329 4106103 register.go:173] nvml registered device id=0, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:32:36.461380 4106103 register.go:180] "start working on the devices" devices=[{"id":"GPU-31518e05-67f8-316e-6bce-e010baa1cc88","index":1,"count":10,"devmem":8155 9,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-994fe2ef-c8c0-a723-e6f9-67dcc3ca7164","index":2,"count":10,"devmem":815 59,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-084020d3-3b49-b252-c6dd-9351a72bc7b1","index":3,"count":10,"devmem":81 559,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-a3efefa9-8d9e-fcd7-eb38-dee51143d163","index":4,"count":10,"devmem":8 1559,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-33b5739b-85b3-4279-071b-380e4f936328","index":5,"count":10,"devmem": 81559,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-c3699744-0b88-6a13-641d-4f12bed5c1e0","index":6,"count":10,"devmem" :81559,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-f33046e8-f8ce-e356-9bd5-dc371cd3440c","index":7,"count":10,"devmem ":81559,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-d5ce9128-646c-649e-3008-e1eaf764f229","count":10,"devmem":81559," devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true}] I0327 10:32:36.461401 4106103 util.go:65] "Fetching node" nodeName="gpu-node.example.com" I0327 10:32:36.468642 4106103 util.go:81] "Successfully fetched node" nodeName="gpu-node.example.com" I0327 10:32:36.468676 4106103 register.go:190] patch node with the following annos map[hami.io/node-handshake:Reported 2025-03-27 10:32:36.468663762 +0000 UTC m=+5397.075 938854 hami.io/node-nvidia-register:GPU-31518e05-67f8-316e-6bce-e010baa1cc88,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,1,hami-core:GPU-994fe2ef-c8c0-a723-e6f9-67dc c3ca7164,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,2,hami-core:GPU-084020d3-3b49-b252-c6dd-9351a72bc7b1,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,3,hami-cor e:GPU-a3efefa9-8d9e-fcd7-eb38-dee51143d163,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,4,hami-core:GPU-33b5739b-85b3-4279-071b-380e4f936328,10,81559,100,NVIDIA-NVIDI A H100 80GB HBM3,0,true,5,hami-core:GPU-c3699744-0b88-6a13-641d-4f12bed5c1e0,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,6,hami-core:GPU-f33046e8-f8ce-e356-9bd5-dc37 1cd3440c,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,7,hami-core:GPU-d5ce9128-646c-649e-3008-e1eaf764f229,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,0,hami-cor e:] I0327 10:32:36.484741 4106103 register.go:210] Successfully registered annotation. Next check in 30s seconds... I0327 10:33:06.496438 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:33:08.835371 4106103 register.go:173] nvml registered device id=1, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:33:08.835500 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:33:11.158095 4106103 register.go:173] nvml registered device id=2, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:33:11.158231 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:33:13.461787 4106103 register.go:173] nvml registered device id=3, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:33:13.461934 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:33:15.741857 4106103 register.go:173] nvml registered device id=4, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:33:15.742003 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:33:18.170124 4106103 register.go:173] nvml registered device id=5, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:33:18.170291 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:33:20.596357 4106103 register.go:173] nvml registered device id=6, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:33:20.596526 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:33:22.886892 4106103 register.go:173] nvml registered device id=7, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:33:22.886998 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:33:25.160834 4106103 register.go:173] nvml registered device id=0, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:33:25.160892 4106103 register.go:180] "start working on the devices" devices=[{"id":"GPU-31518e05-67f8-316e-6bce-e010baa1cc88","index":1,"count":10,"devmem":8155 9,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-994fe2ef-c8c0-a723-e6f9-67dcc3ca7164","index":2,"count":10,"devmem":815 59,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-084020d3-3b49-b252-c6dd-9351a72bc7b1","index":3,"count":10,"devmem":81 559,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-a3efefa9-8d9e-fcd7-eb38-dee51143d163","index":4,"count":10,"devmem":8 1559,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-33b5739b-85b3-4279-071b-380e4f936328","index":5,"count":10,"devmem": 81559,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-c3699744-0b88-6a13-641d-4f12bed5c1e0","index":6,"count":10,"devmem" :81559,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-f33046e8-f8ce-e356-9bd5-dc371cd3440c","index":7,"count":10,"devmem ":81559,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-d5ce9128-646c-649e-3008-e1eaf764f229","count":10,"devmem":81559," devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true}] I0327 10:33:25.160921 4106103 util.go:65] "Fetching node" nodeName="gpu-node.example.com" I0327 10:33:25.167385 4106103 util.go:81] "Successfully fetched node" nodeName="gpu-node.example.com" I0327 10:33:25.167426 4106103 register.go:190] patch node with the following annos map[hami.io/node-handshake:Reported 2025-03-27 10:33:25.167412544 +0000 UTC m=+5445.774 687636 hami.io/node-nvidia-register:GPU-31518e05-67f8-316e-6bce-e010baa1cc88,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,1,hami-core:GPU-994fe2ef-c8c0-a723-e6f9-67dc c3ca7164,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,2,hami-core:GPU-084020d3-3b49-b252-c6dd-9351a72bc7b1,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,3,hami-cor e:GPU-a3efefa9-8d9e-fcd7-eb38-dee51143d163,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,4,hami-core:GPU-33b5739b-85b3-4279-071b-380e4f936328,10,81559,100,NVIDIA-NVIDI A H100 80GB HBM3,0,true,5,hami-core:GPU-c3699744-0b88-6a13-641d-4f12bed5c1e0,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,6,hami-core:GPU-f33046e8-f8ce-e356-9bd5-dc37 1cd3440c,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,7,hami-core:GPU-d5ce9128-646c-649e-3008-e1eaf764f229,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,0,hami-cor e:] I0327 10:33:25.187455 4106103 register.go:210] Successfully registered annotation. Next check in 30s seconds... I0327 10:33:55.205125 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:33:57.488012 4106103 register.go:173] nvml registered device id=6, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:33:57.488188 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:33:59.807066 4106103 register.go:173] nvml registered device id=7, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:33:59.807149 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:02.085074 4106103 register.go:173] nvml registered device id=0, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:02.085196 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:04.402593 4106103 register.go:173] nvml registered device id=1, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:04.402728 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:06.629591 4106103 register.go:173] nvml registered device id=2, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:06.629697 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:08.885992 4106103 register.go:173] nvml registered device id=3, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:08.886106 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:11.211706 4106103 register.go:173] nvml registered device id=4, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:11.211827 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:13.539471 4106103 register.go:173] nvml registered device id=5, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:13.539522 4106103 register.go:180] "start working on the devices" devices=[{"id":"GPU-c3699744-0b88-6a13-641d-4f12bed5c1e0","index":6,"count":10,"devmem":8155 9,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-f33046e8-f8ce-e356-9bd5-dc371cd3440c","index":7,"count":10,"devmem":815 59,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-d5ce9128-646c-649e-3008-e1eaf764f229","count":10,"devmem":81559,"devco re":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-31518e05-67f8-316e-6bce-e010baa1cc88","index":1,"count":10,"devmem":81559,"devc ore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-994fe2ef-c8c0-a723-e6f9-67dcc3ca7164","index":2,"count":10,"devmem":81559,"dev core":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-084020d3-3b49-b252-c6dd-9351a72bc7b1","index":3,"count":10,"devmem":81559,"de vcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-a3efefa9-8d9e-fcd7-eb38-dee51143d163","index":4,"count":10,"devmem":81559,"d evcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-33b5739b-85b3-4279-071b-380e4f936328","index":5,"count":10,"devmem":81559," devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true}] I0327 10:34:13.539543 4106103 util.go:65] "Fetching node" nodeName="gpu-node.example.com" I0327 10:34:13.547045 4106103 util.go:81] "Successfully fetched node" nodeName="gpu-node.example.com" I0327 10:34:13.547094 4106103 register.go:190] patch node with the following annos map[hami.io/node-handshake:Reported 2025-03-27 10:34:13.547080338 +0000 UTC m=+5494.154 355430 hami.io/node-nvidia-register:GPU-c3699744-0b88-6a13-641d-4f12bed5c1e0,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,6,hami-core:GPU-f33046e8-f8ce-e356-9bd5-dc37 1cd3440c,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,7,hami-core:GPU-d5ce9128-646c-649e-3008-e1eaf764f229,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,0,hami-cor e:GPU-31518e05-67f8-316e-6bce-e010baa1cc88,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,1,hami-core:GPU-994fe2ef-c8c0-a723-e6f9-67dcc3ca7164,10,81559,100,NVIDIA-NVIDI A H100 80GB HBM3,0,true,2,hami-core:GPU-084020d3-3b49-b252-c6dd-9351a72bc7b1,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,3,hami-core:GPU-a3efefa9-8d9e-fcd7-eb38-dee5 1143d163,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,4,hami-core:GPU-33b5739b-85b3-4279-071b-380e4f936328,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,5,hami-cor e:] I0327 10:34:13.565924 4106103 register.go:210] Successfully registered annotation. Next check in 30s seconds... I0327 10:34:43.583486 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:45.869590 4106103 register.go:173] nvml registered device id=6, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:45.869722 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:49.328332 4106103 register.go:173] nvml registered device id=7, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:49.328422 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:51.587314 4106103 register.go:173] nvml registered device id=0, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:51.587407 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:53.885241 4106103 register.go:173] nvml registered device id=1, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:53.885355 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:56.204354 4106103 register.go:173] nvml registered device id=2, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:56.204471 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:34:58.467132 4106103 register.go:173] nvml registered device id=3, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:34:58.467298 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:35:00.759884 4106103 register.go:173] nvml registered device id=4, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:35:00.760037 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:35:03.039625 4106103 register.go:173] nvml registered device id=5, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:35:03.039677 4106103 register.go:180] "start working on the devices" devices=[{"id":"GPU-c3699744-0b88-6a13-641d-4f12bed5c1e0","index":6,"count":10,"devmem":8155 9,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-f33046e8-f8ce-e356-9bd5-dc371cd3440c","index":7,"count":10,"devmem":815 59,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-d5ce9128-646c-649e-3008-e1eaf764f229","count":10,"devmem":81559,"devco re":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-31518e05-67f8-316e-6bce-e010baa1cc88","index":1,"count":10,"devmem":81559,"devc ore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-994fe2ef-c8c0-a723-e6f9-67dcc3ca7164","index":2,"count":10,"devmem":81559,"dev core":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-084020d3-3b49-b252-c6dd-9351a72bc7b1","index":3,"count":10,"devmem":81559,"de vcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-a3efefa9-8d9e-fcd7-eb38-dee51143d163","index":4,"count":10,"devmem":81559,"d evcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-33b5739b-85b3-4279-071b-380e4f936328","index":5,"count":10,"devmem":81559," devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true}] I0327 10:35:03.039700 4106103 util.go:65] "Fetching node" nodeName="gpu-node.example.com" I0327 10:35:03.045728 4106103 util.go:81] "Successfully fetched node" nodeName="gpu-node.example.com" I0327 10:35:03.045773 4106103 register.go:190] patch node with the following annos map[hami.io/node-handshake:Reported 2025-03-27 10:35:03.045759048 +0000 UTC m=+5543.653 034140 hami.io/node-nvidia-register:GPU-c3699744-0b88-6a13-641d-4f12bed5c1e0,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,6,hami-core:GPU-f33046e8-f8ce-e356-9bd5-dc37 1cd3440c,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,7,hami-core:GPU-d5ce9128-646c-649e-3008-e1eaf764f229,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,0,hami-cor e:GPU-31518e05-67f8-316e-6bce-e010baa1cc88,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,1,hami-core:GPU-994fe2ef-c8c0-a723-e6f9-67dcc3ca7164,10,81559,100,NVIDIA-NVIDI A H100 80GB HBM3,0,true,2,hami-core:GPU-084020d3-3b49-b252-c6dd-9351a72bc7b1,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,3,hami-core:GPU-a3efefa9-8d9e-fcd7-eb38-dee5 1143d163,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,4,hami-core:GPU-33b5739b-85b3-4279-071b-380e4f936328,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,5,hami-cor e:] I0327 10:35:03.066712 4106103 register.go:210] Successfully registered annotation. Next check in 30s seconds... I0327 10:35:33.085397 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:35:35.700289 4106103 register.go:173] nvml registered device id=7, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:35:35.700372 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:35:37.989766 4106103 register.go:173] nvml registered device id=0, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:35:37.989876 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:35:40.280752 4106103 register.go:173] nvml registered device id=1, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:35:40.280865 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:35:42.583442 4106103 register.go:173] nvml registered device id=2, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:35:42.583559 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:35:44.852659 4106103 register.go:173] nvml registered device id=3, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:35:44.852777 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:35:47.162161 4106103 register.go:173] nvml registered device id=4, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:35:47.162316 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:35:50.641498 4106103 register.go:173] nvml registered device id=5, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:35:50.641617 4106103 register.go:144] MemoryScaling= 1 registeredmem= 81559 I0327 10:35:52.940602 4106103 register.go:173] nvml registered device id=6, memory=81559, type=NVIDIA H100 80GB HBM3, numa=0 I0327 10:35:52.940678 4106103 register.go:180] "start working on the devices" devices=[{"id":"GPU-f33046e8-f8ce-e356-9bd5-dc371cd3440c","index":7,"count":10,"devmem":8155 9,"devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-d5ce9128-646c-649e-3008-e1eaf764f229","count":10,"devmem":81559,"devcor e":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-31518e05-67f8-316e-6bce-e010baa1cc88","index":1,"count":10,"devmem":81559,"devco re":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-994fe2ef-c8c0-a723-e6f9-67dcc3ca7164","index":2,"count":10,"devmem":81559,"devc ore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-084020d3-3b49-b252-c6dd-9351a72bc7b1","index":3,"count":10,"devmem":81559,"dev core":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-a3efefa9-8d9e-fcd7-eb38-dee51143d163","index":4,"count":10,"devmem":81559,"de vcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-33b5739b-85b3-4279-071b-380e4f936328","index":5,"count":10,"devmem":81559,"d evcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true},{"id":"GPU-c3699744-0b88-6a13-641d-4f12bed5c1e0","index":6,"count":10,"devmem":81559," devcore":100,"type":"NVIDIA-NVIDIA H100 80GB HBM3","mode":"hami-core","health":true}] I0327 10:35:52.940701 4106103 util.go:65] "Fetching node" nodeName="gpu-node.example.com" I0327 10:35:52.947888 4106103 util.go:81] "Successfully fetched node" nodeName="gpu-node.example.com" I0327 10:35:52.947938 4106103 register.go:190] patch node with the following annos map[hami.io/node-handshake:Reported 2025-03-27 10:35:52.947919606 +0000 UTC m=+5593.555 194702 hami.io/node-nvidia-register:GPU-f33046e8-f8ce-e356-9bd5-dc371cd3440c,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,7,hami-core:GPU-d5ce9128-646c-649e-3008-e1ea f764f229,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,0,hami-core:GPU-31518e05-67f8-316e-6bce-e010baa1cc88,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,1,hami-cor e:GPU-994fe2ef-c8c0-a723-e6f9-67dcc3ca7164,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,2,hami-core:GPU-084020d3-3b49-b252-c6dd-9351a72bc7b1,10,81559,100,NVIDIA-NVIDI A H100 80GB HBM3,0,true,3,hami-core:GPU-a3efefa9-8d9e-fcd7-eb38-dee51143d163,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,4,hami-core:GPU-33b5739b-85b3-4279-071b-380e 4f936328,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,5,hami-core:GPU-c3699744-0b88-6a13-641d-4f12bed5c1e0,10,81559,100,NVIDIA-NVIDIA H100 80GB HBM3,0,true,6,hami-cor e:]

Environment:

  • HAMi version: 2.5.0
  • nvidia driver: 550.54.14

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions