Skip to content

HAMI-core ERROR: Device 0 OOM, but no OOM occurs when CUDA_DISABLE_CONTROL=true #1003

Open
@chinaran

Description

@chinaran

What happened:

When running the stable-diffusion 2.1 model for text-to-image generation, the following error occurred: [HAMI-core ERROR (pid:25 thread=139892685178688 allocator.c:53)]: Device 0 OOM 27147995136 / 25757220864.

However, after adding CUDA_DISABLE_CONTROL=true, no OOM occurred, and images were successfully generated.

What you expected to happen:

No OOM should occur when using the stable-diffusion 2.1 model.

Additional observations:

  1. Referencing the discussion in 使用显存计算问题导致 device OOM 错误,从而使预测终止 #43, adding the ACTIVE_OOM_KILLER=0 environment variable did not resolve the OOM issue.

  2. After setting the LIBCUDA_LOG_LEVEL=3 environment variable, logs are continuously printed: userutil1=0 currentcores=6291456 total=6291456 limit=100 share=6291456. Can the logging frequency be optimized to avoid such rapid output?

Anything else we need to know?:

  • The output of nvidia-smi -a on your host
Image
  • The stable-diffusion pod container logs (LIBCUDA_LOG_LEVEL=3)
Detail Logs (`LIBCUDA_LOG_LEVEL=3`)

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=4 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623472
[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda7561d48 bytesize=12582912
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=3095758848
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=3095758848 limit=25757220864 new_allocated=3108341760
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:383)]: add_gpu_device_memory:25 0 12582912
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=3108341760
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:406)]: gpu_device_memory_added:25 0 12582912 -> 3108341760
[HAMI-core Info(25:139892685178688:memory.c:143)]: res=0, cuMemAlloc_v2 success dptr=0x7f3856400000 bytesize=12582912
[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda75624f8 bytesize=12582912
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=3108341760
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=3108341760 limit=25757220864 new_allocated=3120924672
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:383)]: add_gpu_device_memory:25 0 12582912
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=3120924672
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:406)]: gpu_device_memory_added:25 0 12582912 -> 3120924672
[HAMI-core Info(25:139892685178688:memory.c:143)]: res=0, cuMemAlloc_v2 success dptr=0x7f3857000000 bytesize=12582912
[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda7561018 bytesize=866123776
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=3120924672
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=3120924672 limit=25757220864 new_allocated=3987048448
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:383)]: add_gpu_device_memory:25 0 866123776
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=3987048448
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:406)]: gpu_device_memory_added:25 0 866123776 -> 3987048448
[HAMI-core Info(25:139892685178688:memory.c:143)]: res=0, cuMemAlloc_v2 success dptr=0x7f3822000000 bytesize=866123776
[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda7562068 bytesize=3397386240
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=3987048448
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=3987048448 limit=25757220864 new_allocated=7384434688
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:383)]: add_gpu_device_memory:25 0 3397386240
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=7384434688
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:406)]: gpu_device_memory_added:25 0 3397386240 -> 7384434688
[HAMI-core Info(25:139892685178688:memory.c:143)]: res=0, cuMemAlloc_v2 success dptr=0x7f3756000000 bytesize=3397386240
[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda7561b68 bytesize=3397386240
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=7384434688
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=7384434688 limit=25757220864 new_allocated=10781820928
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:383)]: add_gpu_device_memory:25 0 3397386240
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10781820928
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:406)]: gpu_device_memory_added:25 0 3397386240 -> 10781820928
[HAMI-core Info(25:139892685178688:memory.c:143)]: res=0, cuMemAlloc_v2 success dptr=0x7f368a000000 bytesize=3397386240
[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=7 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda75628d8 bytesize=2097152
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10781820928
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=10781820928 limit=25757220864 new_allocated=10783918080
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:383)]: add_gpu_device_memory:25 0 2097152
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10783918080
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:406)]: gpu_device_memory_added:25 0 2097152 -> 10783918080
[HAMI-core Info(25:139892685178688:memory.c:143)]: res=0, cuMemAlloc_v2 success dptr=0x7f3754800000 bytesize=2097152
[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda75628d8 bytesize=2097152
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10783918080
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=10783918080 limit=25757220864 new_allocated=10786015232
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:383)]: add_gpu_device_memory:25 0 2097152
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10786015232
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:406)]: gpu_device_memory_added:25 0 2097152 -> 10786015232
[HAMI-core Info(25:139892685178688:memory.c:143)]: res=0, cuMemAlloc_v2 success dptr=0x7f3754a00000 bytesize=2097152
[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda7561708 bytesize=2097152
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10786015232
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=10786015232 limit=25757220864 new_allocated=10788112384
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:383)]: add_gpu_device_memory:25 0 2097152
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10788112384
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:406)]: gpu_device_memory_added:25 0 2097152 -> 10788112384
[HAMI-core Info(25:139892685178688:memory.c:143)]: res=0, cuMemAlloc_v2 success dptr=0x7f3754c00000 bytesize=2097152
[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda7561768 bytesize=2097152
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10788112384
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=10788112384 limit=25757220864 new_allocated=10790209536
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:383)]: add_gpu_device_memory:25 0 2097152
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10790209536
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:406)]: gpu_device_memory_added:25 0 2097152 -> 10790209536
[HAMI-core Info(25:139892685178688:memory.c:143)]: res=0, cuMemAlloc_v2 success dptr=0x7f3754e00000 bytesize=2097152
[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda7561148 bytesize=2097152
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10790209536
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=10790209536 limit=25757220864 new_allocated=10792306688
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:383)]: add_gpu_device_memory:25 0 2097152
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10792306688
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:406)]: gpu_device_memory_added:25 0 2097152 -> 10792306688
[HAMI-core Info(25:139892685178688:memory.c:143)]: res=0, cuMemAlloc_v2 success dptr=0x7f3755000000 bytesize=2097152
[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=7 currentcores=6291456 total=6291456 limit=100 share=6291456

2%|â–� | 1/50 [00:03<02:27, 3.01s/it][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=23 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=23 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=42 currentcores=6291456 total=6291456 limit=100 share=6291456

4%|â–� | 2/50 [00:03<01:06, 1.39s/it][HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623473
[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=61 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=61 currentcores=6291456 total=6291456 limit=100 share=6291456

6%|▌ | 3/50 [00:03<00:40, 1.15it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=80 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=80 currentcores=6291456 total=6291456 limit=100 share=6291456

8%|â–Š | 4/50 [00:03<00:28, 1.59it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=96 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

10%|â–ˆ | 5/50 [00:04<00:22, 2.03it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623474

12%|█� | 6/50 [00:04<00:18, 2.42it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

14%|█� | 7/50 [00:04<00:15, 2.76it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

16%|█▌ | 8/50 [00:04<00:13, 3.06it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

18%|█▊ | 9/50 [00:05<00:12, 3.28it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623475

20%|██ | 10/50 [00:05<00:11, 3.46it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

22%|██� | 11/50 [00:05<00:10, 3.59it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

24%|██� | 12/50 [00:05<00:10, 3.65it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=98 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=98 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=98 currentcores=6291456 total=6291456 limit=100 share=6291456

26%|██▌ | 13/50 [00:06<00:09, 3.74it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=98 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=98 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623476

28%|██▊ | 14/50 [00:06<00:09, 3.78it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=98 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=98 currentcores=6291456 total=6291456 limit=100 share=6291456

30%|███ | 15/50 [00:06<00:09, 3.81it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=98 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

32%|███� | 16/50 [00:06<00:08, 3.85it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

34%|███� | 17/50 [00:07<00:08, 3.86it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623477

36%|███▌ | 18/50 [00:07<00:08, 3.88it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

38%|███▊ | 19/50 [00:07<00:07, 3.88it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

40%|████ | 20/50 [00:07<00:07, 3.90it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

42%|████� | 21/50 [00:08<00:07, 3.90it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623478

44%|████� | 22/50 [00:08<00:07, 3.90it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

46%|████▌ | 23/50 [00:08<00:06, 3.90it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

48%|████▊ | 24/50 [00:08<00:06, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

50%|█████ | 25/50 [00:09<00:06, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623479

52%|█████� | 26/50 [00:09<00:06, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

54%|█████� | 27/50 [00:09<00:05, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

56%|█████▌ | 28/50 [00:09<00:05, 3.89it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

58%|█████▊ | 29/50 [00:10<00:05, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623480
[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

60%|██████ | 30/50 [00:10<00:05, 3.92it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

62%|██████� | 31/50 [00:10<00:04, 3.92it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

64%|██████� | 32/50 [00:10<00:04, 3.92it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

66%|██████▌ | 33/50 [00:11<00:04, 3.92it/s][HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623481
[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

68%|██████▊ | 34/50 [00:11<00:04, 3.92it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

70%|███████ | 35/50 [00:11<00:03, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

72%|███████� | 36/50 [00:11<00:03, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

74%|███████� | 37/50 [00:12<00:03, 3.90it/s][HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623482
[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

76%|███████▌ | 38/50 [00:12<00:03, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

78%|███████▊ | 39/50 [00:12<00:02, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

80%|████████ | 40/50 [00:12<00:02, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

82%|████████� | 41/50 [00:13<00:02, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623483
[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

84%|████████� | 42/50 [00:13<00:02, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

86%|████████▌ | 43/50 [00:13<00:01, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

88%|████████▊ | 44/50 [00:14<00:01, 3.92it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

90%|█████████ | 45/50 [00:14<00:01, 3.92it/s][HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623484
[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

92%|█████████�| 46/50 [00:14<00:01, 3.94it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

94%|█████████�| 47/50 [00:14<00:00, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

96%|█████████▌| 48/50 [00:15<00:00, 3.91it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:257)]: write last kernel time: 1744623485

98%|█████████▊| 49/50 [00:15<00:00, 3.93it/s][HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=99 currentcores=6291456 total=6291456 limit=100 share=6291456

100%|██████████| 50/50 [00:15<00:00, 3.93it/s]
100%|██████████| 50/50 [00:15<00:00, 3.22it/s]
[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda7560b28 bytesize=5473566720
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=10792306688
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=10792306688 limit=25757220864 new_allocated=16265873408
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:383)]: add_gpu_device_memory:25 0 5473566720
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=16265873408
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:406)]: gpu_device_memory_added:25 0 5473566720 -> 16265873408
[HAMI-core Info(25:139892685178688:memory.c:143)]: res=0, cuMemAlloc_v2 success dptr=0x7f3542000000 bytesize=5473566720
[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=98 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139892685178688:memory.c:138)]: into cuMemAllocing_v2 dptr=0x7ffda7560a88 bytesize=10882121728
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:289)]: get_gpu_memory_usage dev=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=1 host pid=0 i=0
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:296)]: dev=0 pid=25 host pid=17451 i=16265873408
[HAMI-core Info(25:139892685178688:allocator.c:51)]: _usage=16265873408 limit=25757220864 new_allocated=27147995136
[HAMI-core ERROR (pid:25 thread=139892685178688 allocator.c:53)]: Device 0 OOM 27147995136 / 25757220864
[HAMI-core Info(25:139892685178688:multiprocess_memory_limit.c:210)]: rm_quitted_process
2025-04-14 09:38:05,361 [mlserver.parallel] ERROR - An error occurred calling method 'predict' from model 'ran-diffusion-pgu-4090'.
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/mlserver/parallel/worker.py", line 136, in _process_request
return_value = await method(
File "/opt/conda/lib/python3.10/site-packages/mlserver_diffusers/runtime.py", line 63, in predict
prediction = self.__generate_image_with_cache(prompt[0], **kwargs)
File "/opt/conda/lib/python3.10/site-packages/mlserver_diffusers/runtime.py", line 74, in __generate_image_with_cache
images = self._model(prompt, **kwargs)["images"]
File "/opt/conda/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 1046, in call
image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False, generator=generator)[
File "/opt/conda/lib/python3.10/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 314, in decode
decoded = self._decode(z).sample
File "/opt/conda/lib/python3.10/site-packages/diffusers/models/autoencoders/autoencoder_kl.py", line 285, in _decode
dec = self.decoder(z)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/diffusers/models/autoencoders/vae.py", line 337, in forward
sample = up_block(sample, latent_embeds)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/diffusers/models/unets/unet_2d_blocks.py", line 2746, in forward
hidden_states = resnet(hidden_states, temb=temb)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/diffusers/models/resnet.py", line 366, in forward
hidden_states = self.conv2(hidden_states)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: CUDA error: unrecognized error code
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
INFO: 127.0.0.1:48510 - "POST /v2/models/ran-diffusion-pgu-4090/infer HTTP/1.1" 500 Internal Server Error
[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=98 currentcores=6291456 total=6291456 limit=100 share=6291456

[HAMI-core Info(25:139887319181056:multiprocess_utilization_watcher.c:211)]: userutil1=98 currentcores=6291456 total=6291456 limit=100 share=6291456

Environment:

  • HAMi version: v2.5.0
  • nvidia driver or other AI device driver version: 550.142
  • Containerd version: 1.7.23-4
  • Kernel version from uname -a: 3.10.0-1160.119.1.el7.x86_64
  • Others:
    • GPU 0: NVIDIA GeForce RTX 4090
    • CUDA Version: 12.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions