Skip to content

Container requires manual eval of ldconfig #975

Open
@wokalski

Description

@wokalski

What happened:
I am trying to run lmdeploy and it is failing to start initially:

~/dialo/master/k8s master* λ kubectl exec -it -n robot multi-intent-llm-5597c74c4-w8vp9 -- python -m lmdeploy.pytorch.check_env.triton_custom_add
[HAMI-core Msg(7:140513931132352:libvgpu.c:837)]: Initializing.....
[HAMI-core Msg(7:140513931132352:libvgpu.c:856)]: Initialized
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/lmdeploy/lmdeploy/pytorch/check_env/triton_custom_add.py", line 31, in <module>
    c = custom_add(a, b)
  File "/opt/lmdeploy/lmdeploy/pytorch/check_env/triton_custom_add.py", line 24, in custom_add
    _add_kernel[grid](a, b, c, size, BLOCK=BLOCK)
  File "/opt/py3/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/opt/py3/lib/python3.10/site-packages/triton/runtime/jit.py", line 607, in run
    device = driver.active.get_current_device()
  File "/opt/py3/lib/python3.10/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/opt/py3/lib/python3.10/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
  File "/opt/py3/lib/python3.10/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
  File "/opt/py3/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
    self.utils = CudaUtils()  # TODO: make static
  File "/opt/py3/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
  File "/opt/py3/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
  File "/opt/py3/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 45, in library_dirs
    return [libdevice_dir, *libcuda_dirs()]
  File "/opt/py3/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 39, in libcuda_dirs
    assert any(os.path.exists(os.path.join(path, 'libcuda.so.1')) for path in dirs), msg
AssertionError: libcuda.so cannot found!
Please make sure GPU is set up and then run "/sbin/ldconfig" (requires sudo) to refresh the linker cache.
[HAMI-core Msg(7:140513931132352:multiprocess_memory_limit.c:498)]: Calling exit handler 7
command terminated with exit code 1
~/dialo/master/k8s master* λ kubectl exec -it -n robot multi-intent-llm-5597c74c4-w8vp9 -- /sbin/ldconfig
~/dialo/master/k8s master* λ kubectl exec -it -n robot multi-intent-llm-5597c74c4-w8vp9 -- python -m lmdeploy.pytorch.check_env.triton_custom_add
[HAMI-core Msg(88:140412218805696:libvgpu.c:837)]: Initializing.....
[HAMI-core Msg(88:140412218805696:libvgpu.c:856)]: Initialized
Done.
[HAMI-core Msg(88:140412218805696:multiprocess_memory_limit.c:498)]: Calling exit handler 88

What you expected to happen:
It should start normally.

How to reproduce it (as minimally and precisely as possible):
run openmmlab/lmdeploy:v0.7.2.post1-cu12 in a pod deployed with hami.

Environment:

  • HAMi version: v2.5.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions