Skip to content

CPU memory leaking? #14

@JasonHe-WQ

Description

@JasonHe-WQ

Desc

On a single server with one Tesla T4 and 32GiB CPU memory, it always OOM when IDLE. I can't even wait for a single bench mark completed.

Env

Bare metal server
Driver Version: 545.23.06 CUDA Version: 12.3
OS: Ubuntu 22.04.4 LTS x86_64
Host: NUC9VXQNX K47173-406
Kernel: 6.5.0-41-generic
ray 2.34.0

More details

running command

python -m sarathi.entrypoints.openai_server.api_server --model_name 01-ai/Yi-6B-200k --model_tensor_parallel_degree 1 --model_attention_backend fi_vattn --model_block_size 16

image

The sarathi serve took over 25GiB on a ray worker

Traceback stack by Python

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/root/vattention/sarathi-lean/sarathi/entrypoints/openai_server/api_server.py", line 125, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/root/vattention/sarathi-lean/sarathi/engine/async_llm_engine.py", line 274, in from_engine_args
    engine = super().from_engine_args(**kwargs)
  File "/root/vattention/sarathi-lean/sarathi/engine/llm_engine.py", line 17, in from_engine_args
    engine = BaseLLMEngine(*engine_configs)
  File "/root/vattention/sarathi-lean/sarathi/engine/base_llm_engine.py", line 110, in __init__
    self._init_cache()
  File "/root/vattention/sarathi-lean/sarathi/engine/base_llm_engine.py", line 226, in _init_cache
    output_all = self._run_workers(
  File "/root/vattention/sarathi-lean/sarathi/engine/base_llm_engine.py", line 425, in _run_workers
    all_outputs = ray.get(all_outputs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/auto_init_hook.py", line 21, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 2659, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/usr/local/lib/python3.10/dist-packages/ray/_private/worker.py", line 873, in get_objects
    raise value
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory.
Memory on the node (IP: 172.16.251.111, ID: 9c2b341f18d9991990f509490a5d70abf52bf71f36aefd5f7669e122) where the task (actor ID: 7a14d0306d676302c532723201000000, name=RayWorker.__init__, pid=90411, memory used=13.79GB) was running was 29.48GB / 31.01GB (0.950528), which exceeds the memory usage threshold of 0.95. Ray killed this worker (ID: 6b07b0e46521d0babdaca6dc30f724c7db89cc020135f2fa6062cb80) because it was the most recently scheduled task; to see more information about memory usage on this node, use `ray logs raylet.out -ip 172.16.251.111`. To see the logs of the worker, use `ray logs worker-6b07b0e46521d0babdaca6dc30f724c7db89cc020135f2fa6062cb80*out -ip 172.16.251.111. Top 10 memory users:
PID     MEM(GB) COMMAND
90411   13.79   ray::RayWorker.execute_method
89330   0.36    python -m sarathi.entrypoints.openai_server.api_server --model_name 01-ai/Yi-6B-200k --model_tensor_...
89353   0.10    /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2...
40794   0.07    /usr/libexec/fwupd/fwupd
89426   0.06    /usr/bin/python /usr/local/lib/python3.10/dist-packages/ray/dashboard/dashboard.py --host=127.0.0.1 ...
89540   0.05    /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/ray/dashboard/agent.py --node-ip-address=...
89425   0.04    /usr/bin/python -u /usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/monitor.py --logs...
89566   0.04    ray::IDLE
89568   0.04    ray::IDLE
89570   0.04    ray::IDLE
Refer to the documentation on how to address the out of memory issue: https://docs.ray.io/en/latest/ray-core/scheduling/ray-oom-prevention.html. Consider provisioning more memory on this node or reducing task parallelism by requesting more CPUs per task. Set max_restarts and max_task_retries to enable retry when the task crashes due to OOM. To adjust the kill threshold, set the environment variable `RAY_memory_usage_threshold` when starting Ray. To disable worker killing, set the environment variable `RAY_memory_monitor_refresh_ms` to zero.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions