ray-less / rpc-less version for simpler debugging of smaller models #2202

vadimkantorov · 2025-06-25T10:35:31Z

vadimkantorov
Jun 25, 2025

Debugging distributed workers is hard, and even more so with ray.

Is it possible at all to have some sidekick, synchronous version (so also meaning a single-worker/single-process wrt FSDP) fitting one GPU for interactive debugging? (or a single-worker recipe where you can just drop in breakpoint()) Ideally it would reuse most of the code used for the distributed version, but be executed in a single process / synchronously, so that interactive debugging is possible, even from terminal

Of course, it wouldn't be 100% faithful wrt distributed aspect, but for small models it could be a helpful debugging test bed, where you can just insert breakpoint()

eric-haibin-lin · 2025-06-29T03:35:00Z

eric-haibin-lin
Jun 29, 2025
Collaborator

ray also supports breakpoint, please see https://verl.readthedocs.io/en/latest/start/ray_debug_tutorial.html
is there any other pain point of ray that you dislike?
we did consider non-ray option. This was an attempt and it runs fine, with torchRPC as the backend. However we noticed that torchRPC is not actively maintained by pytorch team right now, so it may actually introduce stability/performance issues in some corner cases, causing confusion for verl users. Furthermore it introduces maintenance overhead for the repo, making it over-complicated (compared to the benefit).
there may still be torch native solutions like the design and goal of monarch meta-pytorch/monarch#175, but currently they're not as mature.

7 replies

vadimkantorov Jul 4, 2025
Author

If it's possible to have a ray-less/rpc-less side-kick (ideally also a single-controller parallel-less regime where the GRPO stages are just invoked sequentially via plain-old Python functions without RPC), it would be great. If it's possible to even share most of code between the production-ready SPMD/rpc-full and the toy single-worker regime - even better.

For educational purposes, it's certainly beneficial to have a single-thread, rpc-less variant (e.g. might not even need to have). It's certainly should be possible to play with a small 0.5B / 1.5B model on a single 80Gb GPU without parallel workers / rpc stuff.

Ideally, the only difference in the production and toy variant should be scalability/speed, but the obtained losses should be the same

vadimkantorov Jul 4, 2025
Author

Another issue I had is that for some reason Ray refuses to start up without any explanations:

vadimkantorov@delldevadim:/mnt/c/Users/vadim/testray$ RAY_TMPDIR=$PWD/tmp ray start --head
Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 192.168.0.24
Traceback (most recent call last):
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/_private/node.py", line 370, in __init__
    ray._private.services.wait_for_node(
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/_private/services.py", line 452, in wait_for_node
    raise TimeoutError(
TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /mnt/c/Users/vadim/testray/tmp/ray/session_2025-07-04_19-02-35_840062_30286/sockets/plasma_store in the list of object store socket names.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/vadimkantorov/.local/bin/ray", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/scripts/scripts.py", line 2800, in main
    return cli()
           ^^^^^
  File "/usr/lib/python3/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/autoscaler/_private/cli_logger.py", line 823, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/scripts/scripts.py", line 997, in start
    node = ray._private.node.Node(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/_private/node.py", line 375, in __init__
    raise Exception(
Exception: The current node timed out during startup. This could happen because some of the Ray processes failed to startup.

and then keeps dangling processes around which lead to even more cryptic errors like below:

Usage stats collection is enabled. To disable this, add `--disable-usage-stats` to the command that starts the cluster, or run the following command: `ray disable-usage-stats` before starting the cluster. See https://docs.ray.io/en/master/cluster/usage-stats.html for more details.

Local node IP: 192.168.0.24
Traceback (most recent call last):
  File "/home/vadimkantorov/.local/bin/ray", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/scripts/scripts.py", line 2800, in main
    return cli()
           ^^^^^
  File "/usr/lib/python3/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/autoscaler/_private/cli_logger.py", line 823, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/scripts/scripts.py", line 997, in start
    node = ray._private.node.Node(
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/_private/node.py", line 364, in __init__
    self.start_head_processes()
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/_private/node.py", line 1460, in start_head_processes
    self._write_cluster_info_to_kv()
  File "/home/vadimkantorov/.local/lib/python3.12/site-packages/ray/_private/node.py", line 1415, in _write_cluster_info_to_kv
    assert curr_val == self._session_name.encode("utf-8"), (
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Session name session_2025-07-04_19-00-36_972347_28922 does not match persisted value b'session_2025-07-04_18-59-18_832073_27020'. Perhaps there was an error connecting to Redis.

eric-haibin-lin Jul 19, 2025
Collaborator

In general I agree that a single process version of rollout & actor will greatly improve debug-ability and hack-ability for education purpose. However, it also comes with maintenance cost. I guess it's possible to have a non-ray trainer similar to https://github.com/volcengine/verl/blob/main/verl/trainer/fsdp_sft_trainer.py, and possibly reusing some of the actor/rollout components available in verl.

eric-haibin-lin Jul 19, 2025
Collaborator

the key questions is what level of consistency do we want to keep between the fully fledged distributed version v.s. the simpler non-ray version. @vadimkantorov what's your thoughts on that?

vadimkantorov Jul 20, 2025
Author

I think that it's perfectly okay if not all features are supported in the serial "educational" single-process / single-node engine.

For consistency, it would be good to have some final benchmarks / curves allowing to compare run-time / loss-values for the ray and non-ray trainer versions. And also preferably, the config yamls themselves should also be accepted by the non-ray version (except that not all options will be respected). And it's just should be clear from the docs/warnings what exactly is not supported for the non-ray version.

And it can also help/drive factoring out common helpers / functions / API to be reused (and maybe even some of these can be committed upstream to pytorch core or shared with torchtune/torchtitan). But if certain things do not make sense to be reused - it's okay for a research framework like Verl, as long as it's explained.

vadimkantorov · 2025-07-15T13:07:59Z

vadimkantorov
Jul 15, 2025
Author

It would also be easier to do precise memory control, if there was a pipeline where actor/ref/rollout were just objects in a single process using the same PyTorch allocator and share KV cache workspace (e.g. between rollout and ref) - mention by Unsloth in https://unsloth.ai/blog/grpo

2 replies

eric-haibin-lin Jul 19, 2025
Collaborator

I agree single process provides better memory control, that's why in hybrid engine we put actor/rollout/ref together in one process: https://github.com/volcengine/verl/blob/main/verl/trainer/ppo/ray_trainer.py#L44

vadimkantorov Jul 20, 2025
Author

hmm, now curious what is hybrid engine... is the pytorch allocator shared between vllm/fsdp actor/fsdp ref? (btw, is it possible to also use vllm for ref? as gradients are not needed, only log_probs are needed)

Also, in a simpler set-up, maybe ref/actor model could be the same and just need to swap in/out weights? (if no extra-quantization/vllm-like-speedups is used for ref)

vadimkantorov · 2025-07-22T14:22:07Z

vadimkantorov
Jul 22, 2025
Author

And a ray-less, single-process, sequential pipeline (which uses Python function calls instead of RPCs) would also serve as the up-to-date baseline (important for demonstrating performance benefits/features/scalability of distributed versions)

0 replies

sailfish009 · 2025-07-30T10:40:20Z

sailfish009
Jul 30, 2025

rl2 and uvg The simplest and most convenient source I've found. I really like the simplicity and ease of customizing the source.

0 replies

ray-less / rpc-less version for simpler debugging of smaller models #2202

Uh oh!

Uh oh!

vadimkantorov Jun 25, 2025

Replies: 4 comments · 9 replies

Uh oh!

eric-haibin-lin Jun 29, 2025 Collaborator

Uh oh!

Uh oh!

vadimkantorov Jul 4, 2025 Author

Uh oh!

vadimkantorov Jul 4, 2025 Author

Uh oh!

eric-haibin-lin Jul 19, 2025 Collaborator

Uh oh!

eric-haibin-lin Jul 19, 2025 Collaborator

Uh oh!

Uh oh!

vadimkantorov Jul 20, 2025 Author

Uh oh!

vadimkantorov Jul 15, 2025 Author

Uh oh!

eric-haibin-lin Jul 19, 2025 Collaborator

Uh oh!

Uh oh!

vadimkantorov Jul 20, 2025 Author

Uh oh!

vadimkantorov Jul 22, 2025 Author

Uh oh!

sailfish009 Jul 30, 2025

vadimkantorov
Jun 25, 2025

Replies: 4 comments 9 replies

eric-haibin-lin
Jun 29, 2025
Collaborator

vadimkantorov Jul 4, 2025
Author

vadimkantorov Jul 4, 2025
Author

eric-haibin-lin Jul 19, 2025
Collaborator

eric-haibin-lin Jul 19, 2025
Collaborator

vadimkantorov Jul 20, 2025
Author

vadimkantorov
Jul 15, 2025
Author

eric-haibin-lin Jul 19, 2025
Collaborator

vadimkantorov Jul 20, 2025
Author

vadimkantorov
Jul 22, 2025
Author

sailfish009
Jul 30, 2025