Skip to content

Conversation

@Shangwei-Li
Copy link
Contributor

@Shangwei-Li Shangwei-Li commented Dec 24, 2025

What does this PR do?

Due to ray collective not supported on NPU yet, this PR uses vllm stateless group to replace ray collective group to support fully async training and checkpoint engine.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: https://github.com/volcengine/verl/pulls?q=fully+async+npu
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data, cfg, reward
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

We have looked into the performance and throughput of fully async with Qwen3-30B-A3B.
With FSDP backend, we have achieved even higher reward than a previous collocated training. (The pink line represent fully async while the others representing collocated)
image
As for throughput, we have achieved more than 4x throughput gaining comparing 64-rank fully async training to 128-rank collocated training, which means 64-rank fully async training is 2x faster than collocated 128-rank.
image

Also verified with Qwen3-0.6B in cases below:

  1. Megatron backend.
  2. Checkpoint engine disabled.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the checkpointing and parameter synchronization logic to enable fully asynchronous training on NPUs by replacing ray.util.collective with a vllm-based stateless process group. This is a significant and necessary change for NPU support. The implementation is mostly solid, introducing a new distributed_util.py and adapting existing components. However, I've identified a critical issue where the device is hardcoded to 'npu' in a couple of places, which would break compatibility with other hardware like GPUs. Addressing this will make the solution robust and hardware-agnostic.

@Shangwei-Li Shangwei-Li changed the title [recipe, ckpt] Support fully async on NPU [recipe, ckpt] WIP:Support fully async on NPU Dec 24, 2025
@Shangwei-Li Shangwei-Li marked this pull request as draft December 24, 2025 16:17
@Shangwei-Li Shangwei-Li force-pushed the fully_async_npu branch 2 times, most recently from a650c49 to 170feb6 Compare December 26, 2025 12:53
@Shangwei-Li Shangwei-Li marked this pull request as ready for review December 26, 2025 13:24
@Shangwei-Li Shangwei-Li changed the title [recipe, ckpt] WIP:Support fully async on NPU [recipe, megatron, fsdp] Support fully async on NPU Dec 26, 2025
@ArronHZG ArronHZG self-requested a review December 28, 2025 15:26
f" offload model to cpu cost {offload_duration} seconds"
)

@register(dispatch_mode=Dispatch.ONE_TO_ALL, blocking=False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function looks same in fsdp/megatron worker, can we move it to a public place and use it in fsdp/megatron backend?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be easily achieved after this recipe migrates to engine workers. If we move it to Worker class now it would result in another class maintained in recipe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants