-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Open
Labels
Description
Model Engine
- Switch default to new model engine
- Mark legacy engine as deprecated
- Feature parity new and legacy model engine: LoRA/PEFT, etc [megatron] feat: Share actor and ref in LoRA #4673, [worker] fix: new engine saves megatron LoRA adapters checkpoints #4866, [worker] feat: New engine share actor and ref for LoRA #4867
Megatron
- Perfomance optimization
- Megatron dynamic CP [BREAKING][megatron] feat: support dynamic CP #5057
- MoE multi-modal model training
- Long context training: fine-grained activation recompuation/offload
VeOmni
TorchTitan #5306
Rollout Engine
- Improve rollout server profiling: [perf] feat: verl profiler system support Agent Loop scenario and integrate torch.profiler #4320
- New rollout engine: TensorRT-LLM [ray,rollout,trtllm] feat: Adding tensorrt_llm as new rollout engine #4665
- Separate vllm worker from trainer, sync by cuda ipc [BREAKING][worker, rollout, vllm] feat: implement vLLM colocated training-inference rollout with process separation #4280
- Router reply
- AgentLoop
- Refactor tool definition and registration
- Support multiple AgentLoopOutput for one sample: prompt switch, context compression, multi-agent, etc.
Checkpoint Engine
- Add checkpoint engine abstract interface [ckpt] feat: add checkpoint-engine abstraction #4775
- Add NCCL, NIXL transport backend and more [ckpt] feat: add Hccl ckpt engine backend #4885 [ckpt] feat: add kimi ckpt engine backend #4954
- Add checkpoint engine manager [ckpt] feat: add CheckpointEngineManager #5031
Trainer
- Online policy distillation:
- New sync trainer with TransferQueue
- RFC: [RFC] PPOTrainer with TransferQueue Integration #5400
- New sync trainer with TransferQueue: [trainer] feat: add new trainer with TranferQueue #5401
- Fully async trainer
- Refactor one-step-off/fully async with model engine and checkpoint engine [fsdp, megatron] feat: refactor fully-async and one-step-off training to support multiple checkpoint engine backends #5029
- Remove PartialAgentLoop [rollout] feat: support auto resume on abort in FullyAsyncLLMServerManager #5430
- Standalone megatron worker group to recompute old_log_prob
Speculative Decoding
- Support MTP SFT/RL training [megatron] feat: Using MTP in RL Training and Inference #4936 [megatron] feat: Support MTP training in SFT #4981
- [rfc]:add speculator training scripts and checkpoint support #4947
Ascend NPU
Model Support List
Reactions are currently unavailable