Skip to content

[feat] Refactor training framework into fastvideo/train#1159

Merged
jzhang38 merged 31 commits intohao-ai-lab:mainfrom
FoundationResearch:train-clean-refactor
Mar 9, 2026
Merged

[feat] Refactor training framework into fastvideo/train#1159
jzhang38 merged 31 commits intohao-ai-lab:mainfrom
FoundationResearch:train-clean-refactor

Conversation

@alexzms
Copy link
Collaborator

@alexzms alexzms commented Mar 8, 2026

Summary

Introduces fastvideo/train, a refactored training framework that replaces the monolithic training/distillation pipelines with a modular, YAML-driven architecture.

Key design changes

  • _target_-based instantiation: Models and methods are selected via _target_ keys in YAML (e.g., fastvideo.train.models.wan.WanModel,
    fastvideo.train.methods.distribution_matching.dmd2.DMD2Method), making it easy to add new models/methods without modifying framework code.
  • Separated concerns: Models (models/), methods (methods/), callbacks (callbacks/), and the training loop (trainer.py) are fully decoupled. The trainer calls
    method.train_one_step() without knowing which method is running.
  • Callback system: Gradient clipping, validation, and EMA are now callbacks (callbacks/) rather than hardcoded in the training loop. Configured via the callbacks:
    section in YAML.
  • Structured config with defaults: TrainingConfig dataclass (utils/training_config.py) provides typed defaults for all training parameters. The fully-resolved config
    (with defaults filled in) is logged to W&B.
  • Checkpoint management: DCP-based save/resume with CheckpointManager, plus dcp_to_diffusers.py for converting checkpoints to Diffusers format.

Supported models & methods

Models Methods
Wan 2.1 (T2V 1.3B) DMD2 distillation
WanGame (incl. causal) Self-forcing distillation
SFT finetuning
DFSFT (Diffusion ForcingSFT)

Bug fixes

  • CFG formula: Fixed real_score_guidance_scale in DMD2 and self-forcing to use the standard formula uncond + scale * (cond - uncond) instead of cond + scale * (cond - uncond) (which silently added +1 to the effective guidance scale).

File structure

fastvideo/train/
trainer.py
models/{base, wan/, wangame/}
methods/{base, distribution_matching/, fine_tuning/}
callbacks/{callback, grad_clip, validation, ema}
entrypoint/{train, dcp_to_diffusers}
utils/{config, builder, training_config, checkpoint, dataloader, optimizer, tracking, ...}

Usage

torchrun --nproc_per_node=8 -m fastvideo.train.entrypoint.train \
    --config examples/distillation/refactor/distill_wan2.1_t2v_1.3B_dmd2.yaml

Test plan

- DMD2 8-step distillation on Wan 2.1 T2V 1.3B matches legacy training loss curves
- VSA finetuning on Wan produces equivalent results to legacy pipeline
- Self-forcing distillation on WanGame runs without errors
- DFSFT on WanGame runs without errors
- Checkpoint save/resume round-trips correctly
- W&B logging shows fully-resolved config with defaults

Thanks to @jzhang38 heavily discussing, reviewing and modifying code!

@alexzms alexzms requested a review from jzhang38 March 8, 2026 00:42
@alexzms
Copy link
Collaborator Author

alexzms commented Mar 8, 2026

Related Issue: #1158

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant architectural overhaul to the training framework, moving from rigid, hardcoded pipelines to a highly modular and configurable system. The primary goal is to enhance flexibility and extensibility, allowing users to easily combine different models and training algorithms through declarative YAML configurations. This refactoring streamlines the process of experimenting with new models and methods, while also standardizing infrastructure concerns like distributed training, checkpointing, and validation.

Highlights

  • Modular Training Framework: Introduced a refactored training framework, fastvideo/train, that replaces monolithic pipelines with a modular, YAML-driven architecture for models, methods, and infrastructure.
  • target-based Instantiation: Implemented _target_ keys in YAML for easy instantiation of models and methods, allowing new components to be added without modifying core framework code.
  • Decoupled Components: Ensured full decoupling of models, methods, callbacks, and the training loop, promoting separation of concerns and flexibility.
  • Callback System: Integrated a flexible callback system for functionalities like gradient clipping, validation, and EMA, configurable via YAML.
  • Structured Configuration: Utilized a TrainingConfig dataclass for structured, typed defaults for all training parameters, with the fully-resolved config logged to W&B.
  • Checkpoint Management: Implemented DCP-based save/resume functionality with CheckpointManager and a utility for converting checkpoints to Diffusers format.
  • Bug Fix: CFG Formula: Corrected the real_score_guidance_scale formula in DMD2 and self-forcing methods to use the standard uncond + scale * (cond - uncond).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/train/dfsft_wangame_causal_v3.yaml
    • Added a new YAML configuration for causal Diffusion-Forcing SFT on WanGame.
  • examples/train/distill_wan2.1_t2v_1.3B_dmd2.yaml
    • Added a new YAML configuration for DMD2 distillation on Wan 2.1 T2V.
  • examples/train/example.yaml
    • Added a comprehensive example YAML configuration file for the new training framework.
  • examples/train/finetune_wan2.1_t2v_1.3B_vsa_phase3.4_0.9sparsity.yaml
    • Added a new YAML configuration for VSA finetuning on Wan 2.1 T2V.
  • examples/train/finetune_wangame2.1_i2v_1.3B.yaml
    • Added a new YAML configuration for finetuning WanGame I2V.
  • examples/train/issue.md
    • Added an RFC document outlining the new training architecture for community discussion.
  • examples/train/rfc.md
    • Added an internal RFC document detailing the file structure and example YAML for the new training framework.
  • examples/train/run.sh
    • Added a shell script to launch training with the new YAML configurations.
  • examples/train/self_forcing_wangame_causal_v3.yaml
    • Added a new YAML configuration for causal Self-Forcing distillation on WanGame.
  • fastvideo/train/.style.yapf
    • Added a YAPF style configuration for consistent code formatting.
  • fastvideo/train/init.py
    • Initialized the fastvideo.train package.
  • fastvideo/train/callbacks/init.py
    • Initialized the fastvideo.train.callbacks package.
  • fastvideo/train/callbacks/callback.py
    • Defined the base Callback class and CallbackDict manager.
  • fastvideo/train/callbacks/ema.py
    • Implemented the EMACallback for exponential moving average updates.
  • fastvideo/train/callbacks/grad_clip.py
    • Implemented the GradNormClipCallback for gradient norm clipping.
  • fastvideo/train/callbacks/validation.py
    • Implemented a generic ValidationCallback for periodic inference validation.
  • fastvideo/train/entrypoint/init.py
    • Initialized the fastvideo.train.entrypoint package.
  • fastvideo/train/entrypoint/dcp_to_diffusers.py
    • Provided a script to convert DCP checkpoints to Diffusers format.
  • fastvideo/train/entrypoint/train.py
    • Implemented the main YAML-only training entrypoint.
  • fastvideo/train/methods/init.py
    • Initialized the fastvideo.train.methods package with lazy imports.
  • fastvideo/train/methods/base.py
    • Defined the abstract base class for training methods.
  • fastvideo/train/methods/consistency_model/init.py
    • Added a placeholder package for consistency model methods.
  • fastvideo/train/methods/distribution_matching/init.py
    • Initialized the fastvideo.train.methods.distribution_matching package.
  • fastvideo/train/methods/distribution_matching/dmd2.py
    • Implemented the DMD2 distillation training method.
  • fastvideo/train/methods/distribution_matching/self_forcing.py
    • Implemented the Self-Forcing distillation method for causal models.
  • fastvideo/train/methods/fine_tuning/init.py
    • Initialized the fastvideo.train.methods.fine_tuning package with lazy imports.
  • fastvideo/train/methods/fine_tuning/dfsft.py
    • Implemented the Diffusion-forcing SFT (DFSFT) training method.
  • fastvideo/train/methods/fine_tuning/finetune.py
    • Implemented the supervised fine-tuning (SFT) training method.
  • fastvideo/train/methods/knowledge_distillation/init.py
    • Added a placeholder package for knowledge distillation methods.
  • fastvideo/train/models/init.py
    • Initialized the fastvideo.train.models package.
  • fastvideo/train/models/base.py
    • Defined the abstract base classes for per-role model instances (ModelBase, CausalModelBase).
  • fastvideo/train/models/wan/init.py
    • Initialized the fastvideo.train.models.wan package.
  • fastvideo/train/models/wan/wan.py
    • Implemented the WanModel plugin for T2V models.
  • fastvideo/train/models/wan/wan_causal.py
    • Implemented the WanCausalModel plugin with streaming capabilities.
  • fastvideo/train/models/wangame/init.py
    • Initialized the fastvideo.train.models.wangame package.
  • fastvideo/train/models/wangame/wangame.py
    • Implemented the WanGameModel plugin for I2V models.
  • fastvideo/train/models/wangame/wangame_causal.py
    • Implemented the WanGameCausalModel plugin with streaming capabilities.
  • fastvideo/train/trainer.py
    • Implemented the core training loop logic.
  • fastvideo/train/utils/init.py
    • Initialized the fastvideo.train.utils package.
  • fastvideo/train/utils/builder.py
    • Provided functions to build training components from configuration.
  • fastvideo/train/utils/checkpoint.py
    • Managed checkpointing, saving, resuming, and cleanup using DCP.
  • fastvideo/train/utils/config.py
    • Defined RunConfig and provided utilities for parsing YAML configurations.
  • fastvideo/train/utils/dataloader.py
    • Provided functions to build parquet dataloaders for T2V and WanGame.
  • fastvideo/train/utils/instantiate.py
    • Provided utilities for _target_-based class instantiation.
  • fastvideo/train/utils/module_state.py
    • Provided a utility to set module trainability and mode.
  • fastvideo/train/utils/moduleloader.py
    • Provided functions to load specific model modules from paths.
  • fastvideo/train/utils/optimizer.py
    • Provided functions to build optimizers and learning rate schedulers.
  • fastvideo/train/utils/tracking.py
    • Provided functions to initialize and manage experiment trackers.
  • fastvideo/train/utils/training_config.py
    • Defined dataclasses for structured training configuration.
  • fastvideo/train/utils/validation.py
    • Provided utility functions for parsing validation-related configuration.
Activity
  • A new fastvideo/train directory was introduced, containing a refactored and modular training framework.
  • Core components for models, training methods, callbacks, and utilities were added, enabling a YAML-driven configuration approach.
  • New example YAML configurations were provided for various training scenarios, including DMD2 distillation, Self-Forcing, SFT, and DFSFT.
  • An RFC document detailing the new architecture was added for community review and discussion.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a major and well-designed refactoring of the training framework, making it modular and YAML-driven. The separation of concerns into models, methods, and infrastructure is a significant improvement. The code is generally of high quality.

My review focuses on a few areas to improve portability and maintainability:

  • Hardcoded Paths: Several example configuration files and a shell script contain user-specific absolute paths, which should be replaced with placeholders or relative paths to make them portable.
  • Code Encapsulation: One of the entrypoint scripts imports private functions from another module, which could be refactored to improve encapsulation and reduce code duplication.
  • Documentation Formatting: There are minor markdown formatting issues in one of the documentation files.

Note: Security Review did not run due to the size of the PR.

Comment on lines +291 to +356
def _run_config_from_raw(
raw: dict[str, Any],
) -> Any:
"""Reconstruct a RunConfig from a raw config dict.

This mirrors ``load_run_config`` but operates on an
already-parsed dict (from metadata.json) instead of
reading from a YAML file.
"""
from fastvideo.train.utils.config import (
RunConfig,
_build_training_config,
_parse_pipeline_config,
_require_mapping,
_require_str,
)

models_raw = _require_mapping(
raw.get("models"), where="models",
)
models: dict[str, dict[str, Any]] = {}
for role_key, model_cfg_raw in models_raw.items():
role_str = _require_str(
role_key, where="models.<role>",
)
model_cfg = _require_mapping(
model_cfg_raw,
where=f"models.{role_str}",
)
models[role_str] = dict(model_cfg)

method_raw = _require_mapping(
raw.get("method"), where="method",
)
method = dict(method_raw)

callbacks_raw = raw.get("callbacks", None)
callbacks: dict[str, dict[str, Any]] = (
_require_mapping(
callbacks_raw, where="callbacks",
)
if callbacks_raw is not None
else {}
)

pipeline_config = _parse_pipeline_config(
raw, models=models,
)

training_raw = _require_mapping(
raw.get("training"), where="training",
)
t = dict(training_raw)
training = _build_training_config(
t,
models=models,
pipeline_config=pipeline_config,
)

return RunConfig(
models=models,
method=method,
training=training,
callbacks=callbacks,
raw=raw,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function _run_config_from_raw and its usage of private functions (e.g., _build_training_config, _parse_pipeline_config) from fastvideo.train.utils.config suggests a need for refactoring. Importing private members from other modules can lead to fragile code.

Consider one of the following approaches:

  1. Make the helper functions in fastvideo.train.utils.config public if they are intended for reuse.
  2. Refactor load_run_config to accept either a file path or a pre-loaded dictionary, which would eliminate the need for _run_config_from_raw and the private imports.

@jzhang38 jzhang38 added the go Trigger Buildkite CI label Mar 9, 2026
@jzhang38 jzhang38 merged commit bc27a03 into hao-ai-lab:main Mar 9, 2026
1 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go Trigger Buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants