Living catalog of all evaluation metrics for FastVideo-WorldModel video quality assessment. Each metric includes a detailed explanation, implementation status, usage instructions, and interpretation guide.
Last updated: 2026-03-02
| Metric | Category | Status | Location | Trust |
|---|---|---|---|---|
| FVD | Distribution | ✅ Implemented | benchmarks/fvd/ |
High |
| SSIM | Reference | ✅ Implemented | fastvideo/tests/ssim/ |
High |
| LPIPS | Perceptual | ✅ Implemented | scripts/lora_extraction/ |
Medium |
| Loss trajectory | Training signal | ✅ Implemented | W&B train_loss |
Medium |
| Grad norm stability | Training signal | ✅ Implemented | W&B grad_norm |
Medium |
| GameWorld Score | Multi-dim benchmark | 🟡 External | Matrix-Game repo | Low |
| Human preference | Gold standard | 🔴 Manual | N/A | Highest |
Category: Distribution-level quality metric
Status: ✅ Fully implemented in benchmarks/fvd/
Trust: High — standard protocol, I3D feature extractor
FVD measures the distance between the distribution of generated videos and a distribution of real/reference videos. It works by:
- Extracting spatiotemporal features from both real and generated video sets using a pretrained I3D (Inflated 3D ConvNet) model.
- Modeling each set of features as a multivariate Gaussian (mean + covariance).
- Computing the Fréchet distance between the two Gaussians.
Lower FVD = generated videos are more statistically similar to real videos.
- FVD is the de facto standard for benchmarking video generation models.
- It captures both visual quality (are individual frames realistic?) and temporal coherence (do frames flow naturally?).
- Matrix-Game 2.0, Open-Sora, and most video generation papers report FVD.
- Requires a large sample set (standard protocol uses 2048 videos) to produce stable statistics. Small sample sizes yield noisy results.
- Measures distributional similarity, not per-video quality. A model could have low FVD by generating a diverse set of "roughly okay" videos.
- The I3D model was trained on Kinetics-400 (human actions). It may be less sensitive to domain-specific artifacts in non-human-action videos (e.g., driving, game environments).
- Does not directly measure text-video alignment or action controllability.
# Programmatic
from benchmarks.fvd import compute_fvd_with_config, FVDConfig
config = FVDConfig.fvd2048_16f() # Standard: 2048 videos, 16 frames
results = compute_fvd_with_config('data/real/', 'outputs/gen/', config)
print(f"FVD: {results['fvd']:.2f}")# CLI
python -m benchmarks.fvd.cli \
--real-path data/real/ \
--gen-path outputs/gen/ \
--protocol fvd2048_16fPreset protocols:
| Protocol | Videos | Frames | Use Case |
|---|---|---|---|
fvd2048_16f |
2048 | 16 | Standard benchmark (papers) |
fvd2048_128f |
2048 | 128 | Long video evaluation |
quick_test |
100 | 16 | Fast dev iteration |
Feature extractors: i3d (default, standard), clip, videomae
| FVD Range | Interpretation |
|---|---|
| < 100 | Excellent — near-real quality |
| 100–300 | Good — competitive with SOTA |
| 300–600 | Fair — noticeable gap from real |
| > 600 | Poor — significant quality issues |
FVD values are dataset-dependent. Always compare against baselines evaluated on the same real video distribution.
Category: Per-frame reference comparison
Status: ✅ Implemented in fastvideo/tests/ssim/
Trust: High — used in CI regression tests
SSIM compares two images (or video frames) based on three components:
- Luminance: brightness similarity
- Contrast: dynamic range similarity
- Structure: spatial pattern similarity
The final score is a value in [0, 1] where 1.0 = identical.
- Used as a regression guard in CI: ensures model updates don't degrade visual output below a threshold.
- More perceptually meaningful than raw pixel MSE.
- Fast to compute — suitable for automated testing.
- Requires a pixel-aligned reference video. Cannot compare videos with different seeds, prompts, or angles.
- Operates per-frame — does not capture temporal coherence.
- Insensitive to some perceptual artifacts (color shifts, high-frequency noise).
pytest fastvideo/tests/ssim/ -vs| SSIM Range | Quality |
|---|---|
| > 0.90 | Excellent — very close to reference |
| 0.80–0.90 | Good — acceptable for most uses |
| 0.70–0.80 | Fair — noticeable differences |
| < 0.70 | Poor — significant divergence |
Category: Per-frame perceptual distance
Status: ✅ Implemented in scripts/lora_extraction/lora_inference_comparison.py
Trust: Medium — available but only used for LoRA comparison currently
LPIPS uses a pretrained neural network (AlexNet by default) to extract deep features from two images and computes the distance between them in feature space. Unlike SSIM, LPIPS correlates much more strongly with human perceptual judgments.
Lower LPIPS = more perceptually similar.
- Best available automated proxy for human visual judgments at the frame level.
- Captures semantic and structural differences that SSIM misses (e.g., texture changes, minor recoloring).
- Used for validating LoRA merge quality.
- Per-frame metric — no temporal awareness.
- Requires reference video (paired comparison only).
- Slightly slower than SSIM due to neural network forward pass.
python scripts/lora_extraction/lora_inference_comparison.py \
--base merged_model \
--ft path/to/finetuned \
--adapter NONE \
--output-dir results \
--prompt "A cat" \
--compute-lpips| LPIPS Range | Quality |
|---|---|
| < 0.10 | Excellent — nearly indistinguishable |
| 0.10–0.20 | Good — minor perceptual differences |
| 0.20–0.40 | Fair — noticeable differences |
| > 0.40 | Poor — clearly different |
Category: Training signal proxy
Status: ✅ Active (from W&B train_loss)
Trust: Medium — proxy, not direct quality measure
Tracks the training loss over time. A healthy training run shows:
- Decreasing loss over the first hundreds of steps.
- Stable gradient norms (no wild spikes).
- Consistent step times (no infrastructure issues).
- Cheapest evaluation signal — available in real-time from W&B.
- Critical for the 30-minute quality check workflow.
- At later training stages (when loss becomes meaningful), trajectory shape can predict final model quality.
The team's experience shows evaluation signals change during a project:
- Early stage: Loss may be flat or meaningless → focus on SSIM & visual inspection instead.
- Mid stage: Loss starts decreasing → trajectory shape becomes useful.
- Late stage: Loss is meaningful → can compare trajectories across runs.
This dynamic is a key insight from the team's workflow: don't over-rely on loss early; don't ignore it late.
Category: Training health diagnostic
Status: ✅ Active (from W&B grad_norm)
Trust: Medium — diagnostic, not quality metric
The magnitude of gradients during training. Stable grad norms indicate healthy optimization. Spikes or NaN values indicate training instability.
| Condition | Meaning |
|---|---|
| Stable ~0.3–0.5 | Normal training |
| Single spike > 3× average | Possible bad batch, monitor |
| NaN or Inf | 🔴 Training has diverged — stop run |
| Increasing trend | Learning rate may be too high |
Category: Multi-dimensional evaluation framework for interactive world models Status: 🟡 External — not implemented in-repo Source: Matrix-Game 1.0 benchmark, used in Matrix-Game 2.0 paper
A comprehensive benchmark examining four critical capabilities:
| Dimension | What It Evaluates | Example Signals |
|---|---|---|
| Visual quality | Frame-level realism, absence of artifacts | Color fidelity, sharpness, coherence |
| Temporal quality | Smoothness across frames, motion consistency | Jitter, flickering, temporal aliasing |
| Action controllability | Response to input actions (keyboard/mouse) | Action delay, correctness, smoothness |
| Physical rule understanding | Adherence to physics (gravity, collision) | Object persistence, plausible motion |
- Evaluation uses 597-frame composite action sequences over 32 Minecraft scenes and 16 wild scenes.
- Action controllability assessment is Minecraft-specific — cannot be directly applied to wild/general scenes.
- The paper notes that models that "collapse" to static frames can paradoxically score higher on consistency metrics — beware of this confound.
- Matrix-Game 2.0 is built on SkyReels-V2/Wan2.1 architecture — same model family as FastVideo.
- Their distillation uses DMD-based Self-Forcing — same technique as our
self_forcing_distillation_pipeline.py. - GameWorld Score dimensions are a useful framework for thinking about world model quality even outside gaming contexts.
Category: Gold-standard quality assessment Status: 🔴 Manual process — no automated implementation Priority: Highest — this is the most important evaluation signal Trust: Highest — but expensive
Human evaluators compare generated videos and rate them on dimensions like:
- Overall quality and realism
- Temporal coherence and smoothness
- Prompt adherence / action correctness
- Absence of artifacts
All automated metrics are proxies for human judgment. They can be gamed or may miss artifacts that humans easily notice. Human preference is the ultimate ground truth for video generation quality.
| Approach | Cost | Scale | When to Use |
|---|---|---|---|
| Internal team review | Low | ~10–50 videos | Every major checkpoint |
| Crowdsource (MTurk, Scale) | Medium | 100+ videos | Pre-release validation |
| A/B preference test | Medium | Pairs | Comparing two model versions |
- Sample 10–20 videos from the model at a checkpoint.
- Include diverse prompts (easy + hard, short + long).
- Have 2–3 evaluators score each video 1–5 on: quality, coherence, fidelity.
- Record scores in the experiment journal.
| Metric | Reason |
|---|---|
| Not used by the team. Measures text-image alignment using CLIP embeddings, but not well-suited for video temporal quality. | |
| Inception Score (IS) | Less informative than FVD for video; primarily an image metric. |
| PSNR | Pixel-level metric; less perceptually meaningful than SSIM/LPIPS. |
Follow the SOP: .agents/workflows/evaluation-development.md
- Prototype in
.agents/exploration/ - Validate on known-good and known-bad samples
- Add to this registry
- Update the
evaluate-video-qualityskill