Skip to content

feat: support dynamic mode decomposition calibrator#1053

Merged
DefTruth merged 3 commits into
vipshop:mainfrom
Archerkattri:dmd-calibrator
Jun 14, 2026
Merged

feat: support dynamic mode decomposition calibrator#1053
DefTruth merged 3 commits into
vipshop:mainfrom
Archerkattri:dmd-calibrator

Conversation

@Archerkattri

Copy link
Copy Markdown
Contributor

Add a Dynamic Mode Decomposition (Prony) exponential-basis calibrator (calibrator_type="dmd")

Motivation

cache-dit's calibrators currently forecast cached hidden states / residuals with the
TaylorSeer polynomial expansion. This PR adds a second, drop-in calibrator backend
with an exponential forecast basis: Dynamic Mode Decomposition (Schmid 2010), the
SVD-regularised multivariate generalisation of Prony's method (1795). (To avoid the
common collision: this is not Distribution Matching Distillation.)

Honest, family-conditional pitch. We benchmarked both bases across two diffusion
families, and no single basis wins:

  • On flow-matching 3D generators the exponential basis wins clearly and the lead
    grows with the cache interval (numbers below). This is the regime this calibrator is
    for.
  • On DiT-class denoising (DiT-XL/2 ImageNet-256, 250-step DDPM) the ranking inverts:
    the sign-correct TaylorSeer polynomial is near-lossless (paired-noise FID drift 2.27 vs
    the uncached baseline at 3.81x), while the exponential basis drifts 1.7-1.9x more than
    even a near-reuse Hermite control at every interval tested. We therefore do not
    claim DMD as a better default; it is an additional basis for the workloads where it
    wins, default behavior unchanged.

The mechanism behind the 3D win: across denoising steps each cached feature stream
evolves under a slowly varying, near-linear operator; the exact solution class of a
linear feature-ODE is a sum of damped/oscillatory exponentials, and the exponential basis
is exact on that class where any polynomial diverges under extrapolation. Whether a given
model family's stream is in that class at the served horizons is empirical, hence the
per-family numbers below.

It plugs into the existing CalibratorConfig pattern, exactly like
TaylorSeerCalibratorConfig:

import cache_dit
from cache_dit import DMDCalibratorConfig

cache_dit.enable_cache(
    pipe,
    calibrator_config=DMDCalibratorConfig(dmd_history=6),
)

The reference implementation
(hicache-plus-plus) also ships a
training-free holdout selector (backend="auto") that backcasts a held-out snapshot with
both bases per compute window and serves the winner. We benchmarked it on this exact
split and report the honest verdict: it solves intra-run regime switches, but it does
not recover the family-level winner on DiT (both holdout modes served the exponential
arm there, FID drift 18.11 vs the corrected polynomial's 3.54), so the recommended way
to consume this calibrator is a per-family default (DMD for flow-matching
generators; TaylorSeer for DiT-class denoising), not a selector. This PR keeps the
surface minimal: one new basis.

What the calibrator does (math summary)

At each full-compute step the calibrator records the computed tensor as a snapshot
(per named stream, like the TaylorSeer states). At an approximation step it:

  1. takes the longest uniformly spaced suffix of the snapshot history (the identified
    propagator advances exactly one snapshot-spacing per application, and DBCache's dynamic
    decisions can make the compute cadence non-uniform; mixed spacings would corrupt the
    fit);
  2. identifies the linear propagator A with Y_{t+1} ~ A Y_t via one economy SVD of the
    [d, n] snapshot matrix (n = history <= 6, so this is cheap relative to a forward
    pass) with spectrum-based rank truncation (this is what rejects noise);
  3. eigendecomposes once per compute window (cached; refit only when a new snapshot
    arrives) and forecasts the (fractional) horizon k by eigenvalue powers:
    Y_{t+k} ~ Phi (lambda^k * b), b = pinv(Phi) Y_t.

Below the 4-snapshot identifiability floor (a real-valued trajectory spends two real
degrees of freedom per complex pole, so one oscillatory mode already needs three snapshot
pairs), or whenever the fit is degenerate/non-finite, it transparently falls back to the
TaylorSeer expansion it also maintains; warm-up behaves exactly like the existing
calibrator.

Changes

  • caching/cache_contexts/calibrators/dmd.py: new DMDCalibrator + DMDState,
    mirroring the TaylorSeerCalibrator / TaylorSeerState API (mark_step_begin,
    update, approximate, step, reset_cache; per-stream states keyed by name).
  • caching/cache_contexts/calibrators/__init__.py: new DMDCalibratorConfig
    dataclass (dmd_history, dmd_rank, dmd_ridge), registered in the Calibrator
    factory and _supported_calibrators.
  • Export chain: DMDCalibratorConfig re-exported from cache_contexts, caching, and
    the top-level cache_dit namespace, alongside TaylorSeerCalibratorConfig.

No new dependencies (torch-only), no behavior change unless calibrator_type="dmd" is
selected.

Validation so far

  • Unit-level: on synthetic trajectories from the exponential solution class, the
    calibrator's post-warm-up forecast error is ~5e-8 relative L2 where the order-1 Taylor
    expansion sits at ~0.4-1.9 (same snapshots, same schedule).

  • Method-level (reference implementation,
    hicache-plus-plus), flow-matching
    3D generators: on Hunyuan3D-2.1 (Toys4K, F-score@0.05 vs uncached baseline 0.911) the
    deployed polynomial arm decays 0.88 / 0.74 / 0.38 at cache interval 3 / 5 / 6 while the
    exponential basis holds 0.85 / 0.86 / 0.62; exactly lossless at interval 5 on
    Hunyuan3D-2-mini; on SAM3D geometry-lossless (F1 = 1.000) through interval 6 at 1.56x.

  • DiT-class denoising, reported for honesty (the regime where you should NOT pick this
    calibrator). DiT-XL/2 ImageNet-256, 250-step DDPM, cfg 1.5, paired-noise FID-10k drift
    vs the uncached baseline (lossless cache reads ~0; full ledger and protocol:
    hicache-plus-plus/benchmarks/dit_imagenet/RESULTS_DIT.md):

    basis i4 i6 i8
    TaylorSeer (corrected, +k) 2.27 (3.81x) - -
    Hermite (corrected, +k) 3.54 (3.79x) 6.46 (5.46x) 10.74 (7.21x)
    exponential (DMD) 18.02 54.24 100.65

    Holdout selection does not rescue DiT either: in our pre-registered A/B both holdout
    modes of the reference selector served the exponential arm (drift 18.11), because the
    richer exponential fit backcasts the snapshot history better even where it
    extrapolates forward worse. Hence the per-family default recommendation above.

  • Remaining before marking ready for review: a FLUX.1-dev A/B with this exact
    calibrator.

Scoping summary for reviewers: this adds an opt-in basis that wins on flow-matching
generators and is reported, with numbers, as losing on DiT-class denoising. The
per-interval tables are included so the trade-off is judged directly, not from a single
operating point.

@DefTruth

Copy link
Copy Markdown
Member

@Archerkattri Hi, thanks for your contribution! Can you show some visualize cases w/ or w/o dmd calibrator?

…verflow

- Cache the horizon-free DMD eigendecomposition per snapshot window
  (DMDState._fit / _fit_key, invalidated when a new snapshot arrives). Skip
  steps now reuse one SVD/eig instead of recomputing it every step, which is
  what restores the intended cache speedup at large fresh intervals.
- Fit DMD independently per batch item (axis 0). Flattening folded the batch
  into one state, so a prompt's forecast depended on the other prompts in the
  batch; per-item fitting keeps them independent like the Taylor path.
- Move the finite check after the output-dtype cast: a finite float64 forecast
  can still overflow to inf in fp16, so the cast result is what gets guarded.
- yapf / docformatter clean (fixes the failing pre-commit CI check).
@Archerkattri

Copy link
Copy Markdown
Contributor Author

@DefTruth Thanks! e1067af has the visual cases you asked for plus the review fixes.

With / without the DMD calibrator (FLUX.1-dev, 50 steps, seed 42):

with and without DMD

DMD vs the existing TaylorSeer calibrator (same DBCache, matched ~3.2x), which is the real case for the new basis: the exponential forecast holds where the polynomial breaks up.

DMD vs TaylorSeer

calibrator, ~3.2x LPIPS (vs uncached) PSNR CLIP
TaylorSeer 0.78 11.8 0.27
DMD 0.38 19.8 0.32

12 DrawBench prompts; LPIPS/PSNR are vs each method's own uncached image, CLIP is prompt alignment.

Review fixes (all three bot comments, in e1067af):

  • Cache the DMD fit across skip steps: the SVD/eig is fitted once per snapshot window in DMDState (_fit / _fit_key) and reused on every skipped step; only the lambda**k horizon re-advances. Verified one fit per window across N skips.
  • fp16 overflow after the cast: the finite check is now post-cast (a finite float64 forecast can still overflow to inf in fp16).
  • Batch independence: DMD fits per batch item now, so one prompt's forecast no longer depends on the others in the batch.
  • pre-commit (yapf / docformatter) is green.

For full disclosure, since the exponential basis invites it: I also benchmarked against Spectrum (CVPR'26, a global error-bounded Chebyshev fit) on FLUX, and it wins there (3.46x, LPIPS 0.072). Its global fit beats local forecasting of any basis on this image model, and DMD's reported wins are on flow-matching 3D generators. So the honest pitch is that DMD is a strictly better drop-in basis than the TaylorSeer calibrator already in cache-dit, not that it is SOTA on FLUX. Full numbers + scripts: RESULTS.md. Happy to mirror the three hooks to the other model ports.

@DefTruth DefTruth changed the title Add a Dynamic Mode Decomposition (Prony) exponential-basis calibrator (calibrator_type="dmd") feat: support dmd (dynamic mode decomposition) calibrator Jun 14, 2026
@DefTruth DefTruth changed the title feat: support dmd (dynamic mode decomposition) calibrator feat: support dynamic mode decomposition calibrator Jun 14, 2026
@DefTruth

Copy link
Copy Markdown
Member

@Archerkattri This changes is LGTM, please also add 'dmd' calibrator into example CLI and make sure it can work as expected while enable it by --dmd.

def maybe_apply_optimization(

For example:

python -m cache_dit.generate flux # no cache
python -m cache_dit.generate flux --cache # DBCache
python -m cache_dit.generate flux --cache --taylorseer  # DBCache + Taylorseer
python -m cache_dit.generate flux --cache --dmd # DBCache + DMD

@DefTruth DefTruth mentioned this pull request Jun 14, 2026
11 tasks
Per review: enable the DMD calibrator from `python -m cache_dit.generate`
exactly like --taylorseer. --dmd selects DMDCalibratorConfig (history via
--dmd-history, default 6); --taylorseer is unchanged.

Verified end-to-end:
  python -m cache_dit.generate flux --cache --dmd --cpu-offload
generates with the DMD calibrator active (optimization tag ...DMDH6_S12, image saved).
@Archerkattri

Copy link
Copy Markdown
Contributor Author

Done in a0026dc. Added --dmd (and --dmd-history, default 6) to python -m cache_dit.generate, wired exactly like --taylorseer (it selects DMDCalibratorConfig; --taylorseer is unchanged, and --dmd takes precedence if both are passed).

Verified end-to-end on FLUX.1-dev, matching your example:

python -m cache_dit.generate flux                       # no cache
python -m cache_dit.generate flux --cache               # DBCache
python -m cache_dit.generate flux --cache --taylorseer  # DBCache + TaylorSeer
python -m cache_dit.generate flux --cache --dmd         # DBCache + DMD

The --dmd run generates correctly with the DMD calibrator active (optimization tag ...DMDH6_S12, image saved). pre-commit is green.

@DefTruth DefTruth left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~ Thanks for your contribution!

@DefTruth DefTruth merged commit 6cd559c into vipshop:main Jun 14, 2026
4 checks passed
@Archerkattri Archerkattri deleted the dmd-calibrator branch June 14, 2026 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants