feat: support dynamic mode decomposition calibrator#1053
Conversation
… (`calibrator_type="dmd"`)
|
@Archerkattri Hi, thanks for your contribution! Can you show some visualize cases w/ or w/o dmd calibrator? |
…verflow - Cache the horizon-free DMD eigendecomposition per snapshot window (DMDState._fit / _fit_key, invalidated when a new snapshot arrives). Skip steps now reuse one SVD/eig instead of recomputing it every step, which is what restores the intended cache speedup at large fresh intervals. - Fit DMD independently per batch item (axis 0). Flattening folded the batch into one state, so a prompt's forecast depended on the other prompts in the batch; per-item fitting keeps them independent like the Taylor path. - Move the finite check after the output-dtype cast: a finite float64 forecast can still overflow to inf in fp16, so the cast result is what gets guarded. - yapf / docformatter clean (fixes the failing pre-commit CI check).
|
@DefTruth Thanks! With / without the DMD calibrator (FLUX.1-dev, 50 steps, seed 42): DMD vs the existing TaylorSeer calibrator (same DBCache, matched ~3.2x), which is the real case for the new basis: the exponential forecast holds where the polynomial breaks up.
12 DrawBench prompts; LPIPS/PSNR are vs each method's own uncached image, CLIP is prompt alignment. Review fixes (all three bot comments, in
For full disclosure, since the exponential basis invites it: I also benchmarked against Spectrum (CVPR'26, a global error-bounded Chebyshev fit) on FLUX, and it wins there (3.46x, LPIPS 0.072). Its global fit beats local forecasting of any basis on this image model, and DMD's reported wins are on flow-matching 3D generators. So the honest pitch is that DMD is a strictly better drop-in basis than the TaylorSeer calibrator already in cache-dit, not that it is SOTA on FLUX. Full numbers + scripts: RESULTS.md. Happy to mirror the three hooks to the other model ports. |
calibrator_type="dmd")|
@Archerkattri This changes is LGTM, please also add 'dmd' calibrator into example CLI and make sure it can work as expected while enable it by cache-dit/src/cache_dit/_utils/utils.py Line 2198 in 33eacf7 For example: python -m cache_dit.generate flux # no cache
python -m cache_dit.generate flux --cache # DBCache
python -m cache_dit.generate flux --cache --taylorseer # DBCache + Taylorseer
python -m cache_dit.generate flux --cache --dmd # DBCache + DMD |
Per review: enable the DMD calibrator from `python -m cache_dit.generate` exactly like --taylorseer. --dmd selects DMDCalibratorConfig (history via --dmd-history, default 6); --taylorseer is unchanged. Verified end-to-end: python -m cache_dit.generate flux --cache --dmd --cpu-offload generates with the DMD calibrator active (optimization tag ...DMDH6_S12, image saved).
|
Done in Verified end-to-end on FLUX.1-dev, matching your example: python -m cache_dit.generate flux # no cache
python -m cache_dit.generate flux --cache # DBCache
python -m cache_dit.generate flux --cache --taylorseer # DBCache + TaylorSeer
python -m cache_dit.generate flux --cache --dmd # DBCache + DMDThe |
DefTruth
left a comment
There was a problem hiding this comment.
LGTM~ Thanks for your contribution!


Add a Dynamic Mode Decomposition (Prony) exponential-basis calibrator (
calibrator_type="dmd")Motivation
cache-dit's calibrators currently forecast cached hidden states / residuals with the
TaylorSeer polynomial expansion. This PR adds a second, drop-in calibrator backend
with an exponential forecast basis: Dynamic Mode Decomposition (Schmid 2010), the
SVD-regularised multivariate generalisation of Prony's method (1795). (To avoid the
common collision: this is not Distribution Matching Distillation.)
Honest, family-conditional pitch. We benchmarked both bases across two diffusion
families, and no single basis wins:
grows with the cache interval (numbers below). This is the regime this calibrator is
for.
the sign-correct TaylorSeer polynomial is near-lossless (paired-noise FID drift 2.27 vs
the uncached baseline at 3.81x), while the exponential basis drifts 1.7-1.9x more than
even a near-reuse Hermite control at every interval tested. We therefore do not
claim DMD as a better default; it is an additional basis for the workloads where it
wins, default behavior unchanged.
The mechanism behind the 3D win: across denoising steps each cached feature stream
evolves under a slowly varying, near-linear operator; the exact solution class of a
linear feature-ODE is a sum of damped/oscillatory exponentials, and the exponential basis
is exact on that class where any polynomial diverges under extrapolation. Whether a given
model family's stream is in that class at the served horizons is empirical, hence the
per-family numbers below.
It plugs into the existing
CalibratorConfigpattern, exactly likeTaylorSeerCalibratorConfig:The reference implementation
(hicache-plus-plus) also ships a
training-free holdout selector (
backend="auto") that backcasts a held-out snapshot withboth bases per compute window and serves the winner. We benchmarked it on this exact
split and report the honest verdict: it solves intra-run regime switches, but it does
not recover the family-level winner on DiT (both holdout modes served the exponential
arm there, FID drift 18.11 vs the corrected polynomial's 3.54), so the recommended way
to consume this calibrator is a per-family default (DMD for flow-matching
generators; TaylorSeer for DiT-class denoising), not a selector. This PR keeps the
surface minimal: one new basis.
What the calibrator does (math summary)
At each full-compute step the calibrator records the computed tensor as a snapshot
(per named stream, like the TaylorSeer states). At an approximation step it:
propagator advances exactly one snapshot-spacing per application, and DBCache's dynamic
decisions can make the compute cadence non-uniform; mixed spacings would corrupt the
fit);
AwithY_{t+1} ~ A Y_tvia one economy SVD of the[d, n]snapshot matrix (n= history <= 6, so this is cheap relative to a forwardpass) with spectrum-based rank truncation (this is what rejects noise);
arrives) and forecasts the (fractional) horizon
kby eigenvalue powers:Y_{t+k} ~ Phi (lambda^k * b),b = pinv(Phi) Y_t.Below the 4-snapshot identifiability floor (a real-valued trajectory spends two real
degrees of freedom per complex pole, so one oscillatory mode already needs three snapshot
pairs), or whenever the fit is degenerate/non-finite, it transparently falls back to the
TaylorSeer expansion it also maintains; warm-up behaves exactly like the existing
calibrator.
Changes
caching/cache_contexts/calibrators/dmd.py: newDMDCalibrator+DMDState,mirroring the
TaylorSeerCalibrator/TaylorSeerStateAPI (mark_step_begin,update,approximate,step,reset_cache; per-stream states keyed by name).caching/cache_contexts/calibrators/__init__.py: newDMDCalibratorConfigdataclass (
dmd_history,dmd_rank,dmd_ridge), registered in theCalibratorfactory and
_supported_calibrators.DMDCalibratorConfigre-exported fromcache_contexts,caching, andthe top-level
cache_ditnamespace, alongsideTaylorSeerCalibratorConfig.No new dependencies (torch-only), no behavior change unless
calibrator_type="dmd"isselected.
Validation so far
Unit-level: on synthetic trajectories from the exponential solution class, the
calibrator's post-warm-up forecast error is ~5e-8 relative L2 where the order-1 Taylor
expansion sits at ~0.4-1.9 (same snapshots, same schedule).
Method-level (reference implementation,
hicache-plus-plus), flow-matching
3D generators: on Hunyuan3D-2.1 (Toys4K, F-score@0.05 vs uncached baseline 0.911) the
deployed polynomial arm decays 0.88 / 0.74 / 0.38 at cache interval 3 / 5 / 6 while the
exponential basis holds 0.85 / 0.86 / 0.62; exactly lossless at interval 5 on
Hunyuan3D-2-mini; on SAM3D geometry-lossless (F1 = 1.000) through interval 6 at 1.56x.
DiT-class denoising, reported for honesty (the regime where you should NOT pick this
calibrator). DiT-XL/2 ImageNet-256, 250-step DDPM, cfg 1.5, paired-noise FID-10k drift
vs the uncached baseline (lossless cache reads ~0; full ledger and protocol:
hicache-plus-plus/benchmarks/dit_imagenet/RESULTS_DIT.md):Holdout selection does not rescue DiT either: in our pre-registered A/B both holdout
modes of the reference selector served the exponential arm (drift 18.11), because the
richer exponential fit backcasts the snapshot history better even where it
extrapolates forward worse. Hence the per-family default recommendation above.
Remaining before marking ready for review: a FLUX.1-dev A/B with this exact
calibrator.
Scoping summary for reviewers: this adds an opt-in basis that wins on flow-matching
generators and is reported, with numbers, as losing on DiT-class denoising. The
per-interval tables are included so the trade-off is judged directly, not from a single
operating point.