Skip to content

perf: fuse boundary predict_cov and parallelize backward smoother#25

Merged
YuminosukeSato merged 7 commits into
mainfrom
perf/predict-cov-fusion-rayon-backward
Mar 24, 2026
Merged

perf: fuse boundary predict_cov and parallelize backward smoother#25
YuminosukeSato merged 7 commits into
mainfrom
perf/predict-cov-fusion-rayon-backward

Conversation

@YuminosukeSato
Copy link
Copy Markdown
Owner

Summary

Optimize the Kalman filter for large seasonal models (s=168, T=1200). Two key changes:

  • Fuse predict_state_covariance_flat boundary case from 4-pass (TP, TP*T', symmetrize, add Q) to 2-pass closed-form using the sparse structure of the seasonal transition matrix T. Reduces forward-filter memory traffic by ~58%.
  • Split DK backward smoother into Phase 1 (sequential r recursion) and Phase 2 (Rayon parallel P*r correction via par_chunks_mut). Phase 2 is embarrassingly parallel since each smoothed state depends only on a_pred[i], P_pred[i], and r_store[i].
  • Flatten alpha_plus and smoother output from Vec<Vec<f64>> to contiguous Vec<f64> buffers for better cache locality.

Benchmark

Config Before (v1.3.1) After
s=168, T=1200, niter=1000, nwarmup=500 ~92s ~85s (8% improvement)
s=168, T=1200, niter=1000, nwarmup=0 ~61s ~58s (5% improvement)

Note: The forward filter is memory-bound at this scale (p_pred_flat = 900 × 168² × 8B ≈ 203MB, exceeding L3 cache). The FLOPs reduction from 4→2 pass fusion has proportionally less effect under memory bandwidth saturation. Further optimization would require a p_pred recomputation strategy or blocked access pattern.

Changes

src/kalman.rs

  • predict_state_covariance_flat: Replace 4-pass boundary logic with closed-form 2-pass using row_sums precomputation and lower-triangle + mirror write
  • Add debug_assert for input P symmetry verification
  • Backward pass: 2-phase split (Phase 1: sequential r_store, Phase 2: par_chunks_mut parallel)
  • seasonal_kalman_smoother: Return flat Vec<f64> (t×s) instead of Vec<Vec<f64>>
  • alpha_plus: Flatten to contiguous buffer with split_at_mut for borrow safety
  • Add use rayon::prelude::*;

Tests (12 new)

  • 9 PR-D tests: fused predict_cov vs naive reference (s=2,4,12,168 boundary, non-boundary, symmetry, diagonal positivity, identity P, zero P)
  • 3 PR-E tests: Rayon backward vs sequential (DK vs RTS cross-validation at s=4/s=168, finite check)

Test plan

  • cargo test -- -q — 91 Rust tests pass (79 existing + 12 new)
  • .venv/bin/pytest tests/ -v — 236 Python integration tests pass
  • Benchmark: python z-ai/bench_large_seasonal.py
  • Verify on CI

Release

Version bump: 1.3.1 → 1.4.0 (minor: internal API change to seasonal_kalman_smoother return type + performance improvements)

Add flat_from_nested/nested_from_flat helpers and 9 tests comparing
predict_state_covariance_flat against the naive reference for:
- s=2,4,12,168 boundary, s=12 non-boundary
- symmetry (s=52), diagonal positivity (s=168)
- identity P (s=7), zero P (s=4)
Replace the 4-pass boundary predict_state_covariance_flat with a
closed-form 2-pass implementation that directly computes (T*P*T'+Q)
using the sparse structure of the seasonal transition matrix T.

Pass 0: precompute row_sums[l] = Σ P[l][k] for k=1..s-1
Pass 1: write lower triangle + mirror + Q diagonal

Benefits:
- Memory traffic: ~6s² → ~2.5s² operations (-58%)
- Eliminates symmetrize pass (mirror writes guarantee bit-exact symmetry)
- Adds debug_assert for input P symmetry verification
Add DK vs RTS comparison tests for s4/t100 and s168/t1200, plus
finite-value assertion for s168/t1200 smoothed output. These serve
as regression guards before parallelizing the backward pass Phase 2.
Split the DK backward smoother into two phases:

Phase 1 (sequential): r_t recursion with dependency chain, storing
all r vectors in r_store[t*s] flat buffer (~1.6 MB for s=168, T=1200).

Phase 2 (parallel): smooth[i] = a_pred[i] + P_pred[i] * r_store[i]
using Rayon into_par_iter, as each time step is independent.

Inner summation (0..s).map().sum() remains sequential within each
parallel task, guaranteeing bit-deterministic floating-point results.
- alpha_plus: Vec<Vec<f64>> → Vec<f64> (flat t*s buffer)
  Uses split_at_mut for borrow-safe copy_from_slice
- seasonal_kalman_smoother: returns Vec<f64> instead of Vec<Vec<f64>>
- Backward Phase 2: par_chunks_mut for direct flat output
- Add seasonal_kalman_smoother_nested test wrapper for compatibility

Eliminates per-timestep heap allocations in the hot loop.
Show HN 投稿準備。README冒頭にPyPI/Python/Licenseバッジを追加し、
Quick Startセクション末尾にplot画像とsummary()の実出力例を追加。
@YuminosukeSato YuminosukeSato merged commit 3be3f19 into main Mar 24, 2026
14 checks passed
@YuminosukeSato YuminosukeSato deleted the perf/predict-cov-fusion-rayon-backward branch March 24, 2026 02:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant