perf: fuse boundary predict_cov and parallelize backward smoother#25
Merged
Merged
Conversation
Add flat_from_nested/nested_from_flat helpers and 9 tests comparing predict_state_covariance_flat against the naive reference for: - s=2,4,12,168 boundary, s=12 non-boundary - symmetry (s=52), diagonal positivity (s=168) - identity P (s=7), zero P (s=4)
Replace the 4-pass boundary predict_state_covariance_flat with a closed-form 2-pass implementation that directly computes (T*P*T'+Q) using the sparse structure of the seasonal transition matrix T. Pass 0: precompute row_sums[l] = Σ P[l][k] for k=1..s-1 Pass 1: write lower triangle + mirror + Q diagonal Benefits: - Memory traffic: ~6s² → ~2.5s² operations (-58%) - Eliminates symmetrize pass (mirror writes guarantee bit-exact symmetry) - Adds debug_assert for input P symmetry verification
Add DK vs RTS comparison tests for s4/t100 and s168/t1200, plus finite-value assertion for s168/t1200 smoothed output. These serve as regression guards before parallelizing the backward pass Phase 2.
Split the DK backward smoother into two phases: Phase 1 (sequential): r_t recursion with dependency chain, storing all r vectors in r_store[t*s] flat buffer (~1.6 MB for s=168, T=1200). Phase 2 (parallel): smooth[i] = a_pred[i] + P_pred[i] * r_store[i] using Rayon into_par_iter, as each time step is independent. Inner summation (0..s).map().sum() remains sequential within each parallel task, guaranteeing bit-deterministic floating-point results.
- alpha_plus: Vec<Vec<f64>> → Vec<f64> (flat t*s buffer) Uses split_at_mut for borrow-safe copy_from_slice - seasonal_kalman_smoother: returns Vec<f64> instead of Vec<Vec<f64>> - Backward Phase 2: par_chunks_mut for direct flat output - Add seasonal_kalman_smoother_nested test wrapper for compatibility Eliminates per-timestep heap allocations in the hot loop.
Show HN 投稿準備。README冒頭にPyPI/Python/Licenseバッジを追加し、 Quick Startセクション末尾にplot画像とsummary()の実出力例を追加。
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Optimize the Kalman filter for large seasonal models (s=168, T=1200). Two key changes:
predict_state_covariance_flatboundary case from 4-pass (TP, TP*T', symmetrize, add Q) to 2-pass closed-form using the sparse structure of the seasonal transition matrix T. Reduces forward-filter memory traffic by ~58%.par_chunks_mut). Phase 2 is embarrassingly parallel since each smoothed state depends only ona_pred[i],P_pred[i], andr_store[i].alpha_plusand smoother output fromVec<Vec<f64>>to contiguousVec<f64>buffers for better cache locality.Benchmark
Note: The forward filter is memory-bound at this scale (p_pred_flat = 900 × 168² × 8B ≈ 203MB, exceeding L3 cache). The FLOPs reduction from 4→2 pass fusion has proportionally less effect under memory bandwidth saturation. Further optimization would require a p_pred recomputation strategy or blocked access pattern.
Changes
src/kalman.rspredict_state_covariance_flat: Replace 4-pass boundary logic with closed-form 2-pass using row_sums precomputation and lower-triangle + mirror writedebug_assertfor input P symmetry verificationpar_chunks_mutparallel)seasonal_kalman_smoother: Return flatVec<f64>(t×s) instead ofVec<Vec<f64>>alpha_plus: Flatten to contiguous buffer withsplit_at_mutfor borrow safetyuse rayon::prelude::*;Tests (12 new)
Test plan
cargo test -- -q— 91 Rust tests pass (79 existing + 12 new).venv/bin/pytest tests/ -v— 236 Python integration tests passpython z-ai/bench_large_seasonal.pyRelease
Version bump: 1.3.1 → 1.4.0 (minor: internal API change to
seasonal_kalman_smootherreturn type + performance improvements)