-
Notifications
You must be signed in to change notification settings - Fork 71
Feature/demean accelerated #995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Feature/demean accelerated #995
Conversation
Codecov Report❌ Patch coverage is
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
@schroedk just rebased / merged changes from master in here (new features plus I moved to using a pixi toml, plus the dev env no longer installs on windows due to compatibility challenges). As before, you can build the Rust bindings by typing and run the benchmarks via |
dbf6e71 to
1e633c1
Compare
@s3alfisc do you have the code to run this benchmark? |
|
Yes, it's here: https://github.com/s3alfisc/fixest_benchmarks Hope I documented the setup well, but I think clone + just + task runner should get you started. It's the OLS benchmarks for the hard problem that are relevant. Note: @grantmcdermott mentioned the other day there might be a minor issue with the benchmarks (though I don't know what exactly) so best to take it with a grain of salt (though I couldn't spot it, everything looked ok to me). |
|
Wait it looks like I didn't push my local changes including the just setup. One sec |
|
It's in the justfile branch on the remote 😅 https://github.com/s3alfisc/fixest_benchmarks Requirements: global R and Julia installations. Just. Then type just setup To install all package deps as well as python (in local env). Then just bench-ols for benchmarks. |
Replace explicit SIMD intrinsics from the `wide` crate with unrolled scalar loops that the compiler can auto-vectorize. This simplifies the code, removes a dependency, and makes the code more portable across platforms while still achieving good performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add benchmarks/demean_benchmark.py for comparing demeaning backends (rust-accelerated, rust-simple, numba, cupy, fixest via rpy2) - Add benchmarks/bench_demean_r.R for native R fixest benchmarking - Remove coefficient clamping in Irons-Tuck acceleration to match fixest Performance results (100K difficult 3FE): - Rust accelerated: 464ms (1.3x faster than fixest via rpy2) - Native R fixest: 127ms (3.7x faster than Rust) - Numba: 3775ms (8x slower than Rust accelerated) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Remove coefficient_based module and Grand acceleration complexity. Use IronsTuckAcceleration with MultiFactorProjector for 3+ FE, matching the simpler and well-tested 2-FE approach. Performance on 100K difficult 3FE: - rust-accelerated: 1008ms (~960 iterations) - rust-simple: 3805ms (no acceleration) - numba: 4221ms - fixest (rpy2): 595ms Rust is 3.8x faster than simple approach but still 1.7x slower than fixest on hard convergence cases. Easy cases are very fast (11ms). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Implement a fresh coefficient-space iteration algorithm that closely follows fixest's C++ implementation: - Add coef_space.rs with FEInfo struct and coefficient-space iteration - Implement Irons-Tuck acceleration applied every iteration - Implement Grand acceleration applied every 4 iterations - Use nb_coef_no_Q optimization (accelerate only first Q-1 FEs) - Implement multi-phase strategy for 3+ FEs: 1. Warmup with all FEs (15 iterations) 2. 2-FE sub-convergence on first 2 FEs 3. Re-acceleration with all FEs - Add unsafe bounds check elimination for hot loops - Add #[inline(always)] on performance-critical functions Performance vs fixest (R native): - 2-FE cases: 2-10x faster - 3-FE simple: ~1x (matches fixest) - 3-FE difficult: 1.74x slower (down from 27x with simple impl) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Add bench_native_comparison.py that: - Runs fixest directly via R subprocess (no rpy2 overhead) - Compares pyfixest Rust accelerated vs simple implementations - Tests multiple configurations (2-FE, 3-FE, simple/difficult DGP) - Reports median times and ratios vs native fixest Also add benchmarks/results/ to .gitignore for generated output. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Delete 6 files (1400 lines) that were superseded by coef_space.rs: - acceleration.rs, buffers.rs, simd_ops.rs - single_fe.rs, two_fe.rs, general.rs These formed a dead code cluster only referencing each other after the coefficient-space rewrite. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The multi-phase strategy for 3+ FE demeaning was producing incorrect results because of a mismatch in how the output array was interpreted: - fixest stores sum-of-FE-coefficients in output, and accumulates across phases - Our code stored the residual (input - coefs), causing in_out to be computed incorrectly for Phase 2 and Phase 3 The fix introduces a separate `mu` vector to track the sum of FE contributions (fixest's convention), then converts to residual at the end. Each phase now correctly computes in_out = agg(input - mu) and adds its coefficients to mu. This fixes the correctness issue where 3-FE demeaning was converging to a suboptimal solution (34% higher SSR than the simple algorithm). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Flatten fe_ids from Vec<Vec<usize>> to Vec<usize> for better cache locality (eliminates pointer indirection) - Flatten sum_weights from Vec<Vec<f64>> to Vec<f64> similarly - Move FEInfo construction outside parallel loop to share across columns - Add fe_ids_slice() and sum_weights_slice() helper methods for access These changes improve performance by ~10% on the difficult 3-FE benchmark case through better memory access patterns. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add compute_beta_from_alpha function for efficient beta computation - Add SSR stopping criterion every 40 iterations in run_2fe_acceleration - Use effective_input (input - mu) for correct SSR computation in 3+ FE case This matches fixest's early stopping behavior for cases where the residual stops improving even if coefficients are still changing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
- Add specialized project_qfe_3fe_unweighted function for the common case - Use raw pointers instead of slice operations to eliminate bounds checking - Unroll loops to process 4 observations at a time - Eliminate redundant fill(0) operations by using direct assignment - Add debug instrumentation behind PYFIXEST_DEBUG_ITER env var Performance improvement on difficult 3-FE case: - Before: 286ms (2.28x slower than fixest) - After: 225ms (1.77x slower than fixest) - 21% improvement in the hardest benchmark case For all other benchmark cases, pyfixest remains 2-10x faster than fixest. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
The low-level demean() function had a tighter default tolerance (1e-8) than feols() and fixest (1e-6). This caused unfair benchmark comparisons showing pyfixest as 1.77x slower than fixest on hard cases. With matching tolerance, pyfixest is: - 2-8x faster than fixest on 7/8 benchmark cases - Only 10% slower on the hardest case (100K difficult 3-FE) Changes: - Update demean.py default tol from 1e-8 to 1e-6 - Update FixestConfig default in Rust to match - Add ARM64 NEON compiler flags in .cargo/config.toml - Update benchmark to use correct tolerance 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
Major performance improvements for feols(): - Removed gc.collect() call that added ~50ms overhead per model fit - Updated benchmark to compare feols() vs feols() (not demean vs feols) - Benchmark now uses Rust backend for fair comparison Results with Rust backend: - Most cases: pyfixest within 1.03-1.6x of fixest - Hardest case (100K difficult 3FE): pyfixest is 1.8x FASTER than fixest 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>
|
Re failing tests atm - seems to be the prediction method, which is notoriously fickle and should be fixable by changing the tolerance or just checking on a different subset of the prediction array: x1 = array([23.03009222, 7.43960451, 14.04928582, 17.83110832, 17.46495242])
x2 = array([23.03009052, 7.43960323, 14.04928428, 17.83110934, 17.4649537 ]) |



This PR implements a simple version of the Iron-Tucks acceleration.
Closes #357