Feature/demean accelerated #995

schroedk · 2025-08-21T15:48:53Z

This PR implements a simple version of the Iron-Tucks acceleration.

Closes #357

codecov · 2025-08-21T15:58:24Z

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
pyfixest/core/demean_accelerated.py	0.00%	5 Missing ⚠️

Flag	Coverage Δ
core-tests	`75.75% <0.00%> (-0.06%)`	⬇️
tests-vs-r	`16.12% <0.00%> (-0.02%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
pyfixest/core/demean_accelerated.py	`0.00% <0.00%> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

s3alfisc · 2025-10-22T20:05:24Z

@schroedk just rebased / merged changes from master in here (new features plus I moved to using a pixi toml, plus the dev env no longer installs on windows due to compatibility challenges).

As before, you can build the Rust bindings by typing

pixi r -e dev maturin-develop

and run the benchmarks via

pixi run -e dev pytest  tests/test_demean.py::test_demean_complex_fixed_effects

s3alfisc · 2025-12-02T22:22:26Z

Benchmarks of the accelerated vs regular rust vs fixest and FixedEffectsModels.jl. Looks like good progress to me!

schroedk · 2025-12-15T10:04:34Z

Benchmarks of the accelerated vs regular rust vs fixest and FixedEffectsModels.jl. Looks like good progress to me!

@s3alfisc do you have the code to run this benchmark?

s3alfisc · 2025-12-15T10:09:46Z

Yes, it's here: https://github.com/s3alfisc/fixest_benchmarks

Hope I documented the setup well, but I think clone + just + task runner should get you started.

It's the OLS benchmarks for the hard problem that are relevant.

Note: @grantmcdermott mentioned the other day there might be a minor issue with the benchmarks (though I don't know what exactly) so best to take it with a grain of salt (though I couldn't spot it, everything looked ok to me).

s3alfisc · 2025-12-15T10:13:07Z

Wait it looks like I didn't push my local changes including the just setup. One sec

s3alfisc · 2025-12-15T10:18:45Z

It's in the justfile branch on the remote 😅

https://github.com/s3alfisc/fixest_benchmarks

Requirements: global R and Julia installations. Just.

Then type

just setup

To install all package deps as well as python (in local env).

Then just bench-ols for benchmarks.

Replace explicit SIMD intrinsics from the `wide` crate with unrolled scalar loops that the compiler can auto-vectorize. This simplifies the code, removes a dependency, and makes the code more portable across platforms while still achieving good performance. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add benchmarks/demean_benchmark.py for comparing demeaning backends (rust-accelerated, rust-simple, numba, cupy, fixest via rpy2) - Add benchmarks/bench_demean_r.R for native R fixest benchmarking - Remove coefficient clamping in Irons-Tuck acceleration to match fixest Performance results (100K difficult 3FE): - Rust accelerated: 464ms (1.3x faster than fixest via rpy2) - Native R fixest: 127ms (3.7x faster than Rust) - Numba: 3775ms (8x slower than Rust accelerated) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Remove coefficient_based module and Grand acceleration complexity. Use IronsTuckAcceleration with MultiFactorProjector for 3+ FE, matching the simpler and well-tested 2-FE approach. Performance on 100K difficult 3FE: - rust-accelerated: 1008ms (~960 iterations) - rust-simple: 3805ms (no acceleration) - numba: 4221ms - fixest (rpy2): 595ms Rust is 3.8x faster than simple approach but still 1.7x slower than fixest on hard convergence cases. Easy cases are very fast (11ms). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Implement a fresh coefficient-space iteration algorithm that closely follows fixest's C++ implementation: - Add coef_space.rs with FEInfo struct and coefficient-space iteration - Implement Irons-Tuck acceleration applied every iteration - Implement Grand acceleration applied every 4 iterations - Use nb_coef_no_Q optimization (accelerate only first Q-1 FEs) - Implement multi-phase strategy for 3+ FEs: 1. Warmup with all FEs (15 iterations) 2. 2-FE sub-convergence on first 2 FEs 3. Re-acceleration with all FEs - Add unsafe bounds check elimination for hot loops - Add #[inline(always)] on performance-critical functions Performance vs fixest (R native): - 2-FE cases: 2-10x faster - 3-FE simple: ~1x (matches fixest) - 3-FE difficult: 1.74x slower (down from 27x with simple impl) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Add bench_native_comparison.py that: - Runs fixest directly via R subprocess (no rpy2 overhead) - Compares pyfixest Rust accelerated vs simple implementations - Tests multiple configurations (2-FE, 3-FE, simple/difficult DGP) - Reports median times and ratios vs native fixest Also add benchmarks/results/ to .gitignore for generated output. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Delete 6 files (1400 lines) that were superseded by coef_space.rs: - acceleration.rs, buffers.rs, simd_ops.rs - single_fe.rs, two_fe.rs, general.rs These formed a dead code cluster only referencing each other after the coefficient-space rewrite. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The multi-phase strategy for 3+ FE demeaning was producing incorrect results because of a mismatch in how the output array was interpreted: - fixest stores sum-of-FE-coefficients in output, and accumulates across phases - Our code stored the residual (input - coefs), causing in_out to be computed incorrectly for Phase 2 and Phase 3 The fix introduces a separate `mu` vector to track the sum of FE contributions (fixest's convention), then converts to residual at the end. Each phase now correctly computes in_out = agg(input - mu) and adds its coefficients to mu. This fixes the correctness issue where 3-FE demeaning was converging to a suboptimal solution (34% higher SSR than the simple algorithm). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Flatten fe_ids from Vec<Vec<usize>> to Vec<usize> for better cache locality (eliminates pointer indirection) - Flatten sum_weights from Vec<Vec<f64>> to Vec<f64> similarly - Move FEInfo construction outside parallel loop to share across columns - Add fe_ids_slice() and sum_weights_slice() helper methods for access These changes improve performance by ~10% on the difficult 3-FE benchmark case through better memory access patterns. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add compute_beta_from_alpha function for efficient beta computation - Add SSR stopping criterion every 40 iterations in run_2fe_acceleration - Use effective_input (input - mu) for correct SSR computation in 3+ FE case This matches fixest's early stopping behavior for cases where the residual stops improving even if coefficients are still changing. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

- Add specialized project_qfe_3fe_unweighted function for the common case - Use raw pointers instead of slice operations to eliminate bounds checking - Unroll loops to process 4 observations at a time - Eliminate redundant fill(0) operations by using direct assignment - Add debug instrumentation behind PYFIXEST_DEBUG_ITER env var Performance improvement on difficult 3-FE case: - Before: 286ms (2.28x slower than fixest) - After: 225ms (1.77x slower than fixest) - 21% improvement in the hardest benchmark case For all other benchmark cases, pyfixest remains 2-10x faster than fixest. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

The low-level demean() function had a tighter default tolerance (1e-8) than feols() and fixest (1e-6). This caused unfair benchmark comparisons showing pyfixest as 1.77x slower than fixest on hard cases. With matching tolerance, pyfixest is: - 2-8x faster than fixest on 7/8 benchmark cases - Only 10% slower on the hardest case (100K difficult 3-FE) Changes: - Update demean.py default tol from 1e-8 to 1e-6 - Update FixestConfig default in Rust to match - Add ARM64 NEON compiler flags in .cargo/config.toml - Update benchmark to use correct tolerance 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Major performance improvements for feols(): - Removed gc.collect() call that added ~50ms overhead per model fit - Updated benchmark to compare feols() vs feols() (not demean vs feols) - Benchmark now uses Rust backend for fair comparison Results with Rust backend: - Most cases: pyfixest within 1.03-1.6x of fixest - Hardest case (100K difficult 3FE): pyfixest is 1.8x FASTER than fixest 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

s3alfisc · 2025-12-27T12:30:21Z

Is this confirmed? 👀 😄 🚀

s3alfisc · 2025-12-27T12:32:26Z

Re failing tests atm - seems to be the prediction method, which is notoriously fickle and should be fixable by changing the tolerance or just checking on a different subset of the prediction array:

 x1 = array([23.03009222,  7.43960451, 14.04928582, 17.83110832, 17.46495242])
x2 = array([23.03009052,  7.43960323, 14.04928428, 17.83110934, 17.4649537 ])

s3alfisc mentioned this pull request Oct 12, 2025

pyfixest is MUCH slower than (Stata + Julia) reghdfejl #1042

Open

schroedk added 3 commits December 12, 2025 12:42

Add a refactored version of demean as a starting point

ad17c22

Add acceleration path

4dbb8b7

Replace deprecated function

1e633c1

schroedk force-pushed the feature/demean-accelerated branch from dbf6e71 to 1e633c1 Compare December 12, 2025 11:51

schroedk and others added 16 commits December 16, 2025 13:33

Align accelerated demean implementation with fixest

df35196

Align parameters to fixest implementation

cadb475

Add module docstring referencing the original fixest implementation

ba04d9f

Remove unused code to silence rust compiler warnings

bca11f2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/demean accelerated #995

Feature/demean accelerated #995

Uh oh!

schroedk commented Aug 21, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 21, 2025 •

edited

Loading

Uh oh!

s3alfisc commented Oct 22, 2025 •

edited

Loading

Uh oh!

s3alfisc commented Dec 2, 2025 •

edited

Loading

Uh oh!

schroedk commented Dec 15, 2025

Uh oh!

s3alfisc commented Dec 15, 2025 •

edited

Loading

Uh oh!

s3alfisc commented Dec 15, 2025

Uh oh!

s3alfisc commented Dec 15, 2025 •

edited

Loading

Uh oh!

s3alfisc commented Dec 27, 2025 •

edited

Loading

Uh oh!

s3alfisc commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature/demean accelerated #995

Are you sure you want to change the base?

Feature/demean accelerated #995

Uh oh!

Conversation

schroedk commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

s3alfisc commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s3alfisc commented Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

schroedk commented Dec 15, 2025

Uh oh!

s3alfisc commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s3alfisc commented Dec 15, 2025

Uh oh!

s3alfisc commented Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s3alfisc commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s3alfisc commented Dec 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

schroedk commented Aug 21, 2025 •

edited

Loading

codecov bot commented Aug 21, 2025 •

edited

Loading

s3alfisc commented Oct 22, 2025 •

edited

Loading

s3alfisc commented Dec 2, 2025 •

edited

Loading

s3alfisc commented Dec 15, 2025 •

edited

Loading

s3alfisc commented Dec 15, 2025 •

edited

Loading

s3alfisc commented Dec 27, 2025 •

edited

Loading