Skip to content

Comments

Refactor formula parsing#1118

Closed
leostimpfle wants to merge 78 commits intomasterfrom
formula
Closed

Refactor formula parsing#1118
leostimpfle wants to merge 78 commits intomasterfrom
formula

Conversation

@leostimpfle
Copy link
Collaborator

This is a proof of concept for a refactor of PyFixest's formula parsing. The PR introduces a new module parse that refactors formula parsing from the ground up.

The core logic is implemented in pyfixest.estimation.formula.parse.parse which takes in a formula string and returns a collection of parsed formulas represented by pyfixest.estimation.formula.parse.Formula.

All references to the old FormulaParser are bypassed (mostly by renaming the old FixestFormula using imports of the form from pyfixest.estimation.formula.parse import Formula as FixestFormula)

@codecov
Copy link

codecov bot commented Dec 28, 2025

Codecov Report

❌ Patch coverage is 87.03704% with 63 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
pyfixest/estimation/formula/factor_interaction.py 64.78% 25 Missing ⚠️
pyfixest/did/saturated_twfe.py 0.00% 11 Missing ⚠️
pyfixest/estimation/formula/utils.py 85.13% 11 Missing ⚠️
pyfixest/estimation/model_matrix_fixest_.py 12.50% 7 Missing ⚠️
pyfixest/estimation/formula/model_matrix.py 95.31% 6 Missing ⚠️
pyfixest/estimation/formula/parse.py 98.43% 2 Missing ⚠️
pyfixest/estimation/feols_.py 95.45% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (9ba671b) and HEAD (f25b4e8). Click for more details.

HEAD has 1 upload less than BASE
Flag BASE (9ba671b) HEAD (f25b4e8)
tests-extended 1 0
Flag Coverage Δ
core-tests 71.97% <87.03%> (-2.98%) ⬇️
tests-extended ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pyfixest/did/did2s.py 89.32% <100.00%> (-0.31%) ⬇️
pyfixest/errors/__init__.py 100.00% <100.00%> (ø)
pyfixest/estimation/FixestMulti_.py 77.47% <100.00%> (-0.25%) ⬇️
pyfixest/estimation/FormulaParser.py 49.50% <100.00%> (-46.97%) ⬇️
pyfixest/estimation/demean_.py 54.91% <100.00%> (ø)
pyfixest/estimation/fegaussian_.py 87.09% <100.00%> (ø)
pyfixest/estimation/feglm_.py 73.35% <100.00%> (ø)
pyfixest/estimation/feiv_.py 87.27% <100.00%> (ø)
pyfixest/estimation/felogit_.py 88.57% <100.00%> (ø)
pyfixest/estimation/feols_compressed_.py 80.23% <100.00%> (ø)
... and 14 more

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@s3alfisc
Copy link
Member

After first look at the code base: much better and much cleaner than before. No fundamental suggestions for improvement from my side. Thank you!

@leostimpfle
Copy link
Collaborator Author

Note to self: Look into allowing formulaic's multi stage formula notation in the parser. For more background see matthewwardrop/formulaic#108 and matthewwardrop/formulaic#24

@s3alfisc
Copy link
Member

s3alfisc commented Jan 1, 2026

@leostimpfle fixed a bug in the bin() function + adjusted the tests. You can run them via pixi r -e dev pytest tests/test_i.py. Currently give two errors:

      40 passed
       2 failed
         - tests/test_i.py:286 test_factor_x_factor[Y ~ i(f_str, i.g)-Y ~ i(f_str, g)]
         - tests/test_i.py:304 test_factor_x_factor_with_fe[Y ~ i(f_str, i.g) | fe1-Y ~ i(f_str, g) | fe1]
E           AssertionError: Name mismatch:
E               py=['f_str::apple:g::X', 'f_str::apple:g::Y', 'f_str::apple:g::Z', 'f_str::banana:g::X', 'f_str::banana:g::Y', 'f_str::banana:g::Z', 'f_str::cherry:g::X', 'f_str::cherry:g::Y']
E               r=['f_str::apple:g::Y', 'f_str::apple:g::Z', 'f_str::banana:g::X', 'f_str::banana:g::Y', 'f_str::banana:g::Z', 'f_str::cherry:g::X', 'f_str::cherry:g::Y', 'f_str::cherry:g::Z']
E           assert ['f_str::appl...na:g::Z', ...] == ['f_str::appl...ry:g::X', ...]
E             
E             At index 0 diff: 'f_str::apple:g::X' != 'f_str::apple:g::Y'

@leostimpfle
Copy link
Collaborator Author

leostimpfle commented Jan 5, 2026

  • I am still not a big fan of Formula.first_stageand Formula.second_stagenot containing fixed effects - potentially misleading to users despite documentation? Maybe we should add Formula.first_stage_no_fixed_effects etc as extra attributes to make more explicit what type of formula users are dealing with?

Agreed that this is somewhat unintuitive. An alternative to changing the attribute names could be to include the encoded fixed effects directly in the formula. For example, instead of formula_kwargs = {'second_stage': 'Y ~ X1', 'fixed_effects' : 'f1 + f2'}, we could use formula_kwargs = {'second_stage': 'Y ~ X1 + __fixed_effect__(f1) + __fixed_effect__(f2)'} (where the sentinel __fixed_effect__ indicates the integer encoding of fixed effects). The main point is that the latter formula is what we already pass implicitly to formulaic, so in this approach we should call the attribute second_stage_formulaic.

  • Can you specify the reason for the FORMULAIC_FEATURE_FLAG is DefaultFormulaParser.FeatureFlags.ALL in several spots in the code base? Why is it needed? Are there potential downsides?

This is a hangover from my early attempts to use formulaic's multistage syntax (see #1125). DefaultFormulaParser.FeatureFlags.ALL indicates that the multistage syntax is enabled but the FORMULAIC_FEATURE_FLAG is set to DefaultFormulaParser.FeatureFlags.DEFAULT (i.e., multistage syntax is disabled). For clarity, I have removed references to FORMULAIC_FEATURE_FLAG in the parser for now.

  • Is the sortargument in parsestill needed?

Not needed, and I have removed it

I committed a few changes, I hope all of these make sense to you @leostimpfle and are more or less self-explanatory by the commit message?

Yes, all good. Thanks @s3alfisc!

leostimpfle and others added 3 commits February 2, 2026 09:42
* Simplify formula parsing

* Fix pre-commit [skip ci]

* Enable multiple dependent variables #1116

* Add endogenous variables as covariates

* Add drop_intercept

* Update test_formula_parse

* Fix parsing

* Update first_stage

* Add default value to Formula

* Disable variable-based checks

* Fix did2s

* Add multiverse stepwise syntax, closes #1136

* Delegate variable-level checks to formulaic's parser
@leostimpfle leostimpfle linked an issue Feb 2, 2026 that may be closed by this pull request
@s3alfisc
Copy link
Member

Ok, now refactored into 4 PRs - #1186 , #1187, #1188 , #1189 .

@leostimpfle
Copy link
Collaborator Author

Closed in favour of #1186

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

2 participants