Skip to content

Commit 01546b5

Browse files
committed
merge master
2 parents ed84c68 + ab6dfda commit 01546b5

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

41 files changed

+2569
-318
lines changed

.github/workflows/build-and-relase.yaml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# This file is autogenerated by maturin v1.8.6
1+
# This file is autogenerated by maturin v1.9.6
22
# To update, run
33
#
44
# maturin generate-ci github --platform all
@@ -38,7 +38,7 @@ jobs:
3838
- uses: actions/checkout@v4
3939
- uses: actions/setup-python@v5
4040
with:
41-
python-version: '3.13'
41+
python-version: 3.x
4242
- name: Build wheels
4343
uses: PyO3/maturin-action@v1
4444
with:
@@ -69,7 +69,7 @@ jobs:
6969
- uses: actions/checkout@v4
7070
- uses: actions/setup-python@v5
7171
with:
72-
python-version: '3.13'
72+
python-version: 3.x
7373
- name: Build wheels
7474
uses: PyO3/maturin-action@v1
7575
with:
@@ -96,7 +96,7 @@ jobs:
9696
- uses: actions/checkout@v4
9797
- uses: actions/setup-python@v5
9898
with:
99-
python-version: '3.13'
99+
python-version: 3.x
100100
architecture: ${{ matrix.platform.target }}
101101
- name: Build wheels
102102
uses: PyO3/maturin-action@v1
@@ -115,15 +115,15 @@ jobs:
115115
strategy:
116116
matrix:
117117
platform:
118-
- runner: macos-13
118+
- runner: macos-14
119119
target: x86_64
120120
- runner: macos-14
121121
target: aarch64
122122
steps:
123123
- uses: actions/checkout@v4
124124
- uses: actions/setup-python@v5
125125
with:
126-
python-version: '3.13'
126+
python-version: 3.x
127127
- name: Build wheels
128128
uses: PyO3/maturin-action@v1
129129
with:

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,3 +42,5 @@ coverage.xml
4242
# pixi environments
4343
.pixi/*
4444
!.pixi/config.toml
45+
SKILL.md
46+
CLAUDE.md

docs/_quarto.yml

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -121,6 +121,13 @@ quartodoc:
121121
- report.coefplot
122122
- report.iplot
123123
- did.visualize.panelview
124+
- title: Formula Parsing & Model Matrix
125+
desc: |
126+
Internal APIs for formula parsing and model matrix construction
127+
contents:
128+
- estimation.formula.parse.Formula
129+
- estimation.formula.model_matrix.ModelMatrix
130+
- estimation.formula.factor_interaction.factor_interaction
124131
- title: Misc / Utilities
125132
desc: |
126133
PyFixest internals and utilities

docs/_sidebar.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,12 @@ website:
3434
- reference/report.iplot.qmd
3535
- reference/did.visualize.panelview.qmd
3636
section: Summarize and Visualize
37+
- contents:
38+
- reference/estimation.formula.parse.Formula.qmd
39+
- reference/estimation.formula.parse.parse.qmd
40+
- reference/estimation.formula.model_matrix.ModelMatrix.qmd
41+
- reference/estimation.formula.factor_interaction.factor_interaction.qmd
42+
section: Formula Parsing & Model Matrix
3743
- contents:
3844
- reference/estimation.demean.qmd
3945
- reference/estimation.detect_singletons.qmd

docs/changelog.qmd

Lines changed: 99 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,105 @@ fit2 = pf.feols("Y ~ X1 + X2", data = df)
1515
fit3 = pf.feols("Y ~ X1 + X2 | f1", data = df)
1616
```
1717

18-
## PyFixest 0.41.0 (In Development)
18+
## PyFixest 0.50.0 (In Development)
19+
20+
::: {.callout-tip}
21+
You can install the latest pre-release to try out the new features:
22+
23+
```{.bash .code-copy}
24+
pip install pyfixest==0.50.0a1
25+
pip install --pre pyfixest
26+
```
27+
:::
28+
29+
### Reworked Formula Parsing and New `i()` Operator
30+
31+
The formula parsing module has been significantly reworked. The new implementation introduces a cleaner `Formula` class,
32+
a rewritten `i()` operator that closely follows R's `fixest` syntax, and new multiple estimation operators.
33+
34+
#### New `i()` operator
35+
36+
The `i()` operator now follows the `fixest` naming convention, using `::` to separate variable names from levels. It further
37+
provides two new arguments: `ref2` and `bin2` that allow to set
38+
reference levels for interacted variables and to bin categoricals.
39+
40+
**Simple categorical encoding:**
41+
42+
```{python}
43+
# Coefficient per level, first level dropped when intercept is present
44+
fit = pf.feols("Y ~ i(f1, ref=1)", data=df)
45+
fit.coef().head()
46+
```
47+
48+
**Factor x Continuous interaction:**
49+
50+
```{python}
51+
# Each level of f1 gets its own slope on X1
52+
fit = pf.feols("Y ~ i(f1, X1, ref=1)", data=df)
53+
fit.coef().head()
54+
```
55+
56+
**Factor x Factor interaction** with `ref2`:
57+
58+
The `ref2` argument controls the reference level of the second variable in a factor-by-factor interaction.
59+
60+
```{python}
61+
import numpy as np
62+
df["group"] = np.where(df["f1"] < 15, "A", "B")
63+
64+
# Full interaction: f1 levels x group levels
65+
# ref drops from f1, ref2 drops from group
66+
fit = pf.feols("Y ~ i(f1, group, ref=1, ref2='A')", data=df)
67+
fit.coef().head()
68+
```
69+
70+
**Binning** with `bin` and `bin2`:
71+
72+
The `bin` parameter merges categorical levels before encoding. This is useful for collapsing sparse categories.
73+
Values not in the mapping are kept unchanged, matching R `fixest` behavior.
74+
75+
```{python}
76+
df["size"] = np.where(df["f1"] < 10, "small", np.where(df["f1"] < 20, "medium", "large"))
77+
78+
# Merge 'small' and 'medium' into 'not_large', then use as reference
79+
fit = pf.feols("Y ~ i(size, bin={'not_large': ['small','medium']}, ref='not_large')", data=df)
80+
fit.coef()
81+
```
82+
83+
`bin2` applies binning to the second variable in a factor-by-factor interaction.
84+
85+
#### `mvsw()` and multiple estimation
86+
87+
88+
`mvsw()` for **multiverse stepwise** — generates all $2^k$ combinations of the provided variables, including the intercept-only model.
89+
90+
```{python}
91+
# mvsw: all combinations of X1 and X2
92+
fits = pf.feols("Y ~ mvsw(X1, X2)", data=df)
93+
pf.etable(fits)
94+
```
95+
96+
Multiple estimation operators can be combined:
97+
98+
```{python}
99+
# mvsw: all combinations of X1 and X2
100+
fits = pf.feols("Y ~ sw(X1, X2) + csw(f1,f2)", data=df)
101+
pf.etable(fits)
102+
```
103+
104+
Last, you can run operations within the operators:
105+
106+
```{python}
107+
# mvsw: all combinations of X1 and X2
108+
fits = pf.feols("Y ~ sw(X1, f1 + X2)", data=df)
109+
pf.etable(fits)
110+
```
111+
112+
#### Deprecations
113+
114+
- `FixestFormulaParser` is deprecated in favor of `Formula.parse()`. A `FutureWarning` is emitted when the old class is used.
115+
- `model_matrix_fixest()` is deprecated in favor of `create_model_matrix()`.
116+
19117

20118
### Migration to maketables
21119

docs/llms.txt

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -71,7 +71,7 @@ Pass to the `vcov` argument:
7171

7272
## Model Methods
7373

74-
After estimation, model objects support:
74+
After estimation, model objects support (full API: https://py-econometrics.github.io/pyfixest/reference/estimation.feols_.Feols.md):
7575

7676
- `.summary()` — Print regression summary
7777
- `.tidy()` — Tidy DataFrame of coefficients, SEs, t-stats, p-values, CIs
@@ -84,7 +84,8 @@ After estimation, model objects support:
8484
- `.vcov()` — Variance-covariance matrix
8585
- `.wildboottest(param, reps, seed)` — Wild cluster bootstrap inference
8686
- `.ccv(treatment, pk, qk, ...)` — Causal cluster variance estimator
87-
- `.rio(param, reps, ...)` — Randomization inference
87+
- `.ritest(param, reps, ...)` — Randomization inference
88+
- `.decompose(param, x1_vars, type, ...)` — Gelbach (2016) decomposition for mediation analysis (explains coefficient change from short to long model)
8889

8990
## Quick Example
9091

docs/quickstart.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -507,7 +507,7 @@ multi_fit.etable()
507507
You can access an individual model by its name - i.e. a formula - via the `all_fitted_models` attribute.
508508

509509
```{python}
510-
multi_fit.all_fitted_models["Y~X1"].tidy()
510+
multi_fit.all_fitted_models["Y ~ X1"].tidy()
511511
```
512512

513513
or equivalently via the `fetch_model` method:

pixi.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ description = "Fast high dimensional fixed effect estimation following syntax of
77
channels = ["conda-forge"]
88
name = "pyfixest"
99
platforms = ["linux-64", "win-64", "osx-arm64", "osx-64"]
10-
version = "0.40.0"
10+
version = "0.50.0a1"
1111

1212
[tasks]
1313

pyfixest/did/did2s.py

Lines changed: 30 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,8 @@
88
from pyfixest.did.did import DID
99
from pyfixest.estimation import feols
1010
from pyfixest.estimation.feols_ import Feols
11-
from pyfixest.estimation.FormulaParser import FixestFormulaParser
12-
from pyfixest.estimation.model_matrix_fixest_ import model_matrix_fixest
11+
from pyfixest.estimation.formula import model_matrix
12+
from pyfixest.estimation.formula.parse import Formula
1313

1414

1515
class DID2S(DID):
@@ -304,37 +304,48 @@ def _did2s_vcov(
304304

305305
# some formula parsing to get the correct formula for the first and second stage model matrix
306306
first_stage_x, first_stage_fe = first_stage.split("|")
307-
first_stage_fe_list = [f"C({i})" for i in first_stage_fe.split("+")]
307+
first_stage_fe_list = [f"C({i.strip()})" for i in first_stage_fe.split("+")]
308308
first_stage_fe_fml = "+".join(first_stage_fe_list)
309-
first_stage = f"{first_stage_x}+{first_stage_fe_fml}"
310-
311-
second_stage = f"{second_stage}"
309+
first_stage_fml = f"{first_stage_x}+{first_stage_fe_fml}"
312310

313311
# note for future Alex: intercept needs to be dropped! it is not as fixed
314312
# effects are converted to dummies, hence has_fixed checks are False
315313

316-
FML1 = FixestFormulaParser(f"{yname} {first_stage}")
317-
FML2 = FixestFormulaParser(f"{yname} {second_stage}")
318-
FixestFormulaDict1 = FML1.FixestFormulaDict
319-
FixestFormulaDict2 = FML2.FixestFormulaDict
314+
# Create Formula objects for the new model_matrix system.
315+
# First stage: use `- 1` so that C() dummy encoding keeps all levels,
316+
# matching the feols demeaning approach (which implicitly includes all
317+
# fixed-effect levels). Removing `- 1` would cause formulaic to drop
318+
# reference levels, changing the GMM vcov standard errors.
319+
FML1 = Formula(
320+
_second_stage=f"{yname} ~ {first_stage_fml.replace('~', '').strip()} - 1",
321+
)
322+
# Second stage: do NOT use `- 1`. Formulaic needs the intercept present
323+
# for full-rank encoding (dropping a reference level for factors like
324+
# i(treat)). The intercept column is then removed by drop_intercept=True
325+
# below, matching what feols does in _did2s_estimate.
326+
FML2 = Formula(
327+
_second_stage=f"{yname} ~ {second_stage.replace('~', '').strip()}",
328+
)
320329

321-
mm_dict_first_stage = model_matrix_fixest(
322-
FixestFormula=next(iter(FixestFormulaDict1.values()))[0],
330+
mm_first_stage = model_matrix.create_model_matrix(
331+
formula=FML1,
323332
data=data,
324333
weights=None,
325334
drop_singletons=False,
326-
drop_intercept=False,
335+
ensure_full_rank=True,
336+
drop_intercept=True,
327337
)
328-
X1 = cast(pd.DataFrame, mm_dict_first_stage.get("X"))
338+
X1 = mm_first_stage.independent
329339

330-
mm_second_stage = model_matrix_fixest(
331-
FixestFormula=next(iter(FixestFormulaDict2.values()))[0],
340+
mm_second_stage = model_matrix.create_model_matrix(
341+
formula=FML2,
332342
data=data,
333343
weights=None,
334344
drop_singletons=False,
345+
ensure_full_rank=True,
335346
drop_intercept=True,
336-
) # reference values not dropped, multicollinearity error
337-
X2 = cast(pd.DataFrame, mm_second_stage.get("X"))
347+
)
348+
X2 = mm_second_stage.independent
338349

339350
X1 = csr_matrix(X1.to_numpy() * weights_array[:, None])
340351
X2 = csr_matrix(X2.to_numpy() * weights_array[:, None])
@@ -359,10 +370,7 @@ def _did2s_vcov(
359370
X10 = X10.tocsr()
360371
X2 = X2.tocsr() # type: ignore
361372

362-
for (
363-
_,
364-
g,
365-
) in enumerate(clustid):
373+
for _, g in enumerate(clustid):
366374
idx_g: np.ndarray = cluster_col.values == g
367375
X10g = X10[idx_g, :]
368376
X2g = X2[idx_g, :]

pyfixest/did/saturated_twfe.py

Lines changed: 16 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -203,15 +203,14 @@ def aggregate(
203203
treated_periods = list(period_set)
204204

205205
df_agg = pd.DataFrame(
206-
index=treated_periods,
206+
index=pd.Index(treated_periods, name="period"),
207207
columns=["Estimate", "Std. Error", "t value", "Pr(>|t|)", "2.5%", "97.5%"],
208208
)
209-
df_agg.index.name = "period"
210209

211210
for period in treated_periods:
212211
R = np.zeros(len(coefs))
213212
for cohort in cohort_list:
214-
cohort_pattern = rf"\[{re.escape(str(period))}\]:.*{re.escape(cohort)}$"
213+
cohort_pattern = rf"^(?:.+)::{period}:(?:.+)::{cohort}$"
215214
match_idx = [
216215
i
217216
for i, name in enumerate(coefnames)
@@ -319,28 +318,20 @@ def _saturated_event_study(
319318
unit_id: str,
320319
cluster: Optional[str] = None,
321320
):
322-
cohort_dummies = pd.get_dummies(
323-
df.first_treated_period, drop_first=True, prefix="cohort_dummy"
321+
ff = f"{outcome} ~ i(rel_time, first_treated_period, ref = -1, ref2=0) | {unit_id} + {time_id}"
322+
m = feols(fml=ff, data=df, vcov={"CRV1": cluster}) # type: ignore
323+
res = m.tidy().reset_index()
324+
res = res.join(
325+
res["Coefficient"].str.extract(
326+
r".+::(?P<time>.+):.+::(?P<cohort>.+)", expand=True
327+
)
324328
)
325-
df_int = pd.concat([df, cohort_dummies], axis=1)
326-
327-
ff = f"""
328-
{outcome} ~
329-
{"+".join([f"i(rel_time, {x}, ref = -1.0)" for x in cohort_dummies.columns.tolist()])}
330-
| {unit_id} + {time_id}
331-
"""
332-
m = feols(fml=ff, data=df_int, vcov={"CRV1": cluster}) # type: ignore
333-
res = m.tidy()
329+
res["time"] = res["time"].astype(float)
334330
# create a dict with cohort specific effect curves
335331
res_cohort_eventtime_dict: dict[str, dict[str, pd.DataFrame | np.ndarray]] = {}
336-
for cohort in cohort_dummies.columns:
337-
res_cohort = res.filter(like=cohort, axis=0)
338-
event_time = (
339-
res_cohort.index.str.extract(r"\[(?:T\.)?(-?\d+(?:\.\d+)?)\]")
340-
.astype(float)
341-
.values.flatten()
342-
)
343-
res_cohort_eventtime_dict[cohort] = {"est": res_cohort, "time": event_time}
332+
for cohort, res_cohort in res.groupby("cohort"):
333+
event_time = res_cohort["time"].to_numpy()
334+
res_cohort_eventtime_dict[str(cohort)] = {"est": res_cohort, "time": event_time}
344335

345336
return m, res_cohort_eventtime_dict
346337

@@ -366,11 +357,10 @@ def _test_treatment_heterogeneity(
366357
"""
367358
mmres = model.tidy().reset_index()
368359
P = mmres.shape[0]
369-
mmres[["time", "cohort"]] = mmres.Coefficient.str.split(":", expand=True)
370-
mmres["time"] = mmres.time.str.extract(r"\[(?:T\.)?(-?\d+(?:\.\d+)?)\]").astype(
371-
float
360+
mmres[["time", "cohort"]] = mmres["Coefficient"].str.extract(
361+
r".+::(?P<time>.+):.+::(?P<cohort>.+)", expand=True
372362
)
373-
mmres["cohort"] = mmres.cohort.str.extract(r"(\d+)")
363+
mmres["time"] = mmres["time"].astype(float)
374364
# indices of coefficients that are deviations from common event study coefs
375365
event_study_coefs = mmres.loc[~(mmres.cohort.isna()) & (mmres.time > 0)].index
376366
# Method 2 (K x P) - more efficient

0 commit comments

Comments
 (0)