py-econometrics
diff --git a/‎.github/workflows/build-and-relase.yaml‎
Lines changed: 6 additions & 6 deletions b/‎.github/workflows/build-and-relase.yaml‎
Lines changed: 6 additions & 6 deletions
diff --git a/‎.gitignore‎
Lines changed: 2 additions & 0 deletions b/‎.gitignore‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/_quarto.yml‎
Lines changed: 7 additions & 0 deletions b/‎docs/_quarto.yml‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎docs/_sidebar.yml‎
Lines changed: 6 additions & 0 deletions b/‎docs/_sidebar.yml‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎docs/changelog.qmd‎
Lines changed: 99 additions & 1 deletion b/‎docs/changelog.qmd‎
Lines changed: 99 additions & 1 deletion
diff --git a/‎docs/llms.txt‎
Lines changed: 3 additions & 2 deletions b/‎docs/llms.txt‎
Lines changed: 3 additions & 2 deletions
diff --git a/‎docs/quickstart.qmd‎
Lines changed: 1 addition & 1 deletion b/‎docs/quickstart.qmd‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pixi.toml‎
Lines changed: 1 addition & 1 deletion b/‎pixi.toml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎pyfixest/did/did2s.py‎
Lines changed: 30 additions & 22 deletions b/‎pyfixest/did/did2s.py‎
Lines changed: 30 additions & 22 deletions
diff --git a/‎pyfixest/did/saturated_twfe.py‎
Lines changed: 16 additions & 26 deletions b/‎pyfixest/did/saturated_twfe.py‎
Lines changed: 16 additions & 26 deletions
@@ -1,4 +1,4 @@
-# This file is autogenerated by maturin v1.8.6
+# This file is autogenerated by maturin v1.9.6
 # To update, run
 #
 #    maturin generate-ci github --platform all
@@ -38,7 +38,7 @@ jobs:
       - uses: actions/checkout@v4
       - uses: actions/setup-python@v5
         with:
-          python-version: '3.13'
+          python-version: 3.x
       - name: Build wheels
         uses: PyO3/maturin-action@v1
         with:
@@ -69,7 +69,7 @@ jobs:
       - uses: actions/checkout@v4
       - uses: actions/setup-python@v5
         with:
-          python-version: '3.13'
+          python-version: 3.x
       - name: Build wheels
         uses: PyO3/maturin-action@v1
         with:
@@ -96,7 +96,7 @@ jobs:
       - uses: actions/checkout@v4
       - uses: actions/setup-python@v5
         with:
-          python-version: '3.13'
+          python-version: 3.x
           architecture: ${{ matrix.platform.target }}
       - name: Build wheels
         uses: PyO3/maturin-action@v1
@@ -115,15 +115,15 @@ jobs:
     strategy:
       matrix:
         platform:
-          - runner: macos-13
+          - runner: macos-14
             target: x86_64
           - runner: macos-14
             target: aarch64
     steps:
       - uses: actions/checkout@v4
       - uses: actions/setup-python@v5
         with:
-          python-version: '3.13'
+          python-version: 3.x
       - name: Build wheels
         uses: PyO3/maturin-action@v1
         with:
 
@@ -42,3 +42,5 @@ coverage.xml
 # pixi environments
 .pixi/*
 !.pixi/config.toml
+SKILL.md
+CLAUDE.md
@@ -121,6 +121,13 @@ quartodoc:
         - report.coefplot
         - report.iplot
         - did.visualize.panelview
+    - title: Formula Parsing & Model Matrix
+      desc: |
+        Internal APIs for formula parsing and model matrix construction
+      contents:
+        - estimation.formula.parse.Formula
+        - estimation.formula.model_matrix.ModelMatrix
+        - estimation.formula.factor_interaction.factor_interaction
     - title: Misc / Utilities
       desc: |
         PyFixest internals and utilities
 
@@ -34,6 +34,12 @@ website:
       - reference/report.iplot.qmd
       - reference/did.visualize.panelview.qmd
       section: Summarize and Visualize
+    - contents:
+      - reference/estimation.formula.parse.Formula.qmd
+      - reference/estimation.formula.parse.parse.qmd
+      - reference/estimation.formula.model_matrix.ModelMatrix.qmd
+      - reference/estimation.formula.factor_interaction.factor_interaction.qmd
+      section: Formula Parsing & Model Matrix
     - contents:
       - reference/estimation.demean.qmd
       - reference/estimation.detect_singletons.qmd
 
@@ -15,7 +15,105 @@ fit2 = pf.feols("Y ~ X1 + X2", data = df)
 fit3 = pf.feols("Y ~ X1 + X2 | f1", data = df)
 ```
 
-## PyFixest 0.41.0 (In Development)
+## PyFixest 0.50.0 (In Development)
+
+::: {.callout-tip}
+You can install the latest pre-release to try out the new features:
+
+```{.bash .code-copy}
+pip install pyfixest==0.50.0a1
+pip install --pre pyfixest
+```
+:::
+
+### Reworked Formula Parsing and New `i()` Operator
+
+The formula parsing module has been significantly reworked. The new implementation introduces a cleaner `Formula` class,
+a rewritten `i()` operator that closely follows R's `fixest` syntax, and new multiple estimation operators.
+
+#### New `i()` operator
+
+The `i()` operator now follows the `fixest` naming convention, using `::` to separate variable names from levels. It further
+provides two new arguments: `ref2` and `bin2` that allow to set
+reference levels for interacted variables and to bin categoricals.
+
+**Simple categorical encoding:**
+
+```{python}
+# Coefficient per level, first level dropped when intercept is present
+fit = pf.feols("Y ~ i(f1, ref=1)", data=df)
+fit.coef().head()
+```
+
+**Factor x Continuous interaction:**
+
+```{python}
+# Each level of f1 gets its own slope on X1
+fit = pf.feols("Y ~ i(f1, X1, ref=1)", data=df)
+fit.coef().head()
+```
+
+**Factor x Factor interaction** with `ref2`:
+
+The `ref2` argument controls the reference level of the second variable in a factor-by-factor interaction.
+
+```{python}
+import numpy as np
+df["group"] = np.where(df["f1"] < 15, "A", "B")
+
+# Full interaction: f1 levels x group levels
+# ref drops from f1, ref2 drops from group
+fit = pf.feols("Y ~ i(f1, group, ref=1, ref2='A')", data=df)
+fit.coef().head()
+```
+
+**Binning** with `bin` and `bin2`:
+
+The `bin` parameter merges categorical levels before encoding. This is useful for collapsing sparse categories.
+Values not in the mapping are kept unchanged, matching R `fixest` behavior.
+
+```{python}
+df["size"] = np.where(df["f1"] < 10, "small", np.where(df["f1"] < 20, "medium", "large"))
+
+# Merge 'small' and 'medium' into 'not_large', then use as reference
+fit = pf.feols("Y ~ i(size, bin={'not_large': ['small','medium']}, ref='not_large')", data=df)
+fit.coef()
+```
+
+`bin2` applies binning to the second variable in a factor-by-factor interaction.
+
+#### `mvsw()` and multiple estimation
+
+
+`mvsw()` for **multiverse stepwise** — generates all $2^k$ combinations of the provided variables, including the intercept-only model.
+
+```{python}
+# mvsw: all combinations of X1 and X2
+fits = pf.feols("Y ~ mvsw(X1, X2)", data=df)
+pf.etable(fits)
+```
+
+Multiple estimation operators can be combined:
+
+```{python}
+# mvsw: all combinations of X1 and X2
+fits = pf.feols("Y ~ sw(X1, X2) + csw(f1,f2)", data=df)
+pf.etable(fits)
+```
+
+Last, you can run operations within the operators:
+
+```{python}
+# mvsw: all combinations of X1 and X2
+fits = pf.feols("Y ~ sw(X1, f1 + X2)", data=df)
+pf.etable(fits)
+```
+
+#### Deprecations
+
+- `FixestFormulaParser` is deprecated in favor of `Formula.parse()`. A `FutureWarning` is emitted when the old class is used.
+- `model_matrix_fixest()` is deprecated in favor of `create_model_matrix()`.
+
 
 ### Migration to maketables
 
 
@@ -71,7 +71,7 @@ Pass to the `vcov` argument:
 
 ## Model Methods
 
-After estimation, model objects support:
+After estimation, model objects support (full API: https://py-econometrics.github.io/pyfixest/reference/estimation.feols_.Feols.md):
 
 - `.summary()` — Print regression summary
 - `.tidy()` — Tidy DataFrame of coefficients, SEs, t-stats, p-values, CIs
@@ -84,7 +84,8 @@ After estimation, model objects support:
 - `.vcov()` — Variance-covariance matrix
 - `.wildboottest(param, reps, seed)` — Wild cluster bootstrap inference
 - `.ccv(treatment, pk, qk, ...)` — Causal cluster variance estimator
-- `.rio(param, reps, ...)` — Randomization inference
+- `.ritest(param, reps, ...)` — Randomization inference
+- `.decompose(param, x1_vars, type, ...)` — Gelbach (2016) decomposition for mediation analysis (explains coefficient change from short to long model)
 
 ## Quick Example
 
 
@@ -507,7 +507,7 @@ multi_fit.etable()
 You can access an individual model by its name - i.e. a formula - via the `all_fitted_models` attribute.
 
 ```{python}
-multi_fit.all_fitted_models["Y~X1"].tidy()
+multi_fit.all_fitted_models["Y ~ X1"].tidy()
 ```
 
 or equivalently via the `fetch_model` method:
 
@@ -7,7 +7,7 @@ description = "Fast high dimensional fixed effect estimation following syntax of
 channels = ["conda-forge"]
 name = "pyfixest"
 platforms = ["linux-64", "win-64", "osx-arm64", "osx-64"]
-version = "0.40.0"
+version = "0.50.0a1"
 
 [tasks]
 
 
@@ -8,8 +8,8 @@
 from pyfixest.did.did import DID
 from pyfixest.estimation import feols
 from pyfixest.estimation.feols_ import Feols
-from pyfixest.estimation.FormulaParser import FixestFormulaParser
-from pyfixest.estimation.model_matrix_fixest_ import model_matrix_fixest
+from pyfixest.estimation.formula import model_matrix
+from pyfixest.estimation.formula.parse import Formula
 
 
 class DID2S(DID):
@@ -304,37 +304,48 @@ def _did2s_vcov(
 
     # some formula parsing to get the correct formula for the first and second stage model matrix
     first_stage_x, first_stage_fe = first_stage.split("|")
-    first_stage_fe_list = [f"C({i})" for i in first_stage_fe.split("+")]
+    first_stage_fe_list = [f"C({i.strip()})" for i in first_stage_fe.split("+")]
     first_stage_fe_fml = "+".join(first_stage_fe_list)
-    first_stage = f"{first_stage_x}+{first_stage_fe_fml}"
-
-    second_stage = f"{second_stage}"
+    first_stage_fml = f"{first_stage_x}+{first_stage_fe_fml}"
 
     # note for future Alex: intercept needs to be dropped! it is not as fixed
     # effects are converted to dummies, hence has_fixed checks are False
 
-    FML1 = FixestFormulaParser(f"{yname} {first_stage}")
-    FML2 = FixestFormulaParser(f"{yname} {second_stage}")
-    FixestFormulaDict1 = FML1.FixestFormulaDict
-    FixestFormulaDict2 = FML2.FixestFormulaDict
+    # Create Formula objects for the new model_matrix system.
+    # First stage: use `- 1` so that C() dummy encoding keeps all levels,
+    # matching the feols demeaning approach (which implicitly includes all
+    # fixed-effect levels). Removing `- 1` would cause formulaic to drop
+    # reference levels, changing the GMM vcov standard errors.
+    FML1 = Formula(
+        _second_stage=f"{yname} ~ {first_stage_fml.replace('~', '').strip()} - 1",
+    )
+    # Second stage: do NOT use `- 1`. Formulaic needs the intercept present
+    # for full-rank encoding (dropping a reference level for factors like
+    # i(treat)). The intercept column is then removed by drop_intercept=True
+    # below, matching what feols does in _did2s_estimate.
+    FML2 = Formula(
+        _second_stage=f"{yname} ~ {second_stage.replace('~', '').strip()}",
+    )
 
-    mm_dict_first_stage = model_matrix_fixest(
-        FixestFormula=next(iter(FixestFormulaDict1.values()))[0],
+    mm_first_stage = model_matrix.create_model_matrix(
+        formula=FML1,
         data=data,
         weights=None,
         drop_singletons=False,
-        drop_intercept=False,
+        ensure_full_rank=True,
+        drop_intercept=True,
     )
-    X1 = cast(pd.DataFrame, mm_dict_first_stage.get("X"))
+    X1 = mm_first_stage.independent
 
-    mm_second_stage = model_matrix_fixest(
-        FixestFormula=next(iter(FixestFormulaDict2.values()))[0],
+    mm_second_stage = model_matrix.create_model_matrix(
+        formula=FML2,
         data=data,
         weights=None,
         drop_singletons=False,
+        ensure_full_rank=True,
         drop_intercept=True,
-    )  # reference values not dropped, multicollinearity error
-    X2 = cast(pd.DataFrame, mm_second_stage.get("X"))
+    )
+    X2 = mm_second_stage.independent
 
     X1 = csr_matrix(X1.to_numpy() * weights_array[:, None])
     X2 = csr_matrix(X2.to_numpy() * weights_array[:, None])
@@ -359,10 +370,7 @@ def _did2s_vcov(
     X10 = X10.tocsr()
     X2 = X2.tocsr()  # type: ignore
 
-    for (
-        _,
-        g,
-    ) in enumerate(clustid):
+    for _, g in enumerate(clustid):
         idx_g: np.ndarray = cluster_col.values == g
         X10g = X10[idx_g, :]
         X2g = X2[idx_g, :]
 
@@ -203,15 +203,14 @@ def aggregate(
         treated_periods = list(period_set)
 
         df_agg = pd.DataFrame(
-            index=treated_periods,
+            index=pd.Index(treated_periods, name="period"),
             columns=["Estimate", "Std. Error", "t value", "Pr(>|t|)", "2.5%", "97.5%"],
         )
-        df_agg.index.name = "period"
 
         for period in treated_periods:
             R = np.zeros(len(coefs))
             for cohort in cohort_list:
-                cohort_pattern = rf"\[{re.escape(str(period))}\]:.*{re.escape(cohort)}$"
+                cohort_pattern = rf"^(?:.+)::{period}:(?:.+)::{cohort}$"
                 match_idx = [
                     i
                     for i, name in enumerate(coefnames)
@@ -319,28 +318,20 @@ def _saturated_event_study(
     unit_id: str,
     cluster: Optional[str] = None,
 ):
-    cohort_dummies = pd.get_dummies(
-        df.first_treated_period, drop_first=True, prefix="cohort_dummy"
+    ff = f"{outcome} ~ i(rel_time, first_treated_period, ref = -1, ref2=0) | {unit_id} + {time_id}"
+    m = feols(fml=ff, data=df, vcov={"CRV1": cluster})  # type: ignore
+    res = m.tidy().reset_index()
+    res = res.join(
+        res["Coefficient"].str.extract(
+            r".+::(?P<time>.+):.+::(?P<cohort>.+)", expand=True
+        )
     )
-    df_int = pd.concat([df, cohort_dummies], axis=1)
-
-    ff = f"""
-                {outcome} ~
-                {"+".join([f"i(rel_time, {x}, ref = -1.0)" for x in cohort_dummies.columns.tolist()])}
-                | {unit_id} + {time_id}
-                """
-    m = feols(fml=ff, data=df_int, vcov={"CRV1": cluster})  # type: ignore
-    res = m.tidy()
+    res["time"] = res["time"].astype(float)
     # create a dict with cohort specific effect curves
     res_cohort_eventtime_dict: dict[str, dict[str, pd.DataFrame | np.ndarray]] = {}
-    for cohort in cohort_dummies.columns:
-        res_cohort = res.filter(like=cohort, axis=0)
-        event_time = (
-            res_cohort.index.str.extract(r"\[(?:T\.)?(-?\d+(?:\.\d+)?)\]")
-            .astype(float)
-            .values.flatten()
-        )
-        res_cohort_eventtime_dict[cohort] = {"est": res_cohort, "time": event_time}
+    for cohort, res_cohort in res.groupby("cohort"):
+        event_time = res_cohort["time"].to_numpy()
+        res_cohort_eventtime_dict[str(cohort)] = {"est": res_cohort, "time": event_time}
 
     return m, res_cohort_eventtime_dict
 
@@ -366,11 +357,10 @@ def _test_treatment_heterogeneity(
     """
     mmres = model.tidy().reset_index()
     P = mmres.shape[0]
-    mmres[["time", "cohort"]] = mmres.Coefficient.str.split(":", expand=True)
-    mmres["time"] = mmres.time.str.extract(r"\[(?:T\.)?(-?\d+(?:\.\d+)?)\]").astype(
-        float
+    mmres[["time", "cohort"]] = mmres["Coefficient"].str.extract(
+        r".+::(?P<time>.+):.+::(?P<cohort>.+)", expand=True
     )
-    mmres["cohort"] = mmres.cohort.str.extract(r"(\d+)")
+    mmres["time"] = mmres["time"].astype(float)
     # indices of coefficients that are deviations from common event study coefs
     event_study_coefs = mmres.loc[~(mmres.cohort.isna()) & (mmres.time > 0)].index
     # Method 2 (K x P) - more efficient