Skip to content

ENH: Implement 2+ fixed-effect with multi-way clustering OLS #551

@EBoisseauSierra

Description

@EBoisseauSierra

Context

I would effectively like to replicate Stata's reghdfe command, which is based on Correia (2016) estimator.

This estimator allows (fast) converging with greater-than-2 levels of fixed-effect, as well as multi-way clustering.

Expected solution

I am unsure how this would look like — either a new IV estimation module, or somehow piggybacking on AbsorbingLS…

Alternatives

I can achieve most of it via a naive statsmodels implementation, as shown in the dummy example below:

  • Stata 17
webuse nlswork
reghdfe ln_wage hours tenure race, absorb(age union) vce(cluster ind_code)
stdout
(dropped 1 singleton observations)
(MWFE estimator converged in 3 iterations)

HDFE Linear regression                            Number of obs   =     18,890
Absorbing 2 HDFE groups                           F(   3,     11) =      46.81
Statistics robust to heteroskedasticity           Prob > F        =     0.0000
                                                  R-squared       =     0.1766
                                                  Adj R-squared   =     0.1752
                                                  Within R-sq.    =     0.1083
Number of clusters (ind_code) =         12        Root MSE        =     0.4240

                              (Std. err. adjusted for 12 clusters in ind_code)
------------------------------------------------------------------------------
             |               Robust
     ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
       hours |   .0028301   .0015474     1.83   0.095    -.0005757    .0062359
      tenure |   .0327574   .0034951     9.37   0.000     .0250647      .04045
        race |  -.1454669   .0250923    -5.80   0.000    -.2006945   -.0902392
       _cons |   1.712627   .1019591    16.80   0.000     1.488217    1.937038
------------------------------------------------------------------------------

Absorbed degrees of freedom:
-----------------------------------------------------+
 Absorbed FE | Categories  - Redundant  = Num. Coefs |
-------------+---------------------------------------|
         age |        30           0          30     |
       union |         2           1           1     |
-----------------------------------------------------+
  • Python (3.11.4, numpy 1.25.2, statsmodels 0.14.0):
import numpy as np
import statsmodels.datasets as sd
import statsmodels.formula.api as sm

df = sd.webuse("nlswork")
dataset = df.dropna(
    subset=[
        "ln_wage",
        "hours",
        "tenure",
        "race",
        "age",
        "union",
        "ind_code",
    ],
    how="any",
)

model = sm.ols(
    formula="ln_wage ~ hours + tenure + race + C(age) + C(union)",
    data=dataset,
)
results = model.fit(
    cov_type="cluster",
    cov_kwds={"groups": np.array(dataset["ind_code"])},
    use_t=True,
)
results.summary().as_text()
stdout
OLS Regression Results                            
==============================================================================
Dep. Variable:                ln_wage   R-squared:                       0.177
Model:                            OLS   Adj. R-squared:                  0.175
Method:                 Least Squares   F-statistic:                 2.026e+11
Date:                Fri, 08 Sep 2023   Prob (F-statistic):           5.39e-61
Time:                        11:06:12   Log-Likelihood:                -10579.
No. Observations:               18891   AIC:                         2.123e+04
Df Residuals:                   18856   BIC:                         2.150e+04
Df Model:                          34                                         
Covariance Type:              cluster                                         
===================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept           1.1509      0.035     32.655      0.000       1.073       1.228
C(age)[T.17.0]      0.2170      0.168      1.295      0.222      -0.152       0.586
C(age)[T.18.0]      0.2016      0.087      2.312      0.041       0.010       0.393
C(age)[T.19.0]      0.2886      0.081      3.581      0.004       0.111       0.466
C(age)[T.20.0]      0.3644      0.073      4.965      0.000       0.203       0.526
C(age)[T.21.0]      0.4269      0.080      5.361      0.000       0.252       0.602
C(age)[T.22.0]      0.4735      0.089      5.312      0.000       0.277       0.670
C(age)[T.23.0]      0.4947      0.087      5.685      0.000       0.303       0.686
C(age)[T.24.0]      0.4986      0.087      5.709      0.000       0.306       0.691
C(age)[T.25.0]      0.4958      0.088      5.662      0.000       0.303       0.688
C(age)[T.26.0]      0.5486      0.090      6.092      0.000       0.350       0.747
C(age)[T.27.0]      0.5226      0.076      6.872      0.000       0.355       0.690
C(age)[T.28.0]      0.5165      0.080      6.463      0.000       0.341       0.692
C(age)[T.29.0]      0.5147      0.087      5.946      0.000       0.324       0.705
C(age)[T.30.0]      0.5126      0.084      6.131      0.000       0.329       0.697
C(age)[T.31.0]      0.5692      0.075      7.635      0.000       0.405       0.733
C(age)[T.32.0]      0.5452      0.083      6.535      0.000       0.362       0.729
C(age)[T.33.0]      0.5457      0.075      7.245      0.000       0.380       0.712
C(age)[T.34.0]      0.5255      0.073      7.227      0.000       0.365       0.685
C(age)[T.35.0]      0.5492      0.074      7.463      0.000       0.387       0.711
C(age)[T.36.0]      0.5488      0.081      6.775      0.000       0.371       0.727
C(age)[T.37.0]      0.5506      0.077      7.184      0.000       0.382       0.719
C(age)[T.38.0]      0.5283      0.079      6.705      0.000       0.355       0.702
C(age)[T.39.0]      0.5477      0.083      6.625      0.000       0.366       0.730
C(age)[T.40.0]      0.5431      0.088      6.147      0.000       0.349       0.738
C(age)[T.41.0]      0.5340      0.063      8.520      0.000       0.396       0.672
C(age)[T.42.0]      0.5426      0.094      5.800      0.000       0.337       0.748
C(age)[T.43.0]      0.4841      0.092      5.277      0.000       0.282       0.686
C(age)[T.44.0]      0.4711      0.090      5.243      0.000       0.273       0.669
C(age)[T.45.0]      0.6426      0.098      6.553      0.000       0.427       0.858
C(age)[T.46.0]      0.9095      0.174      5.237      0.000       0.527       1.292
C(union)[T.1.0]     0.1809      0.038      4.810      0.001       0.098       0.264
hours               0.0028      0.002      1.829      0.095      -0.001       0.006
tenure              0.0328      0.003      9.372      0.000       0.025       0.040
race               -0.1455      0.025     -5.797      0.000      -0.201      -0.090
==============================================================================
Omnibus:                     1064.590   Durbin-Watson:                   0.868
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             2692.841
Skew:                           0.329   Prob(JB):                         0.00
Kurtosis:                       4.729   Cond. No.                     2.93e+04
==============================================================================
Notes:
[1] Standard Errors are robust to cluster correlation (cluster)
[2] The condition number is large, 2.93e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

However, this approach falls short for 2 reasons:

  1. It is much slower with large datasets than reghdfe (cf. below),
  2. It fails to converge in Python when (one of) the fixed-effect dimension have a too-high cardinality.

(Note: One may argue that the latter point hints at a faulty analysis design… but I'd prefer to not discuss the “whether it is correct to do so”.)

A model showing contrast in computation time is ln_wage ~ hours + tenure + race + C(age) + C(idcode), clustering SE by ind_code:

  • Stata 17: 0.35 seconds
  • Python 3.11: 115.79 s
Reproducible code
  • Python
import numpy as np
import statsmodels.datasets as sd
import statsmodels.formula.api as sm
import time

df = sd.webuse("nlswork")
dataset = df.dropna(
    subset=[
        "ln_wage",
        "hours",
        "tenure",
        "race",
        "age",
        "idcode",
        "ind_code",
    ],
    how="any",
)

start_time = time.perf_counter()
model = sm.ols(
    formula="ln_wage ~ hours + tenure + race + C(age) + C(idcode)",
    data=dataset,
)
results = model.fit(
    cov_type="cluster",
    cov_kwds={"groups": np.array(dataset["ind_code"])},
    use_t=True,
)
end_time = time.perf_counter()

print(f"Model fitting duration: {end_time - start_time:.2f} seconds.")
results.summary().as_text()
  • Stata
webuse nlswork
set rmsg on
reghdfe ln_wage hours tenure race, absorb(age idcode) vce(cluster ind_code)

A model showing Python unable to converge is ln_wage ~ hours + tenure + race + C(age) + C(idcode) + C(grade):

  • Stata: 0.51s
  • Python: stopped after 20 mins.

Other

I'd be happy to raise or contribute to a PR for it, although I might fall short on the theory of the stats.

Linked to #259

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions