-
Notifications
You must be signed in to change notification settings - Fork 187
Description
Context
I would effectively like to replicate Stata's reghdfe command, which is based on Correia (2016) estimator.
This estimator allows (fast) converging with greater-than-2 levels of fixed-effect, as well as multi-way clustering.
Expected solution
I am unsure how this would look like — either a new IV estimation module, or somehow piggybacking on AbsorbingLS…
Alternatives
I can achieve most of it via a naive statsmodels implementation, as shown in the dummy example below:
- Stata 17
webuse nlswork
reghdfe ln_wage hours tenure race, absorb(age union) vce(cluster ind_code)stdout
(dropped 1 singleton observations)
(MWFE estimator converged in 3 iterations)
HDFE Linear regression Number of obs = 18,890
Absorbing 2 HDFE groups F( 3, 11) = 46.81
Statistics robust to heteroskedasticity Prob > F = 0.0000
R-squared = 0.1766
Adj R-squared = 0.1752
Within R-sq. = 0.1083
Number of clusters (ind_code) = 12 Root MSE = 0.4240
(Std. err. adjusted for 12 clusters in ind_code)
------------------------------------------------------------------------------
| Robust
ln_wage | Coefficient std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
hours | .0028301 .0015474 1.83 0.095 -.0005757 .0062359
tenure | .0327574 .0034951 9.37 0.000 .0250647 .04045
race | -.1454669 .0250923 -5.80 0.000 -.2006945 -.0902392
_cons | 1.712627 .1019591 16.80 0.000 1.488217 1.937038
------------------------------------------------------------------------------
Absorbed degrees of freedom:
-----------------------------------------------------+
Absorbed FE | Categories - Redundant = Num. Coefs |
-------------+---------------------------------------|
age | 30 0 30 |
union | 2 1 1 |
-----------------------------------------------------+
- Python (3.11.4, numpy 1.25.2, statsmodels 0.14.0):
import numpy as np
import statsmodels.datasets as sd
import statsmodels.formula.api as sm
df = sd.webuse("nlswork")
dataset = df.dropna(
subset=[
"ln_wage",
"hours",
"tenure",
"race",
"age",
"union",
"ind_code",
],
how="any",
)
model = sm.ols(
formula="ln_wage ~ hours + tenure + race + C(age) + C(union)",
data=dataset,
)
results = model.fit(
cov_type="cluster",
cov_kwds={"groups": np.array(dataset["ind_code"])},
use_t=True,
)
results.summary().as_text()stdout
OLS Regression Results
==============================================================================
Dep. Variable: ln_wage R-squared: 0.177
Model: OLS Adj. R-squared: 0.175
Method: Least Squares F-statistic: 2.026e+11
Date: Fri, 08 Sep 2023 Prob (F-statistic): 5.39e-61
Time: 11:06:12 Log-Likelihood: -10579.
No. Observations: 18891 AIC: 2.123e+04
Df Residuals: 18856 BIC: 2.150e+04
Df Model: 34
Covariance Type: cluster
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 1.1509 0.035 32.655 0.000 1.073 1.228
C(age)[T.17.0] 0.2170 0.168 1.295 0.222 -0.152 0.586
C(age)[T.18.0] 0.2016 0.087 2.312 0.041 0.010 0.393
C(age)[T.19.0] 0.2886 0.081 3.581 0.004 0.111 0.466
C(age)[T.20.0] 0.3644 0.073 4.965 0.000 0.203 0.526
C(age)[T.21.0] 0.4269 0.080 5.361 0.000 0.252 0.602
C(age)[T.22.0] 0.4735 0.089 5.312 0.000 0.277 0.670
C(age)[T.23.0] 0.4947 0.087 5.685 0.000 0.303 0.686
C(age)[T.24.0] 0.4986 0.087 5.709 0.000 0.306 0.691
C(age)[T.25.0] 0.4958 0.088 5.662 0.000 0.303 0.688
C(age)[T.26.0] 0.5486 0.090 6.092 0.000 0.350 0.747
C(age)[T.27.0] 0.5226 0.076 6.872 0.000 0.355 0.690
C(age)[T.28.0] 0.5165 0.080 6.463 0.000 0.341 0.692
C(age)[T.29.0] 0.5147 0.087 5.946 0.000 0.324 0.705
C(age)[T.30.0] 0.5126 0.084 6.131 0.000 0.329 0.697
C(age)[T.31.0] 0.5692 0.075 7.635 0.000 0.405 0.733
C(age)[T.32.0] 0.5452 0.083 6.535 0.000 0.362 0.729
C(age)[T.33.0] 0.5457 0.075 7.245 0.000 0.380 0.712
C(age)[T.34.0] 0.5255 0.073 7.227 0.000 0.365 0.685
C(age)[T.35.0] 0.5492 0.074 7.463 0.000 0.387 0.711
C(age)[T.36.0] 0.5488 0.081 6.775 0.000 0.371 0.727
C(age)[T.37.0] 0.5506 0.077 7.184 0.000 0.382 0.719
C(age)[T.38.0] 0.5283 0.079 6.705 0.000 0.355 0.702
C(age)[T.39.0] 0.5477 0.083 6.625 0.000 0.366 0.730
C(age)[T.40.0] 0.5431 0.088 6.147 0.000 0.349 0.738
C(age)[T.41.0] 0.5340 0.063 8.520 0.000 0.396 0.672
C(age)[T.42.0] 0.5426 0.094 5.800 0.000 0.337 0.748
C(age)[T.43.0] 0.4841 0.092 5.277 0.000 0.282 0.686
C(age)[T.44.0] 0.4711 0.090 5.243 0.000 0.273 0.669
C(age)[T.45.0] 0.6426 0.098 6.553 0.000 0.427 0.858
C(age)[T.46.0] 0.9095 0.174 5.237 0.000 0.527 1.292
C(union)[T.1.0] 0.1809 0.038 4.810 0.001 0.098 0.264
hours 0.0028 0.002 1.829 0.095 -0.001 0.006
tenure 0.0328 0.003 9.372 0.000 0.025 0.040
race -0.1455 0.025 -5.797 0.000 -0.201 -0.090
==============================================================================
Omnibus: 1064.590 Durbin-Watson: 0.868
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2692.841
Skew: 0.329 Prob(JB): 0.00
Kurtosis: 4.729 Cond. No. 2.93e+04
==============================================================================
Notes:
[1] Standard Errors are robust to cluster correlation (cluster)
[2] The condition number is large, 2.93e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
However, this approach falls short for 2 reasons:
- It is much slower with large datasets than
reghdfe(cf. below), - It fails to converge in Python when (one of) the fixed-effect dimension have a too-high cardinality.
(Note: One may argue that the latter point hints at a faulty analysis design… but I'd prefer to not discuss the “whether it is correct to do so”.)
A model showing contrast in computation time is ln_wage ~ hours + tenure + race + C(age) + C(idcode), clustering SE by ind_code:
- Stata 17: 0.35 seconds
- Python 3.11: 115.79 s
Reproducible code
- Python
import numpy as np
import statsmodels.datasets as sd
import statsmodels.formula.api as sm
import time
df = sd.webuse("nlswork")
dataset = df.dropna(
subset=[
"ln_wage",
"hours",
"tenure",
"race",
"age",
"idcode",
"ind_code",
],
how="any",
)
start_time = time.perf_counter()
model = sm.ols(
formula="ln_wage ~ hours + tenure + race + C(age) + C(idcode)",
data=dataset,
)
results = model.fit(
cov_type="cluster",
cov_kwds={"groups": np.array(dataset["ind_code"])},
use_t=True,
)
end_time = time.perf_counter()
print(f"Model fitting duration: {end_time - start_time:.2f} seconds.")
results.summary().as_text()- Stata
webuse nlswork
set rmsg on
reghdfe ln_wage hours tenure race, absorb(age idcode) vce(cluster ind_code)A model showing Python unable to converge is ln_wage ~ hours + tenure + race + C(age) + C(idcode) + C(grade):
- Stata: 0.51s
- Python: stopped after 20 mins.
Other
I'd be happy to raise or contribute to a PR for it, although I might fall short on the theory of the stats.
Linked to #259