Multiple Imputation and causal forests

Hi all,

Thank you for creating and maintaining such a wonderful package!

I would like to use the grf package to study treatment effect heterogeneity in an observational dataset on telework uptake and mental health (my publication public would be non-statisticians), which would be good to translate the package to applied scientists. We have ~5,000 observations and ~50 pre-treatment covariates (many categorical variables expanded to dummies). Missingness is ~10%, so we generated >20 multiply imputed datasets (via mice).

I’d appreciate guidance on best practice for combining results across imputations:

**ATE/CATE/RATE across imputations**

1. Is the recommended approach to fit a separate causal_forest() on each imputed dataset, compute ATE (average_treatment_effect), CATE predictions (predict(..., estimate.variance=TRUE)), and heterogeneity summaries (e.g., RATE / TOC), and then pool the resulting estimands across imputations (e.g. Rubin-style)?

- Or is there a preferred alternative (e.g., stacking imputations with weights, etc.) in the context of grf through merge_forests?

**Randomness / sample splitting consistency across imputations**

2. Since causal_forest() involves randomness (subsampling, honesty, etc.), would you recommend fixing the same train/test split across imputations when generating CATE predictions for evaluation/plots? Or do we do it per imputation?

**Variable importance across imputations**

3.  I noted that variable_importance() can differ across imputations (with certain variables that always appear in the same causal_forests). Do you have any recommendations for summarizing this across imputed datasets? For example, should we normalize importances within each forest and report mean/median + stability metrics (top-k frequency) across imputations?

4. Lastly, in certain codes, I see that you utilize a train - test split (e.g. ijmpr code), if I understand correctly, this is to evaluate the fit of the causal forest? Is there some intuition to see when this is necessary as there is already a split due to honesty? Across the different examples, I see different approaches but I fail to see the reasoning behind it. 

I have already implemented the “one forest per imputation + pooling” workflow for ATE/CATE/RATE, but I’d be very grateful for any suggestions!

Kind regards,
Eduardo


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple Imputation and causal forests #1523

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Multiple Imputation and causal forests #1523

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions