Skip to content

Repository providing benchmarks and code to reproduce the experiments from the paper compositional models for individual causal effect estimation.

Notifications You must be signed in to change notification settings

KDL-umass/compositional_models_cate

Repository files navigation

Compositional Models for Estimating Causal Effects

Repository providing benchmarks and code to reproduce experiments of CLeaR'25 paper: Compositional Models for Estimating Causal Effects, to appear in Causal Learning and Reasoning Conference, 2025.

Summary: We introduce a novel compositional framework to estimate conditional average treatment effects (CATE) for compositional systems with structured units. We introduce three novel and realistic evaluation environments to evaluate compositional approaches for causal effect estimation — (1) query execution in relational databases, (2) matrix processing on different types of computer hardware, and (3) simulated manufacturing assembly line data based on a realistic simulator. We provide data and code to generate data from the three benchmarks and synthetic data used in the paper. We find that the compositional approach provides accurate causal effect estimation for structured units, increased sample efficiency, improved overlap between treatment and control groups, and compositional generalization to units with unseen combinations of components.

Data generation and benchmark creation

Synthetic data

We generate synthetic compositional data with various characteristics -- composition structures (sequential and parallel), data distribution (uniform and normal), functional forms of response functions (linear, non-linear, polynomial), systematic data generation of increasing tree-depths vs. sequential tree generation with exactly same composition structure across units. For more details, see synthetic_data/data_generator/synthetic_data_sampler.py file.

Usage: To generate synthetic data, use the below code (with root_dir synthetic_data/).

from data_generator.synthetic_data_sampler import SyntheticDataSampler
num_modules = 10
module_function_types = ["polyval"] * num_modules

# simulate data for both treatments (experimental data )
sampler = SyntheticDataSampler( num_modules = num_modules, 
                                num_feature_dimensions = 1, 
                                composition_type = "sequential", 
                                fixed_structure = False, 
                                max_depth=num_modules, 
                                num_samples=1000, 
                                seed=42, 
                                data_dist = "uniform", module_function_types=module_function_types, resample=False)

# create observational data by introducing observational bias
sampler.create_observational_data(biasing_covariate="feature_sum",      bias_strength=1)

# split units into train/test systematically (IID: Random split, OOD: split on varying tree-depths) and indicate if models are evaluated on the maximum tree-depth (for OOD split)
sampler.create_iid_ood_split(split_type="ood", 
                            num_train_modules=train_modules, test_on_last_depth=True)

Manufacturing assembly data generation

  1. cd manufacturing_assembly
  2. Run the `factoryScenarioGenerator.ipynb``` notebook to generate various manufacturing assembly line layouts with hierarchical structures and initial factory conditions, which determine how much raw material is available and specify the product demand.
  • Running this notebook will generate factory_scenario folder and initial_conditions folder.
  1. The number of workers and their skill distribution is specified in the workers/workers_00.json and workers/workers_01.jsonfolders. This specifies the binary treatment for hierarchical assembly layouts (units).
  2. Run the simulate_factories-time-dynamics.ipynb notebook, which uses simpy (an event simulator) to generate potential outcomes for different treatments. It takes a factory scenario (hierarchical structure representing instance-specific composition) as an input, runs it with a set of factory workers with multiple skill levels to calculate total rework, scrap, and products produced by each station and the whole factory (component-level and unit-level potential outcomes.)

Matrix operations processing

  • Run matrix_operations/generate_matrix_expressions_data.py to generate data for a set of matrix expressions (units) on a given computer hardware (treatment), and obtain the run-times (potential outcomes) for each operation (component) as well as the overall expression.
  • Expression data generated on two different computer hardware is provided in the JSON and CSV formats here: Google Drive Link.

Query Execution Domain

  • Run query_execution/data_gen/end_to_end_data_gen.sh to generate data from scratch. Note: You would need to setup Stackoverflow database and pull user-generated queries for data generation from scractch.
  • Query execution data generated for around 10k query execution plans (units) with various database configuration parameters (interventions) is provided in the JSONs and CSVs formats here: Google Drive Link.

Experiment results

In order to reproduce experiment results, we currently have a separate codebase for each domain. Run the code in the respective folder to reproduce the results.

Synthetic data

  • cd synthetic_data.

  • Run ./base_experiments.sh in synthetic_data/ folder to generate results for compositional generalization experiment for sequential and parallel compositional structures. This will generate a results/ folder in synthetic_data/ with JSON files consisting of $R^2$ and PEHE metrics for the CATE estimation task.

  • Use notebooks/plot_results.ipynb to reproduce the results of Figure 3.

Manufacturing domain

Unitary model training and evaluation

  • Run manufacturing_assembly/highLevelModelTraining.ipynb for CATE estimation using unitary models.

Compositional model training and evaluation

  • First, run manufacturing_assembly/LowLevelModels.ipynb to train component-level models for potential outcomes estimation.

  • Then, run manufacturing_assembly/LowLevelModels-aggregation.ipynb to aggregate the component-level estimates to obtain unit-level CATE estimate using the compositional approach.

Matrix operations processing

Note: To generate the matrix operations data set, each matrix operation is evaluated independently of other operations, and the overall run-time of the expression is the sum of the run-times of the individual operations; thus, this data set satisfies the additive parallel compositional assumption. Hence, we use the additive parallel compositional model for this data set as explained below.

  • First, make sure that matrix_operations/data/csvs contain the CSV files for the components and units (maths_evaluation_datahigh_levelfeatures.csv), consisting of covariates, treatment and outcomes (run-time) for both the treatments. Download the prepared data from Google Drive Link.

  • Run matrix_operations/run_math_evaluation_baselines.py to run the standard CATE baselines (unitary approach) on experimental (bias_strength = 0) and observational (bias_strength = 1 - 20) data.

  • Run matrix_operations/run_parallel_additive_model_maths_baseline.py to run the additive parallel compositional model.

Query Execution

  • Run query_execution/modeling/causal_effect_estimation.py to run the additive parallel compositional model on query execution data set.

If you find our work helpful, please consider citing:

@article{pruthi2024compositional,
  title={Compositional Models for Estimating Causal Effects},
  author={Pruthi, Purva and Jensen, David},
  journal={arXiv preprint arXiv:2406.17714},
  year={2024}
}

About

Repository providing benchmarks and code to reproduce the experiments from the paper compositional models for individual causal effect estimation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published