Compositional Models for Estimating Causal Effects

Repository providing benchmarks and code to reproduce experiments of CLeaR'25 paper: Compositional Models for Estimating Causal Effects, to appear in Causal Learning and Reasoning Conference, 2025.

Summary: We introduce a novel compositional framework to estimate conditional average treatment effects (CATE) for compositional systems with structured units. We introduce three novel and realistic evaluation environments to evaluate compositional approaches for causal effect estimation — (1) query execution in relational databases, (2) matrix processing on different types of computer hardware, and (3) simulated manufacturing assembly line data based on a realistic simulator. We provide data and code to generate data from the three benchmarks and synthetic data used in the paper. We find that the compositional approach provides accurate causal effect estimation for structured units, increased sample efficiency, improved overlap between treatment and control groups, and compositional generalization to units with unseen combinations of components.

Data generation and benchmark creation

Synthetic data

We generate synthetic compositional data with various characteristics -- composition structures (sequential and parallel), data distribution (uniform and normal), functional forms of response functions (linear, non-linear, polynomial), systematic data generation of increasing tree-depths vs. sequential tree generation with exactly same composition structure across units. For more details, see synthetic_data/data_generator/synthetic_data_sampler.py file.

Usage: To generate synthetic data, use the below code (with root_dir synthetic_data/).

from data_generator.synthetic_data_sampler import SyntheticDataSampler
num_modules = 10
module_function_types = ["polyval"] * num_modules

# simulate data for both treatments (experimental data )
sampler = SyntheticDataSampler( num_modules = num_modules, 
                                num_feature_dimensions = 1, 
                                composition_type = "sequential", 
                                fixed_structure = False, 
                                max_depth=num_modules, 
                                num_samples=1000, 
                                seed=42, 
                                data_dist = "uniform", module_function_types=module_function_types, resample=False)

# create observational data by introducing observational bias
sampler.create_observational_data(biasing_covariate="feature_sum",      bias_strength=1)

# split units into train/test systematically (IID: Random split, OOD: split on varying tree-depths) and indicate if models are evaluated on the maximum tree-depth (for OOD split)
sampler.create_iid_ood_split(split_type="ood", 
                            num_train_modules=train_modules, test_on_last_depth=True)

Manufacturing assembly data generation

cd manufacturing_assembly
Run the `factoryScenarioGenerator.ipynb``` notebook to generate various manufacturing assembly line layouts with hierarchical structures and initial factory conditions, which determine how much raw material is available and specify the product demand.

Running this notebook will generate factory_scenario folder and initial_conditions folder.

The number of workers and their skill distribution is specified in the workers/workers_00.json and workers/workers_01.jsonfolders. This specifies the binary treatment for hierarchical assembly layouts (units).
Run the simulate_factories-time-dynamics.ipynb notebook, which uses simpy (an event simulator) to generate potential outcomes for different treatments. It takes a factory scenario (hierarchical structure representing instance-specific composition) as an input, runs it with a set of factory workers with multiple skill levels to calculate total rework, scrap, and products produced by each station and the whole factory (component-level and unit-level potential outcomes.)

Matrix operations processing

Run matrix_operations/generate_matrix_expressions_data.py to generate data for a set of matrix expressions (units) on a given computer hardware (treatment), and obtain the run-times (potential outcomes) for each operation (component) as well as the overall expression.
Expression data generated on two different computer hardware is provided in the JSON and CSV formats here: Google Drive Link.

Query Execution Domain

Run query_execution/data_gen/end_to_end_data_gen.sh to generate data from scratch. Note: You would need to setup Stackoverflow database and pull user-generated queries for data generation from scractch.
Query execution data generated for around 10k query execution plans (units) with various database configuration parameters (interventions) is provided in the JSONs and CSVs formats here: Google Drive Link.

Experiment results

In order to reproduce experiment results, we currently have a separate codebase for each domain. Run the code in the respective folder to reproduce the results.

Synthetic data

cd synthetic_data.
Run ./base_experiments.sh in synthetic_data/ folder to generate results for compositional generalization experiment for sequential and parallel compositional structures. This will generate a results/ folder in synthetic_data/ with JSON files consisting of $R^2$ and PEHE metrics for the CATE estimation task.
Use notebooks/plot_results.ipynb to reproduce the results of Figure 3.

Manufacturing domain

Unitary model training and evaluation

Run manufacturing_assembly/highLevelModelTraining.ipynb for CATE estimation using unitary models.

Compositional model training and evaluation

First, run manufacturing_assembly/LowLevelModels.ipynb to train component-level models for potential outcomes estimation.
Then, run manufacturing_assembly/LowLevelModels-aggregation.ipynb to aggregate the component-level estimates to obtain unit-level CATE estimate using the compositional approach.

Matrix operations processing

Note: To generate the matrix operations data set, each matrix operation is evaluated independently of other operations, and the overall run-time of the expression is the sum of the run-times of the individual operations; thus, this data set satisfies the additive parallel compositional assumption. Hence, we use the additive parallel compositional model for this data set as explained below.

First, make sure that matrix_operations/data/csvs contain the CSV files for the components and units (maths_evaluation_datahigh_levelfeatures.csv), consisting of covariates, treatment and outcomes (run-time) for both the treatments. Download the prepared data from Google Drive Link.
Run matrix_operations/run_math_evaluation_baselines.py to run the standard CATE baselines (unitary approach) on experimental (bias_strength = 0) and observational (bias_strength = 1 - 20) data.
Run matrix_operations/run_parallel_additive_model_maths_baseline.py to run the additive parallel compositional model.

Query Execution

Run query_execution/modeling/causal_effect_estimation.py to run the additive parallel compositional model on query execution data set.

If you find our work helpful, please consider citing:

@article{pruthi2024compositional,
  title={Compositional Models for Estimating Causal Effects},
  author={Pruthi, Purva and Jensen, David},
  journal={arXiv preprint arXiv:2406.17714},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
manufacturing_assembly		manufacturing_assembly
matrix_operations		matrix_operations
query_execution		query_execution
synthetic_data		synthetic_data
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Compositional Models for Estimating Causal Effects

Data generation and benchmark creation

Synthetic data

Manufacturing assembly data generation

Matrix operations processing

Query Execution Domain

Experiment results

Synthetic data

Manufacturing domain

Unitary model training and evaluation

Compositional model training and evaluation

Matrix operations processing

Query Execution

About

Uh oh!

Releases

Packages

Uh oh!

Languages

KDL-umass/compositional_models_cate

Folders and files

Latest commit

History

Repository files navigation

Compositional Models for Estimating Causal Effects

Data generation and benchmark creation

Synthetic data

Manufacturing assembly data generation

Matrix operations processing

Query Execution Domain

Experiment results

Synthetic data

Manufacturing domain

Unitary model training and evaluation

Compositional model training and evaluation

Matrix operations processing

Query Execution

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages