inga is a toolkit for generating and inspecting synthetic tabular datasets. It constructs arbitrarily complex Structural Causal Models (SCMs), draws samples from them, and computes causal effects and causal biases conditioned on observed variables and outcomes. All computed quantities are stored and made available for causally-informed pre-training of tabular models.
The current scope of this repository is restricted to SCMs with continuous variables. Let
Here,
In particular, let
Here,
inga approximates this posterior using a robust Laplace approximation, enabling scalable computation in high-dimensional settings and across batches of observations
One can show that the association between treatment
Causal effect and causal bias provide a granular characterization of how information propagates from observed variables to the outcome within the DAG.
Standard point-estimation models aim to approximate the conditional expectation
Consider an encoder model
inga enables causally consistent pre-training by generating synthetic datasets that include the full set of causal effects
The small benchmark causal_consistency_benchmark.py demonstrates this intution. A simple MLP encoder is attached to three linear heads, respectively predicting outcomes, causal effects and causal biases. The model is trained and tested individually on splits of 30 randomly generated synthetic dataset.
+--------------------+----------------+-------------------+-------------------------+ | method_type | prediction_mae | causal_effect_mae | prediction_win_fraction | +--------------------+----------------+-------------------+-------------------------+ | standard | 0.7909 [0.31] | 0.3353 [0.45] | 0.0667 | | l2 | 0.7868 [0.31] | 0.3141 [0.46] | 0.0667 | | causal_consistency | 0.7694 [0.31] | 0.0461 [0.21] | 0.8667 | +--------------------+----------------+-------------------+-------------------------+
The table shows that not only the model trained using causal consistency provides much more reliable causal effect estimates, but also decreases the generalization error on ~87% of the datasets. Results can be replicated by running uv run python examples/causal_consistency_benchmark.py.
Get inga from PyPI:
pip install ingaAlternatively, clone the repository:
git clone https://github.com/gianlucadetommaso/inga.git
cd ingaSync dependencies:
uv syncRun scripts, for example:
uv run python -m examples/explore.pyYou can create and draw the DAG of a SCM as follows:
from inga.scm import SCM, Variable
scm = SCM(
variables=[
Variable(name="Z"),
Variable(name="X", parent_names=["Z"]),
Variable(name="Y", parent_names=["Z", "X"]),
]
)
scm.draw(output_path="YOUR_DAG.png")The class Variable defines a variable
import torch
from torch import Tensor
from inga.scm import GaussianVariable
class MyVariable(GaussianVariable):
def f_mean(self, parents: dict[str, Tensor]) -> Tensor:
f_mean: Tensor | float = 0.0
for parent in parents.values():
f_mean = f_mean + torch.sin(parent)
return f_meanAn example of built-in GaussianVariable with defined mean function is LinearVariable. Now, Let's update the SCM using our newly defined variable class!
from inga.scm import SCM
scm = SCM(
variables=[
MyVariable(name="Z", sigma=1.0),
MyVariable(name="X", sigma=1.0, parent_names=["Z"]),
MyVariable(name="Y", sigma=1.0, parent_names=["Z", "X"]),
]
)We are ready to compute causal effect and causal bias. We need to define treatment variable, outcome variable and observed variables. Note: the treatment should always be observed, while the outcome should never be. Here an example:
from torch import Tensor
treatment_name, outcome_name = "X", "Y"
observed = {"X": Tensor([1.])}
scm.posterior.fit(observed)
causal_effect = scm.causal_effect(
observed=observed,
treatment_name=treatment_name,
outcome_name=outcome_name
)
causal_bias = scm.causal_bias(
observed=observed,
treatment_name=treatment_name,
outcome_name=outcome_name
)You can investigate the dataset interactively by exporting the SCM to HTML:
scm.export_html(
output_path="YOUR_SCM.html",
observed_ranges={"X": (-2.0, 2.0)}
)For an exmaple, run uv run python examples/explore.py or checkout datacard.html.
Given that we have constructed our SCM, let's generate, save and load a SCM dataset.
from inga.scm import CausalQueryConfig, load_scm_dataset
dataset = scm.generate_dataset(
num_samples=128,
seed=123,
queries=[
CausalQueryConfig(
treatment_name="X",
outcome_name="Y",
observed_names=["X"],
),
],
)
dataset_path = "YOUR_DATASET.json"
dataset.save(dataset_path)
loaded_dataset = load_scm_dataset(dataset_path)Alternatively, you can generate datasets at random from config as follows:
from inga.scm.dataset import SCMDatasetConfig, generate_scm_dataset
from inga.scm.random import RandomSCMConfig
config = SCMDatasetConfig(
scm_config=RandomSCMConfig(num_variables=4, parent_prob=0.5, seed=7),
num_samples=128,
num_queries=2,
min_observed=1,
seed=42,
)
dataset = generate_scm_dataset(config)If you use inga in academic work, you can cite it with the following BibTeX entry:
@software{detommaso_inga,
author = {Detommaso, Gianluca},
title = {Inga: Causal Synthetic Tabular Data Toolkit},
url = {https://github.com/gianlucadetommaso/inga},
year = {2026},
note = {GitHub repository}
}Licensed under the MIT License.
