DATALUS: Diffusion-Augmented Tabular Architecture for Local Utility and Security

🇧🇷 Atenção Comissão Julgadora do 32º Prêmio Jovem Cientista: A documentação oficial, elaborada com o rigor científico exigido pelo edital e detalhando o impacto na LGPD e em políticas públicas, encontra-se no arquivo README_pt-BR.md.

DATALUS is a production-oriented Generative AI framework for synthetic tabular data. It is designed for high-dimensional, heterogeneous, privacy-sensitive government datasets, with a specific proof-of-concept path for Brazilian public-sector health data. The system learns a joint distribution over tabular records, samples new microdata from that distribution, and subjects generated artifacts to reproducible privacy and utility audits before release.

DATALUS is not an anonymization script. It is a generative ecosystem for ab-initio synthesis, data augmentation, minority-class balancing, tabular inpainting, counterfactual modification, audit automation, ONNX export, INT8 edge inference, FastAPI artifact serving, Streamlit operation, and browser-local execution through ONNX Runtime Web.

Research and Public Data Context
System Requirements
Architecture
Generative Capabilities
Mathematical Foundation
Autonomous Audit Orchestrator
Installation and Developer Setup
Complete CLI Cheatsheet
Training Lifecycle and Colab Constraints
Inference and Model Architecture Details
FastAPI Artifact Service
Docker Deployment
CI/CD and Automated Testing
Data Governance, Ethics, and LGPD Alignment
Troubleshooting and FAQ
References
License
Citation

Research and Public Data Context

Brazil's open-data policy is coordinated through the Infraestrutura Nacional de Dados Abertos (INDA), and dados.gov.br is the central catalog for public datasets. Government sources publish data across CSV, JSON, XML, ODS, RDF, APIs, Parquet-like analytic exports, and sector-specific repositories. The practical result is schema heterogeneity: inconsistent delimiters, lossy encodings, sparse columns, high-cardinality codes, rare municipalities, rare diseases, changing field names, and mixed numeric/string representations.

DATALUS implements ingestion and encoding policies for that reality:

Lazy Polars scans avoid full in-memory Pandas loading.
CSV scans detect common delimiters and use lossy UTF-8 decoding for legacy encodings.
Identifier-like fields are removed before modeling.
Sparse and free-text columns are rejected by explicit policy.
Observed rare categories are preserved as first-class tokens; only unseen inference-time values map to __UNKNOWN__.
Category frequency metadata is serialized so downstream audits can detect long-tail collapse.

Primary public-data references:

Brazilian open-data policy and dados.gov.br catalog: Governo Digital Dados Abertos, Portal Brasileiro de Dados Abertos, API Portal de Dados Abertos.
Tabular diffusion: Kotelnikov et al., TabDDPM.
RePaint inpainting: Lugmayr et al..
Classifier-Free Guidance: Ho and Salimans.
DDIM sampling: Song et al..
Membership inference attacks: Shokri et al..

System Requirements

Layer	Minimum	Recommended	Notes
Python	3.11	3.11 or newer	The package metadata declares `requires-python >=3.11`.
Training GPU	CPU works for tests	NVIDIA T4 15 GB VRAM or better	Colab T4 is the target constrained GPU profile.
Training RAM	8 GB	16 GB or more	Lazy ingestion helps, but encoding and audit projection need memory.
Browser inference	Modern Chromium, Firefox, or Edge	Browser with WebAssembly and Cache API	The React component uses `onnxruntime-web` WASM locally.
Node.js	20 in CI	20 LTS	Frontend CI uses `actions/setup-node@v4` with Node 20.
Docker	Compose v2	Docker Engine with Compose plugin	Compose starts the API and Streamlit containers.

Optional dependency groups are declared in pyproject.toml:

Extra	Purpose
`training`	PyTorch, ONNX, ONNX Runtime, ONNX Script.
`test`	Pytest and HTTPX for API tests.
`frontend`	Streamlit runtime.
`audit`	LightGBM and CatBoost for heavier audit experiments.
`dev`	Full local development stack.

Architecture

The codebase uses a strict src/ layout and Clean Architecture boundaries:

src/datalus/
  domain/            Framework-free schemas and diffusion schedule math
  infrastructure/    Polars, PyTorch, ONNX, checkpointing, encoding adapters
  application/       Training, inference, audit, and export use cases
  interfaces/        Typer CLI and FastAPI delivery adapters
frontend/
  streamlit/         Python Streamlit shell
  component/         React TypeScript ONNX Runtime Web component
tests/               Unit and integration tests
docker/              API and Streamlit Dockerfiles
.github/workflows/   CI jobs for Python, frontend, and Docker builds

flowchart TD
    A["Government Tabular Sources<br/>CSV, TSV, CSV.GZ, Parquet, ORC, APIs"]
    B["Layer 1: Zero-Shot Ingestion<br/>Polars LazyFrame, delimiter detection, identifier removal"]
    C["Layer 2: Heterogeneous Encoding<br/>Quantile numeric transforms, categorical vocabularies, rare-category preservation"]
    D["Layer 3: Hybrid Diffusion Engine<br/>Residual MLP denoiser, cosine schedule, DDIM sampler"]
    E["Layer 4: Generative Operations<br/>Ab-initio generation, augmentation, balancing, inpainting, counterfactuals"]
    F["Layer 5: Autonomous Audit Orchestrator<br/>DCR, Shadow-MIA, TSTR/TRTR, MLE-ratio"]
    G["Layer 6: Edge Deployment<br/>ONNX, INT8 PTQ, FastAPI, Streamlit, React WASM"]

    A --> B --> C --> D --> E --> F --> G

Clean Architecture Responsibilities

Layer	Source path	Responsibility
Domain	`src/datalus/domain`	Pydantic contracts, diffusion schedule math, RePaint config, privacy thresholds.
Infrastructure	`src/datalus/infrastructure`	Polars scanning, reversible encoders, PyTorch networks, diffusion tensors, checkpointing, ONNX export.
Application	`src/datalus/application`	Training, sampling, augmentation, balancing, inpainting, counterfactuals, auditing, artifact export.
Interfaces	`src/datalus/interfaces`	Typer CLI and FastAPI app.
Frontend	`frontend`	Streamlit shell and React ONNX Runtime Web component.

Generative Capabilities

DATALUS exposes distinct workflows because synthetic data systems have different operational goals:

Capability	Purpose	CLI
Ab-initio generation	Create a new synthetic dataset from learned distributions.	`datalus sample`
Data augmentation	Append synthetic rows to a small dataset.	`datalus augment`
Minority balancing	Generate records until target-class counts approach a requested distribution.	`datalus balance`
Tabular inpainting	Fill missing values while preserving observed fields at every reverse step.	`datalus inpaint`
Counterfactual modification	Apply column interventions and regenerate compatible records.	`datalus counterfactual`
Audit	Evaluate empirical privacy and predictive utility before release.	`datalus audit`
Edge export	Export EMA weights to ONNX and optional INT8.	`datalus export-onnx`
Artifact serving	Serve registry artifacts for browser-local inference.	`datalus serve`

The current denoiser exposes CFG-compatible inference logic. In the default training path, models are instantiated without a context vector, so cfg_scale=1.0 is the unconditional path and changing cfg_scale has no effect unless a context-enabled denoiser is introduced. The ONNX export path still records an INT8 CFG amplification parity guard at cfg_scale=3.0 because quantization drift can become operationally relevant when guidance is enabled.

Mathematical Foundation

Implemented Cosine Schedule

The domain layer implements the Nichol-Dhariwal cosine schedule with numerical clipping. For training horizon $T$ and offset $s=0.008$:

$$ f(t)=\cos^2\left(\frac{t/T+s}{1+s}\frac{\pi}{2}\right) $$

The normalized cumulative product is:

$$ \bar{\alpha}_t=\frac{f(t)}{f(0)} $$

The beta schedule is:

$$ \beta_t=\text{clip}\left(1-\frac{\bar{\alpha}_{t+1}}{\bar{\alpha}_t},10^{-5},0.999\right) $$

VarianceSchedule converts these values into tensors for $\beta_t$, $\alpha_t=1-\beta_t$, $\bar{\alpha}_t$, $\sqrt{\bar{\alpha}_t}$, and $\sqrt{1-\bar{\alpha}_t}$. A linear beta schedule also exists for ablation and tests.

Forward Markov Chain

For a latent tabular vector $\mathbf{x}_0\in\mathbb{R}^d$, the DDPM forward process corrupts the sample through a Markov chain:

$$ q(\mathbf{x}_t\mid\mathbf{x}_{t-1})=\mathcal{N}\left(\mathbf{x}_t;\sqrt{1-\beta_t}\mathbf{x}_{t-1},\beta_t\mathbf{I}\right) $$

With $\alpha_t=1-\beta_t$ and $\bar{\alpha}_t=\prod_{s=1}^{t}\alpha_s$, the closed-form marginal is:

$$ q(\mathbf{x}_t\mid\mathbf{x}_0)=\mathcal{N}\left(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\mathbf{x}_0,(1-\bar{\alpha}_t)\mathbf{I}\right) $$

The implemented q_sample tensor operation is:

$$ \mathbf{x}_t=\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon},\quad\boldsymbol{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) $$

Reverse Process and Implemented Objective

The denoiser $\boldsymbol{\epsilon}_{\theta}$ predicts the injected noise. The probabilistic DDPM reverse process is:

$$ p_{\theta}(\mathbf{x}_{t-1}\mid\mathbf{x}_t)=\mathcal{N}\left(\mathbf{x}_{t-1};\boldsymbol{\mu}_{\theta}(\mathbf{x}_t,t),\boldsymbol{\Sigma}_{\theta}(\mathbf{x}_t,t)\right) $$

DATALUS implements the simplified epsilon-prediction objective used by TabularDiffusion.compute_loss:

$$ \mathcal{L}_{\mathrm{MSE}}=\mathbb{E}_{t,\mathbf{x}_0,\boldsymbol{\epsilon}}\left[\left\lVert\boldsymbol{\epsilon}-\boldsymbol{\epsilon}_{\theta}\left(\sqrt{\bar{\alpha}_t}\mathbf{x}_0+\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon},t\right)\right\rVert_2^2\right] $$

For categorical-logit extensions, the intended TabDDPM composite objective is:

$$ \mathcal{L}_{\mathrm{total}}=\lambda_{\mathrm{num}}\mathcal{L}_{\mathrm{MSE}}^{\mathrm{num}}+\lambda_{\mathrm{cat}}\mathcal{L}_{\mathrm{CE}}^{\mathrm{cat}} $$

The current implementation projects categorical values into continuous learned embedding slices and trains the diffusion model with the MSE objective over the full latent vector. It does not currently train a separate categorical cross-entropy head.

DDIM Sampling

make_ddim_timesteps returns a descending deterministic subsequence from the training horizon. In each reverse step, ddim_step first computes:

$$ \hat{\mathbf{x}}_0=\frac{\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t,t)}{\sqrt{\bar{\alpha}_t}} $$

When $\eta>0$, the implemented stochastic variance term is:

$$ \sigma_t=\eta\sqrt{\left(\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_t}\right)\left(1-\frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}\right)} $$

The update is:

$$ \mathbf{x}_{t-1}=\sqrt{\bar{\alpha}_{t-1}}\hat{\mathbf{x}}_0+\sqrt{1-\bar{\alpha}_{t-1}-\sigma_t^2}\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t,t)+\sigma_t\boldsymbol{\epsilon} $$

In the default CLI path, $\eta=0$, so $\sigma_t=0$ and sampling is deterministic for a fixed seed. The final implementation convention treats prev_t=-1 as $\bar{\alpha}_{-1}=1$.

Classifier-Free Guidance

When context is provided, DATALUS combines unconditional and conditional noise predictions:

$$ \tilde{\boldsymbol{\epsilon}}_{\theta}(\mathbf{x}_t,\mathbf{c},t)=\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t,\varnothing,t)+w\left[\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t,\mathbf{c},t)-\boldsymbol{\epsilon}_{\theta}(\mathbf{x}_t,\varnothing,t)\right] $$

The implementation returns the direct denoiser prediction when context is None or cfg_scale == 1.0. Group guidance is implemented as a mask-and-scale extension over context dimensions.

RePaint Tabular Inpainting

For a known-value mask $\mathbf{m}$, observed coordinates are re-noised during the reverse process:

$$ \mathbf{x}_{t}^{\mathrm{known}}=\sqrt{\bar{\alpha}_t}\mathbf{x}_0^{\mathrm{known}}+\sqrt{1-\bar{\alpha}_t}\boldsymbol{\epsilon} $$

Unknown coordinates are produced by DDIM, and known coordinates are restored by latent-space fusion:

$$ \mathbf{x}_t=\mathbf{m}\odot\mathbf{x}_{t}^{\mathrm{known}}+(1-\mathbf{m})\odot\mathbf{x}_{t}^{\mathrm{generated}} $$

The implemented jump-back step reintroduces noise from from_t to to_t:

$$ \mathbf{x}_{\mathrm{jump}}=\sqrt{\frac{\bar{\alpha}_{\mathrm{to}}}{\bar{\alpha}_{\mathrm{from}}}}\mathbf{x}_{\mathrm{from}}+\sqrt{1-\frac{\bar{\alpha}_{\mathrm{to}}}{\bar{\alpha}_{\mathrm{from}}}}\boldsymbol{\epsilon} $$

make_repaint_schedule creates the reverse schedule plus forward jumps. The CLI exposes --jump-length and --jump-n-sample.

Autonomous Audit Orchestrator

flowchart TD
    A["Inputs<br/>Real train R_train, real holdout R_holdout, synthetic S, schema"]
    B["Projection<br/>Standardize numeric fields and one-hot categorical fields"]
    C["DCR<br/>Nearest-neighbor distance from each synthetic record to R_train"]
    D["Shadow-MIA<br/>Release mode: exhaustive shadows<br/>CI-lite mode: bounded k-fold shadows"]
    E["Utility<br/>TRTR baseline and TSTR model trained on synthetic data"]
    F["Report<br/>JSON metrics, thresholds, mode, pass/fail verdict"]
    G{"Publish?"}

    A --> B --> C --> F
    A --> D --> F
    A --> E --> F
    F --> G

Projection Space

The OAA projects real and synthetic records into a common sklearn feature space. Numerical columns are standardized. Categorical and boolean columns are one-hot encoded with handle_unknown="ignore". Dropped columns, target columns, and columns absent from either frame are excluded from privacy projection.

Distance to Closest Record

For each synthetic record $\hat{\mathbf{x}}_i$, DATALUS computes the nearest-neighbor distance to the projected real training data:

$$ \mathrm{DCR}(\hat{\mathbf{x}}_i)=\min_{j\in{1,\ldots,N}}d(\hat{\mathbf{x}}_i,\mathbf{x}_j^{\mathrm{real}}) $$

The alert threshold is the configured percentile of each real record's second-nearest real neighbor distance:

$$ \tau_{\mathrm{DCR}}=\text{percentile}_{p}\left({d_2(\mathbf{x}_j^{\mathrm{real}},R_{\mathrm{train}})}_{j=1}^{N}\right) $$

The memorization ratio is:

$$ \rho_{\mathrm{mem}}=\frac{1}{M}\sum_{i=1}^{M}\mathbf{1}\left[\mathrm{DCR}(\hat{\mathbf{x}}_i)<\tau_{\mathrm{DCR}}\right] $$

The default approval rule is $\rho_{\mathrm{mem}}<0.01$ with the DCR threshold percentile $p=1.0$.

Shadow-Model Membership Inference

Following the Shokri threat model, the attacker learns features that distinguish records used to train a generator from non-member records. DATALUS computes nearest-neighbor attack features for candidate records against generated records:

$$ \phi(\mathbf{x})=\left[d_{\min}(\mathbf{x},S),\overline{d}_k(\mathbf{x},S),\text{std}_k(\mathbf{x},S),\frac{d_{\min}(\mathbf{x},S)}{\max(\overline{d}_k(\mathbf{x},S),10^{-8})}\right] $$

A RandomForest attack model estimates membership scores. The central metric is attack ROC-AUC:

$$ \mathrm{AUC}_{\mathrm{MIA}}=\Pr\left(s_{\mathrm{member}}>s_{\mathrm{nonmember}}\right) $$

release mode uses the uncapped ShadowMIAConfig. ci_lite applies deterministic caps: at most two shadow models, synthetic multiplier at most 0.5, at most three neighbors, default maximum of 512 rows, at most 50 attack estimators, maximum depth 6, and minimum leaf size at least 2. ci_lite is a regression check for CI/CD, not a release audit.

MLE-Ratio Utility

Utility is measured by Train on Synthetic, Test on Real (TSTR) against a Train on Real, Test on Real (TRTR) baseline:

$$ \mathrm{MLE}_{\mathrm{ratio,AUC}}=\frac{\mathrm{AUC}_{\mathrm{TSTR}}}{\mathrm{AUC}_{\mathrm{TRTR}}} $$

The implementation also reports:

$$ \mathrm{MLE}_{\mathrm{ratio,F1}}=\frac{\mathrm{F1}_{\mathrm{TSTR}}}{\mathrm{F1}_{\mathrm{TRTR}}} $$

The default utility approval threshold is $\mathrm{MLE}_{\mathrm{ratio,AUC}}\geq0.90$.

Installation and Developer Setup

Python Environment

Python 3.11 or newer is required. Use a local virtual environment rather than installing into externally managed system Python:

python -m venv .venv
.venv/bin/python -m pip install --upgrade pip
.venv/bin/python -m pip install -e '.[dev]'

For lighter roles:

.venv/bin/python -m pip install -e '.[training,test]'
.venv/bin/python -m pip install -e '.[frontend]'

requirements.txt is a compatibility shim generated from pyproject.toml; the authoritative dependency source is pyproject.toml.

Frontend Component

The Streamlit shell embeds the React component from frontend/component/dist when built. If the bundle is absent, the Python wrapper points to the Vite dev server at http://localhost:5173.

cd frontend/component
npm ci
npm run test
npm run build

For interactive component development:

cd frontend/component
npm run dev

Then launch Streamlit separately:

.venv/bin/datalus streamlit

Local Verification

.venv/bin/python -m pytest -q
cd frontend/component
npm run test
npm run build

Complete CLI Cheatsheet

The Command Line Interface (CLI) is implemented with Typer in src/datalus/interfaces/cli.py. It is deliberately thin: commands translate user input into application use cases and print file locations.

End-to-End Workflow

datalus ingest raw.csv artifacts/demo/processed.parquet --schema-path artifacts/demo/schema_config.json --target-column target
datalus train artifacts/demo/schema_config.json artifacts/demo/processed.parquet artifacts/demo --epochs 5 --batch-size 2048
datalus sample artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json artifacts/demo/synthetic.parquet --n-records 10000 --ddim-steps 50 --cfg-scale 1.0
datalus augment artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json small.parquet artifacts/demo/augmented.parquet --n-records 5000
datalus balance artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json train.parquet artifacts/demo/balanced.parquet target '{"0": 5000, "1": 5000}'
datalus inpaint artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json incomplete.parquet artifacts/demo/inpainted.parquet
datalus counterfactual artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json records.parquet artifacts/demo/counterfactual.parquet '{"municipality": "3550308"}'
datalus audit real_train.parquet artifacts/demo/synthetic.parquet artifacts/demo/schema_config.json artifacts/demo/audit_report.json --target-column target --mia-mode release
datalus export-onnx artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json artifacts/demo --quantize
datalus serve artifacts --host 0.0.0.0 --port 8000

Command Summary

Command	Purpose	Primary output
`ingest`	Infer schema and stream retained data to Parquet.	`schema_config.json`, processed Parquet.
`train`	Train the diffusion model with deterministic checkpoints.	`encoder_config.json`, `checkpoints/checkpoint_latest.pt`.
`sample`	Generate an ab-initio synthetic dataset.	Synthetic Parquet.
`augment`	Append synthetic records to an existing Parquet dataset.	Augmented Parquet.
`balance`	Generate records for requested class counts.	Balanced Parquet.
`inpaint`	Fill null values using RePaint-style masks.	Inpainted Parquet.
`counterfactual`	Apply do-style interventions and regenerate compatible fields.	Counterfactual Parquet.
`audit`	Run DCR and Shadow-MIA, and utility when target is valid.	Audit JSON report.
`export-onnx`	Export EMA denoiser to ONNX and optional INT8.	ONNX files and manifest JSON.
`serve`	Serve artifacts for browser-local inference.	FastAPI service.
`streamlit`	Launch the Portuguese Streamlit UI.	Streamlit service.

Command Arguments and Defaults

Command	Positional arguments	Options and defaults	Expected output
`ingest`	`input_path`, `output_path`	`--schema-path artifacts/schema_config.json`, `--target-column None`	Prints schema and processed Parquet paths. Writes Snappy Parquet and schema metadata.
`train`	`schema_path`, `data_path`, `output_dir`	`--epochs 1`, `--batch-size 2048`, `--max-steps None`, `--resume-from None`	Prints checkpoint path. Writes `encoder_config.json` and checkpoints under `output_dir/checkpoints`.
`sample`	`checkpoint_path`, `encoder_path`, `output_path`	`--n-records 100`, `--ddim-steps 50`, `--seed 42`, `--cfg-scale 1.0`	Writes synthetic Parquet with Snappy compression.
`augment`	`checkpoint_path`, `encoder_path`, `input_path`, `output_path`	`--n-records 100`, `--ddim-steps 50`, `--seed 42`, `--cfg-scale 1.0`	Writes original rows plus synthetic rows selected to original columns.
`balance`	`checkpoint_path`, `encoder_path`, `input_path`, `output_path`, `target_column`, `target_distribution_json`	`--ddim-steps 50`, `--seed 42`, `--cfg-scale 1.0`, `--max-attempts 10`, `--strict False`	Writes Parquet approaching requested class counts. Raises if `--strict` and attempts are exhausted.
`inpaint`	`checkpoint_path`, `encoder_path`, `input_path`, `output_path`	`--ddim-steps 50`, `--jump-length 10`, `--jump-n-sample 10`, `--seed 42`	Writes Parquet with null-driven latent fields imputed.
`counterfactual`	`checkpoint_path`, `encoder_path`, `input_path`, `output_path`, `intervention_json`	`--ddim-steps 50`, `--seed 42`	Writes Parquet under fixed intervention columns.
`audit`	`real_train_path`, `synthetic_path`, `schema_path`, `report_path`	`--target-column None`, `--real-holdout-path None`, `--mia-mode release`, `--max-audit-rows None`	Writes privacy JSON and utility JSON when target exists in both datasets.
`export-onnx`	`checkpoint_path`, `encoder_path`, `output_dir`	`--quantize True`	Writes `model_fp32.onnx`, optional `model_int8.onnx`, `encoder_config.json`, `projector_config.json`, `manifest.json`.
`serve`	`registry_path` default `artifacts`	`--host 0.0.0.0`, `--port 8000`	Starts Uvicorn factory app with `DATALUS_REGISTRY_PATH`.
`streamlit`	None	None	Runs `streamlit run frontend/streamlit/app.py`.

Operational Notes by Command

ingest supports .csv, .tsv, .csv.gz, .tsv.gz, .parquet, and .orc. ORC requires pyarrow.
ingest uses delimiter sniffing for CSV, semicolon fallback for ambiguous Brazilian spreadsheet exports, utf8-lossy decoding, infer_schema_length=10000, ignore_errors=True, and truncate_ragged_lines=True.
ingest drops sparse columns with null ratio above 0.95, identifier-like names such as CPF/CNPJ/CNS/email/phone-like fields, free-text columns, and unsupported dtypes.
The underlying ZeroShotPreprocessor defaults are high_cardinality_threshold=50, sample_size=100000, rare_category_threshold=5, and null tokens "", NA, N/A, null, NULL, and None. The target column is protected from the sparse and identifier drop rules.
train currently exposes a subset of TrainingConfig on the CLI. Learning rate, weight decay, hidden dimensions, AMP, EMA, warmup, and maximum encoder fit rows are configured in TrainingConfig for programmatic use.
balance treats JSON class labels as strings during matching, so numeric class labels should be represented as JSON object keys such as {"0": 5000}.
counterfactual interventions must reference retained columns from the fitted encoder. Unknown categorical values map to __UNKNOWN__.
audit reads Parquet eagerly. For large official release audits, run on a machine sized for the projected one-hot matrix.
serve disables server-side PyTorch generation by default. It is intended to serve public artifacts to the browser-local ONNX runtime.

Training Lifecycle and Colab Constraints

Lifecycle

ingest creates a retained Parquet dataset and schema_config.json.
DatalusTrainer loads schema metadata and builds deterministic Parquet batch offsets.
TabularEncoder fits numeric quantile transforms and categorical vocabularies on up to max_encoder_fit_rows=100000.
FeatureProjector concatenates numeric latent slices and categorical embedding slices.
TabularDenoiserMLP predicts diffusion noise over the full latent vector.
TabularDiffusion.compute_loss samples random timesteps and optimizes the MSE epsilon objective.
AdamW updates diffusion and projector parameters.
Linear warmup is followed by cosine annealing to eta_min=1e-6.
AMP GradScaler is used on CUDA when amp=True.
Gradients are clipped to max_grad_norm=1.0.
EMA tracks diffusion parameters with decay 0.9999.
Checkpoints persist training state for deterministic resume.

Default Training Configuration

Parameter	Default
`batch_size`	`2048`
`epochs`	`1`
`learning_rate`	`2e-4`
`weight_decay`	`1e-4`
`checkpoint_every_steps`	`500`
`seed`	`42`
`num_timesteps`	`1000`
`hidden_dims`	`(512, 1024, 1024, 512)`
`amp`	`True`
`condition_dropout`	`0.1`
`ema_decay`	`0.9999`
`warmup_steps`	`500`
`max_grad_norm`	`1.0`
`max_encoder_fit_rows`	`100000`

Deterministic Checkpointing

Checkpoints are written atomically where possible. Each checkpoint includes:

Diffusion model state.
Feature projector state.
Optimizer state.
Scheduler state.
AMP scaler state.
EMA shadow weights.
Epoch, batch index, global step, latest loss, and loss history.
Training config and SHA-256 config hash.
Python, NumPy, Torch, and CUDA RNG states when CUDA is available.

Resume with:

datalus train artifacts/demo/schema_config.json artifacts/demo/processed.parquet artifacts/demo --epochs 20 --batch-size 1024 --resume-from artifacts/demo/checkpoints/checkpoint_latest.pt

Google Colab T4 Guidance

Colab T4 sessions are useful for proof-of-concept training but should be treated as preemptible. Store artifacts on Google Drive:

from google.colab import drive
drive.mount('/content/drive')

Recommended Drive layout:

/content/drive/MyDrive/datalus/
  raw/
  processed/
  artifacts/
    datasus_sih/
      schema_config.json
      encoder_config.json
      checkpoints/

Install and smoke test:

python -m pip install --upgrade pip
python -m pip install -e '.[training,test]'
datalus train /content/drive/MyDrive/datalus/artifacts/datasus_sih/schema_config.json /content/drive/MyDrive/datalus/processed/train.parquet /content/drive/MyDrive/datalus/artifacts/datasus_sih --epochs 1 --batch-size 2048 --max-steps 20

Batch-size tuning for T4:

Symptom	Action
CUDA OOM before first checkpoint	Retry with `--batch-size 1024`.
OOM after several steps	Resume from `checkpoint_latest.pt` and reduce to `512`.
OOM with very wide categorical embedding space	Reduce to `256`, reduce retained high-cardinality columns upstream, or train on a larger GPU.
Session interruption	Resume from the Drive checkpoint path.

Learning-rate defaults are conservative for T4. If using much smaller batches, keep 2e-4 for initial experiments and compare loss curves before changing optimizer settings.

Inference and Model Architecture Details

Numerical Quantile Transformations

Each numerical column fits up to 1,000 empirical quantiles. Transform maps finite values into the unit quantile domain and then into [-1, 1]. Non-finite values use the training median fill value. Inverse transform clips generated values back to [0, 1] in quantile space and interpolates over the stored quantile table.

Categorical Vocabulary and Embeddings

Each categorical column stores:

__UNKNOWN__ at index 0.
__NULL__ at index 1.
All observed categories sorted after the sentinels.
Frequency metadata and rare-category counts.

Observed rare categories remain first-class tokens. During inference, never-seen categories map to __UNKNOWN__. The projector embeds each categorical column with the default dimension ceil(log2(cardinality)), bounded below by 2.

Residual MLP Denoiser

The implemented TabularDenoiserMLP topology is:

Sinusoidal timestep embedding with default dimension 128.
MLP time projection to dim_t * 4.
Optional context projection when context_dim is configured.
Input projection from latent dimension to the first hidden dimension.
Residual MLP blocks with Linear, LayerNorm, timestep injection, SiLU, Dropout, Linear, LayerNorm, and residual projection when dimensions differ.
Final LayerNorm, SiLU, and Linear projection back to the latent dimension.
Zero initialization of final Linear weights and bias.

Python Inference Lifecycle

load_model_bundle reconstructs the encoder, projector, denoiser, and diffusion wrapper from a checkpoint. It loads EMA weights only when use_ema=True, which is used by ONNX export. Sampling creates Gaussian latent noise, runs DDIM, splits the latent tensor into numerical and categorical slices, decodes numerical values by inverse quantile interpolation, and decodes categories by nearest learned embedding.

Browser Inference Lifecycle

The React component operates without server-side PyTorch:

Streamlit passes schema, encoder, projector, manifest, seed, row count, DDIM steps, and precision choice.
The component downloads model_int8.onnx or model_fp32.onnx from the FastAPI artifact endpoint.
The browser Cache API stores the ONNX bytes under datalus-onnx-artifacts.
ONNX Runtime Web creates a WASM session with graph optimization enabled.
TypeScript initializes deterministic Gaussian noise with a seeded linear congruential generator and Box-Muller transform.
DDIM runs in the browser by repeatedly invoking the ONNX denoiser.
Numerical and categorical decoding uses encoder_config.json and projector_config.json.
Generated records are returned to Streamlit through streamlit-component-lib.

Current browser conditions include precision, which selects the artifact filename. The guidanceScale argument is passed through the component interface for forward compatibility, but the exported ONNX wrapper currently has no context input.

ONNX and INT8 Artifacts

export-onnx writes:

artifacts/<domain>/
  model_fp32.onnx
  model_int8.onnx
  encoder_config.json
  projector_config.json
  manifest.json

The ONNX graph uses inputs x_t and timestep, output predicted_noise, opset 17, and dynamic batch axes. INT8 uses ONNX Runtime dynamic quantization with QInt8 weights. The manifest includes FP32 parity and INT8 CFG amplification parity:

{
  "cfg_scale": 3.0,
  "amplified_max_abs_diff": 0.012,
  "categorical_agreement": null,
  "passed": true,
  "atol": 0.2
}

Treat passed: false as a release blocker for INT8 artifacts. Use FP32 or retrain/re-export before publishing edge artifacts.

FastAPI Artifact Service

Artifact serving is enabled by default. Server-side PyTorch generation is disabled unless create_app(..., enable_server_generation=True) is used programmatically.

datalus serve artifacts --host 0.0.0.0 --port 8000

Core endpoints:

Endpoint	Method	Purpose
`/health`	`GET`	Service status, uptime, registry path.
`/artifacts`	`GET`	Lists artifact domains under the registry.
`/artifacts/{domain}/manifest`	`GET`	Returns `manifest.json`.
`/artifacts/{domain}/schema`	`GET`	Returns `schema_config.json`.
`/artifacts/{domain}/{file_name}`	`GET`	Serves approved public artifact files.
`/audit/latest`	`GET`	Returns the newest `audit_report.json`.
`/generate`	`POST`	Server-side generation when explicitly enabled.
`/augment`	`POST`	Server-side augmentation when explicitly enabled.
`/balance`	`POST`	Server-side balancing when explicitly enabled.
`/inpaint`	`POST`	Server-side inpainting when explicitly enabled.
`/counterfactual`	`POST`	Server-side counterfactual generation when explicitly enabled.

Allowed public artifact files are model_fp32.onnx, model_fp16.onnx, model_int8.onnx, schema_config.json, encoder_config.json, projector_config.json, model_config.json, audit_report.json, and manifest.json. Domain path traversal is rejected. CORS currently allows all origins with GET and POST, which is convenient for local artifact demos and must be tightened by the production reverse proxy or deployment boundary.

Example request contract:

{
  "domain": "datasus_sih",
  "n_records": 1000,
  "ddim_steps": 50,
  "seed": 42,
  "cfg_scale": 1.0
}

If /generate returns 403, the service is operating in the intended artifact-serving mode. Use browser ONNX inference or construct the app with server-side generation enabled in a trusted internal environment.

Docker Deployment

The repository includes a two-service Compose deployment:

api: FastAPI artifact service built from docker/Dockerfile.api.
streamlit: Streamlit UI built from docker/Dockerfile.streamlit, including Node/NPM component build.

Expected local artifact layout:

artifacts/
  datasus_sih/
    manifest.json
    schema_config.json
    encoder_config.json
    projector_config.json
    model_fp32.onnx
    model_int8.onnx
    audit_report.json

Start both services:

docker compose up --build

Port mappings and volumes:

Service	Container command	Port mapping	Artifact volume
`api`	`uvicorn datalus.interfaces.api:app --host 0.0.0.0 --port 8000`	`8000:8000`	`./artifacts:/app/artifacts:ro`
`streamlit`	`streamlit run frontend/streamlit/app.py --server.address=0.0.0.0 --server.port=8501`	`8501:8501`	`./artifacts:/app/artifacts:ro`

Environment variables:

Variable	Service	Value in Compose
`DATALUS_REGISTRY_PATH`	`api`, `streamlit`	`/app/artifacts`
`DATALUS_ARTIFACT_BASE_URL`	`streamlit`	`http://localhost:8000/artifacts`

Production security note: although the API container listens on 0.0.0.0:8000 and Streamlit listens on 0.0.0.0:8501, real public-sector production deployments must place both containers behind a reverse proxy such as NGINX or Traefik with HTTPS/TLS termination, access controls, request logging, and network segmentation. Synthetic sensitive-data artifacts still require controlled distribution and transport encryption.

CI/CD and Automated Testing

GitHub Actions defines three jobs:

Job	Runtime	Commands
`python-tests`	Ubuntu, Python 3.11, CPU Torch	`pip install ".[training,test]"`, `pytest`.
`frontend-build`	Ubuntu, Node 20	`npm install`, `npm run test`, `npm run build`.
`docker-build`	Ubuntu Docker	Build API and Streamlit images.

Dependabot is configured for weekly devcontainer updates. The devcontainer uses Debian with Python, Node, and Docker-outside-of-Docker features.

Current tests cover:

Diffusion schedules and RePaint shape/mask invariants.
Deterministic RNG state roundtrip.
Lazy preprocessing, identifier dropping, rare-category preservation, and reversible encoding.
DCR and Shadow-MIA report structure.
ci_lite deterministic runtime caps.
FastAPI artifact serving and path traversal rejection.
ONNX export, INT8 quantization, and CFG amplification parity guard.

Data Governance, Ethics, and LGPD Alignment

DATALUS reduces disclosure risk through generative synthesis and empirical audit, but it is not a legal declaration that data is anonymized under every context. Release decisions must remain accountable to institutional governance, LGPD interpretation, and domain-specific risk review.

Operational governance requirements:

Do not commit raw datasets, processed Parquet, checkpoints, ONNX files, or generated artifacts. .gitignore excludes common artifact paths and file extensions.
Remove direct identifiers before training and review quasi-identifiers in the schema report.
Preserve rare categories intentionally, then evaluate whether rare generated combinations create re-identification risk.
Publish synthetic data only with an audit report, schema metadata, source-data provenance, generation configuration, and limitations.
Treat OAA release mode as the release evidence path. Treat ci_lite only as regression protection.
Keep access logs and artifact versions for every public-sector release.
Apply HTTPS/TLS and access control to artifact services even when artifacts are synthetic.

Troubleshooting and FAQ

Polars ingestion runs out of memory

Prefer Parquet or ORC when available. For CSV, DATALUS already scans lazily and sinks to Parquet, but very wide schemas or expensive inference can still pressure memory. Remove free-text and known identifier columns upstream, split very large CSVs by year or region, and use the Python ZeroShotPreprocessor(sample_size=...) API with a smaller deterministic sample if schema inference itself is too large.

Training fails with CUDA OOM on Colab T4

Retry with lower batch sizes in this order: 1024, 512, 256. Resume from checkpoint_latest.pt rather than restarting. Avoid increasing hidden dimensions on T4. Very high-cardinality categorical columns increase embedding width; consider governance-driven column reduction before training.

ONNX INT8 CFG parity fails

Open manifest.json and inspect int8_cfg_parity. If amplified_max_abs_diff exceeds 0.2, do not publish the INT8 artifact. Use model_fp32.onnx, retrain and re-export, or disable quantization with --no-quantize until parity is acceptable.

OAA `ci_lite` fails in GitHub Actions

Check that real and synthetic Parquet files share retained schema columns, have at least eight usable rows, and contain enough class variation for the requested utility target. Add --max-audit-rows 512 for bounded regression runs. A ci_lite failure indicates a regression or test fixture issue; it does not replace a full release audit.

Counterfactual generation reports schema or column problems

Intervention keys must be retained columns from encoder_config.json. Dropped identifiers and unsupported columns cannot be intervened on. Numeric interventions should be parseable as numbers. Categorical interventions not observed during encoder fitting map to __UNKNOWN__, which can produce less meaningful counterfactuals.

`/generate` returns `403`

This is expected for the default API. Server-side generation is disabled to keep deployment free of server-side PyTorch dependencies. Use the browser ONNX component or explicitly construct create_app(enable_server_generation=True) for trusted internal deployments.

Streamlit cannot generate in the browser

Verify that the React component was built with npm run build, the API is reachable at DATALUS_ARTIFACT_BASE_URL, the selected domain contains manifest.json, encoder_config.json, projector_config.json, and the chosen ONNX file, and the Docker artifact mount points to ./artifacts.

Audit utility metrics are missing

datalus audit adds utility metrics only when --target-column is provided and the target exists in both real and synthetic frames. Privacy metrics still run without a target column.

References

License

DATALUS is released under the Apache License 2.0.

Citation

@software{Silva_DATALUS_Diffusion-Augmented_Tabular,
  author = {Silva, Emanuel Lázaro Custódio},
  license = {Apache-2.0},
  title = {{DATALUS: Diffusion-Augmented Tabular Architecture for Local Utility and Security}},
  url = {https://github.com/emanuellcs/datalus}
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.devcontainer		.devcontainer
.github		.github
docker		docker
frontend		frontend
notebooks		notebooks
src/datalus		src/datalus
tests		tests
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
README_pt-BR.md		README_pt-BR.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

DATALUS: Diffusion-Augmented Tabular Architecture for Local Utility and Security

Table of Contents

Research and Public Data Context

System Requirements

Architecture

Clean Architecture Responsibilities

Generative Capabilities

Mathematical Foundation

Implemented Cosine Schedule

Forward Markov Chain

Reverse Process and Implemented Objective

DDIM Sampling

Classifier-Free Guidance

RePaint Tabular Inpainting

Autonomous Audit Orchestrator

Projection Space

Distance to Closest Record

Shadow-Model Membership Inference

MLE-Ratio Utility

Installation and Developer Setup

Python Environment

Frontend Component

Local Verification

Complete CLI Cheatsheet

End-to-End Workflow

Command Summary

Command Arguments and Defaults

Operational Notes by Command

Training Lifecycle and Colab Constraints

Lifecycle

Default Training Configuration

Deterministic Checkpointing

Google Colab T4 Guidance

Inference and Model Architecture Details

Numerical Quantile Transformations

Categorical Vocabulary and Embeddings

Residual MLP Denoiser

Python Inference Lifecycle

Browser Inference Lifecycle

ONNX and INT8 Artifacts

FastAPI Artifact Service

Docker Deployment

CI/CD and Automated Testing

Data Governance, Ethics, and LGPD Alignment

Troubleshooting and FAQ

Polars ingestion runs out of memory

Training fails with CUDA OOM on Colab T4

ONNX INT8 CFG parity fails

OAA ci_lite fails in GitHub Actions

Counterfactual generation reports schema or column problems

/generate returns 403

Streamlit cannot generate in the browser

Audit utility metrics are missing

References

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

OAA `ci_lite` fails in GitHub Actions

`/generate` returns `403`

Packages