🇧🇷 Atenção Comissão Julgadora do 32º Prêmio Jovem Cientista: A documentação oficial, elaborada com o rigor científico exigido pelo edital e detalhando o impacto na LGPD e em políticas públicas, encontra-se no arquivo README_pt-BR.md.
DATALUS is a production-oriented Generative AI framework for synthetic tabular data. It is designed for high-dimensional, heterogeneous, privacy-sensitive government datasets, with a specific proof-of-concept path for Brazilian public-sector health data. The system learns a joint distribution over tabular records, samples new microdata from that distribution, and subjects generated artifacts to reproducible privacy and utility audits before release.
DATALUS is not an anonymization script. It is a generative ecosystem for ab-initio synthesis, data augmentation, minority-class balancing, tabular inpainting, counterfactual modification, audit automation, ONNX export, INT8 edge inference, FastAPI artifact serving, Streamlit operation, and browser-local execution through ONNX Runtime Web.
- Research and Public Data Context
- System Requirements
- Architecture
- Generative Capabilities
- Mathematical Foundation
- Autonomous Audit Orchestrator
- Installation and Developer Setup
- Complete CLI Cheatsheet
- Training Lifecycle and Colab Constraints
- Inference and Model Architecture Details
- FastAPI Artifact Service
- Docker Deployment
- CI/CD and Automated Testing
- Data Governance, Ethics, and LGPD Alignment
- Troubleshooting and FAQ
- References
- License
- Citation
Brazil's open-data policy is coordinated through the Infraestrutura Nacional de Dados Abertos (INDA), and dados.gov.br is the central catalog for public datasets. Government sources publish data across CSV, JSON, XML, ODS, RDF, APIs, Parquet-like analytic exports, and sector-specific repositories. The practical result is schema heterogeneity: inconsistent delimiters, lossy encodings, sparse columns, high-cardinality codes, rare municipalities, rare diseases, changing field names, and mixed numeric/string representations.
DATALUS implements ingestion and encoding policies for that reality:
- Lazy Polars scans avoid full in-memory Pandas loading.
- CSV scans detect common delimiters and use lossy UTF-8 decoding for legacy encodings.
- Identifier-like fields are removed before modeling.
- Sparse and free-text columns are rejected by explicit policy.
- Observed rare categories are preserved as first-class tokens; only unseen inference-time values map to
__UNKNOWN__. - Category frequency metadata is serialized so downstream audits can detect long-tail collapse.
Primary public-data references:
- Brazilian open-data policy and dados.gov.br catalog: Governo Digital Dados Abertos, Portal Brasileiro de Dados Abertos, API Portal de Dados Abertos.
- Tabular diffusion: Kotelnikov et al., TabDDPM.
- RePaint inpainting: Lugmayr et al..
- Classifier-Free Guidance: Ho and Salimans.
- DDIM sampling: Song et al..
- Membership inference attacks: Shokri et al..
| Layer | Minimum | Recommended | Notes |
|---|---|---|---|
| Python | 3.11 | 3.11 or newer | The package metadata declares requires-python >=3.11. |
| Training GPU | CPU works for tests | NVIDIA T4 15 GB VRAM or better | Colab T4 is the target constrained GPU profile. |
| Training RAM | 8 GB | 16 GB or more | Lazy ingestion helps, but encoding and audit projection need memory. |
| Browser inference | Modern Chromium, Firefox, or Edge | Browser with WebAssembly and Cache API | The React component uses onnxruntime-web WASM locally. |
| Node.js | 20 in CI | 20 LTS | Frontend CI uses actions/setup-node@v4 with Node 20. |
| Docker | Compose v2 | Docker Engine with Compose plugin | Compose starts the API and Streamlit containers. |
Optional dependency groups are declared in pyproject.toml:
| Extra | Purpose |
|---|---|
training |
PyTorch, ONNX, ONNX Runtime, ONNX Script. |
test |
Pytest and HTTPX for API tests. |
frontend |
Streamlit runtime. |
audit |
LightGBM and CatBoost for heavier audit experiments. |
dev |
Full local development stack. |
The codebase uses a strict src/ layout and Clean Architecture boundaries:
src/datalus/
domain/ Framework-free schemas and diffusion schedule math
infrastructure/ Polars, PyTorch, ONNX, checkpointing, encoding adapters
application/ Training, inference, audit, and export use cases
interfaces/ Typer CLI and FastAPI delivery adapters
frontend/
streamlit/ Python Streamlit shell
component/ React TypeScript ONNX Runtime Web component
tests/ Unit and integration tests
docker/ API and Streamlit Dockerfiles
.github/workflows/ CI jobs for Python, frontend, and Docker builds
flowchart TD
A["Government Tabular Sources<br/>CSV, TSV, CSV.GZ, Parquet, ORC, APIs"]
B["Layer 1: Zero-Shot Ingestion<br/>Polars LazyFrame, delimiter detection, identifier removal"]
C["Layer 2: Heterogeneous Encoding<br/>Quantile numeric transforms, categorical vocabularies, rare-category preservation"]
D["Layer 3: Hybrid Diffusion Engine<br/>Residual MLP denoiser, cosine schedule, DDIM sampler"]
E["Layer 4: Generative Operations<br/>Ab-initio generation, augmentation, balancing, inpainting, counterfactuals"]
F["Layer 5: Autonomous Audit Orchestrator<br/>DCR, Shadow-MIA, TSTR/TRTR, MLE-ratio"]
G["Layer 6: Edge Deployment<br/>ONNX, INT8 PTQ, FastAPI, Streamlit, React WASM"]
A --> B --> C --> D --> E --> F --> G
| Layer | Source path | Responsibility |
|---|---|---|
| Domain | src/datalus/domain |
Pydantic contracts, diffusion schedule math, RePaint config, privacy thresholds. |
| Infrastructure | src/datalus/infrastructure |
Polars scanning, reversible encoders, PyTorch networks, diffusion tensors, checkpointing, ONNX export. |
| Application | src/datalus/application |
Training, sampling, augmentation, balancing, inpainting, counterfactuals, auditing, artifact export. |
| Interfaces | src/datalus/interfaces |
Typer CLI and FastAPI app. |
| Frontend | frontend |
Streamlit shell and React ONNX Runtime Web component. |
DATALUS exposes distinct workflows because synthetic data systems have different operational goals:
| Capability | Purpose | CLI |
|---|---|---|
| Ab-initio generation | Create a new synthetic dataset from learned distributions. | datalus sample |
| Data augmentation | Append synthetic rows to a small dataset. | datalus augment |
| Minority balancing | Generate records until target-class counts approach a requested distribution. | datalus balance |
| Tabular inpainting | Fill missing values while preserving observed fields at every reverse step. | datalus inpaint |
| Counterfactual modification | Apply column interventions and regenerate compatible records. | datalus counterfactual |
| Audit | Evaluate empirical privacy and predictive utility before release. | datalus audit |
| Edge export | Export EMA weights to ONNX and optional INT8. | datalus export-onnx |
| Artifact serving | Serve registry artifacts for browser-local inference. | datalus serve |
The current denoiser exposes CFG-compatible inference logic. In the default training path, models are instantiated without a context vector, so cfg_scale=1.0 is the unconditional path and changing cfg_scale has no effect unless a context-enabled denoiser is introduced. The ONNX export path still records an INT8 CFG amplification parity guard at cfg_scale=3.0 because quantization drift can become operationally relevant when guidance is enabled.
The domain layer implements the Nichol-Dhariwal cosine schedule with numerical clipping. For training horizon
The normalized cumulative product is:
The beta schedule is:
VarianceSchedule converts these values into tensors for
For a latent tabular vector
With
The implemented q_sample tensor operation is:
The denoiser
DATALUS implements the simplified epsilon-prediction objective used by TabularDiffusion.compute_loss:
For categorical-logit extensions, the intended TabDDPM composite objective is:
The current implementation projects categorical values into continuous learned embedding slices and trains the diffusion model with the MSE objective over the full latent vector. It does not currently train a separate categorical cross-entropy head.
make_ddim_timesteps returns a descending deterministic subsequence from the training horizon. In each reverse step, ddim_step first computes:
When
The update is:
In the default CLI path, prev_t=-1 as
When context is provided, DATALUS combines unconditional and conditional noise predictions:
The implementation returns the direct denoiser prediction when context is None or cfg_scale == 1.0. Group guidance is implemented as a mask-and-scale extension over context dimensions.
For a known-value mask
Unknown coordinates are produced by DDIM, and known coordinates are restored by latent-space fusion:
The implemented jump-back step reintroduces noise from from_t to to_t:
make_repaint_schedule creates the reverse schedule plus forward jumps. The CLI exposes --jump-length and --jump-n-sample.
flowchart TD
A["Inputs<br/>Real train R_train, real holdout R_holdout, synthetic S, schema"]
B["Projection<br/>Standardize numeric fields and one-hot categorical fields"]
C["DCR<br/>Nearest-neighbor distance from each synthetic record to R_train"]
D["Shadow-MIA<br/>Release mode: exhaustive shadows<br/>CI-lite mode: bounded k-fold shadows"]
E["Utility<br/>TRTR baseline and TSTR model trained on synthetic data"]
F["Report<br/>JSON metrics, thresholds, mode, pass/fail verdict"]
G{"Publish?"}
A --> B --> C --> F
A --> D --> F
A --> E --> F
F --> G
The OAA projects real and synthetic records into a common sklearn feature space. Numerical columns are standardized. Categorical and boolean columns are one-hot encoded with handle_unknown="ignore". Dropped columns, target columns, and columns absent from either frame are excluded from privacy projection.
For each synthetic record
The alert threshold is the configured percentile of each real record's second-nearest real neighbor distance:
The memorization ratio is:
The default approval rule is
Following the Shokri threat model, the attacker learns features that distinguish records used to train a generator from non-member records. DATALUS computes nearest-neighbor attack features for candidate records against generated records:
A RandomForest attack model estimates membership scores. The central metric is attack ROC-AUC:
release mode uses the uncapped ShadowMIAConfig. ci_lite applies deterministic caps: at most two shadow models, synthetic multiplier at most 0.5, at most three neighbors, default maximum of 512 rows, at most 50 attack estimators, maximum depth 6, and minimum leaf size at least 2. ci_lite is a regression check for CI/CD, not a release audit.
Utility is measured by Train on Synthetic, Test on Real (TSTR) against a Train on Real, Test on Real (TRTR) baseline:
The implementation also reports:
The default utility approval threshold is
Python 3.11 or newer is required. Use a local virtual environment rather than installing into externally managed system Python:
python -m venv .venv
.venv/bin/python -m pip install --upgrade pip
.venv/bin/python -m pip install -e '.[dev]'For lighter roles:
.venv/bin/python -m pip install -e '.[training,test]'
.venv/bin/python -m pip install -e '.[frontend]'requirements.txt is a compatibility shim generated from pyproject.toml; the authoritative dependency source is pyproject.toml.
The Streamlit shell embeds the React component from frontend/component/dist when built. If the bundle is absent, the Python wrapper points to the Vite dev server at http://localhost:5173.
cd frontend/component
npm ci
npm run test
npm run buildFor interactive component development:
cd frontend/component
npm run devThen launch Streamlit separately:
.venv/bin/datalus streamlit.venv/bin/python -m pytest -q
cd frontend/component
npm run test
npm run buildThe Command Line Interface (CLI) is implemented with Typer in src/datalus/interfaces/cli.py. It is deliberately thin: commands translate user input into application use cases and print file locations.
datalus ingest raw.csv artifacts/demo/processed.parquet --schema-path artifacts/demo/schema_config.json --target-column target
datalus train artifacts/demo/schema_config.json artifacts/demo/processed.parquet artifacts/demo --epochs 5 --batch-size 2048
datalus sample artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json artifacts/demo/synthetic.parquet --n-records 10000 --ddim-steps 50 --cfg-scale 1.0
datalus augment artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json small.parquet artifacts/demo/augmented.parquet --n-records 5000
datalus balance artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json train.parquet artifacts/demo/balanced.parquet target '{"0": 5000, "1": 5000}'
datalus inpaint artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json incomplete.parquet artifacts/demo/inpainted.parquet
datalus counterfactual artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json records.parquet artifacts/demo/counterfactual.parquet '{"municipality": "3550308"}'
datalus audit real_train.parquet artifacts/demo/synthetic.parquet artifacts/demo/schema_config.json artifacts/demo/audit_report.json --target-column target --mia-mode release
datalus export-onnx artifacts/demo/checkpoints/checkpoint_latest.pt artifacts/demo/encoder_config.json artifacts/demo --quantize
datalus serve artifacts --host 0.0.0.0 --port 8000| Command | Purpose | Primary output |
|---|---|---|
ingest |
Infer schema and stream retained data to Parquet. | schema_config.json, processed Parquet. |
train |
Train the diffusion model with deterministic checkpoints. | encoder_config.json, checkpoints/checkpoint_latest.pt. |
sample |
Generate an ab-initio synthetic dataset. | Synthetic Parquet. |
augment |
Append synthetic records to an existing Parquet dataset. | Augmented Parquet. |
balance |
Generate records for requested class counts. | Balanced Parquet. |
inpaint |
Fill null values using RePaint-style masks. | Inpainted Parquet. |
counterfactual |
Apply do-style interventions and regenerate compatible fields. | Counterfactual Parquet. |
audit |
Run DCR and Shadow-MIA, and utility when target is valid. | Audit JSON report. |
export-onnx |
Export EMA denoiser to ONNX and optional INT8. | ONNX files and manifest JSON. |
serve |
Serve artifacts for browser-local inference. | FastAPI service. |
streamlit |
Launch the Portuguese Streamlit UI. | Streamlit service. |
| Command | Positional arguments | Options and defaults | Expected output |
|---|---|---|---|
ingest |
input_path, output_path |
--schema-path artifacts/schema_config.json, --target-column None |
Prints schema and processed Parquet paths. Writes Snappy Parquet and schema metadata. |
train |
schema_path, data_path, output_dir |
--epochs 1, --batch-size 2048, --max-steps None, --resume-from None |
Prints checkpoint path. Writes encoder_config.json and checkpoints under output_dir/checkpoints. |
sample |
checkpoint_path, encoder_path, output_path |
--n-records 100, --ddim-steps 50, --seed 42, --cfg-scale 1.0 |
Writes synthetic Parquet with Snappy compression. |
augment |
checkpoint_path, encoder_path, input_path, output_path |
--n-records 100, --ddim-steps 50, --seed 42, --cfg-scale 1.0 |
Writes original rows plus synthetic rows selected to original columns. |
balance |
checkpoint_path, encoder_path, input_path, output_path, target_column, target_distribution_json |
--ddim-steps 50, --seed 42, --cfg-scale 1.0, --max-attempts 10, --strict False |
Writes Parquet approaching requested class counts. Raises if --strict and attempts are exhausted. |
inpaint |
checkpoint_path, encoder_path, input_path, output_path |
--ddim-steps 50, --jump-length 10, --jump-n-sample 10, --seed 42 |
Writes Parquet with null-driven latent fields imputed. |
counterfactual |
checkpoint_path, encoder_path, input_path, output_path, intervention_json |
--ddim-steps 50, --seed 42 |
Writes Parquet under fixed intervention columns. |
audit |
real_train_path, synthetic_path, schema_path, report_path |
--target-column None, --real-holdout-path None, --mia-mode release, --max-audit-rows None |
Writes privacy JSON and utility JSON when target exists in both datasets. |
export-onnx |
checkpoint_path, encoder_path, output_dir |
--quantize True |
Writes model_fp32.onnx, optional model_int8.onnx, encoder_config.json, projector_config.json, manifest.json. |
serve |
registry_path default artifacts |
--host 0.0.0.0, --port 8000 |
Starts Uvicorn factory app with DATALUS_REGISTRY_PATH. |
streamlit |
None | None | Runs streamlit run frontend/streamlit/app.py. |
ingestsupports.csv,.tsv,.csv.gz,.tsv.gz,.parquet, and.orc. ORC requirespyarrow.ingestuses delimiter sniffing for CSV, semicolon fallback for ambiguous Brazilian spreadsheet exports,utf8-lossydecoding,infer_schema_length=10000,ignore_errors=True, andtruncate_ragged_lines=True.ingestdrops sparse columns with null ratio above0.95, identifier-like names such as CPF/CNPJ/CNS/email/phone-like fields, free-text columns, and unsupported dtypes.- The underlying
ZeroShotPreprocessordefaults arehigh_cardinality_threshold=50,sample_size=100000,rare_category_threshold=5, and null tokens"",NA,N/A,null,NULL, andNone. The target column is protected from the sparse and identifier drop rules. traincurrently exposes a subset ofTrainingConfigon the CLI. Learning rate, weight decay, hidden dimensions, AMP, EMA, warmup, and maximum encoder fit rows are configured inTrainingConfigfor programmatic use.balancetreats JSON class labels as strings during matching, so numeric class labels should be represented as JSON object keys such as{"0": 5000}.counterfactualinterventions must reference retained columns from the fitted encoder. Unknown categorical values map to__UNKNOWN__.auditreads Parquet eagerly. For large official release audits, run on a machine sized for the projected one-hot matrix.servedisables server-side PyTorch generation by default. It is intended to serve public artifacts to the browser-local ONNX runtime.
ingestcreates a retained Parquet dataset andschema_config.json.DatalusTrainerloads schema metadata and builds deterministic Parquet batch offsets.TabularEncoderfits numeric quantile transforms and categorical vocabularies on up tomax_encoder_fit_rows=100000.FeatureProjectorconcatenates numeric latent slices and categorical embedding slices.TabularDenoiserMLPpredicts diffusion noise over the full latent vector.TabularDiffusion.compute_losssamples random timesteps and optimizes the MSE epsilon objective.- AdamW updates diffusion and projector parameters.
- Linear warmup is followed by cosine annealing to
eta_min=1e-6. - AMP GradScaler is used on CUDA when
amp=True. - Gradients are clipped to
max_grad_norm=1.0. - EMA tracks diffusion parameters with decay
0.9999. - Checkpoints persist training state for deterministic resume.
| Parameter | Default |
|---|---|
batch_size |
2048 |
epochs |
1 |
learning_rate |
2e-4 |
weight_decay |
1e-4 |
checkpoint_every_steps |
500 |
seed |
42 |
num_timesteps |
1000 |
hidden_dims |
(512, 1024, 1024, 512) |
amp |
True |
condition_dropout |
0.1 |
ema_decay |
0.9999 |
warmup_steps |
500 |
max_grad_norm |
1.0 |
max_encoder_fit_rows |
100000 |
Checkpoints are written atomically where possible. Each checkpoint includes:
- Diffusion model state.
- Feature projector state.
- Optimizer state.
- Scheduler state.
- AMP scaler state.
- EMA shadow weights.
- Epoch, batch index, global step, latest loss, and loss history.
- Training config and SHA-256 config hash.
- Python, NumPy, Torch, and CUDA RNG states when CUDA is available.
Resume with:
datalus train artifacts/demo/schema_config.json artifacts/demo/processed.parquet artifacts/demo --epochs 20 --batch-size 1024 --resume-from artifacts/demo/checkpoints/checkpoint_latest.ptColab T4 sessions are useful for proof-of-concept training but should be treated as preemptible. Store artifacts on Google Drive:
from google.colab import drive
drive.mount('/content/drive')Recommended Drive layout:
/content/drive/MyDrive/datalus/
raw/
processed/
artifacts/
datasus_sih/
schema_config.json
encoder_config.json
checkpoints/
Install and smoke test:
python -m pip install --upgrade pip
python -m pip install -e '.[training,test]'
datalus train /content/drive/MyDrive/datalus/artifacts/datasus_sih/schema_config.json /content/drive/MyDrive/datalus/processed/train.parquet /content/drive/MyDrive/datalus/artifacts/datasus_sih --epochs 1 --batch-size 2048 --max-steps 20Batch-size tuning for T4:
| Symptom | Action |
|---|---|
| CUDA OOM before first checkpoint | Retry with --batch-size 1024. |
| OOM after several steps | Resume from checkpoint_latest.pt and reduce to 512. |
| OOM with very wide categorical embedding space | Reduce to 256, reduce retained high-cardinality columns upstream, or train on a larger GPU. |
| Session interruption | Resume from the Drive checkpoint path. |
Learning-rate defaults are conservative for T4. If using much smaller batches, keep 2e-4 for initial experiments and compare loss curves before changing optimizer settings.
Each numerical column fits up to 1,000 empirical quantiles. Transform maps finite values into the unit quantile domain and then into [-1, 1]. Non-finite values use the training median fill value. Inverse transform clips generated values back to [0, 1] in quantile space and interpolates over the stored quantile table.
Each categorical column stores:
__UNKNOWN__at index 0.__NULL__at index 1.- All observed categories sorted after the sentinels.
- Frequency metadata and rare-category counts.
Observed rare categories remain first-class tokens. During inference, never-seen categories map to __UNKNOWN__. The projector embeds each categorical column with the default dimension ceil(log2(cardinality)), bounded below by 2.
The implemented TabularDenoiserMLP topology is:
- Sinusoidal timestep embedding with default dimension
128. - MLP time projection to
dim_t * 4. - Optional context projection when
context_dimis configured. - Input projection from latent dimension to the first hidden dimension.
- Residual MLP blocks with Linear, LayerNorm, timestep injection, SiLU, Dropout, Linear, LayerNorm, and residual projection when dimensions differ.
- Final LayerNorm, SiLU, and Linear projection back to the latent dimension.
- Zero initialization of final Linear weights and bias.
load_model_bundle reconstructs the encoder, projector, denoiser, and diffusion wrapper from a checkpoint. It loads EMA weights only when use_ema=True, which is used by ONNX export. Sampling creates Gaussian latent noise, runs DDIM, splits the latent tensor into numerical and categorical slices, decodes numerical values by inverse quantile interpolation, and decodes categories by nearest learned embedding.
The React component operates without server-side PyTorch:
- Streamlit passes schema, encoder, projector, manifest, seed, row count, DDIM steps, and precision choice.
- The component downloads
model_int8.onnxormodel_fp32.onnxfrom the FastAPI artifact endpoint. - The browser Cache API stores the ONNX bytes under
datalus-onnx-artifacts. - ONNX Runtime Web creates a WASM session with graph optimization enabled.
- TypeScript initializes deterministic Gaussian noise with a seeded linear congruential generator and Box-Muller transform.
- DDIM runs in the browser by repeatedly invoking the ONNX denoiser.
- Numerical and categorical decoding uses
encoder_config.jsonandprojector_config.json. - Generated records are returned to Streamlit through
streamlit-component-lib.
Current browser conditions include precision, which selects the artifact filename. The guidanceScale argument is passed through the component interface for forward compatibility, but the exported ONNX wrapper currently has no context input.
export-onnx writes:
artifacts/<domain>/
model_fp32.onnx
model_int8.onnx
encoder_config.json
projector_config.json
manifest.json
The ONNX graph uses inputs x_t and timestep, output predicted_noise, opset 17, and dynamic batch axes. INT8 uses ONNX Runtime dynamic quantization with QInt8 weights. The manifest includes FP32 parity and INT8 CFG amplification parity:
{
"cfg_scale": 3.0,
"amplified_max_abs_diff": 0.012,
"categorical_agreement": null,
"passed": true,
"atol": 0.2
}Treat passed: false as a release blocker for INT8 artifacts. Use FP32 or retrain/re-export before publishing edge artifacts.
Artifact serving is enabled by default. Server-side PyTorch generation is disabled unless create_app(..., enable_server_generation=True) is used programmatically.
datalus serve artifacts --host 0.0.0.0 --port 8000Core endpoints:
| Endpoint | Method | Purpose |
|---|---|---|
/health |
GET |
Service status, uptime, registry path. |
/artifacts |
GET |
Lists artifact domains under the registry. |
/artifacts/{domain}/manifest |
GET |
Returns manifest.json. |
/artifacts/{domain}/schema |
GET |
Returns schema_config.json. |
/artifacts/{domain}/{file_name} |
GET |
Serves approved public artifact files. |
/audit/latest |
GET |
Returns the newest audit_report.json. |
/generate |
POST |
Server-side generation when explicitly enabled. |
/augment |
POST |
Server-side augmentation when explicitly enabled. |
/balance |
POST |
Server-side balancing when explicitly enabled. |
/inpaint |
POST |
Server-side inpainting when explicitly enabled. |
/counterfactual |
POST |
Server-side counterfactual generation when explicitly enabled. |
Allowed public artifact files are model_fp32.onnx, model_fp16.onnx, model_int8.onnx, schema_config.json, encoder_config.json, projector_config.json, model_config.json, audit_report.json, and manifest.json. Domain path traversal is rejected. CORS currently allows all origins with GET and POST, which is convenient for local artifact demos and must be tightened by the production reverse proxy or deployment boundary.
Example request contract:
{
"domain": "datasus_sih",
"n_records": 1000,
"ddim_steps": 50,
"seed": 42,
"cfg_scale": 1.0
}If /generate returns 403, the service is operating in the intended artifact-serving mode. Use browser ONNX inference or construct the app with server-side generation enabled in a trusted internal environment.
The repository includes a two-service Compose deployment:
api: FastAPI artifact service built fromdocker/Dockerfile.api.streamlit: Streamlit UI built fromdocker/Dockerfile.streamlit, including Node/NPM component build.
Expected local artifact layout:
artifacts/
datasus_sih/
manifest.json
schema_config.json
encoder_config.json
projector_config.json
model_fp32.onnx
model_int8.onnx
audit_report.json
Start both services:
docker compose up --buildPort mappings and volumes:
| Service | Container command | Port mapping | Artifact volume |
|---|---|---|---|
api |
uvicorn datalus.interfaces.api:app --host 0.0.0.0 --port 8000 |
8000:8000 |
./artifacts:/app/artifacts:ro |
streamlit |
streamlit run frontend/streamlit/app.py --server.address=0.0.0.0 --server.port=8501 |
8501:8501 |
./artifacts:/app/artifacts:ro |
Environment variables:
| Variable | Service | Value in Compose |
|---|---|---|
DATALUS_REGISTRY_PATH |
api, streamlit |
/app/artifacts |
DATALUS_ARTIFACT_BASE_URL |
streamlit |
http://localhost:8000/artifacts |
Production security note: although the API container listens on 0.0.0.0:8000 and Streamlit listens on 0.0.0.0:8501, real public-sector production deployments must place both containers behind a reverse proxy such as NGINX or Traefik with HTTPS/TLS termination, access controls, request logging, and network segmentation. Synthetic sensitive-data artifacts still require controlled distribution and transport encryption.
GitHub Actions defines three jobs:
| Job | Runtime | Commands |
|---|---|---|
python-tests |
Ubuntu, Python 3.11, CPU Torch | pip install ".[training,test]", pytest. |
frontend-build |
Ubuntu, Node 20 | npm install, npm run test, npm run build. |
docker-build |
Ubuntu Docker | Build API and Streamlit images. |
Dependabot is configured for weekly devcontainer updates. The devcontainer uses Debian with Python, Node, and Docker-outside-of-Docker features.
Current tests cover:
- Diffusion schedules and RePaint shape/mask invariants.
- Deterministic RNG state roundtrip.
- Lazy preprocessing, identifier dropping, rare-category preservation, and reversible encoding.
- DCR and Shadow-MIA report structure.
ci_litedeterministic runtime caps.- FastAPI artifact serving and path traversal rejection.
- ONNX export, INT8 quantization, and CFG amplification parity guard.
DATALUS reduces disclosure risk through generative synthesis and empirical audit, but it is not a legal declaration that data is anonymized under every context. Release decisions must remain accountable to institutional governance, LGPD interpretation, and domain-specific risk review.
Operational governance requirements:
- Do not commit raw datasets, processed Parquet, checkpoints, ONNX files, or generated artifacts.
.gitignoreexcludes common artifact paths and file extensions. - Remove direct identifiers before training and review quasi-identifiers in the schema report.
- Preserve rare categories intentionally, then evaluate whether rare generated combinations create re-identification risk.
- Publish synthetic data only with an audit report, schema metadata, source-data provenance, generation configuration, and limitations.
- Treat OAA
releasemode as the release evidence path. Treatci_liteonly as regression protection. - Keep access logs and artifact versions for every public-sector release.
- Apply HTTPS/TLS and access control to artifact services even when artifacts are synthetic.
Prefer Parquet or ORC when available. For CSV, DATALUS already scans lazily and sinks to Parquet, but very wide schemas or expensive inference can still pressure memory. Remove free-text and known identifier columns upstream, split very large CSVs by year or region, and use the Python ZeroShotPreprocessor(sample_size=...) API with a smaller deterministic sample if schema inference itself is too large.
Retry with lower batch sizes in this order: 1024, 512, 256. Resume from checkpoint_latest.pt rather than restarting. Avoid increasing hidden dimensions on T4. Very high-cardinality categorical columns increase embedding width; consider governance-driven column reduction before training.
Open manifest.json and inspect int8_cfg_parity. If amplified_max_abs_diff exceeds 0.2, do not publish the INT8 artifact. Use model_fp32.onnx, retrain and re-export, or disable quantization with --no-quantize until parity is acceptable.
Check that real and synthetic Parquet files share retained schema columns, have at least eight usable rows, and contain enough class variation for the requested utility target. Add --max-audit-rows 512 for bounded regression runs. A ci_lite failure indicates a regression or test fixture issue; it does not replace a full release audit.
Intervention keys must be retained columns from encoder_config.json. Dropped identifiers and unsupported columns cannot be intervened on. Numeric interventions should be parseable as numbers. Categorical interventions not observed during encoder fitting map to __UNKNOWN__, which can produce less meaningful counterfactuals.
This is expected for the default API. Server-side generation is disabled to keep deployment free of server-side PyTorch dependencies. Use the browser ONNX component or explicitly construct create_app(enable_server_generation=True) for trusted internal deployments.
Verify that the React component was built with npm run build, the API is reachable at DATALUS_ARTIFACT_BASE_URL, the selected domain contains manifest.json, encoder_config.json, projector_config.json, and the chosen ONNX file, and the Docker artifact mount points to ./artifacts.
datalus audit adds utility metrics only when --target-column is provided and the target exists in both real and synthetic frames. Privacy metrics still run without a target column.
- Kotelnikov et al. TabDDPM: Modelling Tabular Data with Diffusion Models.
- Lugmayr et al. RePaint: Inpainting using Denoising Diffusion Probabilistic Models.
- Ho and Salimans. Classifier-Free Diffusion Guidance.
- Song et al. Denoising Diffusion Implicit Models.
- Shokri et al. Membership Inference Attacks Against Machine Learning Models.
- Governo Digital. Dados Abertos, Portal Brasileiro de Dados Abertos, and API Portal de Dados Abertos.
DATALUS is released under the Apache License 2.0.
@software{Silva_DATALUS_Diffusion-Augmented_Tabular,
author = {Silva, Emanuel Lázaro Custódio},
license = {Apache-2.0},
title = {{DATALUS: Diffusion-Augmented Tabular Architecture for Local Utility and Security}},
url = {https://github.com/emanuellcs/datalus}
}