Production-grade, deterministic synthetic data generation for ML pipelines
Repository: https://github.com/CodersAcademy006/Synthetic-Dataset-Gen
Maintainer: Srijan Upadhayay
A config-driven pipeline for generating, validating, and publishing synthetic datasets with strict immutability, deterministic outputs, and full auditability. Designed for ML teams that need reproducible training data without the compliance overhead of real data.
- 100% Config-Driven — No hardcoded values; all behavior controlled via YAML
- Deterministic Generation — Same config + version = identical bytes every time
- Immutable Runs — Once finalized, runs cannot be modified or overwritten
- Schema Enforcement — Strict type/constraint validation against declared schemas
- Drift Detection — Automatic quality and distribution drift metrics vs prior versions
- Kaggle Publishing — One-command upload to Kaggle with versioned metadata
- Structured Logging — JSON logs for pipeline observability (optional)
- CI-Ready — GitHub Actions workflow included
┌─────────────────────────────────────────────────────────────────┐
│ ORCHESTRATOR │
│ scripts/run.py │
│ CLI → Config Loading → Version Resolution → Stage Execution │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ PROFILE │ │ GENERATE │ │ VALIDATE │
│ Prior data │ │ Synthetic │ │ Schema + │
│ analysis │ │ output │ │ constraints │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└───────────────────┼───────────────────┘
▼
┌─────────────────┐
│ EVALUATE │
│ Quality + Drift │
└─────────────────┘
│
▼
┌─────────────────┐
│ ARTIFACTS │
│ Finalization │
└─────────────────┘
│
▼
┌─────────────────┐
│ REGISTRY │
│ Version catalog │
└─────────────────┘
git clone https://github.com/CodersAcademy006/Synthetic-Dataset-Gen.git
cd Synthetic-Dataset-Gen
# Install dependencies
pip install -r requirements.txt
# (Optional) Install dev dependencies for testing
pip install -r requirements-dev.txtpython scripts/run.py --dataset finance_transactionsThis will:
- Load configs from
datasets/finance_transactions/ - Generate deterministic synthetic data
- Validate against schema constraints
- Compute quality metrics
- Finalize artifacts in
runs/finance_transactions/<version>/ - Update the registry
# Fixed version for byte-identical reproduction
python scripts/run.py --dataset finance_transactions --run-id 2025-01-15T00-00-00Zsynthetic-data-platform/
├── scripts/
│ └── run.py # CLI orchestrator
├── engine/
│ ├── profile.py # Prior version profiling
│ ├── generate.py # Deterministic data generation
│ ├── validate.py # Schema/constraint validation
│ ├── evaluate.py # Quality and drift metrics
│ ├── artifacts.py # Immutability enforcement
│ ├── version.py # Version identity resolution
│ ├── ingest.py # External dataset ingestion
│ ├── kaggle.py # Kaggle upload with retry
│ ├── registry.py # Registry update logic
│ └── logging_utils.py # JSON structured logging
├── datasets/
│ ├── finance_transactions/
│ │ ├── dataset.yaml # Row count, metadata
│ │ ├── schema.yaml # Column definitions
│ │ └── evolution.yaml # Drift/missingness config
│ ├── market_time_series/
│ └── saas_events/
├── registry/
│ └── datasets.json # Authoritative version catalog
├── runs/ # Generated at runtime
│ └── <dataset>/<version>/
├── tests/
│ ├── test_version.py
│ ├── test_generate.py
│ ├── test_validate.py
│ ├── test_profile.py
│ ├── test_evaluate.py
│ ├── test_artifacts.py
│ └── test_integration.py
├── notebooks/
│ └── run_dataset.ipynb # Interactive runner
├── .github/workflows/
│ └── ci.yml # GitHub Actions CI
├── requirements.txt # Pinned runtime deps
├── requirements-dev.txt # Dev/test deps
└── README.md
name: finance_transactions
domain: finance
description: Synthetic transactional data for ML training
row_count: 10000columns:
transaction_id:
type: integer
nullable: false
amount:
type: float
nullable: false
constraints:
min: 0.01
max: 10000.0
is_fraud:
type: boolean
nullable: falseSupported types: string, integer, float, boolean, datetime
fraud_rate: 0.02 # 2% of rows marked as fraud
missingness:
merchant_category: 0.05 # 5% nulls in this columnEach run produces these files in runs/<dataset>/<version>/:
| File | Description |
|---|---|
data.parquet |
Generated dataset (Parquet preferred) |
data.csv |
Fallback if Parquet unavailable |
configs_snapshot.json |
Frozen copy of input configs |
run_metadata.json |
Execution context and timestamps |
validation_report.json |
Schema validation results |
evaluation_report.json |
Quality and drift metrics |
prior_profile.json |
Prior version statistics (if exists) |
final_metadata.json |
Finalization manifest (immutability marker) |
python scripts/run.py --dataset <name> [--run-id <version>]# Generate
from engine.generate import generate_dataset
generate_dataset(dataset_dir, configs, run_dir)
# Validate
from engine.validate import validate_dataset
validate_dataset(dataset_dir, configs, run_dir)
# Ingest external data
from engine.ingest import ingest_external_dataset
ingest_external_dataset("external.parquet", "runs/imports/v1")
# Publish to Kaggle
from engine.kaggle import upload_to_kaggle
upload_to_kaggle("runs/finance_transactions/v1", "user/dataset-name")| Variable | Default | Description |
|---|---|---|
SDP_LOGGING_ENABLED |
true |
Enable JSON logging |
SDP_LOG_LEVEL |
INFO |
Log level (DEBUG, INFO, WARNING, ERROR) |
KAGGLE_CONFIG_DIR |
~/.kaggle |
Kaggle credentials location |
# Run all tests
pytest tests/
# Run with coverage
pytest tests/ --cov=engine --cov-report=term-missing
# Run specific test
pytest tests/test_generate.py -vGitHub Actions workflow (.github/workflows/ci.yml) runs on every push/PR:
- Python 3.11 on Ubuntu
- Installs pinned dependencies
- Runs pytest with fail-fast
# Ensure credentials exist
ls ~/.kaggle/kaggle.json
# Upload finalized run
python -c "
from engine.kaggle import upload_to_kaggle
upload_to_kaggle(
run_dir='runs/finance_transactions/2025-01-15T00-00-00Z',
kaggle_slug='username/finance-synthetic',
is_public=True
)
"Upload includes only:
data.parquetordata.csvfinal_metadata.json
| Aspect | Guarantee |
|---|---|
| Random seed | Derived from SHA256(dataset_name:version) |
| Column order | Lexicographically sorted |
| Row order | Preserved from generation/ingestion |
| Timestamps | UTC, ISO-8601 format |
| JSON output | Sorted keys, deterministic formatting |
- CSV dtype inference — Pandas may infer different types across platforms; documented, not fixed by contract
- Local filesystem only — No native S3/GCS/Azure Blob support
- Single-process — No parallelization or distributed generation
- Basic generation heuristics — Column values inferred from names only; no statistical modeling
- Kaggle single-retry — One retry with 2s backoff; no exponential backoff
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Run tests (
pytest tests/) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT License — see LICENSE for details.
Built for ML teams who need reproducible, compliant synthetic data.
Prepared by Srijan Upadhayay