Skip to content

Urz1/synthetic-data-studio

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

133 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Synth Studio πŸ§ͺ

Privacy-first synthetic data generation for healthcare and fintech

License: MIT Build Python 3.9+ Next.js 16 Docs


⚑ Quick Install

# Clone
git clone https://github.com/Urz1/synthetic-data-studio.git && cd synthetic-data-studio

# Backend
cd backend && cp .env.example .env
pip install -r requirements.txt && alembic upgrade head
uvicorn app.main:app --reload

# Frontend (new terminal)
cd frontend && cp .env.local.example .env.local
pnpm install && pnpm dev

Frontend: http://localhost:3000 | API Docs: http://localhost:8000/docs

πŸ“– Full setup guide: LOCAL_DEVELOPMENT.md


🎯 What It Does

Generate high-quality synthetic data with differential privacy guarantees. Built for regulated industries:

Industry Use Case
πŸ₯ Healthcare (HIPAA) Synthetic EHR, FHIR, patient records
🏦 Fintech (SOC-2/GDPR) Transaction data, fraud testing
πŸ€– ML Teams Privacy-safe training datasets
🏒 Enterprise Cross-department data sharing

✨ Key Features

Generation Methods

Method Description Best For
Schema-Based Define columns β†’ generate data (no source dataset needed) Testing, prototyping
Dataset-Based ML Train on real data β†’ generate synthetic Production quality
LLM-Powered Seed AI generates realistic seed data β†’ statistical expansion Domain-specific realism

ML Generators

  • CTGAN - Conditional Tabular GAN (mixed numeric + categorical)
  • TVAE - Tabular Variational Autoencoder (high-cardinality categorical)
  • GaussianCopula - Statistical copulas (fast, correlation-preserving)

Privacy & Compliance

  • Differential Privacy - Configurable Ξ΅/Ξ΄ with RDP accounting
  • PII/PHI Detection - Automatic sensitive column identification
  • Compliance Reports - HIPAA, GDPR, SOC-2 ready documentation
  • Audit Logs - Immutable activity tracking

AI-Powered Features

  • Chat Assistant - Natural language data generation guidance
  • Enhanced PII Detection - LLM-powered sensitivity analysis
  • Compliance Writer - Auto-generate compliance documentation

Quality Evaluation

  • Statistical Similarity - Distribution matching, K-S tests
  • ML Utility - Train/test accuracy preservation
  • Privacy Risk - Membership inference, re-identification risk

πŸ“‹ Prerequisites

Requirement Version
Python 3.9+
Node.js 18+
PostgreSQL 13+
Redis 7+ (local Docker by default; set REDIS_URL for managed)

Environment Variables:

# Backend (.env)
DATABASE_URL=postgresql://user:pass@localhost/synthstudio
SECRET_KEY=your-jwt-secret
AWS_S3_BUCKET=your-bucket  # optional
REDIS_URL=redis://localhost:6379/0  # default local container; use rediss:// for hosted

# Frontend (.env.local)
NEXT_PUBLIC_API_URL=http://localhost:8000
BETTER_AUTH_SECRET=your-auth-secret

πŸ”§ Usage

Generate from Schema (No Dataset Needed)

curl -X POST "http://localhost:8000/generators/schema" \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "columns": {
      "name": {"type": "string", "faker": "name"},
      "age": {"type": "integer", "min": 18, "max": 80},
      "email": {"type": "string", "faker": "email"},
      "balance": {"type": "number", "min": 0, "max": 50000}
    }
  }'

Generate from Dataset (ML-Based)

# Upload dataset
curl -X POST "http://localhost:8000/datasets/upload" \
  -H "Authorization: Bearer $TOKEN" \
  -F "file=@data.csv"

# Generate synthetic data with DP
curl -X POST "http://localhost:8000/generators/dataset/{dataset_id}/generate" \
  -H "Authorization: Bearer $TOKEN" \
  -d '{
    "generator_type": "ctgan",
    "num_rows": 10000,
    "epochs": 300,
    "differential_privacy": {"enabled": true, "epsilon": 1.0, "delta": 1e-5}
  }'

Python SDK Example

import requests

# Login
session = requests.Session()
session.post("http://localhost:8000/auth/login", json={
    "email": "user@example.com", "password": "secret"
})

# Schema-based generation
synth_data = session.post("/generators/schema?num_rows=1000", json={
    "columns": {
        "patient_id": {"type": "string", "pattern": "PAT-[0-9]{6}"},
        "diagnosis": {"type": "category", "values": ["A01", "B12", "C34"]},
        "visit_date": {"type": "date", "min": "2024-01-01", "max": "2024-12-31"}
    }
}).json()

πŸ§ͺ Testing

# Backend tests with coverage
cd backend && pytest tests/ -v --cov=app

# Frontend tests
cd frontend && pnpm test

# E2E tests
cd frontend && pnpm test:e2e

πŸ“ Project Structure

synth-studio/
β”œβ”€β”€ backend/                  # FastAPI API server
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ auth/            # Authentication (JWT, OAuth, 2FA)
β”‚   β”‚   β”œβ”€β”€ datasets/        # Dataset upload, profiling
β”‚   β”‚   β”œβ”€β”€ generators/      # Schema + ML generation
β”‚   β”‚   β”œβ”€β”€ evaluations/     # Quality metrics
β”‚   β”‚   β”œβ”€β”€ services/
β”‚   β”‚   β”‚   β”œβ”€β”€ synthesis/   # CTGAN, TVAE, Copula
β”‚   β”‚   β”‚   β”œβ”€β”€ llm/         # AI chat, PII detection
β”‚   β”‚   β”‚   └── privacy/     # DP accounting
β”‚   β”‚   β”œβ”€β”€ compliance/      # HIPAA/GDPR reports
β”‚   β”‚   └── audit/           # Activity logging
β”‚   └── tests/
β”œβ”€β”€ frontend/                 # Next.js 16 web app
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ dashboard/       # Overview & metrics
β”‚   β”‚   β”œβ”€β”€ datasets/        # Upload & profile
β”‚   β”‚   β”œβ”€β”€ generators/      # Create & manage
β”‚   β”‚   β”œβ”€β”€ evaluations/     # Quality reports
β”‚   β”‚   β”œβ”€β”€ synthetic-datasets/  # Generated data
β”‚   β”‚   β”œβ”€β”€ compliance/      # Compliance center
β”‚   β”‚   └── assistant/       # AI chat
β”‚   └── components/
└── docs/                     # Docusaurus docs

πŸ“š Documentation

Resource Description
Docs Site Full documentation
Getting Started Installation & quickstart
User Guide Feature walkthroughs
API Reference OpenAPI/Swagger
Examples Code samples & Postman

🀝 Contributing

  1. Fork & clone
  2. Create feature branch (git checkout -b feature/amazing)
  3. Add tests & make changes
  4. Run tests (pytest / pnpm test)
  5. Submit PR

See CONTRIBUTING.md for guidelines.


πŸ”’ Security

Report vulnerabilities privately: halisadam391@gmail.com or see SECURITY.md.


πŸ“„ License

MIT Β© 2025 Sadam Husen


πŸ“¬ Contact

Sadam Husen @Urz1 halisadam391@gmail.com

LinkedIn β€’ GitHub


πŸ—οΈ Architecture
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Frontend (Next.js 16)                 β”‚
β”‚  Dashboard β€’ Datasets β€’ Generators β€’ Evaluations   β”‚
β”‚  Compliance β€’ Audit β€’ Billing β€’ AI Assistant       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                         β”‚ REST API (JWT + OAuth)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                Backend (FastAPI)                    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚   Auth   β”‚ β”‚ Datasets β”‚ β”‚Generatorsβ”‚            β”‚
β”‚  β”‚JWT/OAuth β”‚ β”‚Profiling β”‚ β”‚Schema/ML β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚   LLM    β”‚ β”‚Evaluationβ”‚ β”‚Complianceβ”‚            β”‚
β”‚  β”‚Chat/PII  β”‚ β”‚Quality   β”‚ β”‚Reports   β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β–Ό               β–Ό                β–Ό
   PostgreSQL        Redis          AWS S3
   (metadata)     (queue/cache)    (files)
                        β”‚
                        β–Ό
              Celery Workers
        (generation, evaluation, exports)

Tech Stack:

  • Frontend: Next.js 16, React 19, TypeScript 5, Tailwind, shadcn/ui
  • Backend: FastAPI, SQLAlchemy 2, Celery, SDV
  • ML/Privacy: CTGAN, TVAE, Opacus (DP), RDP accounting
  • LLM: OpenAI/Anthropic (chat, PII detection, compliance)
  • Infra: Vercel, Railway/AWS, Neon/Supabase
πŸ“Š Complete Feature List

Data Generation

  • Schema-based generation (no training data required)
  • Dataset-based ML generation (CTGAN, TVAE, GaussianCopula)
  • LLM-powered seed data generation
  • Differential privacy with configurable Ξ΅/Ξ΄
  • DP parameter validation & recommendations
  • Model download & export

Data Management

  • CSV upload with auto-profiling
  • Schema detection & type inference
  • PII/PHI column detection
  • Distribution analysis & statistics
  • Correlation matrices
  • Missing value analysis

Quality & Privacy

  • Statistical similarity scoring
  • ML utility evaluation (classification/regression)
  • Privacy risk assessment
  • Membership inference testing
  • k-anonymity checks
  • Privacy budget tracking

AI Assistant

  • Natural language queries
  • Context-aware recommendations
  • Code generation for API usage
  • Error debugging
  • Compliance guidance

Enterprise

  • HIPAA/GDPR/SOC-2 compliance reports
  • Immutable audit logs
  • Usage & billing dashboards
  • Role-based access control
  • OAuth (Google, GitHub)
  • Two-factor authentication
πŸ—ΊοΈ Roadmap
  • FHIR/HL7 medical data formats
  • Time-series synthetic data
  • Enterprise SSO (SAML 2.0)
  • Python & JavaScript SDKs
  • Self-hosted Docker templates
  • Real-time streaming generation

See CHANGELOG.md for version history.

About

Generate hyper-realistic, privacy-safe synthetic data and compliance packs for regulated startups, Bootcamps, competitions, and Learning.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors