Skip to content
/ nydra Public

A production-grade template for ML APIs and pipelines. It couples FastAPI + MLflow + Metaflow with security, observability, and deployment patterns that work from laptop to enterprise.

Notifications You must be signed in to change notification settings

matjsz/nydra

Repository files navigation

Nydra (Neural Hydra): End-to-end ML ops starter

Nydra is a production-grade template for ML APIs and pipelines. It couples FastAPI + MLflow + Metaflow with security, observability, and deployment patterns that work from laptop to enterprise.

Highlights

  • Inference/API: FastAPI with API keys (Argon2id), RBAC, optional OIDC/JWT, mTLS toggles, security headers/CSP, Redis-backed rate limiting (per-key/global/IP with circuit breaker), and OTEL tracing/logging.
  • Traffic control: Champion-only, canary/A/B, and shadow routing with hot-reload from MLflow Model Registry.
  • Model ops: Metaflow flows for training, evaluation/promotion guard (champion vs challenger), feature pipeline, drift monitoring, and key maintenance; promotion CLI and reports logged to MLflow.
  • Data contracts: Versioned Pandera schemas for training/inference with compatibility checks and metrics for alerts.
  • Observability: Prometheus metrics, Grafana dashboards + alert rules, OTLP hooks, structured logs with trace/span IDs; drift/eval metrics exposed for dashboards.
  • Security & secrets: Vault optional, env loading with fallbacks, TLS toggles for Postgres/Redis/MLflow, and guidance for secret sidecars; security headers and strict CORS presets for prod.
  • Deployment: Docker Compose dev/prod-ish stack, Helm/K8s manifests (ingress, HPA, PodSecurity, RBAC, NetworkPolicy), and Traefik guidance for shared self-hosted ingress.
  • Integrations: Extras for PyTorch, TensorFlow, Transformers, Ultralytics, HuggingFace hub download, DuckDB/Snowflake/BigQuery stubs.

Quickstart

  • Generate: cookiecutter https://github.com/matjsz/nydra.git
  • Configure: copy .env.example to .env; set MODEL_NAME, MLFLOW_TRACKING_URI, MLFLOW_BACKEND_URI (Postgres), MLFLOW_ARTIFACT_ROOT (S3/MinIO), APP_KEY, TRAFFIC_STRATEGY, MODEL_FLAVOR.
  • Run dev stack: docker compose --profile dev up --build -d (API, MLflow, MinIO, Postgres auth + MLflow backend, Redis, Prometheus, Grafana).
  • Train: uv run python -m src.flows.train run --model_name ${MODEL_NAME}.
  • Issue key: curl -H "X-App-Key: $APP_KEY" -X POST http://localhost:8000/auth/key -d '{"name":"dev","role":"admin"}'.
  • Call: curl -H "X-API-Key: <key>" -H "Content-Type: application/json" -d '{"input":0.5}' http://localhost:8000/predict.

Configuration essentials

  • Inference/model: MODEL_NAME, MODEL_FLAVOR, TRAFFIC_STRATEGY (champion_only|canary|shadow), PCT_CANARY, MLFLOW_TRACKING_URI, MLFLOW_BACKEND_URI, MLFLOW_ARTIFACT_ROOT.
  • Auth/keys: AUTH_DB_URL (Postgres), APP_KEY (issuance), ENABLE_KEY_MANAGER, KEY_DEFAULT_TTL_DAYS, KEY_ROTATION_GRACE_DAYS, KEY_NOTIFY_DAYS.
  • Rate limiting: RATE_LIMIT_BACKEND (local|redis), REDIS_URL, GLOBAL_RATE_LIMIT_PER_SEC, IP_RATE_LIMIT_PER_SEC, per-key limits set on issuance.
  • Security: ALLOWED_ORIGINS (required in prod), CONTENT_SECURITY_POLICY (if set), ENABLE_OIDC + issuer/audience/JWKS, DB_SSL_MODE/DB_SSL_ROOT_CERT, REDIS_USE_SSL/certs, mTLS toggles.
  • Observability: OTEL_EXPORTER_OTLP_ENDPOINT/headers, Prometheus scrape /metrics, alert rules in prom-rules.yml.
  • Secrets: VAULT_ENABLED=true with VAULT_ADDR, VAULT_TOKEN/VAULT_TOKEN_FILE, VAULT_NAMESPACE, VAULT_KV_MOUNT, VAULT_KV_PATH;
  • SMTP (optional for key expiry): SMTP_HOST, SMTP_PORT, SMTP_USER, SMTP_PASSWORD, SMTP_FROM.

Auth & key lifecycle

  • src/api/security/key_store.py stores Argon2id-hashed keys with prefixes; roles: consumer|tester|admin; per-key rate limits; validity windows (valid_from, expires_at); owner/contact metadata.
  • Endpoints: /auth/key (issue via APP_KEY or admin key), /auth/keys (list, admin only), /auth/keys/{id}/rotate, /auth/keys/{id}/deactivate.
  • Key Manager (src/api/services/key_manager.py) runs in-process to detect expiring/expired keys, emit Prom gauges, and email owners; optional Redis leader lock to avoid duplicate notifications.
  • Scheduled maintenance flow: src/flows/key_maintenance.py for deactivation + notifications (Metaflow/cron style).

Flows (Metaflow)

  • Train (src/flows/train.py): logs metrics to MLflow, registers model, sets aliases (champion/challenger), logs baseline stats.
  • Eval (src/flows/eval.py): compares champion vs challenger on holdout/production slices, logs Markdown/HTML reports to MLflow, blocks promotion on regression; supports approval flags/Slack webhook.
  • Features (src/flows/features.py): builds feature parquet, tags schema/version, optional Redis/Feast-style online push.
  • Drift (src/flows/drift.py): compares current data to baseline stats and logs drift metrics.
  • Promotion guard script: scripts/promotion_guarded.py to require fresh eval before alias changes.
  • Key maintenance (src/flows/key_maintenance.py): scheduled expiry checks/notifications.

Observability

  • Prometheus metrics: request latency, inference counts/errors by alias + endpoint, shadow usage, rate-limit decisions, per-key usage, key-expiry gauges.
  • Grafana dashboards provisioned in grafana/provisioning/dashboards; alert rules for latency/5xx/drift/eval in prom-rules.yml.
  • OTEL: src/api/utils/otel.py instruments FastAPI and clients; logs include trace_id/span_id when tracing is enabled.

Deployment

  • Docker Compose: dev/prod-ish stack in docker-compose.yml; see docs/self_hosted.md for self-hosted guidance, Traefik shared-ingress pattern, and Vault usage.
  • Helm/K8s: charts under helm/ and manifests under k8s/ include ingress, HPA, PodDisruptionBudget, PodSecurityContext, RBAC, NetworkPolicies, TLS toggles.
  • Traefik shared ingress: create traefik-public network once; run Traefik; attach each model stack with host-rule labels (see docs/self_hosted.md).

Data contracts & schemas

  • Pandera schemas for inference/training in src/api/utils/contracts.py; versions tracked via INFERENCE_SCHEMA_VERSION, TRAINING_SCHEMA_VERSION, and EXPECTED_TRAINING_SCHEMA_VERSION.
  • Compatibility checks enforced in training/inference; schema version metrics surface for alerting.

Security posture

  • Security headers + CSP, strict CORS in prod, mTLS toggles for internal services, Redis-backed limiter default in prod with circuit breaker, optional OIDC alongside API keys.
  • Secrets from Vault/env; guidance for rotation in docs/rotation.md; Alembic migration alembic/versions/0001_init_auth.py seeds auth schema; scripts/db_upgrade.py runs migrations.

Docs & assets

  • First-day walkthrough: docs/first_day.md
  • Architecture: docs/architecture.md
  • Self-hosted guide + Traefik/Vault notes: docs/self_hosted.md
  • Rotation/secrets: docs/rotation.md
  • Flows overview: docs/flows.md
  • Postman collection: postman_collection.json
  • Alerts/dashboards: prom-rules.yml, grafana/provisioning/dashboards/json/*.json

CI/testing

  • CI workflow runs smoke/e2e (scripts/smoke.py, scripts/e2e.py); unit/contract tests in tests/.
  • Run locally: uv run pytest.

Built with sweat, scars and much love.

About

A production-grade template for ML APIs and pipelines. It couples FastAPI + MLflow + Metaflow with security, observability, and deployment patterns that work from laptop to enterprise.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published