Day-two operations for Future Horizons ASPT. Commands assume the repo root and a
configured .env (see .env.example).
For first-time deploys to Vercel + Railway + Neon (the demo / UAT
stack), see docs/DEPLOY.md. For the persona-driven
testing brief colleagues use after deploy, see
docs/USER_TESTING_BRIEF.md.
cp .env.example .env # ports already aligned to docker-compose
docker compose up -d # Postgres + Redis
npm run demo:bootstrap # install + prisma generate/push + full seed
npm run dev # concurrent client (5173) + server (3002)Or use the one-shot helpers (Linux/macOS): ./start.sh / ./stop.sh.
For the colleague-review walkthrough see DEMO.md.
npm run build
node server/dist/index.js # server only; serve client dist/ via any static host# Build (CI is the canonical builder; this command is for local verification)
docker build -t herm-platform:$(git rev-parse --short HEAD) .
# Run with the prod env-var matrix
docker run --rm -p 3002:3002 \
-e NODE_ENV=production \
-e DATABASE_URL=postgresql://... \
-e JWT_SECRET=... \
-e FRONTEND_URL=https://app.example.com \
-e SENTRY_DSN=... \
-e STRIPE_SECRET_KEY=... \
-e STRIPE_WEBHOOK_SECRET=... \
herm-platform:<tag>
# Apply migrations from inside the container (prefer this over running
# from CI/CD so secrets stay scoped to the deployment context). The
# `prisma` CLI is installed globally in the runner stage, pinned to
# the same major as @prisma/client, so this works air-gapped without
# an at-runtime npm registry round-trip.
docker run --rm \
-e DATABASE_URL=postgresql://... \
herm-platform:<tag> prisma migrate deploy --schema=prisma/schema.prismaThe image is multi-stage:
node:20-alpineruntime (~150 MB final image).- Non-root user
node. - Bundles compiled server (
server/dist), the Prisma client (regenerated in the runner stage so the engine binary matches the slim production dep tree),prisma/schema.prisma, andprisma/migrations/fordb:migrate:deployfrom inside the container. - Excludes the client SPA (served separately), test files, source
TypeScript, and seed scripts (seed scripts need
tsxwhich is a devDependency; run them from a separate one-off container if needed). - Built-in
HEALTHCHECKprobes/api/healthevery 30 s.
| Var | Required | Notes |
|---|---|---|
NODE_ENV |
yes | production flips strict env-check + hides internal error details |
DATABASE_URL |
yes | Postgres connection string (consider ?connection_limit=10&options=-c statement_timeout=15000) |
JWT_SECRET |
yes | ≥ 32 chars; 64+ recommended |
FRONTEND_URL |
yes (prod) | Browser-facing origin used for CORS and the post-SSO redirect (which carries the session JWT). checkEnvironment() refuses to boot in prod without it. |
SP_BASE_URL |
yes (prod) | API origin used for SAML ACS + OIDC callback URLs (Phase 10.10). Without it, IdPs would be told to redirect to localhost. checkEnvironment() refuses to boot in prod without it. |
SP_ENTITY_ID |
optional | SAML entity ID. Defaults to <SP_BASE_URL>/api/sso/sp; override only when an IdP admin (e.g. UKAMF) assigns one. |
SP_SIGNING_KEY |
optional (req'd for UKAMF) | PEM private key used to sign SAML AuthnRequests + SP metadata. Pair with SP_SIGNING_CERT — both set or both unset. Inline PEM (literal \n allowed) or file:/abs/path.pem. Generate a self-signed pair with openssl req -x509 -newkey rsa:2048 -nodes -days 730 -keyout sp-signing.key -out sp-signing.crt -subj /CN=herm-sp. |
SP_SIGNING_CERT |
optional (req'd for UKAMF) | PEM X.509 certificate matching SP_SIGNING_KEY. Same accepted forms. |
SSO_SECRET_KEY |
recommended (prod) | 32-byte master key (64 hex chars or base64) for envelope-encrypting SsoIdentityProvider.samlCert + .oidcClientSecret. Generate with openssl rand -hex 32. Without it, encrypted rows return opaque 404 and the SSO admin write path refuses to persist plaintext. Rotation: re-encrypt every row through the read→write path; a one-shot rotation tool is on the follow-up list. |
REDIS_URL |
optional | Enables shared lockout state + the SSO OIDC PKCE flow store + Redis readiness probe. Without it, lockout falls back to in-memory (per-pod). Required for multi-pod deployments. |
SENTRY_DSN |
optional | Error reporting; no-op when unset |
SENTRY_ENVIRONMENT |
optional | Defaults to NODE_ENV |
SENTRY_TRACES_SAMPLE_RATE |
optional | 0–1; default 0 (errors only) |
STRIPE_SECRET_KEY |
optional | If set, STRIPE_WEBHOOK_SECRET must also be set (env-check is fatal otherwise) |
STRIPE_WEBHOOK_SECRET |
optional | Required when STRIPE_SECRET_KEY is set |
ANTHROPIC_API_KEY |
optional | AI assistant |
SMTP_HOST |
optional | SMTP relay host. When set, SMTP_FROM (or SMTP_USER) must also be set or env-check fails — outbound email would otherwise silently no-op. |
SMTP_PORT |
optional | 1–65535; STARTTLS on 587 by default |
SMTP_SECURE |
optional | "true" for SMTPS (port 465); leave unset for STARTTLS |
SMTP_USER / SMTP_PASSWORD |
optional | Relay credentials |
SMTP_FROM |
optional (req'd with SMTP_HOST) |
RFC-5322 mailbox, e.g. "HERM <noreply@example.com>" |
RATE_LIMIT_* |
optional | Per-tier ceilings (ANONYMOUS, FREE, PROFESSIONAL, ENTERPRISE, API_KEY); see middleware/security.ts for defaults |
DEV_UNLOCK_ALL_TIERS |
optional | Pre-billing escape hatch — every logged-in user gets tier="enterprise". Env-check shouts loudly if set in prod. Useful for demos before subscriptions land. |
DEMO_PASSWORD |
optional | Overrides the seed-default demo user password (demo12345). Leave unset for documented demos — the Login page demo-credentials hint is hard-coded to the default. |
RETENTION_SCHEDULER_ENABLED |
optional (default false) |
Set to true to start the in-process retention sweeper at server boot (Phase 11.9). When the sweeper runs, soft-deleted Users older than the grace window are hard-deleted. Leave off in dev / test envs to avoid surprise purges; turn on in prod. Out-of-process schedulers (Kubernetes CronJob, GitHub Actions) can drive sweeps via npm run db:retention-sweep instead. |
RETENTION_GRACE_DAYS |
optional (default 30) |
Grace window between soft-delete (User.deletedAt stamped) and hard-delete by the scheduler. Shorten for stricter retention, lengthen for more recovery time. |
RETENTION_SWEEP_INTERVAL_MS |
optional (default 21_600_000 = 6 h) |
How often the in-process scheduler runs. The window is days, so a 6-hour sweep is fine. |
RETENTION_BATCH_SIZE |
optional (default 100) |
Per-sweep cap on rows hard-deleted, so a backlog cannot lock the DB. |
ENABLE_SOFT_DELETE_AUTH_CHECK |
optional (test-only) | When NODE_ENV=test, the soft-delete revocation check in authenticateJWT is skipped by default to avoid consuming queued prisma.user.findUnique mocks across the existing suite. Set to true in tests that specifically pin the revocation behaviour. Has no effect outside test mode. |
UKAMF_METADATA_URL |
optional | URL of a SAML 2.0 metadata aggregate (e.g. https://metadata.ukfederation.org.uk/ukfederation-metadata.xml). When set together with UKAMF_ROTATION_ENABLED=true (Phase 11.10), the in-process scheduler periodically polls the feed and rotates SsoIdentityProvider.samlCert rows whose samlEntityId appears in the feed with a different cert. Out-of-process equivalent: npx tsx server/src/scripts/ukamf-rotate.ts [--dry-run]. |
UKAMF_ROTATION_ENABLED |
optional (default false) |
Explicit opt-in for the in-process UKAMF rotation scheduler. Without it (or without UKAMF_METADATA_URL), the scheduler is a no-op at boot. |
UKAMF_ROTATION_INTERVAL_MS |
optional (default 86_400_000 = 24 h) |
Interval between sweeps when the scheduler is enabled. UKAMF cert rotations are rare; daily is plenty. |
UKAMF_FETCH_TIMEOUT_MS |
optional (default 30_000 = 30 s) |
HTTP timeout for the metadata fetch. The UKAMF feed is several MB; allow generous timeout. |
SIGTERM/SIGINT → close HTTP listener → flush Sentry → prisma.$disconnect() → exit 0.
Force-exit at 10 s if shutdown stalls. Kubernetes / PM2 / systemd can send
SIGTERM directly. The Dockerfile uses exec-form CMD ["node", ...] so
SIGTERM reaches Node directly without a wrapping shell process.
npm run db:pushdb:push is the fast path used by local dev and the CI test job. It reconciles
the live DB to match prisma/schema.prisma by running diffed DDL directly,
without creating a migration record. Never use it against prod.
npm run db:migrate:deploy # applies every migration in prisma/migrations/ in order
npm run db:migrate:status # show which migrations have been appliedThis is the only supported path to change a prod schema. Workflow:
First-time baseline (one-off, only for DBs bootstrapped via
db:push)Any DB that was created or kept in sync via
db:pushalready has the current schema shape but no rows in_prisma_migrations. Runningmigrate deployagainst it would try to re-create tables that already exist and fail. For those DBs, run the baseline once:for m in $(ls prisma/migrations | grep -v migration_lock); do npx prisma migrate resolve --applied "$m" --schema=prisma/schema.prisma done npm run db:migrate:status # confirm "Database schema is up to date"After this, future deploys use
db:migrate:deploynormally. Fresh Postgres instances (CI, new prod) skip the baseline —migrate deployapplies every migration from scratch.
- Locally: edit
prisma/schema.prismathen runnpx prisma migrate dev --name <short_change_name>— creates a new timestamped folder underprisma/migrations/with the SQL Prisma computed and applies it to your dev DB. - Commit the new migration folder alongside the schema change. CI's
Validate Prisma schemajob will fail the PR if the schema file drifted from the migrations. - On deploy: pipeline runs
npm run db:migrate:deployagainst prod before the new app version starts taking traffic. Migrations are forward-only; rollback uses a DB snapshot, not a reverse migration (see "Rolling back a deploy" below).
npm run db:generatenpm run db:seed # full seed (HERM + vendors + demo)
npm run db:seed:demo # demo user only
npm run db:seed:jurisdictions # procurement jurisdictions onlynpm run db:studioSet DATABASE_URL to the production connection string; never run
db:push --force-reset against prod. For schema changes use the
migration workflow described in "Apply migrations (prod / staging)"
above — db:push is only for dev / CI test DBs that get torn down.
CI fails any PR whose client bundles grow past the configured ceilings in
client/.size-limit.json. Run locally:
npm run build --workspace=@herm-platform/client
npm run size:check --workspace=@herm-platform/clientOutput is a per-asset table of current size vs ceiling (gzip-compressed). A ceiling miss means one of three things:
- You added a heavy dep. Look at the diff; consider dynamic-import or route-level code-splitting before bumping the ceiling.
- A transitive dep grew.
npm why <package>plusdu -sh client/dist/assetsto see which chunk moved. - Threshold is genuinely too tight. Bump in
.size-limit.jsonand call it out in the PR description so the regression is reviewable, not silent.
Initial ceilings (committed at the Phase 12.5 baseline) are the current
size + ~3% headroom. The follow-up Phase 12.5b PR will introduce
route-level code-splitting on the four heaviest pages
(ProcurementProjects, ProcurementGuide, SectorAnalytics,
AdminSystems) to chase the kickoff doc's <500 KB initial-load target;
the ceilings will ratchet down with that work.
curl -i http://localhost:3002/api/health # liveness
curl -i http://localhost:3002/api/readiness # db ping (also at /api/ready)
npm run demo:validate # all of the above + demo loginExpect 200 for both. readiness flips to 503 on DB (or, when REDIS_URL is
set, Redis) loss.
Prometheus text-format metrics live at GET /metrics (mounted outside the
/api namespace so scrapers reach a stable, version-free path). Every metric
uses the herm_ prefix. Defaults emitted by prom-client.collectDefaultMetrics
cover Node.js runtime + process state (heap, GC, event loop lag, FD count, CPU).
curl -s http://localhost:3002/metrics | head -40Application-level metrics emitted today:
| Metric | Type | Labels | Notes |
|---|---|---|---|
herm_http_request_duration_seconds |
Histogram | method, route, status |
RED-method latency. Buckets cover 5ms–10s. |
herm_http_requests_total |
Counter | method, route, status |
RED-method rate + errors. |
herm_http_requests_in_flight |
Gauge | method |
Saturation indicator. |
herm_auth_login_total |
Counter | outcome |
success / bad_credentials / locked / mfa_required / mfa_failed. |
herm_sso_login_total |
Counter | protocol, outcome |
protocol: saml / oidc. outcome: success / validation_failure / etc. Never includes institutionSlug (per ADR-0001 — would let an external observer enumerate which tenants have SSO configured). |
Route labels collapse dynamic IDs via the matched Express route pattern
(/api/users/:id), so cardinality stays bounded. Unmatched paths get the
sentinel label __not_found rather than the raw URL.
/metrics is public on the application port by default — protect it via
network isolation, not auth. The scrape pattern is:
- Run Prometheus inside the same VPC / k8s namespace; scrape over the internal
port. The public load balancer never routes to
/metrics. - Or: front the app with an ingress that strips
/metricsfrom the public surface.
Adding bearer-token auth to /metrics is on the deferred list. Until then,
do not expose this path to the open internet.
All production logs are JSON lines. Every line carries req.id so you can
correlate a single request across the HTTP log, business logic logs, and the
error handler.
# Find all logs for one request
jq 'select(.req.id == "REQ-ID")' server.log
# Failed logins in the last hour
jq 'select(.msg=="login failed" and (.time | fromdateiso8601) > (now-3600))' server.log
# AI cost audit: sum output tokens per user
jq -r 'select(.msg=="ai.chat completed") | "\(.userId)\t\(.outputTokens)"' server.log \
| awk '{ users[$1] += $2 } END { for (u in users) print u, users[u] }'| Log message | Meaning | Fix |
|---|---|---|
readiness: database ... |
Postgres unreachable | Check DB host/creds; verify DATABASE_URL |
ai.chat failed |
Anthropic API error / timeout | Inspect err.status, retry, check quota |
unhandled error |
Uncaught exception reached errorHandler | Triage via err.stack + req.id |
AUTHENTICATION_ERROR |
401 from authenticateJWT |
Token missing/expired/invalid |
RATE_LIMIT_EXCEEDED |
Caller hit express-rate-limit |
Back off; raise limit if false-positive |
- Generate a new value (
openssl rand -base64 48). - Update the secret in your secret store.
- Roll the deployment. All existing tokens are invalidated — clients
will see a 401 and the axios interceptor will redirect them to
/login.
- Create a new key in the Anthropic console.
- Update the secret.
- Roll the deployment. No user-visible impact if keys overlap during the rollout.
Same pattern. Webhooks: update the endpoint signing secret in both Stripe
and STRIPE_WEBHOOK_SECRET simultaneously.
The app itself performs no backups — rely on the managed Postgres provider (e.g. RDS, Supabase, Neon) automated backups. To restore:
# 1. Point DATABASE_URL at the restore target.
# 2. Apply migration history (forward-only; matches prod schema state):
npm run db:migrate:deploy
# 3. If needed, re-seed reference data (non-tenant):
npm run db:seedTenant data (users, baskets, projects) comes from the backup restore, not
the seed. If the snapshot was taken at a schema state OLDER than the
current migration history, restore the snapshot first then run
db:migrate:deploy to bring it forward.
- Identify the last known good release SHA / tag.
- In your deploy platform, roll the service back to that version.
- Watch
/api/readinessuntil 200. - Watch error-rate dashboards for 5 minutes.
- Open an incident note describing the trigger and rollback.
Schema-incompatible rollbacks: if the rolled-back version is incompatible with the current DB schema, you must first restore the DB to a snapshot taken before the bad migration. Treat this as an incident.
- Rate limits —
server/src/middleware/security.ts(global) andserver/src/api/chat/chat.router.ts(chat-specific). - AI timeout —
REQUEST_TIMEOUT_MSinserver/src/services/ai-assistant.ts. - JSON body size —
express.json({ limit: '1mb' })inserver/src/app.ts. - Token expiry —
generateTokeninserver/src/middleware/auth.ts.