The 2026-05-21 client demo with Kathelijne surfaced two basic UX gaps:
- "How do you just basically respond to an email or whatsapp —
that's basic functionality." The app only offered four
pre-approved template replies on
/en/triage. /en/triage(where the template replies lived) was missing from the desktop sidebar — Kathelijne couldn't find the page; only the mobile nav surfaced it.
- Free-text reply composer on the enquiry detail page.
- Operator types verbatim text → service decides channel (WhatsApp within the 24h customer-service window, otherwise email).
- Outbound reuses the existing
whatsappService.sendTextMessage/emailService.sendBrandedEmailpaths so message-log + DEMO_MODE behaviour is preserved. AuditLogrow per operator action.- Sidebar gets a
Triagelink betweenInboxandEnquiries.
lib/utils/whatsapp-window.ts— pure 24h-window helpers.lib/services/reply.constants.ts—MAX_REPLY_BODY_LENGTHin its own module so the client composer can import the constant without dragging the server-only reply service into the browser bundle.lib/services/reply.service.ts—replyService.sendReply(input)returns a discriminatedSendReplyResult. Channel selection mirrors the stock-reply service: enquiry's channel first, then customer's preferred, then whichever contact is populated.app/api/enquiries/[id]/reply/route.ts— NURSE+ POST endpoint; maps service statuses to 200 / 400 / 404 / 409 (window-expired) / 422 / 500.components/triage/FreeTextReplyComposer.tsx— client component with textarea, character counter, channel indicator, live 24h window status, send button. Disables when window expired.app/[locale]/enquiries/[id]/page.tsx— embeds the composer below the message thread.components/layout/Sidebar.tsx— addsTriagebetweenInboxandEnquiries.messages/{en,fr}.json—enquiries.reply.*namespace.- Tests: 24 new (window util ×7, service ×10, API ×7); all 1415 prior preserved → 1439 / 1439 green.
npm run lint— greennpm run typecheck— greennpx vitest run— 1439 / 1439 greenSKIP_ENV_VALIDATION=true npm run build— green- Manual on Vercel preview deferred to PR review.
- 24h window anchor uses
Enquiry.receivedAtas a proxy for "last inbound from this customer". For multi-message threads a later inbound resets it via the same webhook path. Anchoring on the most recentEnquiryMessagewithdirection = INBOUNDis a future enhancement. - The composer lives on the enquiry detail page, not inline in
/inbox(would clutter the list view). The detail-page placement matches the existing triage action card. - Free-text WhatsApp always uses
sendTextMessage. Outside the 24h window deliverability is impossible by Meta policy — the service refuses, and the operator is routed to the existing stock-reply / template flow on/triage.
The 2026-05-21 client demo exposed a silent data-loss class. The WhatsApp webhook returned 200 to Meta in ~50ms then processed the message asynchronously. When the async chain hit a Neon cold-start the message was logged and lost. Meta saw 200 so wouldn't retry; no operator-visible signal anywhere. Phase 27 fixed the in-app simulator side of the same incident; Phase 28 closes the real-WhatsApp side.
- Webhook routes (
/api/webhooks/whatsapp,/api/webhooks/email) enqueue async failures to the existingFailedOperationDLQ with scopeswhatsapp-inbound/email-inboundand the originating message id asoperationKey. deadLetterService.replay(id)re-runs the original intake for these inbound scopes. Outbound scopes (whatsapp-send-text,email-send) keep their manual mark-replayed path because the triggering workflow can't be re-driven from the DLQ.POST /api/admin/observability/failed-operations/[id]/replay— ADMIN-only, audit-logged inSecurityAuditLog(OTHER event, targetFailedOperation), maps replay outcomes to 200 / 500 / 422 / 404 / 409./admin/observabilityDLQ table gets a new "Replay" button for PENDING rows whose scope is replayable. Existing "Mark replayed" / "Abandon" buttons preserved for outbound scopes and for cases where the operator wants to skip the auto-replay path.- i18n strings (EN: "Replay", FR: "Rejouer") added to
messages/{en,fr}.jsonunderobservability.dlq.replay.
lib/services/dead-letter.service.ts— newreplay()method,REPLAYABLE_SCOPESexport,ReplayResulttype. Re-uses the existing intake services without circular import.app/api/webhooks/whatsapp/route.ts— wraps the async catch with adeadLetterService.enqueuecall, extracts the first message id foroperationKey.app/api/webhooks/email/route.ts— synchronous path; intake throws are caught, enqueued, and surfaced as 500 to n8n (so n8n's own retry policy still gets a chance, but the DLQ row provides the operator-visible recovery path if n8n gives up).app/api/admin/observability/failed-operations/[id]/replay/route.ts— new endpoint.components/admin/ObservabilityDashboard.tsx— addsonReplayhandler and conditionally rendered button for replayable scopes.- Tests: 17 new (
dead-letter.service.test.ts+6 replay branches,webhooks/dlq-wiring.test.ts×4,admin-observability-replay.test.ts×6). Refactor preserves all 1398 existing tests → 1415 / 1415 green. - Docs: this entry +
docs/KNOWN_ISSUES.mdPhase 28 entry.
npm run lint— greennpm run typecheck— greennpx vitest run— 1415 / 1415 greenSKIP_ENV_VALIDATION=true npm run build— green- Manual on Vercel preview deferred to PR review: deliberately fail
Neon (e.g. break DATABASE_URL temporarily, send a WhatsApp), confirm
the row appears in
/admin/observabilitywith scopewhatsapp-inbound, fix the credential, click Replay, confirm the message lands in/inboxand the row flips to REPLAYED.
- Replay is single-row. A "Replay all PENDING" bulk action is trivial to add later if the DLQ ever has more than a handful of rows at once.
- The stored payload is JSON.stringify(redact(raw)) —
redact()scrubs Auth / api-key / signature headers but leaves the message body and phone number through (the same data that would have landed inEnquiry.rawTexthad intake succeeded). PII retention policy follows the existing rules forFailedOperation.
| Phase | Name | Branch | Status |
|---|---|---|---|
| 0 | Scaffold | feature/phase0-scaffold |
✅ Complete |
| 1 | Foundation | feature/phase1-foundation |
✅ Complete |
| 2 | Core Features | feature/phase2-core-features |
✅ Complete |
| 3 | Messaging Intake | feature/phase3-messaging-intake |
✅ Complete |
| 4 | Triage Operations | feature/phase4-triage-ops |
✅ Complete |
| 5 | Route Planning | feature/phase5-route-planning |
✅ Complete |
| 6 | Booking & Confirmations | feature/phase6-booking-confirmations |
✅ Complete |
| 7 | Hardening & Polish | feature/phase7-hardening-polish |
✅ Complete |
| 8 | UAT & Launch | feature/phase8-uat-launch |
✅ Complete |
| 9–13 | Auth, Clinical, Demo, AI Vision, Idempotency | various | ✅ Complete (see migration history + docs/KNOWN_ISSUES.md) |
| 14 | Security Hardening (PRs A–E) | feature/phase14-* |
✅ Complete |
| 15 | Production-readiness uplift | per-PR | ✅ Complete (2026-04-23 — see docs/PRODUCTION_READINESS.md) |
| 16 | Overnight hardening (8 slices) | per-PR | ✅ Complete (2026-04-25 → 2026-04-27 — see docs/KNOWN_ISSUES.md Phase 16 sections) |
| 17 | Google Maps cost-control + go-live gate | claude/equismile-resume-build-tDChx |
✅ Complete (2026-05-13 — see docs/MAPS_COST_CONTROL.md) |
| 18 | Unified inbox + n8n Gmail wire-up + journey-planner reorder | claude/equismile-phase18-unified-inbox-journey-ux |
✅ Complete (2026-05-13) |
| 19 | Outlook setup + scope-clarification doc + handover runbook | claude/equismile-phase19-handover-scope-outlook |
✅ Complete (2026-05-13) |
| 20 | Template UX + customer-DB import + WhatsApp simulator + road-following routes | claude/equismile-phase20-templates-import-simulator-routing |
✅ Complete (2026-05-13 — see docs/IMPORT_GUIDE.md) |
| 20.5 | Docs handover refresh + sidebar scroll/collapse + macOS scrollbar | claude/equismile-docs-refresh-handover, claude/equismile-sidebar-scroll-fixup |
✅ Complete (2026-05-14 — see PRs #140 + #141) |
| 21 | Audit residue — Sentry error-sink option + Prisma pool-param boot warning | claude/equismile-phase21-audit-residue |
✅ Complete (2026-05-15) |
| 22 | Audit tail — WhatsApp token boot probe + pre-migrate snapshot + SW cache verification | claude/equismile-phase22-audit-tail |
✅ Complete (2026-05-16 — closes the 2026-04-18 audit) |
| 23 | Go-live runbooks — WhatsApp Meta production approval + production data load | claude/equismile-phase23-operator-runbooks |
✅ Complete (2026-05-16) |
| 24 | Operator readiness — UAT refresh + DR drill + operator quick-start | claude/equismile-phase24-operator-readiness |
✅ Complete (2026-05-19) |
| 25 | Build hardening — SKIP_ENV_VALIDATION honoured at module-import time |
claude/equismile-phase25-build-fix |
✅ Complete (2026-05-19) |
For three PRs in a row (#148, #149, prior local-build runs) the
"five-check gate" verified four checks (lint, typecheck, prisma
validate, tests) and noted the fifth (SKIP_ENV_VALIDATION=true npm run build) as a "pre-existing failure on origin/main." That
gap was never properly closed; it got documented as acceptable
rather than fixed.
The root cause was a real bug, not environmental noise: lib/env.ts
invoked validateEnv() at module-import time which threw on
missing DATABASE_URL. The flag SKIP_ENV_VALIDATION=true only
gated the standalone scripts/check-env.ts validator, NOT the
module-level validation. So when next build collected page data
for any route that imported lib/env (directly or transitively),
the build aborted with "Failed to collect page data for
/api/appointments/[id]/cancel" — misleadingly attributed to that
one route when every route was affected.
Code (one file, ~25 lines):
lib/env.ts—validateEnv()now checksSKIP_ENV_VALIDATIONat the top. When set, supplies a placeholderDATABASE_URL=postgresql://skip:skip@localhost:5432/skip(only when DATABASE_URL is unset) and lets Zod's.optional().default(…)fields fill the rest, returning a validEnvwithout throwing. Aconsole.warnfires (suppressed in tests) so a production- runtime leak of the flag is loud rather than silent.
Tests:
__tests__/unit/lib/env-skip-validation.test.ts— 5 cases: throws when flag unset + DATABASE_URL missing (regression guard); does NOT throw when flag set + DATABASE_URL missing (the fix); placeholder applied when none provided; real DATABASE_URL preserved when provided alongside the flag; normal validation flow unchanged when flag unset and DATABASE_URL present.
npm run lint✅npm run typecheck✅npx prisma validate✅npm run test— 1389 pass (+5 new), 0 regressions ✅SKIP_ENV_VALIDATION=true npm run build✅ — the fix target, five-check gate fully green for the first time- Production runtime semantics unchanged — when
SKIP_ENV_VALIDATIONis unset, the validator throws on missing required vars exactly as before. Real Vercel / Docker production builds where env vars are properly populated are unaffected.
- The 2026-05-08 memory entry that said "five-check gate" had a footnote about the build step needing the SKIP flag. That footnote is now obsolete; the gate runs cleanly with the flag set and real production builds (Vercel) don't set the flag.
- If
next buildever starts failing again with "Environment variable validation failed", the diagnosis is not to re-add another escape-hatch — it's to find the new env var that's been added toenvSchemawith.min(1)or similar required constraint and either give it a.default(…)or extend the placeholder source object invalidateEnv().
Track A slice 2 of the post-audit go-live plan. Phase 23 shipped the two externally-blocked runbooks (Meta approval, production data load); Phase 24 covers the three internally-actionable readiness gaps that remained:
- UAT report is stale.
docs/UAT_v2_VALIDATION.md(2026-05-07) validated 25 cases against commit7cb7efb. Phases 17–23 have shipped since — adding maps cost control, unified inbox + IMAP, CSV import, WhatsApp simulator, RouteMap DirectionsService, Sentry sink, WhatsApp token boot probe, pre-migrate snapshot, SW cache verification + VersionBanner. A future UAT pass needs to know what's still valid from v2 and what new test cases the intervening phases require. - No operator-facing DR rehearsal book.
docs/BACKUP.mdanddocs/OPERATIONS.mddocumented the restore procedure + the weekly automated restore-verify smoke test, but there was no "press here to practice" walkthrough for operators to rehearse DR scenarios on a dev environment before they need them in anger. - No one-page operator onboarding. A new operator handed EquiSmile had to read 12+ docs to know what to do day 1 / week 1 / month 1. The doc-first principle in CLAUDE.md helps, but a single-page checklist that indexes the existing runbooks (without duplicating them) was the missing piece.
None of these are blocked on external input — they could be built in parallel with Kathelijne's Meta approval timer running from Phase 23.
A — docs/UAT_v3_REFRESH.md (~380 lines)
- Delta-from-v2 table for each shipped phase (17–23) — which v2 cases need re-testing, which defects are now closed.
- Resolution status update for v2's three defects: D-2 (zero
invoices on prod — Phase-0 dep, status check needed), D-3
(missing recall workspace — resolved by Phase E
/recallsshipped 2026-05-08), D-4 ("login broken" — likely DEMO_MODE env, status check needed). - 39 refreshed test cases across 9 sections (25 v2 baseline + 14 new): Section G Maps cost (3), Section H Inbox/IMAP (2), Section I Admin tools (3), Section J Observability/PWA (5), plus one new UAT-PLN-04 for Phase 18 drag-reorder persistence.
- Execution checklist for a future live UAT pass (this doc is the plan, not the execution — the actual validation needs a live deploy URL).
B — docs/DR_DRILL.md (~330 lines)
- Three rehearsal scenarios with full step-by-step:
- Drill A — "Bad migration deployed an hour ago" (uses Phase 22 pre-migrate snapshot). RTO 30 min, RPO 0 if schema rollback chosen.
- Drill B — "Disk lost overnight" (uses Phase 16 nightly dump + off-box copy). RTO 2 h, RPO ≤ 24 h.
- Drill C — "Weekly automated restore-verify failed"
(uses Phase 16
backup-restore-verify.sh). The meta-recovery drill — ensures the recovery path itself still works.
- Each drill: scenario narrative, recovery targets, step-by-step rehearsal procedure, success criteria, common-failure table mapping rehearsal gotchas to production incident causes.
- Cross-references
docs/BACKUP.md§ 4 + § 7 anddocs/OPERATIONS.md§ 4 rather than duplicating the restore reference manual. - Quarterly cadence recommendation + drill-run ticket template.
C — docs/OPERATOR_QUICKSTART.md (~140 lines)
- Day 1 checklist (8 steps): get the stack up, verify probes, sign in.
- Week 1 checklist (9 steps): load real data, start Meta approval timer, walk the simulator with Kathelijne.
- Month 1 checklist (10 steps): Meta cutover, first DR drill, spend baseline establishment.
- Stop conditions per phase — explicit "do not progress if X" guards.
- Standing-state reference table linking each operational topic to its canonical doc.
- Emergency-contacts sequence (5 scenarios → 5 doc references).
All three docs cross-reference the existing runbooks (SETUP, VERCEL, OPERATIONS, BACKUP, IMPORT_GUIDE, MAPS_COST_CONTROL, WHATSAPP_PRODUCTION_APPROVAL, PRODUCTION_DATA_LOAD, OUTLOOK_INBOUND, HANDOVER, SCOPE_CLARIFICATIONS) rather than duplicating them.
npm run lint✅npm run typecheck✅npx prisma validate✅npm run test— 0 regressions ✅npm run build— pre-existing failure under SKIP_ENV_VALIDATION (not regression)- All cross-referenced file paths and section numbers verified against current main.
- No code changes; Phase 24 is doc-shaped by design (the underlying infrastructure was already in place from Phases 16–22).
Track A of the post-audit go-live plan splits into two slices. Phase 23 is the first slice — the two operator runbooks that front-load the externally-blocked work so Richard / Kathelijne can act on them in parallel while Phase 24 (UAT refresh + DR drill + operator guide) follows.
Two concrete gaps existed:
- No documented Meta approval pathway.
docs/OPERATIONS.md§ 1 covered token rotation post-approval, but there was no operator- facing runbook for the externally-blocked work: business verification, display name approval, template submission per locale, system-user token mint, webhook + verify-token install, cutover. The Meta review timer is the longest external lead time in the project (1–2 weeks typical); not having a runbook meant guess-and-check. - No production data load runbook.
docs/IMPORT_GUIDE.mdcovered CSV import mechanics but not the upstream prep: source-data inventory, dedup decisions, field-mapping calls specific to the Swiss practice context, pre-load data-quality checks, post-load verification queries, rollback paths. Kathelijne couldn't start prepping her CSVs without that guidance.
A — docs/WHATSAPP_PRODUCTION_APPROVAL.md (10 sections)
- Timeline expectation (2–3 weeks end-to-end, critical-path items identified).
- Prerequisites: dedicated phone number (with the consumer-account gotcha called out), Swiss business verification documents (Handelsregisterauszug, VAT/UID, signatory), Meta Business account.
- Business verification step-by-step with common rejection causes.
- WhatsApp Business Account + phone number setup + display name review.
- Template approval per template per locale: lists all nine templates
from
lib/demo/template-registry.ts× EN/FR = 18 submissions, with submission procedure + common rejection table. - System-user permanent token mint (cross-references
docs/OPERATIONS.md§ 1.2 rather than duplicating). - Webhook + verify-token install in the Meta App Dashboard.
- Phased cutover: sandbox-with-test-number → production, using the Phase 20 simulator's "Send to me (real)" path as the verification step before full production.
- Rollback plan: flip
DEMO_MODE=trueand restart. - Ongoing-operations notes (token rotation, template version bumps, conversation pricing, Phase 22 boot probe).
- Failure-mode quick reference table.
B — docs/PRODUCTION_DATA_LOAD.md (9 sections)
- Order-matters reminder (customers → yards → horses).
- Source-data inventory (VetUp export / Outlook / appointment diary / WhatsApp history / handwritten notes).
- Practice-specific field-mapping decisions for each profile
(customers / yards / horses) that the generic
IMPORT_GUIDE.mddoesn't cover — couple-vs-single legal-entity question, E.164 Swiss numbers, francophone-vs-anglophone preferred language, when to leave Lat/Lng blank vs populated, owner-vs-yard-manager distinction for horses. - Data-quality pre-checks (one row per legal customer, E.164 phones,
no clinical data in
Notes). - Load procedure with manual pre-migrate snapshot bracket, customer- ID-lookup loop, batch-geocoding post-load.
- Post-load verification SQL query (single-statement row-count
rollup with
deletedAtfiltering). - Rollback paths at three time horizons (minutes → re-import with
update; hours → restore from the manual snapshot; later → nightly
backup window via
docs/BACKUP.md§ 4). - Common-gotchas table (multi-owner horses, yards-with-no-street-
address, postcode typos surfaced via geocoding
partial_match).
Both docs cross-reference existing operations docs (OPERATIONS, IMPORT_GUIDE, BACKUP, MAPS_COST_CONTROL) rather than duplicating their content.
npm run lint✅npm run typecheck✅npx prisma validate✅npm run test— 0 regressions ✅npm run build✅- Both runbooks cite real file paths, env vars, and Meta-side
procedures — verified against
lib/demo/template-registry.ts,docs/OPERATIONS.md, anddocs/IMPORT_GUIDE.md. - No code changes; Phase 23 is doc-shaped by design (the underlying infrastructure was already in place from Phases 17, 20, 22).
Phase 22 — Audit tail: WhatsApp token probe + pre-migrate snapshot + SW cache verification (2026-05-16)
Closes the MEDIUM/LOW residue from the 2026-04-18 production-readiness audit. With Phase 21 having already shipped the CRITICAL/HIGH items (Sentry option + Prisma pool warning), three concrete operational gaps remained:
- MED-05 — A revoked
WHATSAPP_ACCESS_TOKENwas only discovered when the first outbound confirmation failed, often hours after the revocation. No boot-time signal existed. - LOW-01 — The nightly
pg_dumpruns at 02:30 UTC. A destructive migration deployed at 14:00 left up to a 23-hour data-loss window if the schema corruption was not caught immediately. - LOW-03 — Serwist's hashed-asset cache invalidation works correctly on next-navigation, but a tab that was open before the deploy (Kathelijne's inbox sitting open all day) silently keeps the old HTML/JS until the operator manually reloads.
A — MED-05 WhatsApp token boot probe
- New
lib/services/whatsapp-token-probe.service.ts.probe()makes a singleGET https://graph.facebook.com/v21.0/<phone_number_id>withAuthorization: Bearer <token>and a 5-second timeout.- HTTP 200 → log info, no further action.
- HTTP 401 → write
AuditLog{action:'WHATSAPP_TOKEN_INVALID', entityType:'config', entityId:'whatsapp-access-token'}and send a once-per-UTC-day alert email viaemailService.sendBrandedEmailtoMAPS_ALERT_EMAIL. - Any other status / network error → log warn, no audit, no alert (transient — never false-alarm).
- Hooked into
instrumentation.tsas a fire-and-forget call after the error sinks register. Skipped entirely in demo mode and when credentials are absent. - In-process dedup mirrors the Phase 17
maybeFireSoftCapAlertpattern (Set<string>keyed by UTC date; re-armed on restart).
B — LOW-01 pre-migrate snapshot automation
- New
docker/pre-migrate-snapshot.sh— runspg_dumponce before themigratorservice and writes a labelledpre-migrate-<UTC-timestamp>.sql.gzinto the existingbackups_datavolume. Skips on first-ever boot (empty schema). - New
pre-migrate-snapshotcompose service. Same safety guards asdocker/backup-entrypoint.sh(libpq.pgpass, narrow env-var whitelists, no password literals in shell commands). migratornowdepends_on: pre-migrate-snapshot: service_completed_successfullyso migrations are blocked until the snapshot lands.- Retention is governed by the nightly backup's existing
BACKUP_RETENTION_DAYSsweep — no separate knob. - Documented in
docs/BACKUP.md§ 7.
C — LOW-03 service-worker cache verification
- Verified Serwist's
precacheEntries: self.__SW_MANIFEST+skipWaiting: true+clientsClaim: truestrategy is invalidation-safe for navigation-triggered loads. No code change required for the canonical case. - Shipped a defensive open-tab safety net regardless:
scripts/write-version.tswritespublic/version.json = { sha, builtAt }atprebuildtime (chained aftercheck-env).- Checked-in placeholder
public/version.jsonwithsha:'dev'so the file always exists in dev / shallow-clone CI builds. - New client
components/system/VersionBanner.tsxpolls/version.jsonevery 5 minutes (cache-busted), captures the bootstrap SHA on first poll, and surfaces a non-modal<div role="status" aria-live="polite">banner when the SHA changes. Skipped when bootstrap SHA is'dev'. - Mounted in
app/[locale]/layout.tsxnext toOfflineBanner.
- New i18n keys under
version.*in EN + FR.
| File | Cases |
|---|---|
__tests__/unit/services/whatsapp-token-probe.service.test.ts |
7 |
__tests__/unit/components/VersionBanner.test.tsx |
4 |
npm run lint✅npm run typecheck✅npx prisma validate✅npm run test— 0 regressions ✅npm run build✅- Boot probe fires when
WHATSAPP_ACCESS_TOKEN+WHATSAPP_PHONE_NUMBER_IDare set in non-demo mode; skips silently otherwise. - Pre-migrate snapshot lands in
/backupsbefore everymigratorinvocation; absent on first-ever boot. - Bumping
public/version.jsoncauses a long-lived tab to surface the refresh banner on the next 5-minute poll. - All five originally-flagged audit items (HIGH-02, HIGH-05, MED-05, LOW-01, LOW-03) are now ✅ in
docs/PRODUCTION_READINESS_AUDIT_RESPONSE.md.
Closes the two remaining CRIT/HIGH items from the 2026-04-18 production-readiness audit (HIGH-02 + HIGH-05) that weren't already covered by Phases 14–20. See docs/PRODUCTION_READINESS_AUDIT_RESPONSE.md for the full triage; everything else in the audit is shipped.
- HIGH-02 (Sentry option). New
lib/observability/sentry-error-sink.tswith a dynamic-import based factory: whenSENTRY_DSNis set AND@sentry/nextjsis installed, registers a second error sink alongside the existing webhook sink (both fire in parallel). When the SDK isn't installed, logs a one-time warning to stderr and falls through.@sentry/nextjsstays an OPTIONAL operator install — no new hard dependency. - HIGH-05 (Pool-param boot warning).
lib/utils/env-check.tsnow warns whenDATABASE_URLlacks?connection_limit=10&pool_timeout=10query params in non-demo mode./api/statusexposesprobes.database.poolConfigured+poolMissing[]so the operator can see the gap on the observability page. The URL is never silently mutated — the operator decides whether to add the params. - Docs.
.env.exampledocuments both new vars;docs/OPERATIONS.md§6 (new) explains the Sentry trade-off vs. the existing webhook sink.
| File | Action |
|---|---|
lib/observability/sentry-error-sink.ts |
New |
instrumentation.ts |
Register both sinks in parallel |
lib/utils/env-check.ts |
Pool-param warning |
app/api/status/route.ts |
Surface poolConfigured + poolMissing[] |
.env.example |
Document SENTRY_DSN and the pool-tuning recipe |
__tests__/unit/observability/sentry-error-sink.test.ts |
New |
__tests__/unit/utils/env-check.test.ts |
+5 cases for pool-tuning warnings |
npm run lint✅npm run typecheck✅npm run test— 1373 / 1373 pass, 0 regressions ✅npm run build✅- Boot warning fires when
DATABASE_URLlacks pool params (verified via the new env-check tests). - Sentry sink falls back gracefully when
@sentry/nextjsis not installed (verified via the new sink test).
Phase 20 — Template UX + customer-DB import + WhatsApp simulator + road-following routes (2026-05-13)
User feedback after testing the live Vercel deployment surfaced four concrete asks bundled into a single overnight build:
- Templates editor too "raw" — positional
{{1}}/{{2}}placeholders confused non-technical operators. - No customer-database upload path. Export existed; import didn't. Practice needed bulk-load for Customers / Yards / Horses.
- No WhatsApp simulator. Operators couldn't preview a template against a real customer without actually sending.
- Map polyline crossed Lake Geneva — straight geodesic lines between yards on opposite shores rendered as routes across water.
A — Template editor UX
components/admin/TemplatesAdmin.tsxrewritten with click-to-insert placeholder pills, debounced auto-save (no Save button), live validation badges (ok / missing / unknown) and a "Preview as customer" panel that renders against real customer/appointment data.lib/utils/template-placeholders.ts— bidirectional{{N}}↔[name]serialiser with round-trip-locked unit tests.lib/services/template-render.service.ts— server-side renderer shared with the simulator; resolves customer/appointment/horse fields against the live DB.app/api/admin/templates/preview/route.ts— POST renders a draft body against any customer.- New
DELETE /api/admin/templates/[key]for the Reset-to-default button +messageTemplateService.deleteOverride().
B — Customer / yard / horse CSV import
lib/services/csv-parse.service.ts— RFC 4180 decoder.lib/services/csv-import.service.ts— three profiles (customers / yards / horses) with validation, conflict detection, dry-run + atomic-transaction commit, audit-logged viaIMPORT_RUN.app/api/admin/import/{preview,commit}/route.ts— multipart upload endpoints, ADMIN-only, file SHA-256 recorded (no on-disk persist).app/[locale]/admin/import/page.tsx+components/admin/ImportRunner.tsx— drag-drop UI with profile + conflict-policy selectors, dry-run preview table, downloadable CSV templates per profile.- New runbook
docs/IMPORT_GUIDE.md.
C — WhatsApp Business simulator
app/[locale]/admin/simulator/page.tsx+components/admin/TemplateSimulator.tsx.app/api/admin/simulator/send/route.ts— two modes:simulate(renders + audits, never touches Meta) andreal(rate-limited 3/hour per admin, gated onWHATSAPP_TEST_NUMBERenv var).- New
WHATSAPP_TEST_NUMBERenv var documented in.env.example. - Audit events:
TEMPLATE_SIMULATED,TEMPLATE_TEST_SENT.
D — Real road-following routes on the map
components/maps/RouteMap.tsx— replaces the geodesic-truePolylinewith aRouteDirectionscomponent that calls Google's client-sideDirectionsServiceper leg. SessionStorage cache keyed bylat,lng→lat,lng. Falls back to a fainter geodesic line on per-leg failure.- New
NEXT_PUBLIC_MAP_ROUTING_MODEenv var (directionsdefault,straightfor demo deploys with synthetic coordinates). - Note in
docs/MAPS_COST_CONTROL.md: client-side DirectionsService has zero impact on the Phase 17 server-side spend cap.
Cross-cutting
- New i18n keys under
admin.templates.*,admin.import.*,admin.simulator.*,nav.import,nav.simulatorin EN + FR. - Sidebar gains two ADMIN-only entries: Import + Simulator.
| File | Cases |
|---|---|
__tests__/unit/utils/template-placeholders.test.ts |
11 |
__tests__/unit/services/csv-parse.test.ts |
10 |
__tests__/unit/services/csv-import.test.ts |
9 |
__tests__/unit/api/admin-simulator-send.test.ts |
4 |
(existing) __tests__/unit/components/RouteMap.test.tsx |
updated polyline assertion to match the new RouteDirections component |
npm run lint✅npm run typecheck✅npm run test— 1367 / 1367 ✅npm run build✅- New routes registered:
/[locale]/admin/import,/[locale]/admin/simulator,/api/admin/import/preview,/api/admin/import/commit,/api/admin/simulator/send,/api/admin/templates/preview
Three deferred items from the 2026-05-13 gap analysis were doc-shaped (not code-shaped). Bundling them into a single doc-only slice closes the analysis without spinning up three near-empty PRs:
- Outlook inbound — the n8n IMAP workflow from Phase 18 is provider-agnostic; what was missing was operator documentation for pointing it at Outlook / Microsoft 365.
- Auto AM/PM slot suggestion — explicitly excluded from MVP per contract § 3.3. Path A from the slice-planning conversation: document the exclusion in writing rather than build it. Bundled with the broader answer to Patrick's six scope questions.
docs/HANDOVER.md(H-06) — source-code transfer runbook for moving the repo from the developer-ownedRJK134account to a practice-owned account.
docs/OUTLOOK_INBOUND.md— full setup runbook for IMAP + app password against Outlook / 365 using the existing Phase 18 workflow. Covers troubleshooting + an explicit "running Gmail AND Outlook simultaneously" pattern. OAuth2 / Microsoft Graph path documented as a future option, not built.docs/SCOPE_CLARIFICATIONS.md— point-by-point answer to Patrick's six pointed questions about scheduling intelligence, with a consolidated "Out-of-scope register" table. The MVP is positioned as an "intelligent workflow automation and scheduling assistant", not an autonomous scheduler. Auto AM/PM slot suggestion is documented as deliberately out-of-scope (Q3) with a sketched path to "yes" for a future phase.docs/HANDOVER.md— full source-code transfer runbook covering pre-transfer secret inventory (~40 env vars), external integration inventory (Meta, Vercel, n8n, Anthropic, Google), the transfer itself, post-transfer verification checklist, and a rollback plan (GitHub transfers are reversible within 48h).
- All three new docs land in
docs/ - BUILD_PLAN.md updated with this entry ✅
- KNOWN_ISSUES.md updated with Phase 19 section ✅
- No code changes; no migrations; lint / typecheck / build unchanged
- The "Out of scope" register in
SCOPE_CLARIFICATIONS.mdbecomes the canonical reference for "what does EquiSmile MVP do?"
Three open items from the 2026-05-13 gap analysis against Patrick's consultant feedback and the April-12 build update doc:
- Unified inbox — the build update promised one screen for WhatsApp + email; in practice only the triage queue existed.
- n8n Gmail intake — webhook handler complete, but
n8n/02-inbound-email.jsonwas noOp stubs. No mail actually flowed. - Route-planner reorder — Patrick's "vet always confirms the final order" promise was partial: the vet could approve/reject but not resequence proposed stops; no mobile-friendly affordance.
- n8n workflow (
n8n/02-inbound-email.json) replaced with realemailReadImap→ Code (parse to webhook contract) → HTTP Request → IF (success/failure logger) chain. Shipped inactive; operator activates after configuring the IMAP credential in n8n UI. - Unified inbox at
/[locale]/inbox:- Server page + new
components/inbox/InboxView.tsxclient component - Thread grouping by customer (anonymous senders grouped per
sourceFromso unknown numbers/emails aren't lumped together) - Channel filter (ALL / WhatsApp / Email), debounced search
- Sidebar entry added; MobileNav promotes Inbox over triage queue
- i18n keys under
inbox.*+nav.inbox
- Server page + new
- Journey-planner reorder:
PATCH /api/route-planning/proposals/[id]/reorder-stops— transactional resequence; rejects on APPROVED+ status; validates that every existing stop appears exactly oncelib/repositories/route-run.repository.ts#reorderStops— atomic transaction; nulls stale per-stop travel figurescomponents/route-runs/RouteRunStopsList.tsx— HTML5 drag-and-drop- up/down arrow buttons (accessible, touch-friendly, ARIA-labelled);
presents identically inline and inside the
<BottomSheet>drawer
- up/down arrow buttons (accessible, touch-friendly, ARIA-labelled);
presents identically inline and inside the
- Mobile-focused "Reorder stops" trigger button opens the bottom sheet for a larger touch target experience
- i18n keys under
routeRuns.reorder.*
- Tests (4 new files, 28 cases):
__tests__/unit/api/route-planning-reorder-stops.test.ts— 9 cases covering DRAFT/PROPOSED reorder, APPROVED/BOOKED lock, validation__tests__/unit/components/RouteRunStopsList.test.tsx— 7 cases covering render, reorder controls, optimistic UI__tests__/unit/components/InboxView.test.tsx— 5 cases covering thread grouping, channel filter wiring, empty + error states__tests__/unit/n8n/inbound-email-workflow.test.ts— 7 cases locking the workflow against noOp regressions and verifying the Bearer-auth contract with the EquiSmile webhook
npm run lintpasses ✅npm run typecheckpasses ✅npm run test— 1332 tests pass (139 files), 0 regressions ✅npm run buildpasses ✅- New routes registered:
/[locale]/inbox,/api/route-planning/proposals/[id]/reorder-stops✅
EQUISMILE_LIVE_MAPS=true was a live-billing footgun: no daily spend
cap, no per-call telemetry, no operator dashboard. Enabling live Maps
on a runaway batch (or against a malicious test of the geocode
endpoint) could rack up unbounded cost before anyone noticed. The
existing safety net was a single environment variable.
- New
MapsApiCallPrisma model +MapsOperationenum (additive migration20260513000000_phase17_maps_api_call) lib/services/maps-cost-tracker.service.ts—checkBudget/recordCall/getDailySpendUsd/last7DaysSpend/recentMapsBudgetExceededErrorthrown before the network call when the daily hard cap is breached- Wrappers around three live call sites:
googleMapsClient.geocode,geocodingService.geocodeAddress,routeOptimizerService.optimizeRoute. Demo-mode is unwrapped. - Budget-driven gate in
batchGeocodeYards()replaces the fixed 100ms inter-request delay (closes KI-001) GET /api/admin/maps-usage+/[locale]/admin/maps-usagepage — today's spend, 7-day rollup, recent calls, soft/hard cap banners- 5 new env vars in
lib/env.ts:MAPS_DAILY_SPEND_CAP_USD,MAPS_SOFT_CAP_PCT,MAPS_ALERT_EMAIL,MAPS_PRICE_GEOCODE_USD,MAPS_PRICE_OPTIMIZE_TOURS_USD - Soft-cap alert email via
emailService.sendBrandedEmail, dedup'd per-UTC-day to prevent flooding - New i18n keys under
admin.mapsUsage.*in EN + FR - 3 new test files (26 cases): unit + integration coverage
- New runbook
docs/MAPS_COST_CONTROL.md
npm run lintpasses ✅npm run typecheckpasses ✅npx prisma validatepasses ✅npm run test— 1304 tests pass (135 files), 0 regressions ✅npm run buildpasses ✅- Migration is additive only (no destructive ops) ✅
- KI-001 moved to resolved in
docs/KNOWN_ISSUES.md✅
- Tooling and configuration (package.json, tsconfig, Tailwind, ESLint, Prettier, Docker Compose)
- Documentation skeleton
- n8n workflow JSON skeletons (01–06)
- Prisma schema with complete data model
- Next.js App Router shell with bilingual i18n (EN/FR)
- Shared libraries and test scaffolding
- CLAUDE.md and .claude/ agent configuration
- GitHub Actions CI workflow
npm run lintpasses ✅npm run typecheckpasses ✅npm run testpasses ✅npx prisma validatepasses ✅npm run buildpasses ✅
- PWA shell with Serwist
- Docker Compose verified (PostgreSQL + n8n healthy)
- Prisma migration init
- Idempotent seed data
- Environment variable validation
- Health check API endpoint
- CI pipeline passing
- Customer/yard/horse CRUD with bilingual UI
- Manual enquiry creation
- Triage classification interface
- Planning pool view with filters
- Repository/service layer pattern
- Meta WhatsApp Cloud API webhook handler
- Email/IMAP intake endpoint
- Message logging
- n8n-to-app REST contract
- Webhook signature verification
- Triage rules engine (EN/FR)
- Missing-information auto-detection
- Manual override and escalation with audit trail
- Triage task queue
- Status machine for valid transitions
- Google Geocoding integration
- Geographic clustering by postcode area
- Route scoring algorithm
- Google Route Optimisation API integration
- Route proposal generation, review, approval
- Route approval to appointment conversion
- WhatsApp/email confirmation dispatch (bilingual)
- 24h/2h reminder scheduling
- Cancel/reschedule handling
- Visit outcome recording with follow-up
- Retry logic with exponential backoff and jitter
- Structured JSON logging with data masking
- Error recovery UX (error boundaries, toast, offline banner)
- WCAG 2.1 AA accessibility
- PWA offline capabilities with request queue
- Performance (skeletons, pagination)
- Mobile polish (bottom sheet, safe-area insets)
- Pre-flight check script
- Release candidate tag (
rc/v1.0.0) - CHANGELOG.md and release notes
- Comprehensive UAT test scripts (TC-001 through TC-008)
- Environment validation script
- Production readiness checklist
- Deployment guide with rollback procedure
- Enhanced seed data for realistic UAT testing
- Multi-stage production Dockerfile
- CI/CD enhancements (Docker build, security audit)
- Final documentation update
Following the release of rc/v1.0.0, a retrospective verification pass was run against every phase's master prompt.
- Plan: PHASE_VERIFICATION_PLAN.md
- Findings: V1_AUDIT_FINDINGS.md
- AMBER items logged: KNOWN_ISSUES.md — 13 active AMBERs, 1 closed in-audit, 1 retracted, 1 resolved by PR #17
Summary: All 10 phases (0–9) verdict GREEN with AMBER log. Zero RED findings. Non-negotiable checks all pass (lint, typecheck, test, prisma validate, build). In-audit fix applied to __tests__/unit/infra/demo-startup.test.ts to guard Windows exec-bit assertions.
State drift: The audit was anchored at fbafbd9. During publication, PRs #13–#17 landed Phase 12 work on main (current HEAD 3e295ba). AMBER-03 (seed counts) was resolved by PR #17's seed split; remaining AMBERs re-verified against the diff and stand.
Outstanding triage decisions for v1.1 include brand-colour reconciliation (AMBER-02), Phase 6 data-model richness (AMBER-08 through AMBER-13), and idempotency store externalisation (AMBER-14). See the findings file for the per-deliverable evidence tables.
- Gate the internal operations UI behind GitHub sign-in using Auth.js v5 with the
@auth/prisma-adapter. - Restrict access to an env-driven allow-list (
ALLOWED_GITHUB_LOGINS), matching either GitHub login or email (case-insensitive). - Add standard Auth.js Prisma models (
User,Account,Session,VerificationToken) with arolecolumn andgithubLoginstored onUserfor future RBAC and audit wiring. - Chain Auth.js middleware with the existing
next-intlmiddleware; keep/api/webhooks/*(n8n) and/api/auth/*public. - Replace the hard-coded
performedBy = "admin"default inapp/api/triage-ops/override/route.tswith the signed-in user's GitHub login/email.
auth.ts,lib/auth/allowlist.ts(lib/auth/session.tssuperseded bylib/auth/rbac.tsin PR E)app/api/auth/[...nextauth]/route.tsapp/[locale]/login/page.tsx,components/auth/{SignInButton,UserMenu,AuthSessionProvider}.tsx- Prisma schema + migration for auth tables
- Updated
middleware.ts,lib/utils/env-check.ts,.env.example - Allow-list + middleware unit tests
- Docs: SETUP (GitHub OAuth App section), ARCHITECTURE (Authentication section), KNOWN_ISSUES (KI-006)
npm run lint && npm run typecheck && npm run testpass.npx prisma validatepasses and aprisma migrate devrun creates the four auth tables.- Unauthenticated visits to any locale route redirect to
/{locale}/login. - Allow-listed GitHub account signs in successfully; non-allow-listed account is denied with the
notAuthorisedbanner. /api/webhooks/whatsappstill accepts n8n calls withN8N_API_KEYalone (no session).- Triage override creates audit rows with
performedByset to the signed-in user, not"admin".
- Support a 2+ vet practice by introducing a Staff model separate from the Auth User, so domain assignments are decoupled from auth plumbing.
- Track appointment ownership (primary vet + joint assignments) and route-run leadership (lead + assistants) so "rounds with both vets" is explicit, not convention.
- Prisma:
Staffmodel,AppointmentAssignmentjoin,RouteRunAssistantjoin,RouteRun.leadStaffIdFK. Additive migration only. - Repository/service:
staff.repository.ts,staff.service.ts(list/create/update/deactivate + assignToAppointment/assignToRouteRun + appointmentsForCalendar). - API:
GET|POST /api/staff,GET|PATCH|DELETE /api/staff/[id],POST|DELETE /api/staff/assign(target=appointment|routeRun). - UI:
/{locale}/staffmanagement page (list + create modal + toggle active). - Validations:
lib/validations/staff.schema.ts(zod). - i18n: EN + FR strings for Staff page, roles, assignment labels.
- Seed: demo-staff-rachel (lead vet, maroon), demo-staff-second (visiting vet, blue), demo-staff-nurse (green).
- Tests: 10 service unit tests (create/duplicate email/assignment with primary flag/route-run lead+assistant/calendar filter).
npm run lint,typecheck,test,prisma validateall pass.POST /api/staff { name }creates a vet; duplicate email returns 409.POST /api/staff/assign { target: 'appointment', appointmentId, staffId, primary: true }unflags other primaries for that appointment.POST /api/staff/assign { target: 'routeRun', routeRunId, staffId, isLead: true }writesRouteRun.leadStaffId.
- Provide a clean CSV export of the EquiSmile dataset that can be ingested by VetUp (or any patient-centric PMS). Column schema is kept in
VETUP_PATIENT_COLUMNSso it's a one-file change when the client confirms VetUp's actual headers.
lib/services/csv.service.ts— RFC 4180 encoder (CRLF, quote-escaping, null → empty, Date → ISO-8601).lib/services/vetup-export.service.ts— three profiles:patient(horse-centric with denormalised owner + yard),customers,yards.GET /api/export/vetup?profile=patient|customers|yards— streams CSV withContent-Disposition: attachment.- Customers page gains three download buttons (VetUp, Customers, Yards).
- 13 unit tests (10 CSV encoder + 3 export service).
curl /api/export/vetup?profile=patientreturns a CSV with the VetUp-patient header and one row per horse.- Fields with commas or double quotes are correctly RFC-4180 quoted/escaped.
- Null fields render as empty (no literal "null" string).
- Per-horse clinical history: PDF/image attachments, dental charts, tooth-level findings, prescriptions. Sets up the data model that the Phase 13 vision pipeline will populate.
- Prisma:
HorseAttachment,DentalChart,ClinicalFinding,Prescriptionmodels + 4 new enums (AttachmentKind / FindingCategory / FindingSeverity / PrescriptionStatus). Additive migration. lib/services/attachment.service.ts— upload/list/read-bytes/delete; relative path kept in DB so storage backend (FS/S3) is swappable; 25 MB limit; allow-list of image+PDF mimes.lib/services/clinical-record.service.ts— CRUD for dental charts, findings, prescriptions; trims inputs, validates duration/withdrawal non-negative, mutatesstatus+ timestamp atoms on transitions.- API:
GET|POST /api/horses/[id]/attachments,GET|DELETE /api/attachments/[id],GET|POST /api/horses/[id]/clinical,PATCH /api/prescriptions/[id]. .env.exampleaddsATTACHMENT_STORAGE_DIR;.gitignoreexcludesdata/attachments/.
curl -F file=@chart.pdf /api/horses/<id>/attachments→ row inserted, bytes on disk under$ATTACHMENT_STORAGE_DIR/<horseId>/….GET /api/attachments/<id>streams the original bytes inline.POST /api/horses/<id>/clinical { kind:'prescription', medicineName, dosage }returns 201 ACTIVE row;PATCH /api/prescriptions/<id> { status:'CANCELLED', cancelledReason }sets status + cancelledAt.- 16 unit tests (8 attachment, 8 clinical-record).
- Analyse uploaded PDF dental charts and clinical images with Claude (Opus 4.7), producing structured findings + prescriptions that land directly in the Phase 12 clinical models. Acts as decision support — the vet reviews everything before acceptance.
lib/integrations/anthropic.client.ts— singleton SDK client; throws ifANTHROPIC_API_KEYunset; model override viaEQUISMILE_VISION_MODEL.lib/services/vision-analysis.service.ts— builds vision message (document block for PDFs, image block for JPEG/PNG/WebP/GIF), calls Claude with adaptive thinking +output_config.format: json_schemausing a strict Zod schema (generalNotes, findings[], prescriptions[], confidence). System prompt is cache-control marked. Validates response locally before persisting.- Post-processing writes one
DentalChart(linked to the source attachment) with all findings + any explicitly-recorded prescriptions, attributed to the calling staff member. - API:
POST /api/attachments/[id]/analyse— returns{ dentalChartId, findingIds[], prescriptionIds[], result }; returns 503 ifANTHROPIC_API_KEYis missing. .env.example:ANTHROPIC_API_KEY+ optionalEQUISMILE_VISION_MODEL.- 14 unit tests (schema validation, extract/fallback/JSON error paths, service-level attachment lookup, persist=true vs false, PDF-vs-image block selection, staff attribution, cache_control placement).
POST /api/attachments/<id>/analysewith an equine dental PDF: returns 201 with findings[] and prescriptions[] populated; new DentalChart row linked via attachmentId.- Without
ANTHROPIC_API_KEY: 503 "Vision analysis unavailable". - Corrupt/off-topic PDF: model returns
confidence: "low", empty findings, explanatory generalNotes — no findings/prescriptions written beyond the chart row. - System prompt cached:
usage.cache_read_input_tokens > 0on the second analyse call in a 5-minute window.
- Replace the in-memory
processedKeys: Set<string>inlib/utils/retry.tswith a Postgres-backed store so idempotency markers survive restarts and are shared across instances.
- Prisma:
IdempotencyKey { key @id, scope, createdAt, expiresAt? }with indexes onscopeandexpiresAt. Additive migration. lib/services/idempotency.service.ts:hasProcessed(key),markProcessed(key, scope, ttlMs?)(upsert-based, concurrency-safe),pruneExpired(now).lib/utils/retry.ts:hasBeenProcessed/markAsProcessed/clearProcessedKeysare now async and delegate to the service. Default TTL 30 days.- Call sites (
lib/services/whatsapp.service.ts) updated withawait. docs/KNOWN_ISSUES.mdAMBER-14 marked resolved.- 8 new idempotency-service tests; existing
retry.test.tsidempotency suite converted to async + uses an in-memory mock of the service.
- Restart the app between two sends with the same idempotency key → second call still detects the dupe (was previously lost).
POST /api/healthshows the new table inprisma migrate status.- Expired keys are pruned automatically on first
hasProcessedread (self-healing).
- Harden authentication and introduce defence-in-depth HTTP response headers.
lib/auth/redirect.ts—isSafeCallbackUrl/safeCallbackUrl. Rejects absolute URLs, protocol-relative URLs (//evil), percent-encoded variants,javascript:/data:schemes, path traversal, CR/LF/NUL injection, and oversize values. Wired intomiddleware.ts,auth.tsredirectcallback, andapp/[locale]/login/page.tsx.lib/auth/allowlist.ts— upgraded to constant-time comparison viacrypto.timingSafeEqual(no short-circuit walk; length-gated).auth.ts— explicit secure cookie config (__Secure-/__Host-prefixes,SameSite=Lax,HttpOnly,Securein production), 30-day session with 24-hour rotation,trustHostonly whenAUTH_URLis set,useSecureCookiesin prod,redirectcallback that enforces same-origin.lib/security/headers.ts+ middleware wiring — adds:Content-Security-Policy(pragmatic for HTML; strictdefault-src 'none'; frame-ancestors 'none'for API)Strict-Transport-Security(production only)X-Content-Type-Options: nosniffX-Frame-Options: DENYReferrer-Policy: strict-origin-when-cross-originPermissions-Policy(disables camera/mic/etc.)Cross-Origin-Opener-Policy: same-originCross-Origin-Resource-Policy: same-origin
- Tests: 12 redirect tests + 8 header tests + 5 new allowlist tests + 2 new middleware tests.
npm run lint,typecheck,test,prisma validateall pass (674 tests across 73 files).- Open-redirect vectors (
//evil,%2F%2Fevil,/javascript:...,/../admin, CR/LF injection) are rejected by both the middleware callbackUrl attach step and the Auth.jsredirectcallback. - Non-allow-listed sign-in attempts are logged without identifiers; production cookies carry
__Secure-prefix.
- Enforce least-privilege on sensitive API routes and record every security-relevant action in an append-only audit log.
lib/auth/rbac.ts—ROLESenum (admin | vet | nurse | readonly) +normaliseRole+hasRole+requireAuth+requireRole+withRole+AuthzError. Unknown roles default toreadonly(deny-by-default).- Prisma:
SecurityAuditLog+SecurityAuditEventenum with 17 event types. Additive migration20260420130000_phase14_security_audit. lib/services/security-audit.service.ts:record(event, actor, ...)(best-effort, never blocks the primary request),recent({limit, event})for admin dashboards; detail is truncated to 500 chars; no secrets.- Route lockdowns (with audit where appropriate):
GET/POST /api/export/vetup→ ADMIN +EXPORT_DATASETauditPOST /api/staff→ ADMIN +STAFF_CREATEDPATCH /api/staff/[id]→ ADMIN +ROLE_CHANGED|STAFF_UPDATEDDELETE /api/staff/[id]→ ADMIN +STAFF_DEACTIVATEDGET /api/staff+GET /api/staff/[id]→ READONLYPOST/DELETE /api/staff/assign→ VETGET /api/attachments/[id]→ NURSE +ATTACHMENT_DOWNLOADEDDELETE /api/attachments/[id]→ VET +ATTACHMENT_DELETEDPOST /api/attachments/[id]/analyse→ VET +VISION_ANALYSIS_INVOKEDGET /api/horses/[id]/attachments→ NURSEPOST /api/horses/[id]/attachments→ VET (uploader attribution taken from session, not form)GET /api/horses/[id]/clinical→ NURSEPOST /api/horses/[id]/clinical(dentalChart/finding/prescription) → VET +CLINICAL_RECORD_CREATEDPATCH /api/prescriptions/[id]→ VET +PRESCRIPTION_STATUS_CHANGEDGET /api/status→ ADMIN
auth.tssign-in denial callback writesSIGN_IN_DENIEDaudit events with a coarse actor label (no denied-user identifiers stored)./api/setuplint warning cleaned up as a drive-by.- Tests: 15 RBAC tests + 10 audit-service tests. Net 695 passing across 75 files.
- A
nursecannotPOST /api/horses/<id>/clinicalorDELETE /api/attachments/<id>(403). - A
readonlycannotPOST /api/staff(403) but canGET /api/customers. - A
vetcannotGET /api/export/vetup(admin-only). - Every admin export, attachment delete/download, clinical mutation, prescription status change, and vision-analysis invocation lands in
SecurityAuditLog.
- Harden public-path webhook auth, cap abuse-prone routes with a rate limiter, and introduce a log-redaction utility so secrets can't leak via structured logs.
lib/utils/signature.ts— newconstantTimeStringEqualshelper; newverifyWhatsAppVerifyTokenthat uses it so the GET-challenge verify token can't be probed by timing.app/api/webhooks/whatsapp/route.ts—GETswaps===for constant-time compare;POSTrate-limited per client IP (300/min) before parsing body.app/api/webhooks/email/route.ts—POSTrate-limited per client IP (200/min) before signature check.lib/utils/rate-limit.ts— in-memory sliding-window limiter,rateLimiter({windowMs, max, now, maxKeys})+rateLimitedResponsehelper +clientKeyFromRequest. Per-key LRU-bounded to 10,000 keys.- Wired into:
POST /api/attachments/[id]/analyse(20/hour per user — caps Claude Opus 4.7 spend) andGET /api/export/vetup(10/hour per admin — discourages automated exfil). lib/utils/log-redact.ts—redact(value)walks any object, replaces values of sensitive keys (authorization, api_key, cookie, password, signature, etc.) with[redacted]; also redactsBearer …andsk-…string values regardless of key.- Tests: 11 rate-limit + 10 log-redact + 6 new signature tests. Net 722 passing across 77 files.
- Spamming
POST /api/webhooks/whatsapp301 times in a minute from one IP returns 429 withRetry-After. GET /api/export/vetup?profile=patient11 times from the same admin returns 429.redact({authorization: 'Bearer sk-xxx'})returns{authorization: '[redacted]'}.- WhatsApp GET verification with a same-length wrong token no longer short-circuits compared to a matching token (no timing oracle).
- The rate limiter is in-memory per Node instance. Horizontal scaling needs a Redis (or Postgres — same pattern as
IdempotencyKey) backend. - The
log-redactutility is available but not yet automatically wired into everyconsole.log; adopt on a per-call basis as call sites are reviewed.
- Resolve the functional gaps logged during the v1.0.0 retrospective audit. Split across three data-model additions, three audit-service wirings, a dead-letter queue, a visit-requests operator page, and docs reconciliation for the items that were naming/narrative gaps rather than code gaps.
- Prisma additive migration
20260420140000_phase14_amber_gap_closure:- AMBER-06:
Yardgets nullablegeocodeSource,geocodePrecision,formattedAddress. - AMBER-10:
ConfirmationDispatch { appointmentId, channel, sentAt, success, externalMessageId?, errorMessage? }. - AMBER-11:
AppointmentResponse { appointmentId, kind, channel, receivedAt, rawText?, enquiryMessageId? }+AppointmentResponseKindenum. - AMBER-13:
AppointmentStatusHistory { appointmentId, fromStatus?, toStatus, changedBy, reason?, changedAt }. - AMBER-15:
FailedOperation { scope, operationKey?, payload, lastError, attempts, status, createdAt, updatedAt }+FailedOperationStatusenum.
- AMBER-06:
lib/services/appointment-audit.service.ts:logConfirmationDispatch,logResponse,logStatusChange(skips no-op transitions), plus readers. Best-effort writes.lib/services/dead-letter.service.ts:enqueue(runsredact()+ caps sizes),list({status,scope,limit}),markStatus.- Wirings:
confirmationService.sendConfirmationwrites aConfirmationDispatchrow on every attempt (success or failure).bookingService.bookRoute,rescheduleService.cancelAppointment/markNoShow,visitOutcomeService.completeVisiteach writeAppointmentStatusHistoryrows in the same transaction as the status mutation.whatsappService.sendTextMessage/sendTemplateMessageandemailService.sendEmailenqueueFailedOperationrows on permanent failure.
app/[locale]/visit-requests/page.tsx(AMBER-04) — list view with planning-status + urgency filters; sidebar entry + EN/FR i18n.- Docs:
docs/ARCHITECTURE.mdnew "Domain vocabulary reconciliation" section (AMBER-05, 07, 08, 12) with explicit mapping tables;docs/KNOWN_ISSUES.mdupdated — 10 AMBERs closed. eslint.config.mjs:argsIgnorePattern: ^_so_text-style deliberately-unused args stop tripping the linter.- Tests: 6 appointment-audit + 7 dead-letter = 13 new tests. Net (pre-PR D baseline 722) → see running-totals below.
- AMBER-04 (code) —
/visit-requestsroute + UI - AMBER-05 (docs) — triage vocabulary reconciliation
- AMBER-06 (code) — geocoding metadata
- AMBER-07 (docs) — RouteRun naming rationale
- AMBER-08 (docs) — AppointmentStatus rationale
- AMBER-10 (code) —
ConfirmationDispatch - AMBER-11 (code) —
AppointmentResponse - AMBER-12 (docs) —
ReminderSchedulerationale - AMBER-13 (code) —
AppointmentStatusHistory - AMBER-15 (code) —
FailedOperationDLQ
SELECT event, actor, targetType FROM "SecurityAuditLog"after a full booking → cancellation cycle shows the expected trail of events ANDAppointmentStatusHistoryshowsnull → PROPOSED → CANCELLED.- Forcing a WhatsApp send against an invalid phone number enqueues a
FailedOperationrow whosepayloadcontains[redacted]for any Bearer/api_key value that may have been attempted. /en/visit-requestsloads at 390px width; filters refine the returned list.- Only documentation-only AMBERs remain open: AMBER-09 (
AppointmentHorselink table) — deferred per the audit note (adequate until per-appointment horse metadata is tracked).
- Verify the overnight hardening report claims against the repo, fix any mismatches with the smallest safe change, and lock the fix in with a regression test.
-
Uploader attribution spoofing (high-severity) —
app/api/horses/[id]/attachmentsPOST previously fell back touploadedByIdread from the multipart form if present, and usedsubject.id(Auth.jsUser.id) as a second fallback. Two bugs:- Authenticated vet could spoof a colleague as the uploader by adding
uploadedById=<victim-staff-id>to the form. HorseAttachment.uploadedByIdFK referencesStaff.id, so the fallback would also fail the FK check (or silently mis-attribute) whensubject.idis a bare User id.
Fix: ignore the form field entirely; resolve
staffRepository.findByUserId(subject.id)and storestaff?.id ?? null. New regression suite__tests__/unit/api/horses-attachments.test.tslocks in four cases: session→staff happy path, spoofed form value dropped, no-linked-staff falls back to null, description passes through unchanged. - Authenticated vet could spoof a colleague as the uploader by adding
- Every
Appointment.statusmutation site (bookingService,rescheduleService.cancel/markNoShow,visitOutcomeService) now writesAppointmentStatusHistory; no stray mutation site exists. - Every
requireRoleplacement matches the overnight report (/api/export/vetupADMIN,/api/staffmutations ADMIN,/api/attachments/[id]NURSE GET + VET DELETE,/api/attachments/[id]/analyseVET,/api/horses/[id]/clinicalNURSE GET + VET POST,/api/horses/[id]/attachmentsNURSE GET + VET POST,/api/prescriptions/[id]VET,/api/statusADMIN). - Rate limits wired at the four claimed routes (
webhooks/whatsapp,webhooks/email,export/vetup,attachments/[id]/analyse). deadLetterService.enqueuecalled from three claimed sites (whatsapp sendTextMessage,whatsapp sendTemplateMessage,email sendEmail).applySecurityHeaderswraps every branch ofmiddleware.ts(6 call sites).verifyWhatsAppVerifyTokenis the only verify-token check inapp/api/webhooks/whatsapp/route.ts(no residual===).
npm run lint,typecheck,test,prisma validate,build— all green. Net 739 tests passing (+4 new).
Overnight hardening sweep focused on data-access RBAC, fail-closed webhook auth, and rate limiting. Priority: protect customer/clinical data and close the remaining unauthenticated-integration paths.
- Fail-closed n8n / webhook auth —
lib/utils/signature.ts#requireN8nApiKeyreplaces ad-hocif (env.N8N_API_KEY)checks. Returns HTTP 500 in production when the key is unset, instead of silently accepting anonymous traffic. Applied to:/api/webhooks/email/api/n8n/triage-result,/api/n8n/geocode-result,/api/n8n/route-proposal/api/n8n/trigger/send-email,/api/n8n/trigger/send-whatsapp,/api/n8n/trigger/request-info/api/reminders/check
- Middleware public-paths —
/api/n8n/*and/api/reminders/checkadded so n8n server-to-server calls are not blocked by the session middleware while the fail-closed API-key gate runs in the handler. - Per-route rate limits on every n8n-authenticated endpoint (60–300 req/min per IP) plus a 30 req/min per-IP limiter on
/api/auth/{callback,signin,verify-request,session}inmiddleware.tsto slow magic-link / OAuth callback abuse. - RBAC + audit —
requireRoleadded to customer / horse / yard / enquiry / visit-request / appointment / dashboard / triage-ops / triage-tasks / route-planning endpoints. DELETEs on Customer / Yard / Horse now writeSecurityAuditLogentries (CUSTOMER_DELETED,YARD_DELETED,HORSE_DELETED). Override endpoint now derivesperformedByfrom the RBAC subject, closing a spoofable-actor gap. - Geocoding provenance runtime coverage — both
geocodingService.geocodeYardandupdateYardCoordinatesnow writegeocodeSource/geocodePrecision/formattedAddress(columns existed from PR D but weren't populated on the Google path). - Tests — signature-gate tests (6 new cases), middleware public-path tests (3 new cases), customer delete RBAC + audit tests (2 new cases). All existing suites adapted.
npm run lint,typecheck,test,prisma validate,build— all green. Net 749 tests passing (+10 net new).- Manual: confirmed unauthenticated
GET /api/n8n/triage-resultnow returns 500 in a non-demo env withN8N_API_KEYunset; returns 401 with it set and no Bearer header; returns 200 with correct Bearer. - Manual: DELETE /api/customers/:id with a NURSE session now returns 403; with ADMIN returns 200 and writes a
CUSTOMER_DELETEDrow.
The full build of contract draft v3 § 4.2 plus the five new requirements surfaced in Kathelijne's 2026-05-26 planning call. Ten PRs (#161 → #172). Full per-feature description in docs/PHASE_2_DELIVERY_SUMMARY.md; this section is the build-plan-shaped summary.
| Slice | Feature | Round | PR |
|---|---|---|---|
| § 2.1 | Quick WhatsApp Answer Mode | 2 | #163 |
| § 2.2 | VetUp data-shape import | 6 | #167 |
| § 2.3 | Structured Fiche dentaire dental chart | 3 | #164 |
| § 2.4 | WhatsApp self-message → invoice line | 4 | #165 |
| § 2.5 | Voice-note intake + wake-word routing | 9 | #170 |
| § 2.6 | Vet-pairing routing | 5 | #166 |
| § 2.7 | Unified visit timing model | 1 | #161 |
| § 2.8 | Practice scheduling config (singleton + admin UI) | 1 + 10 | #161 / #172 |
| § 2.9 | Vaccination reminder cron | — | already shipped pre-Phase-2 |
| § 2.10 | Slot suggestion engine | 7 | #168 |
| § 2.11 | FAQ AI (curation + matcher + Quick Answer integration) | 8 + 10 | #169 / #172 |
| — | On-prem deployment artefact | 10 | #172 |
Schema additions (additive, all backward-compatible):
- New enums:
VisitServiceType,AnimalSpecies,AnimalSex,SelfInvoiceTaskStatus, plus 5 newFindingCategoryvalues - New models:
PracticeSchedulingConfig(singleton),SelfInvoiceTask,FaqEntry - Customer + Horse gained 12 VetUp-parity nullable columns
DentalChart.checklistJSONB for the 11-section structured formVisitRequest.servicesarray +suggestedSlotsJSONB +suggestedSlotsAtRouteRun.parallelGroupId+RouteRunStop.isJointfor vet pairingEnquiryMessage.isVoiceNote / audioMediaId / audioTranscript / wakeIntentfor voice intake
New services:
lib/services/draft-generation.service.ts(3-tone AI drafts)lib/services/dental-chart-prefill.service.ts(free-text → structured JSON)lib/services/self-invoice-parser.service.ts(parse vet WhatsApp self-messages)lib/services/self-invoice.service.ts(PENDING → INVOICED workflow)lib/services/voice-transcription.service.ts(STT interface + mock backend)lib/services/wake-word.service.ts(pure detector)lib/services/vet-pairing.service.ts(joint/solo classification + distribution)lib/services/visit-timing.service.ts(unified timing calculator)lib/services/practice-config.service.ts(singleton with 30s cache)lib/services/slot-suggestion.service.ts(haversine-based scoring)lib/services/faq.service.ts(CRUD + audit)lib/services/faq-matcher.service.ts(lexical + LLM two-pass)lib/services/vetup-import.service.ts(CSV parser + idempotent upsert)
New UI surfaces:
/[locale]/enquiries/[id]/answer— Quick Answer Mode/[locale]/horses/[id]/dental-charts/new— structured dental form/[locale]/admin/self-invoice— pending self-invoice tasks/[locale]/admin/practice-config— tunable scheduling config/[locale]/admin/faqs— FAQ curation- Inline panels on
/visit-requests(slot suggestions) +/route-runs(parallel-route badges) - Voice-note + intent badges in the message thread
New infra:
docker-compose.onprem.yml+ supporting Caddy / backup / env files- CLI:
scripts/import-vetup.ts
LLM stack: single Claude Haiku integration via the pre-existing @anthropic-ai/sdk. Same client surface (getAnthropicClient + DRAFT_MODEL) reused across four features (drafts, dental prefill, self-invoice parsing, FAQ matching). DEMO_MODE / no-key fallback path on every LLM service so the demo runs offline.
Voice transcription: deterministic mock with deterministic per-mediaId hash. Production-ready interface — callWhisper() in voice-transcription.service.ts is the single function to wire up (one OpenAI / Gemini SDK call).
- Five-check gate (lint / typecheck / prisma validate / npm test / build) green on every push across all ten rounds
- Net 1,743 tests passing on main after Round 10 merge (up from ~1,440 pre-Phase-2)
- ~270 new test cases added across the rounds covering parsers, services, API routes, components
- No regressions on the existing Phase 1 surface — every UI page rendered + tested with the new fields nullable
- Both new admin pages + each new feature route registered in the production build manifest
- DEMO_MODE behaviour verified end-to-end on each round before push (mock STT, mock drafts, mock parser, mock matcher all return deterministic output)