diff --git a/docs/adrs/DRAFT-app-owned-scheduling.md b/docs/adrs/DRAFT-app-owned-scheduling.md new file mode 100644 index 00000000..4847bc6a --- /dev/null +++ b/docs/adrs/DRAFT-app-owned-scheduling.md @@ -0,0 +1,65 @@ +# DRAFT: App-owned scheduling — Postgres-backed schedules fired by api-server + +**Date:** 2026-04-28 +**Status:** Proposed +**Owner:** @tomkis + +## Context + +Schedules in Humr today are K8s-native: each user-created schedule is an `agent-schedule` ConfigMap with `humr.ai/type=agent-schedule`. The Go controller watches these ConfigMaps, runs a per-schedule goroutine that walks RRULE occurrences with quiet hours, wakes the target instance, and delivers the trigger via `kubectl exec` writing `/home/agent/.triggers/{ts}.json` ([ADR-008](008-trigger-files.md), [ADR-031](031-schedule-rrule-quiet-hours.md)). + +This shape was inherited from [ADR-006](006-configmaps-over-crds.md): "everything is a ConfigMap so Humr installs without cluster-admin." That rationale is load-bearing for `agent-instance` (the controller materializes pods from it), but it is **not** load-bearing for `agent-schedule` — schedules don't materialize K8s objects, they poke existing pods. We adopted the ConfigMap pattern uniformly for symmetry, and the cost has accumulated: + +- **RRULE math lives in two languages.** The api-server already needs a TS RRULE implementation to render "next fires" in the UI; the controller has its own Go implementation that fires for real. Two implementations, one source of truth — guaranteed drift surface, and the schedule-RRULE-quiet-hours semantics in [ADR-031](031-schedule-rrule-quiet-hours.md) are non-trivial enough that drift is likely. +- **Schedules can't JOIN with anything.** Schedule↔session linkage lives across two stores: `agent-schedule.status.yaml` references `sessionId`, while `sessions` is a Postgres table ([ADR-017](017-db-backed-sessions.md)). Owner filtering, allow-list awareness, and audit trails for schedules are all api-server concerns sitting on the wrong side of a K8s/Postgres boundary. +- **`kubectl exec` is a privileged trigger path.** The controller needs `pods/exec` RBAC purely for trigger delivery. Trigger files give us "durable at-least-once via PVC" almost for free, but at the cost of a pod-pierce capability that nothing else in the platform needs. +- **Status plumbing is awkward.** Schedule next-fire / last-error has to flow controller → `status.yaml` → api-server read → UI render. Edits go the other direction through `spec.yaml`. Both paths are eventual via the K8s watch. +- **Two-language ownership of one concept.** Schedules are user-created, owner-scoped, allow-list-aware domain objects. The api-server owns identity, owner labels, and tRPC validation; the controller fires. Splitting domain ownership across processes raises the cost of every schedule-shaped feature (e.g., "preview the next 10 fires," "skip the next fire," "bulk-edit quiet hours"). + +The recently-landed centralized pod-reachability primitive ([ADR-032](032-pod-reachability-primitive.md)) removes one of the historical reasons schedule firing had to live next to the StatefulSet reconciler: wake is now a callable primitive in both Go and TypeScript with identical semantics. Any process that can reach the K8s API can wake a pod safely. + +## Decision + +**Schedules become a Postgres-backed domain resource owned by the api-server. The api-server fires them. The controller stops watching `agent-schedule` ConfigMaps.** + +Concretely: + +- **Storage.** A new `schedules` Postgres table holds `{ id, owner, instance_id, rrule, quiet_hours, tzid, mode, payload, created_at, updated_at }`. The api-server is the sole writer, owner-filtered like every other api-server resource. `agent-schedule` ConfigMaps are removed from the resource model. +- **Firing.** A single api-server replica holds a Postgres advisory lock and runs the schedule loop: pull due schedules, walk RRULE/quiet-hours in TS (the existing UI preview library), enqueue fires into a `schedule_fires` outbox table. Other replicas idle on the lock; on leader loss, the next replica picks it up. No K8s leader election. +- **Trigger delivery.** Replaces `kubectl exec → trigger file` with `api-server → harness port HTTP POST` (the harness port already accepts trigger receipt — see [ADR-022](022-harness-api-server.md)). The agent-runtime processes the POST identically to a trigger-file pickup. Wake is via the existing reachability primitive in the api-server ([ADR-032](032-pod-reachability-primitive.md)). +- **Durability.** The `schedule_fires` outbox row is written before delivery and acked on harness 2xx. Unacked rows retry with backoff. This replaces the PVC-based at-least-once that the trigger file gave us "for free" with explicit at-least-once in Postgres — at-least-once semantics are preserved, the substrate moves from a per-pod filesystem to the platform DB. +- **Status.** Schedule status (`nextFire`, `lastFire`, `lastError`) becomes a column on the `schedules` table. The ConfigMap `status.yaml` round-trip is gone. +- **Controller.** Loses the schedule reconciler, the cron loop, the RRULE library, and the `pods/exec` reliance for trigger delivery. Keeps everything else: `agent`, `agent-instance`, `agent-fork` reconcilers; pod/StatefulSet/Service/NetworkPolicy/Secret materialization; idle checker; reachability primitive. +- **Migration.** A one-shot `agent-schedule` → Postgres backfill, then deletion of the ConfigMaps and the controller-side reconciler. Interactive triggers (the `pi-agent` flow that also drops trigger files) keep working — they're not on the schedule path and the trigger-file mechanism stays functional. + +This supersedes [ADR-008](008-trigger-files.md) for the *scheduled* trigger path. ADR-008's "trigger files in `/home/agent/.triggers/`" mechanism remains valid for any non-scheduled trigger source that wants the file-based contract; the supersession is scoped to "controller-owned cron + exec-based delivery." + +## Alternatives Considered + +**Keep schedules in the controller, add a shared RRULE package.** Pull the RRULE math into a `packages/schedule-core/` shared between Go and TS so the firer and the UI preview compute the same thing. Cheaper migration; doesn't address the JOIN problem, the `pods/exec` reliance, the cross-process status plumbing, or the two-language ownership split. A point fix, not a refactor. + +**Move schedules to the api-server but keep ConfigMaps as the substrate.** API-server reads/writes `agent-schedule` ConfigMaps directly, runs its own RRULE loop. Removes the controller dependency but keeps all the ConfigMap-as-DB problems (no JOIN, no schema, awkward edit semantics). No clear advantage over going to Postgres. + +**External job queue (BullMQ + Redis, Temporal, etc.).** Stand up a real workflow engine. Rejected for proportionality: schedules in Humr are a few dozen per install, low frequency, low fan-out. Postgres advisory locks + an outbox table is the smallest tool that matches the problem. Adding Redis or Temporal expands the install surface for no proportionate gain. + +**K8s CronJobs.** Out of scope — would put scheduling back into K8s under a different label, and CronJobs don't have the wake/pod-reachability semantics Humr needs. Rejected before this proposal even framed. + +## Consequences + +**Easier:** + +- One language owns RRULE/quiet-hours/timezone semantics (TS). +- Schedule features that need user identity or owner filtering (preview-next-fires-for-user, audit log, allow-list checks) become trivial — they're api-server-local. +- The controller shrinks. Its responsibilities collapse to "reconcile pods/secrets/networkpolicies from instance ConfigMaps" — closer to a textbook operator. +- `pods/exec` RBAC drops out of the controller entirely (the reachability primitive's wake path doesn't need it; only trigger delivery did). +- Schedule edits are immediate (Postgres transaction) instead of eventual (ConfigMap watch). +- Schedule status no longer needs the spec/status split; the same row holds intent and observed. + +**Harder:** + +- The PVC-based "trigger file as durable inbox" property has to be replaced with an outbox table that the api-server actually retries from. This is standard but is real code that has to be right — at-least-once with retry + dedup at the harness side. Worth dedicated test coverage before flipping. +- Leader election among api-server replicas now matters for schedule firing. Postgres advisory lock is the proposed mechanism; needs an explicit "leader-loss → drop in-flight fires for restart-recovery" story. +- Migration cost: backfilling existing `agent-schedule` ConfigMaps into the new table, with rollback if the rollout regresses. Bounded but non-trivial. +- The controller's NetworkPolicy invariant ("api-server pods can reach the harness port") becomes load-bearing for schedule firing, not just MCP/trigger receipt. It's already in place; just worth flagging that the NP is now the spine of trigger delivery. +- Architecture pages need updates: `agent-lifecycle.md` (Trigger fire section), `persistence.md` (substrate table — schedules move from ConfigMap to Postgres), `platform-topology.md` (controller responsibilities shrink, ConfigMap types table loses `agent-schedule`). +- ADR-006's "ConfigMaps over CRDs" rationale needs revisiting in spirit — not invalidated, but the symmetry argument ("everything is a ConfigMap") weakens once schedules step out. Worth a follow-up note clarifying the rationale applies to *resources that materialize K8s objects*, not domain state in general. diff --git a/docs/adrs/index.md b/docs/adrs/index.md index e7962bd2..28b9a774 100644 --- a/docs/adrs/index.md +++ b/docs/adrs/index.md @@ -45,3 +45,4 @@ This directory contains ADRs for the Humr project. | Draft | Title | Owner | |-------|-------|-------| | [DRAFT](DRAFT-multi-agent.md) | Multi-agent collaboration — isolated instances with shared artifacts | @tomkis | +| [DRAFT](DRAFT-app-owned-scheduling.md) | App-owned scheduling — Postgres-backed schedules fired by api-server (supersedes ADR-008) | @tomkis |