Context and conventions for AI assistants working on the Optio codebase.
Optio is an orchestration system for AI coding agents. Think of it as "CI/CD where the build step is an AI agent." One primary user-facing concept (Tasks) with one attribute (has a repo) flipping the pipeline, plus shared primitives:
Tasks — a configured unit of agent work. A Task has a Who (agent type), What (prompt or template), When (trigger: manual / schedule / webhook / ticket), optional Where (repo + branch), and Why (description). Tasks come in three flavors:
-
Repo Task —
Whereis set. The agent clones the repo into a worktree and opens a PR:- Spins up an isolated Kubernetes pod for the repository (pod-per-repo)
- Creates a git worktree for the task (multiple run concurrently per repo)
- Runs Claude Code, OpenAI Codex, GitHub Copilot, Google Gemini, or OpenCode with the prompt
- Streams structured logs back to a web UI in real time
- Agent stops after opening a PR (no CI blocking)
- PR watcher tracks CI checks, review status, and merge state
- Auto-triggers code review agent on CI pass or PR open (if enabled)
- Auto-resumes agent when reviewer requests changes (if enabled)
- Auto-completes on merge, auto-fails on close
Supported git platforms: GitHub, GitLab (incl. self-hosted via
GITLAB_HOSTS), and AWS CodeCommit. CodeCommit auths via AWS access keys (or IRSA / instance profile when running on EKS) and uses the AWS CLI credential helper for clones; PR ops go through@aws-sdk/client-codecommit. CodeCommit has no native CI or issues —getCIChecksreturns[](auto-merge still fires onchecksStatus="none"),listIssuesreturns[], andreviewTrigger="on_pr"is recommended over the defaulton_ci_passfor CodeCommit repos. -
Standalone Task — no
Where. The agent runs in an isolated pod with no repo checkout, producing logs and side effects (e.g., queries Slack, posts to a database). Scheduled/webhook-driven runs of this flavor are the common case. -
Persistent Agent — long-lived, named, message-driven agent process that does not terminate after running. Halts after each turn and waits to be re-woken by a user message, an agent message, a webhook, a cron tick, or a ticket event. Addressable by other agents in the same workspace via the inter-agent HTTP API (
/api/internal/persistent-agents/*). Three configurable pod lifecycle modes:always-on,sticky(default, with idle warm window), andon-demand. UI at/agents. Schema:persistent_agents,persistent_agent_turns,persistent_agent_messages,persistent_agent_pods. Seedocs/persistent-agents.mdand the demo indemos/the-forge/.
Scheduled (Task Configs) — a saved Task blueprint that spawns fresh Tasks on a trigger firing. Stored in task_configs. Each firing calls instantiateTask() which goes through the full Repo Task pipeline. Manageable at /tasks/scheduled. Standalone equivalents are stored in workflows (see backend-naming note below).
Triggers — polymorphic table workflow_triggers keyed by (target_type, target_id). target_type is "job" (Standalone Tasks) or "task_config" (Repo Tasks). Trigger types: manual, schedule (cron), webhook, ticket. The workflow-trigger-worker polls due schedule triggers and dispatches to the correct target service.
Templates — reusable prompt templates in prompt_templates with a kind discriminator (prompt / review / job / task). Supports {{param}} substitution and {{#if param}}...{{/if}} blocks. Rendered lazily on trigger firing so params from the trigger payload substitute into the prompt.
Connections — external service integrations injected into agent pods at runtime via MCP (Model Context Protocol). Built-in providers: Notion, GitHub, Slack, Linear, PostgreSQL, Sentry, Filesystem. Also supports custom MCP servers and HTTP APIs. Fine-grained access control (per-repo, per-agent-type, permission levels).
Backend-naming note. For historical reasons the tables are tasks (Repo Tasks' one-time runs), task_configs (Repo Task blueprints), and workflows / workflow_runs / workflow_triggers (Standalone Tasks and their shared trigger surface). The v0.4 UI settled on these user-facing names:
- Tasks — Repo Tasks (formerly "Repo Tasks" in copy; now just "Tasks")
- Jobs — Standalone Tasks (matches the
/api/jobsURL and/jobs/*web routes) - Reviews — code-review subtasks + external PR reviews (formerly "PR Reviews"; promoted out of
/tasksinto its own top-level slot) - Issues — GitHub Issues queue (promoted to its own top-level nav item)
- Agents — Persistent Agents (the third tier; long-lived, message-driven)
- Prompts — reusable prompt templates (was "Templates" in the Library)
The sidebar groups these as Run (Tasks · Jobs · Reviews · Issues · Scheduled) and Live (Agents · Sessions). The /tasks hub-with-tabs from earlier versions is gone — each section is its own page now. Legacy /tasks?tab=... URLs redirect to the dedicated routes.
For the long-form explanation of how the two flavors map to the three internal types, the polymorphic HTTP layer, and how the UI presents them, see docs/tasks.md.
Unified /api/tasks HTTP layer. All three kinds (repo-task, repo-blueprint, standalone) are reachable through one polymorphic HTTP resource:
GET /api/tasks?type=repo-task|repo-blueprint|standalone|all— unified listPOST /api/tasks— body takes{ type, ... }; dispatches to taskService, taskConfigService, or workflowService based on typeGET /api/tasks/:id— resolves the id across all three tables; returns native row tagged withtypediscriminatorGET/POST /api/tasks/:id/runs[/:runId]— polymorphic runs (spawnedtasksfor blueprints,workflow_runsfor standalone, 405 for ad-hoc)GET/POST/PATCH/DELETE /api/tasks/:id/triggers[/:triggerId]— polymorphic triggers (405 for ad-hoc repo-task)- Resolver:
unified-task-service.resolveAnyTaskById()checks tasks → task_configs → workflows; UUIDs are globally unique so no collision
Legacy /api/jobs/* and /api/task-configs/* endpoints still work as thin aliases for back-compat.
┌─────────────┐ ┌──────────────┐ ┌─────────────────────────────┐
│ Web UI │────→│ API Server │────→│ K8s Pods │
│ Next.js │ │ Fastify │ │ │
│ :30310 │ │ :30400 │ │ ┌─ Repo Pod A ──────────┐ │
│ │←ws──│ │ │ │ clone + sleep │ │
│ Run │ │ - BullMQ │ │ │ ├─ worktree 1 │ │
│ Tasks │ │ - Drizzle │ │ │ ├─ worktree 2 │ │
│ Jobs │ │ - WebSocket │ │ │ └─ worktree N │ │
│ Reviews │ │ - PR Watcher │ │ └────────────────────────┘ │
│ Issues │ │ - Workflow Q │ │ ┌─ Job Pod ─────────────┐ │
│ Scheduled │ │ - PA Worker │ │ │ isolated agent │ │
│ Live │ │ - Reconciler │ │ └────────────────────────┘ │
│ Agents │ │ - Health Mon │ │ ┌─ Persistent Agent Pod ┐ │
│ Sessions │ │ - Connection │ │ │ long-lived; turns │ │
│ │ │ Service │ │ │ wake on messages │ │
└─────────────┘ └──────┬───────┘ │ └────────────────────────┘ │
│ └──────────────────────────────┘
┌──────┴───────┐
│ Postgres │ State, logs, workflows, persistent agents,
│ │ inboxes, connections, secrets
│ Redis │ Job queue, pub/sub
└──────────────┘
All services run in Kubernetes (including API and web). Local dev uses
Docker Desktop K8s with Helm. See setup-local.sh.
Central optimization. Instead of one pod per task (slow, wasteful), one long-lived pod per repository:
- Pod clones repo once, runs
sleep infinity. Tasksexecin:git worktree add→ run agent → cleanup - Multiple tasks run concurrently per pod (one per worktree)
- Pods use persistent volumes; idle for 10 min (
OPTIO_REPO_POD_IDLE_MS) before cleanup - Entrypoints:
scripts/repo-init.sh(pod),scripts/agent-entrypoint.sh(legacy)
Multi-pod scaling: repos can have multiple pod instances for higher throughput.
maxPodInstances(default 1, max 20) — pod replicas per repomaxAgentsPerPod(default 2, max 50) — concurrent agents per pod- Total capacity =
maxPodInstances × maxAgentsPerPod - Pod scheduling: same-pod retry affinity → least-loaded → dynamic scale-up → queue overflow
- LIFO scaling: higher-index pods removed first on idle cleanup
Tasks track worktree state via tasks.worktreeState: active, dirty, reset, preserved, removed. tasks.lastPodId enables same-pod retry affinity. See repo-cleanup-worker for cleanup rules.
pending → queued → provisioning → running → pr_opened → completed
↓ ↑ ↓ ↑
needs_attention needs_attention
↓ ↓
cancelled cancelled
running → failed → queued (retry)
State machine in packages/shared/src/utils/state-machine.ts. All transitions validated — invalid ones throw InvalidTransitionError. Always use taskService.transitionTask().
Tasks have integer priority (lower = higher). Two concurrency limits:
- Global:
OPTIO_MAX_CONCURRENT(default 5) — total running/provisioning tasks - Per-repo:
repos.maxConcurrentTasks(default 2) — effective limit ismax(maxConcurrentTasks, maxPodInstances × maxAgentsPerPod)
When a limit is hit, task is re-queued with 10s delay.
Web UI: Multi-provider OAuth (GitHub, Google, GitLab, generic OIDC). Enable by setting <PROVIDER>_OAUTH_CLIENT_ID + <PROVIDER>_OAUTH_CLIENT_SECRET (or OIDC_ISSUER_URL + OIDC_CLIENT_ID + OIDC_CLIENT_SECRET for generic OIDC). Sessions use SHA256-hashed tokens (30-day TTL). Local dev bypass: OPTIO_AUTH_DISABLED=true.
Claude Code (four modes, selected in setup wizard):
- API Key:
ANTHROPIC_API_KEYenv var injected into agent pods - OAuth Token (recommended for k8s):
CLAUDE_CODE_OAUTH_TOKENencrypted secret injected into pods - Vertex AI (GCP workloads): Routes through Google Cloud Vertex AI. Uses
CLAUDE_VERTEX_PROJECT_ID,CLAUDE_VERTEX_REGION, and optionalCLAUDE_VERTEX_SERVICE_ACCOUNT_KEY(encrypted, global scope). Falls back to workload identity when no service account key provided. Service account keys written to/home/agent/.config/gcloud/gsa-key.jsonwith chmod 600 - Max Subscription (legacy, local dev only): reads from host macOS Keychain
These are well-documented in code; read the relevant service files for details:
- PR watcher (
pr-watcher-worker.ts): polls PRs every 30s, tracks CI/review, triggers reviews, auto-resumes, handles merge/close - Code review agent (
review-service.ts): launches review as blocking subtask, usesrepos.reviewModel(defaults to sonnet) - Subtask system: three types (child, step, review) via
parentTaskId, withblocksParentfor synchronization - Prompt templates:
{{VARIABLE}}+{{#if VAR}}...{{/if}}syntax. Priority: repo override → global default → hardcoded fallback - Shared cache directories: per-repo persistent PVCs for tool caches (npm, pip, cargo, etc.), managed via
/api/repos/:id/shared-directories - Interactive sessions: persistent workspaces with terminal + agent chat, at
/sessions - Workspaces: multi-tenancy via
workspaceIdcolumn. Roles (admin/member/viewer) in schema but not fully enforced - Standalone Tasks / Jobs (
workflow-service.ts,workflow-worker.ts): top-level Jobs nav item under "Run" (list at/jobs, detail at/jobs/:id, runs at/jobs/:id/runs/:runId). Agent runs with no repo,{{PARAM}}prompt templates, four trigger types (manual/schedule/webhook/ticket), pooled pod execution, real-time log streaming, auto-retry with exponential backoff. Pods are shared across runs within a workflow, keyed on(workflow_id, instance_index): each workflow hasworkflows.maxPodInstancespod replicas (default 1, max 20) andworkflows.maxAgentsPerPodconcurrent runs per pod (default 2, max 50) — mirrors repo pod scaling. Runs track their assigned pod viaworkflow_runs.pod_idand remember it for retry affinity vialast_pod_id. Schema:workflows,workflow_triggers,workflow_runs,workflow_run_logs,workflow_pods - Repo Task Configs (
task-config-service.ts, routes intask-configs.ts): reusable Repo Task blueprints that spawn tasks when triggers fire.instantiateTask(configId, { triggerId, params })creates a task with rendered prompt + title, transitions it to QUEUED, and enqueues the BullMQ job. UI at/tasks/scheduled. Schema:task_configs - Triggers (
workflow-trigger-service.ts,workflow-trigger-worker.ts): polymorphic trigger table (workflow_triggers) keyed by(target_type, target_id).target_type="job"dispatches tocreateWorkflowRun;target_type="task_config"dispatches toinstantiateTask. Schedule trigger worker polls every 60s (OPTIO_WORKFLOW_TRIGGER_INTERVAL). - Prompts / Templates (
prompt-template-service.ts, routes inprompt-templates.ts): reusable prompt templates withkinddiscriminator (prompt/review/job/task).renderTemplateString(template, params)handles{{param}}substitution +{{#if}}blocks. UI at/templates(labeled Prompts in the Library nav as of v0.4 — the "Templates" name was freed for other use). - Persistent Agents (
persistent-agent-service.ts, workerspersistent-agent-worker.ts/persistent-agent-cleanup-worker.ts, routes inpersistent-agents.tsandinternal/persistent-agents.ts): the third Task tier — long-lived, named, message-driven. Cyclic state machine (idle → queued → provisioning → running → idle), per-agent pod lifecycle modes (always-on/sticky/on-demand). Wake sources: user/agent messages, webhook, cron tick, ticket event, system. Inter-agent HTTP API at/api/internal/persistent-agents/*(auth viaX-Optio-Agent-Token). UI at/agents(under "Live"). Schema:persistent_agents,persistent_agent_turns,persistent_agent_turn_logs,persistent_agent_messages,persistent_agent_pods. Seedocs/persistent-agents.mdand the demo indemos/the-forge/. - Connections (
connection-service.ts): external service integrations via MCP. Built-in providers: Notion, GitHub, Slack, Linear, PostgreSQL, Sentry, Filesystem. Also supports custom MCP servers and HTTP APIs. Three-layer model: providers (catalog) → connections (configured instances) → assignments (per-repo/agent-type rules). Injected into agent pods at task runtime viagetConnectionsForTask()in task-worker - Reconciliation control plane (
workers/reconcile-worker.ts,services/reconcile-{snapshot,executor,queue}.ts,packages/shared/src/reconcile/): K8s-style reconciler with fourRunKinds —repo(Repo Task runs intasks),standalone(Job runs inworkflow_runs),pr-review(external PR reviews), andpersistent-agent(Persistent Agents inpersistent_agents). Pure decision functions consume a frozenWorldSnapshotand return a typedAction; the executor applies it under CAS so concurrent passes can't trample each other. Producers —taskService.transitionTask,workflow-worker'stransitionRun,pr-watcherpoll cycle,repo-cleanuppod-health detection,wakeAgent()(PA inbox + trigger dispatch) — wake the reconciler viaenqueueReconcile. Periodic resync (OPTIO_RECONCILE_RESYNC_INTERVAL, 5 min) catches anything missed. The reconciler owns: PR-driven transitions (auto-merge, complete-on-merge, fail-on-close), auto-resume on CI/conflict/review (capped byOPTIO_MAX_AUTO_RESUMES), review launch, stall + pod-death detection, control-intent (cancel/retry/resume/restart), and the Persistent Agent turn cycle. Schema:control_intent,reconcile_backoff_until,reconcile_attemptscolumns ontasks,workflow_runs, andpersistent_agents. Seedocs/reconciliation.md - Task dependencies:
task_dependenciestable for multi-step pipelines - Cost tracking:
GET /api/analytics/costswith daily/repo/type breakdowns, UI at/costs - Error classification:
packages/shared/src/error-classifier.tspattern-matches errors into categories with remedies
| Layer | Technology | Notes |
|---|---|---|
| Monorepo | Turborepo + pnpm 10 | 6 packages, workspace protocol |
| API | Fastify 5 | Plugins, schema validation, WebSocket |
| ORM | Drizzle | PostgreSQL, migrations in apps/api/src/db/migrations/ |
| Queue | BullMQ + Redis | Also used for pub/sub (log streaming to WebSocket clients) |
| Web | Next.js 15 App Router | Tailwind CSS v4, Zustand, Lucide icons, sonner toasts, Recharts |
| K8s client | @kubernetes/client-node | Pod lifecycle, exec, log streaming, metrics |
| Validation | Zod | API request schemas |
| Testing | Vitest | Test files across shared + api |
| CI | GitHub Actions | Format, typecheck, test, build-web, build-image |
| Deploy | Helm | Chart at helm/optio/, local dev via setup-local.sh |
| Hooks | Husky + lint-staged + commitlint | Pre-commit: lint-staged + format + typecheck. Commit-msg: conventional commits |
# Setup (first time — builds everything, deploys to local k8s via Helm)
./scripts/setup-local.sh
# Update (pull + rebuild + redeploy)
./scripts/update-local.sh
# Manual rebuild + redeploy
docker build -t optio-api:latest -f Dockerfile.api .
docker build -t optio-web:latest -f Dockerfile.web .
kubectl rollout restart deployment/optio-api deployment/optio-web -n optio
# Quality (these are what CI runs, and pre-commit hooks mirror them)
pnpm format:check # Check formatting (Prettier)
pnpm turbo typecheck # Typecheck all 6 packages
pnpm turbo test # Run tests (Vitest)
cd apps/web && npx next build # Verify production build
# Database
cd apps/api && npx drizzle-kit generate # Generate migration after schema change
cd apps/api && npx tsx src/db/migrate.ts # Apply migrations (standalone runner)
bash scripts/check-migration-prefixes.sh # Check for duplicate prefixes
# Agent images
./images/build.sh # Build all presets (base, node, python, go, rust, full)
# Helm
helm lint helm/optio --set encryption.key=test
helm upgrade optio helm/optio -n optio --reuse-values
# Teardown
helm uninstall optio -n optio- ESM everywhere: all packages use
"type": "module"with.jsextensions in imports (TypeScript resolves them to.ts) - Conventional commits: enforced by commitlint (e.g.,
feat:,fix:,refactor:) - Pre-commit hooks: lint-staged (eslint + prettier), then
pnpm format:checkandpnpm turbo typecheck - Tailwind CSS v4:
@import "tailwindcss"+@themeblock in CSS, notailwind.configfile - Drizzle ORM: schema in
apps/api/src/db/schema.ts, rundrizzle-kit generateafter changes. New migrations use unix-timestamp prefixes (migrations.prefix: "unix"indrizzle.config.ts). Existing00xx_*files are frozen — never rename them - Zustand: use
useStore.getState()in callbacks/effects, not hook selectors (avoids infinite re-renders) - Next.js webpack:
extensionAliasinnext.config.tsresolves.js→.tsfor workspace packages - State transitions: always go through
taskService.transitionTask()— validates, updates DB, records event, publishes WebSocket - Secrets: never log or return secret values. Encrypted at rest with AES-256-GCM
- Cost tracking: stored as string (
costUsd) to avoid float precision issues - K8s RBAC: namespace-scoped Role (pods, exec, secrets, PVCs) + ClusterRole (nodes, namespaces, metrics)
Key values.yaml settings:
- Image defaults point to GHCR (
ghcr.io/jonwiggins/optio-*). Setagent.image.prefixtooptio-for local dev postgresql.enabled/redis.enabled— set tofalseand useexternalDatabase.url/externalRedis.urlfor managed servicesencryption.key— required, generate withopenssl rand -hex 32serviceAccount.name/serviceAccount.annotations— used by API/web pods (K8s API access) and agent pods (workload identity). Example for GKE:iam.gke.io/gcp-service-account: optio@PROJECT_ID.iam.gserviceaccount.com- Local dev overrides in
helm/optio/values.local.yaml(setup-local.shapplies automatically)
Pod won't start: check kubectl get pods -n optio, verify agent image exists (docker images | grep optio-agent), check OPTIO_IMAGE_PULL_POLICY=Never for local images.
Auth errors: verify CLAUDE_AUTH_MODE secret, check ANTHROPIC_API_KEY or CLAUDE_CODE_OAUTH_TOKEN exists, check GET /api/auth/status.
Tasks stuck in queued: check concurrency limits (OPTIO_MAX_CONCURRENT, per-repo maxConcurrentTasks), look for stuck provisioning/running tasks.
WebSocket drops: ensure Redis is running, check REDIS_URL and INTERNAL_API_URL config.
Pod OOM/crash: check pod_health_events, increase resource limits. Cleanup worker auto-detects and fails associated tasks.
OAuth login fails: verify PUBLIC_URL matches deployment URL, check provider callback URLs are registered.
Migration errors: migrations auto-run on startup. Historical duplicate prefixes (0016, 0018, 0019, 0026, 0039, 0042) are allowlisted. New migrations use unix-timestamp prefixes.
Repo init timeout: large repos may exceed 120s default. Increase OPTIO_REPO_INIT_TIMEOUT_MS.
- Generate encryption key:
openssl rand -hex 32 - Configure at least one OAuth provider (
*_CLIENT_ID+*_CLIENT_SECRET) - Ensure
OPTIO_AUTH_DISABLEDis NOT set - Use managed PostgreSQL/Redis (
externalDatabase.url,externalRedis.url) - Set
PUBLIC_URLto actual deployment URL - Enable ingress with TLS
- Set
GITHUB_TOKENsecret for PR watching, issue sync, repo detection - Install
metrics-serverin cluster
- Workspace RBAC roles are in schema but not fully enforced in all routes
- API container runs via
tsxrather than compiled JS (workspace packages export./src/index.ts) - OAuth tokens from
claude setup-tokenhave limited scopes vs Keychain-extracted tokens