A small Go backend for an agent-native fine-tuning ledger.
Phase 0 is intentionally narrow: it targets a single machine with 2x RTX 3090 GPUs and runs one-GPU LoRA fine-tuning jobs from validated presets. The goal is not to expose a generic job runner; the goal is to make training runs submit-able, inspectable, and reproducible by AI agents.
See SPEC.md for the full MVP contract.
CLAUDE.md is the canonical agent instruction file for this repository. Agents
that start from AGENTS.md should treat it as a pointer to CLAUDE.md and then
follow the same shared rules.
Use .claude/skills/README.md for the available project workflows and skills.
- Accept declarative run drafts built from preset refs and option parameters
- Validate presets and option policies before consuming queue or GPU capacity
- Persist every run in a local SQLite ledger
- Execute at most two single-GPU runs concurrently
- Keep Docker behind an agent-side workload backend
- Preserve logs, config, metrics, and artifacts for every terminal run
- Make failures machine-readable through explicit run states and
failure_reason
AI agents are the primary consumer. Responses should be machine-readable first:
structured JSON envelopes, endpoint-specific data payloads, explicit statuses,
stable error.code values, and clear next-step hints where useful.
Long-running operations expose pollable resources. For Phase 0, logs use cursor-based polling rather than WebSockets.
RunDraft
-> API preflight validation
-> preset registry / spec builder
-> immutable spec.Spec
-> SQLite run ledger
-> ScheduleCoordinator
-> WorkloadProvisioner / GPU claim
-> WorkloadPlan
-> WorkloadLauncher
-> DockerWorkloadBackend
-> local artifact store
| Component | Role |
|---|---|
| HTTP API | Submit and inspect runs, logs, and artifacts |
| Spec builder | Resolve preset refs, validate option parameters, and finalize immutable specs |
| ScheduleCoordinator | Own run lifecycle transitions and terminal reconciliation |
| WorkloadProvisioner | FIFO scheduling, 2-GPU assignment, and workload plan construction |
| WorkloadLauncher | Manager-side port for prepare/start/cleanup calls |
| DockerWorkloadBackend | Agent-side Docker container materialization and observation |
| SQLite | Durable source of truth for projects, runs, and artifact metadata |
| Local artifact store | Stores specs, resolved configs, logs, metrics, reports, adapters |
- Language: Go
- External API: HTTP + JSON REST
- Database: SQLite for Phase 0
- Workload substrate: Agent-side Docker backend behind REST/HTTP and a Go port
- Storage: Local filesystem artifact store
Postgres, Redis/Valkey hints, alternate manager-agent transports, multi-node scheduling, and richer cancellation/orphan cleanup semantics are future architecture directions, not Phase 0 requirements.
- Multi-tenant quota or policy enforcement
- Distributed training
- Kubernetes native integration
- Real-time serving orchestration
- Web UI or dashboard
- Advanced scheduling or bin-packing
- Webhook or notification system
- W&B SaaS integration
- Cancel API implementation, deferred to Phase 2
├── CLAUDE.md # Canonical AI agent guidelines
├── AGENTS.md # Pointer for agents that read AGENTS.md first
├── cmd/ # Binary entry points
├── internal/ # Private packages
├── docs/ # Design, education, and learning notes
├── SPEC.md # Phase 0 MVP specification
└── Makefile # Build, test, lint, fmt targets
TBD