A Plan-and-Execute multi-agent orchestrator written as the smallest legible version that's still production-shaped. Type a question, the planner builds a DAG, specialists run in parallel, you get a synthesis with cost transparency.
~820 LoC kernel · 106 tests · 97% local coverage on
src/core/*.ts· $0.05-0.12 per real-research run (Anthropic web_search_20250305 pricing as of 2026-05-09) · TypeScript · MIT
A sextant fixes position by combining sightings from multiple angles. The system is shaped the same way: the planner picks the angles, specialists take their sightings in parallel, the synthesis fixes the position. A judge that scores the fix and a memory that carries the chart forward land in Phases 4-5.
A reference implementation of the Plan-and-Execute pattern in four layers:
- Orchestrator - planner produces a DAG, kernel runs nodes in parallel under a concurrency cap, bounded replanner adapts when a precondition fails.
- Specialists - retrieval (BM25 + vector), web/API, synthesis. Pluggable adapters; no framework lock.
- Memory - working KV per run, vector store for embeddings, episodic store for full (goal, plan, trace, outcome) tuples.
- Observability - OpenTelemetry spans, Langfuse exporter env-gated, LLM-as-Judge harness with a frozen rubric.
Phase 1 (kernel) and Phase 2 (Anthropic provider + planner + working demo with real Anthropic web_search) are shipped on main. Phases 3-5 are gated.
- Not a framework. The kernel is ~820 LoC. Read it, fork it, replace any layer.
- Not a LangGraph replacement. LangGraph is battle-tested with parallel execution, persistence, and conditional edges. If you're already on it, stay there.
- Not a hosted product. Run it where your code already runs.
- Not Python. The provider abstraction is portable, but this repo is TS-first. Python port is on the table if there's pull for it.
pnpm install
cp .env.example .env # add your ANTHROPIC_API_KEY
pnpm demo:basic # default goal, real Anthropic web search
pnpm demo:basic "your question here" # ask anything; planner builds the DAG live
SEXTANT_FIXTURE_SEARCH=1 pnpm demo:basic # offline fixture mode for CI / no-network demos
pnpm demo:rag # 4-node DAG with hybrid retrieval (Phase 3, gated)
pnpm eval # full eval harness with LLM-as-Judge (Phase 4, gated)Phase 2 (DAG kernel + planner + executor + working demo with real web search) is shipped. The default demo hits the Anthropic native web_search server tool, so any goal you pass becomes a real research run. Cost is roughly $0.05 for simple single-search queries, $0.12 for multi-angle comparisons. Phase 3+ (adaptive replanning + retrieval + observability + eval) is gated on inbound signal. See the build plan below.
The kernel and adapters are exported as a library. ~20 lines gets you a working multi-agent research run:
import {
run,
createAnthropicProvider,
createAnthropicWebSearch,
createSynthSummarize,
webSearchToolSchema,
synthSummarizeToolSchema,
} from '@bambushu/sextant';
const provider = createAnthropicProvider();
const search = createAnthropicWebSearch();
const synth = createSynthSummarize({ provider });
const result = await run('How do agentic orchestration frameworks compare in 2026?', {
provider,
tools: { 'web.search': search, 'synth.summarize': synth },
toolSchemas: [webSearchToolSchema, synthSummarizeToolSchema],
onPlan: ({ plan }) => console.log(`Plan: ${plan.nodes.length} nodes`),
});
const finalNodeId = result.plan.nodes.at(-1)!.id;
console.log(result.outputs.get(finalNodeId));
console.log(`Cost: $${(result.costSpentUsd + result.plannerCostUsd).toFixed(4)}`);Bring your own tools by adding entries to tools and toolSchemas. The planner reads the schema descriptions and picks them when a goal calls for it. The kernel handles concurrency, abort signals, retries, cost accounting, and (in Phase 3) bounded replanning.
Currently install from GitHub:
pnpm add github:Bambushu/sextant. npm publish lands at v0.3.
Click to expand a recorded run of pnpm demo:basic
=== Sextant ===
Goal: How do recent agentic-AI orchestration frameworks compare in 2026?
Mode: live web search (Anthropic web_search_20250305, ~$0.01 per search call)
Cost cap: $0.20 (run aborts if exceeded)
[planning]
claude-sonnet-4-6 -> 4-node DAG ($0.006339, 1 attempt)
search1 (web.search)
search2 (web.search)
search3 (web.search)
summarize (synth.summarize) <- [search1, search2, search3]
[executing]
ok search2 11.6s $0.0358
ok search1 11.7s $0.0338
ok search3 13.1s $0.0349
ok summarize 5.3s $0.0059
[result]
# Agentic-AI Orchestration Frameworks in 2026: Key Comparison
Three frameworks dominate 2026 deployments: LangGraph, CrewAI, and AutoGen
(search2, search1). LangGraph uses graph-based state management for explicit
workflow control, excelling at cyclical and complex routing scenarios. CrewAI
emphasizes role-based agent orchestration with predefined collaboration
patterns... [full synthesis at examples/recorded-run.md]
[summary]
Status: succeeded
Wall clock: 23.4s
Planner cost: $0.006339
Specialist cost: $0.110326
Total cost: $0.116665 (under cap)
Captured 2026-05-09 against Claude Sonnet 4.6 (planner) + Haiku 4.5 (specialists) with the native web_search_20250305 server tool. Real research, real synthesis, real $0.116665 cost for a 3-search comparison. Sonnet read "compare" in the goal and fanned out into three parallel web.search nodes; the kernel held summarize back until all three resolved, then handed their outputs through working memory. The synthesis cites upstream node ids inline ((search2, search1)), surfacing how downstream specialists name and read upstream results.
Full trace + cost breakdown at examples/recorded-run.md. Run with your own goal: pnpm demo:basic "your question here".
flowchart TB
user([User goal]) --> planner
subgraph L1[Layer 1 - Orchestrator]
planner[Planner Agent]
kernel[DAG Kernel]
replanner[Replanner Hook]
planner --> kernel
kernel --> replanner
replanner -->|assumption broken| planner
end
subgraph L2[Layer 2 - Specialist Agents]
retrieval[Retrieval Agent<br/>BM25 + vector + RRF]
web[Web/API Agent]
synthesis[Synthesis Agent<br/>conflict resolution]
end
kernel -->|dispatch ready node| retrieval
kernel -->|dispatch ready node| web
kernel -->|dispatch ready node| synthesis
retrieval --> working
web --> working
synthesis --> working
subgraph L3[Layer 3 - Memory]
working[Working Memory<br/>run-scoped KV]
vector[(Vector Store<br/>pgvector default)]
episodic[(Episodic Store<br/>run history)]
retrieval -.read.-> vector
kernel -.write.-> episodic
end
working --> replanner
subgraph L4[Layer 4 - Observability and Eval]
otel[OpenTelemetry Spans]
langfuse[Langfuse Exporter<br/>env-gated]
judge[LLM-as-Judge]
otel --> langfuse
synthesis -.span.-> otel
retrieval -.span.-> otel
web -.span.-> otel
planner -.span.-> otel
replanner -.span.-> otel
end
synthesis --> output([Final answer + trace])
output --> judge
judge --> report[(Eval Report)]
classDef agent fill:#2c3a4a,stroke:#5a8,color:#fff
classDef store fill:#3a2c2c,stroke:#a85,color:#fff
classDef obs fill:#2c3a3a,stroke:#5aa,color:#fff
class planner,kernel,replanner,retrieval,web,synthesis,judge agent
class vector,episodic,report store
class otel,langfuse,working obs
GitHub renders the diagram above natively. The same source, plus a styled rendering pipeline for non-GitHub viewers, lives at docs/architecture.mmd.
Three patterns dominate senior agentic briefs.
| Pattern | Strength | Failure mode |
|---|---|---|
| ReAct (single loop) | Simple. Good for short tasks. | Public reports describe coherence loss past 5-7 steps. We have not benchmarked this ourselves. |
| Plan-and-Execute | Predictable cost. Parallelism is straightforward. | Brittle if the world changes mid-run. |
| LangGraph state-graph | Most flexible. Battle-tested. Parallel execution and checkpointing built in. | Framework-coupled. Orchestration logic is harder to read end-to-end. |
Sextant is Plan-and-Execute with bounded adaptive replanning. The planner produces a DAG. The kernel runs nodes in parallel up to a configured concurrency limit, in topological order. After each completed node, a replan hook checks downstream Zod preconditions against the new state. If a precondition fails, the planner is re-invoked with the state diff and the failed node id, subject to a per-run replan cap.
We are not claiming Sextant beats LangGraph at every workload. We are claiming it sits at a legible middle for teams who would rather own ~820 LoC of orchestration code than depend on a framework.
| If you want | Use | What Sextant offers instead |
|---|---|---|
| Web research from a chat UI | Claude.ai with web_search, Perplexity | A library you embed in your own app, with cost transparency, fan-out parallelism, and a typed kernel you can extend |
| Production multi-agent orchestration with checkpointing, persistence, conditional edges | LangGraph | A smaller, readable kernel for teams that would rather own ~820 LoC than depend on a framework |
| Role-based agent teams with structured handoffs | CrewAI | A DAG-first model where the planner picks the team for each goal, instead of you defining roles up front |
| Token-by-token reasoning with tool use | Anthropic computer use, OpenAI Agents SDK | A planner that commits to a graph up front, so you can budget cost and parallelize before execution |
| A polished framework with a community and a roadmap | LangGraph, CrewAI, Mastra | Sextant is a reference primitive, not a framework. Fork-and-modify is the intended consumption pattern |
This isn't a "best framework" claim. It's a positioning: Sextant is for teams who want the smallest readable Plan-and-Execute primitive that's still production-shaped, with provider portability and pluggable specialists.
The pieces below are what Phase 1 must implement. They exist to keep the architecture from being hand-waved.
| Concern | Spec |
|---|---|
| Node schema | { id, tool, inputs (Zod), outputs (Zod), preconditions (Zod predicate over RunState), maxRetries, timeoutMs } |
| Replan trigger | Post-node hook re-evaluates downstream preconditions against updated RunState. Failed precondition fires the planner with state diff + failed node id. |
| Replan bound | maxReplans per run (default 3). maxReplansPerNode per node id. Run fails with ReplanExhausted after the cap. |
| Concurrency | Ready set runs in parallel up to concurrencyLimit (default 4). Slow nodes don't block siblings. |
| Backoff | Per-node retry with exponential backoff. Retries are separate from replans. |
| Cost guard | Per-run token budget enforced at the LLM layer. Aborts on exceed. Default ~$1.00 USD-equivalent. |
| Cancellation | AbortSignal plumbed through all nodes and tools. |
| Trace redaction | OTel spans redact tool inputs by default; verbose tracing is opt-in. |
- Replan thrash if a node's precondition keeps failing. Caps and tests guard against this; pathological tools can still hit the cap.
- This is currently TypeScript only. If your team's stack is LangChain Python, Sextant will be a poor fit; a Python port is open as a follow-up.
- Phase 1 + Phase 2 are shipped and live-verified. Real research demo runs at ~$0.05 (single-search queries) to ~$0.12 (3-search comparisons) per goal. Numbers for Phase 3-5 are still targets until each phase ships.
The DAG kernel is pure functions plus an execute loop. Plan validation, topological sort, parallel ready-set computation (capped by concurrencyLimit), and run-state transitions are all separable. The replanner hook is opt-in: pass { replan: true, maxReplans: 3 } to execute() and any post-node Zod precondition that fails will trigger a bounded re-plan with the state diff and the failed node id.
Each agent is a Tool the kernel can dispatch. The default specialists:
- Retrieval: pgvector for vector, an in-memory BM25 (
lunror similar) for keyword, Reciprocal Rank Fusion for the merge. Adapter interface so LanceDB, Qdrant, Pinecone, or OpenSearch slot in without touching the agent. - Web/API: provider-agnostic adapter. First-party fixtures so demos run without network.
- Synthesis: aggregates upstream node outputs. Conflict resolution is a heuristic chain: source priority → recency → consensus → flagged-disagreement.
Three tiers, each with a clear job:
- Working memory: run-scoped key-value store. Drives node-input resolution.
- Vector store: long-term embeddings. Read by retrieval, written by ingestion jobs.
- Episodic store: full (goal, plan, trace, outcome) tuples. Used for offline analysis and few-shot prompting.
OpenTelemetry spans wrap each agent invocation. The Langfuse exporter is env-gated: no key set, no calls made. The eval harness takes a (goal, output, rubric) triple and runs an LLM-as-Judge with a frozen rubric covering faithfulness, completeness, conflict resolution, plan efficiency, latency, and cost. Frozen rubric means scores stay comparable across runs and across model versions.
| Concern | Default | Swap |
|---|---|---|
| LLM provider | Anthropic Claude (Sonnet 4.6 plan, Haiku 4.5 specialists) | Vercel AI SDK provider interface |
| Vector store | pgvector | LanceDB, Qdrant, Pinecone via adapter |
| BM25 | in-memory (lunr) |
OpenSearch, Tantivy |
| Tracing | OpenTelemetry → Langfuse | Honeycomb, Phoenix, LangSmith |
| Eval | LLM-as-Judge with frozen rubric | RAGAS, custom |
Sextant ships in five phases. Phases 1 and 2 are shipped on main. Phases 3-5 are gated on inbound signal.
| Phase | Lands | Estimate | Status |
|---|---|---|---|
| 1 | DAG kernel + core types (concurrency, replan budget, cost guard, tests) | 8-12 h | shipped |
| 2 | Anthropic provider + planner + executor + working demo with real web search | 12-20 h | shipped |
| 3 | Bounded adaptive replanning + retrieval specialist | 20-28 h | gated |
| 4 | Observability + LLM-as-Judge harness | 16-24 h | gated |
| 5 | Polish, docs, full public launch | 10-16 h | gated |
Phases 3-5 are gated. They start only after either Crucible (a prior public artifact) or the Sextant Phase 1+2 stub captures at least one senior-tier inbound lead within 4 weeks of launch. If neither does, the public-artifact-funnel hypothesis is falsified and Sextant ships at v0.2 (kernel + Plan-and-Execute) only.
Phase 1 (DAG kernel + types + 56 tests) and Phase 2 (Anthropic provider + planner + executor + working demo with real web research) are shipped and merged on main. The demo accepts any goal as a CLI arg and runs live against the Anthropic native web_search server tool. Real measured cost: $0.05-0.12 per run depending on how many search nodes the planner picks. Phase 3 (adaptive replanning + retrieval) and Phase 4-5 (observability + LLM-as-Judge + public launch) are gated on inbound signal per the build plan above. Stars and issues welcome.
Why TypeScript and not Python? Most agentic AI work is Python. TS gives end-to-end typing across the planner contract, kernel state, and tool schemas (Zod). It also matches the deployment context for many senior briefs (Next.js, edge functions, Tauri, browser extensions). Python port is on the table if there's pull for it.
Why no LangChain? LangChain has its place; it's not the right fit when the goal is to read the orchestration code in one sitting. Sextant is the inverse choice: minimum legible code, opinionated kernel, swap any layer.
How much does it cost to run? ~$0.05 for a single-search query, ~$0.12 for a multi-angle comparison (3 parallel web_searches + synthesis). $10 of Anthropic credit covers ~85-200 runs depending on complexity. Fixture mode is free.
Can I use a different LLM provider?
The LlmProvider interface is the seam. OpenAI / Bedrock / Vercel AI SDK adapters land in Phase 3+. Today, only Anthropic is wired up; the kernel itself doesn't know about any provider.
Is the kernel really only ~820 LoC?
820 LoC measured for src/core/{types,plan,state,kernel}.ts (run wc -l src/core/*.ts to verify). Add the planner + executor + agents + Anthropic provider and the full footprint is ~2000 LoC. The kernel is what you read when you want to understand "how does this actually run a DAG safely under concurrency, replans, aborts, and cost budgets."
How do I add my own tool?
Implement the Tool signature (call: ToolCall) => Promise<ToolResult>, write a Zod schema for inputs, and register it in tools + toolSchemas. The planner sees the schema description and picks it when relevant. See src/agents/web.ts and src/agents/synthesis.ts for the two existing patterns (pure function and LLM-calling).
What happens if the planner generates an invalid plan?
The planner has a retry budget (default 3 attempts). On a parse or schema-validation failure, the previous output and the validator's error message are appended to the conversation so the model can self-correct. After the budget, PlannerExhausted surfaces with the last raw output for debugging.
What happens if a tool fails or times out mid-run?
Per-node retries with exponential backoff (capped at 5s). Aborts propagate via AbortSignal. After the retry budget, the kernel records state.failed for that node id; downstream nodes that depended on it stay unscheduled, and the run lands in state.status === 'failed' with state.terminalError populated. Phase 3 wires the replanner so a downstream Zod precondition failure can swap in a new plan instead.
Sextant inherits design patterns from a small constellation of private tools (Sanhedrin, KeurSmid, ContentSmid) and one public artifact (Crucible at github.com/Bambushu/crucible). The pattern reuse is genuine; the code is green-field. Where a file's structure mirrors a prior tool, the file header notes the source.
MIT. See LICENSE.
Built by Maikel Slomp. For senior-tier agentic-AI work and partnerships, reach out via mad-it.agency.