Skip to content

Bambushu/sextant

Repository files navigation

Sextant

Sextant

A Plan-and-Execute multi-agent orchestrator written as the smallest legible version that's still production-shaped. Type a question, the planner builds a DAG, specialists run in parallel, you get a synthesis with cost transparency.

~820 LoC kernel · 106 tests · 97% local coverage on src/core/*.ts · $0.05-0.12 per real-research run (Anthropic web_search_20250305 pricing as of 2026-05-09) · TypeScript · MIT

A sextant fixes position by combining sightings from multiple angles. The system is shaped the same way: the planner picks the angles, specialists take their sightings in parallel, the synthesis fixes the position. A judge that scores the fix and a memory that carries the chart forward land in Phases 4-5.

What this is

A reference implementation of the Plan-and-Execute pattern in four layers:

  1. Orchestrator - planner produces a DAG, kernel runs nodes in parallel under a concurrency cap, bounded replanner adapts when a precondition fails.
  2. Specialists - retrieval (BM25 + vector), web/API, synthesis. Pluggable adapters; no framework lock.
  3. Memory - working KV per run, vector store for embeddings, episodic store for full (goal, plan, trace, outcome) tuples.
  4. Observability - OpenTelemetry spans, Langfuse exporter env-gated, LLM-as-Judge harness with a frozen rubric.

Phase 1 (kernel) and Phase 2 (Anthropic provider + planner + working demo with real Anthropic web_search) are shipped on main. Phases 3-5 are gated.

What this isn't

  • Not a framework. The kernel is ~820 LoC. Read it, fork it, replace any layer.
  • Not a LangGraph replacement. LangGraph is battle-tested with parallel execution, persistence, and conditional edges. If you're already on it, stay there.
  • Not a hosted product. Run it where your code already runs.
  • Not Python. The provider abstraction is portable, but this repo is TS-first. Python port is on the table if there's pull for it.

Quick start

pnpm install
cp .env.example .env  # add your ANTHROPIC_API_KEY
pnpm demo:basic                                # default goal, real Anthropic web search
pnpm demo:basic "your question here"           # ask anything; planner builds the DAG live
SEXTANT_FIXTURE_SEARCH=1 pnpm demo:basic       # offline fixture mode for CI / no-network demos
pnpm demo:rag                                  # 4-node DAG with hybrid retrieval (Phase 3, gated)
pnpm eval                                      # full eval harness with LLM-as-Judge (Phase 4, gated)

Phase 2 (DAG kernel + planner + executor + working demo with real web search) is shipped. The default demo hits the Anthropic native web_search server tool, so any goal you pass becomes a real research run. Cost is roughly $0.05 for simple single-search queries, $0.12 for multi-angle comparisons. Phase 3+ (adaptive replanning + retrieval + observability + eval) is gated on inbound signal. See the build plan below.

Use it from your code

The kernel and adapters are exported as a library. ~20 lines gets you a working multi-agent research run:

import {
  run,
  createAnthropicProvider,
  createAnthropicWebSearch,
  createSynthSummarize,
  webSearchToolSchema,
  synthSummarizeToolSchema,
} from '@bambushu/sextant';

const provider = createAnthropicProvider();
const search = createAnthropicWebSearch();
const synth = createSynthSummarize({ provider });

const result = await run('How do agentic orchestration frameworks compare in 2026?', {
  provider,
  tools: { 'web.search': search, 'synth.summarize': synth },
  toolSchemas: [webSearchToolSchema, synthSummarizeToolSchema],
  onPlan: ({ plan }) => console.log(`Plan: ${plan.nodes.length} nodes`),
});

const finalNodeId = result.plan.nodes.at(-1)!.id;
console.log(result.outputs.get(finalNodeId));
console.log(`Cost: $${(result.costSpentUsd + result.plannerCostUsd).toFixed(4)}`);

Bring your own tools by adding entries to tools and toolSchemas. The planner reads the schema descriptions and picks them when a goal calls for it. The kernel handles concurrency, abort signals, retries, cost accounting, and (in Phase 3) bounded replanning.

Currently install from GitHub: pnpm add github:Bambushu/sextant. npm publish lands at v0.3.

What does it look like?

Click to expand a recorded run of pnpm demo:basic
=== Sextant ===
Goal: How do recent agentic-AI orchestration frameworks compare in 2026?
Mode: live web search (Anthropic web_search_20250305, ~$0.01 per search call)
Cost cap: $0.20 (run aborts if exceeded)

[planning]
  claude-sonnet-4-6 -> 4-node DAG ($0.006339, 1 attempt)
    search1      (web.search)
    search2      (web.search)
    search3      (web.search)
    summarize    (synth.summarize) <- [search1, search2, search3]

[executing]
  ok  search2      11.6s   $0.0358
  ok  search1      11.7s   $0.0338
  ok  search3      13.1s   $0.0349
  ok  summarize    5.3s    $0.0059

[result]

# Agentic-AI Orchestration Frameworks in 2026: Key Comparison

Three frameworks dominate 2026 deployments: LangGraph, CrewAI, and AutoGen
(search2, search1). LangGraph uses graph-based state management for explicit
workflow control, excelling at cyclical and complex routing scenarios. CrewAI
emphasizes role-based agent orchestration with predefined collaboration
patterns... [full synthesis at examples/recorded-run.md]

[summary]
  Status:           succeeded
  Wall clock:       23.4s
  Planner cost:     $0.006339
  Specialist cost:  $0.110326
  Total cost:       $0.116665  (under cap)

Captured 2026-05-09 against Claude Sonnet 4.6 (planner) + Haiku 4.5 (specialists) with the native web_search_20250305 server tool. Real research, real synthesis, real $0.116665 cost for a 3-search comparison. Sonnet read "compare" in the goal and fanned out into three parallel web.search nodes; the kernel held summarize back until all three resolved, then handed their outputs through working memory. The synthesis cites upstream node ids inline ((search2, search1)), surfacing how downstream specialists name and read upstream results.

Full trace + cost breakdown at examples/recorded-run.md. Run with your own goal: pnpm demo:basic "your question here".

Architecture

flowchart TB
    user([User goal]) --> planner

    subgraph L1[Layer 1 - Orchestrator]
        planner[Planner Agent]
        kernel[DAG Kernel]
        replanner[Replanner Hook]
        planner --> kernel
        kernel --> replanner
        replanner -->|assumption broken| planner
    end

    subgraph L2[Layer 2 - Specialist Agents]
        retrieval[Retrieval Agent<br/>BM25 + vector + RRF]
        web[Web/API Agent]
        synthesis[Synthesis Agent<br/>conflict resolution]
    end

    kernel -->|dispatch ready node| retrieval
    kernel -->|dispatch ready node| web
    kernel -->|dispatch ready node| synthesis

    retrieval --> working
    web --> working
    synthesis --> working

    subgraph L3[Layer 3 - Memory]
        working[Working Memory<br/>run-scoped KV]
        vector[(Vector Store<br/>pgvector default)]
        episodic[(Episodic Store<br/>run history)]
        retrieval -.read.-> vector
        kernel -.write.-> episodic
    end

    working --> replanner

    subgraph L4[Layer 4 - Observability and Eval]
        otel[OpenTelemetry Spans]
        langfuse[Langfuse Exporter<br/>env-gated]
        judge[LLM-as-Judge]
        otel --> langfuse
        synthesis -.span.-> otel
        retrieval -.span.-> otel
        web -.span.-> otel
        planner -.span.-> otel
        replanner -.span.-> otel
    end

    synthesis --> output([Final answer + trace])
    output --> judge
    judge --> report[(Eval Report)]

    classDef agent fill:#2c3a4a,stroke:#5a8,color:#fff
    classDef store fill:#3a2c2c,stroke:#a85,color:#fff
    classDef obs fill:#2c3a3a,stroke:#5aa,color:#fff
    class planner,kernel,replanner,retrieval,web,synthesis,judge agent
    class vector,episodic,report store
    class otel,langfuse,working obs
Loading

GitHub renders the diagram above natively. The same source, plus a styled rendering pipeline for non-GitHub viewers, lives at docs/architecture.mmd.

Why this pattern

Three patterns dominate senior agentic briefs.

Pattern Strength Failure mode
ReAct (single loop) Simple. Good for short tasks. Public reports describe coherence loss past 5-7 steps. We have not benchmarked this ourselves.
Plan-and-Execute Predictable cost. Parallelism is straightforward. Brittle if the world changes mid-run.
LangGraph state-graph Most flexible. Battle-tested. Parallel execution and checkpointing built in. Framework-coupled. Orchestration logic is harder to read end-to-end.

Sextant is Plan-and-Execute with bounded adaptive replanning. The planner produces a DAG. The kernel runs nodes in parallel up to a configured concurrency limit, in topological order. After each completed node, a replan hook checks downstream Zod preconditions against the new state. If a precondition fails, the planner is re-invoked with the state diff and the failed node id, subject to a per-run replan cap.

We are not claiming Sextant beats LangGraph at every workload. We are claiming it sits at a legible middle for teams who would rather own ~820 LoC of orchestration code than depend on a framework.

Compared to other tools

If you want Use What Sextant offers instead
Web research from a chat UI Claude.ai with web_search, Perplexity A library you embed in your own app, with cost transparency, fan-out parallelism, and a typed kernel you can extend
Production multi-agent orchestration with checkpointing, persistence, conditional edges LangGraph A smaller, readable kernel for teams that would rather own ~820 LoC than depend on a framework
Role-based agent teams with structured handoffs CrewAI A DAG-first model where the planner picks the team for each goal, instead of you defining roles up front
Token-by-token reasoning with tool use Anthropic computer use, OpenAI Agents SDK A planner that commits to a graph up front, so you can budget cost and parallelize before execution
A polished framework with a community and a roadmap LangGraph, CrewAI, Mastra Sextant is a reference primitive, not a framework. Fork-and-modify is the intended consumption pattern

This isn't a "best framework" claim. It's a positioning: Sextant is for teams who want the smallest readable Plan-and-Execute primitive that's still production-shaped, with provider portability and pluggable specialists.

DAG kernel contract

The pieces below are what Phase 1 must implement. They exist to keep the architecture from being hand-waved.

Concern Spec
Node schema { id, tool, inputs (Zod), outputs (Zod), preconditions (Zod predicate over RunState), maxRetries, timeoutMs }
Replan trigger Post-node hook re-evaluates downstream preconditions against updated RunState. Failed precondition fires the planner with state diff + failed node id.
Replan bound maxReplans per run (default 3). maxReplansPerNode per node id. Run fails with ReplanExhausted after the cap.
Concurrency Ready set runs in parallel up to concurrencyLimit (default 4). Slow nodes don't block siblings.
Backoff Per-node retry with exponential backoff. Retries are separate from replans.
Cost guard Per-run token budget enforced at the LLM layer. Aborts on exceed. Default ~$1.00 USD-equivalent.
Cancellation AbortSignal plumbed through all nodes and tools.
Trace redaction OTel spans redact tool inputs by default; verbose tracing is opt-in.

Limitations and known risks

  • Replan thrash if a node's precondition keeps failing. Caps and tests guard against this; pathological tools can still hit the cap.
  • This is currently TypeScript only. If your team's stack is LangChain Python, Sextant will be a poor fit; a Python port is open as a follow-up.
  • Phase 1 + Phase 2 are shipped and live-verified. Real research demo runs at ~$0.05 (single-search queries) to ~$0.12 (3-search comparisons) per goal. Numbers for Phase 3-5 are still targets until each phase ships.

Layered components

Orchestrator

The DAG kernel is pure functions plus an execute loop. Plan validation, topological sort, parallel ready-set computation (capped by concurrencyLimit), and run-state transitions are all separable. The replanner hook is opt-in: pass { replan: true, maxReplans: 3 } to execute() and any post-node Zod precondition that fails will trigger a bounded re-plan with the state diff and the failed node id.

Specialist agents

Each agent is a Tool the kernel can dispatch. The default specialists:

  • Retrieval: pgvector for vector, an in-memory BM25 (lunr or similar) for keyword, Reciprocal Rank Fusion for the merge. Adapter interface so LanceDB, Qdrant, Pinecone, or OpenSearch slot in without touching the agent.
  • Web/API: provider-agnostic adapter. First-party fixtures so demos run without network.
  • Synthesis: aggregates upstream node outputs. Conflict resolution is a heuristic chain: source priority → recency → consensus → flagged-disagreement.

Memory

Three tiers, each with a clear job:

  • Working memory: run-scoped key-value store. Drives node-input resolution.
  • Vector store: long-term embeddings. Read by retrieval, written by ingestion jobs.
  • Episodic store: full (goal, plan, trace, outcome) tuples. Used for offline analysis and few-shot prompting.

Observability + eval

OpenTelemetry spans wrap each agent invocation. The Langfuse exporter is env-gated: no key set, no calls made. The eval harness takes a (goal, output, rubric) triple and runs an LLM-as-Judge with a frozen rubric covering faithfulness, completeness, conflict resolution, plan efficiency, latency, and cost. Frozen rubric means scores stay comparable across runs and across model versions.

Defaults and swap paths

Concern Default Swap
LLM provider Anthropic Claude (Sonnet 4.6 plan, Haiku 4.5 specialists) Vercel AI SDK provider interface
Vector store pgvector LanceDB, Qdrant, Pinecone via adapter
BM25 in-memory (lunr) OpenSearch, Tantivy
Tracing OpenTelemetry → Langfuse Honeycomb, Phoenix, LangSmith
Eval LLM-as-Judge with frozen rubric RAGAS, custom

Build plan

Sextant ships in five phases. Phases 1 and 2 are shipped on main. Phases 3-5 are gated on inbound signal.

Phase Lands Estimate Status
1 DAG kernel + core types (concurrency, replan budget, cost guard, tests) 8-12 h shipped
2 Anthropic provider + planner + executor + working demo with real web search 12-20 h shipped
3 Bounded adaptive replanning + retrieval specialist 20-28 h gated
4 Observability + LLM-as-Judge harness 16-24 h gated
5 Polish, docs, full public launch 10-16 h gated

Phases 3-5 are gated. They start only after either Crucible (a prior public artifact) or the Sextant Phase 1+2 stub captures at least one senior-tier inbound lead within 4 weeks of launch. If neither does, the public-artifact-funnel hypothesis is falsified and Sextant ships at v0.2 (kernel + Plan-and-Execute) only.

Status

Phase 1 (DAG kernel + types + 56 tests) and Phase 2 (Anthropic provider + planner + executor + working demo with real web research) are shipped and merged on main. The demo accepts any goal as a CLI arg and runs live against the Anthropic native web_search server tool. Real measured cost: $0.05-0.12 per run depending on how many search nodes the planner picks. Phase 3 (adaptive replanning + retrieval) and Phase 4-5 (observability + LLM-as-Judge + public launch) are gated on inbound signal per the build plan above. Stars and issues welcome.

FAQ

Why TypeScript and not Python? Most agentic AI work is Python. TS gives end-to-end typing across the planner contract, kernel state, and tool schemas (Zod). It also matches the deployment context for many senior briefs (Next.js, edge functions, Tauri, browser extensions). Python port is on the table if there's pull for it.

Why no LangChain? LangChain has its place; it's not the right fit when the goal is to read the orchestration code in one sitting. Sextant is the inverse choice: minimum legible code, opinionated kernel, swap any layer.

How much does it cost to run? ~$0.05 for a single-search query, ~$0.12 for a multi-angle comparison (3 parallel web_searches + synthesis). $10 of Anthropic credit covers ~85-200 runs depending on complexity. Fixture mode is free.

Can I use a different LLM provider? The LlmProvider interface is the seam. OpenAI / Bedrock / Vercel AI SDK adapters land in Phase 3+. Today, only Anthropic is wired up; the kernel itself doesn't know about any provider.

Is the kernel really only ~820 LoC? 820 LoC measured for src/core/{types,plan,state,kernel}.ts (run wc -l src/core/*.ts to verify). Add the planner + executor + agents + Anthropic provider and the full footprint is ~2000 LoC. The kernel is what you read when you want to understand "how does this actually run a DAG safely under concurrency, replans, aborts, and cost budgets."

How do I add my own tool? Implement the Tool signature (call: ToolCall) => Promise<ToolResult>, write a Zod schema for inputs, and register it in tools + toolSchemas. The planner sees the schema description and picks it when relevant. See src/agents/web.ts and src/agents/synthesis.ts for the two existing patterns (pure function and LLM-calling).

What happens if the planner generates an invalid plan? The planner has a retry budget (default 3 attempts). On a parse or schema-validation failure, the previous output and the validator's error message are appended to the conversation so the model can self-correct. After the budget, PlannerExhausted surfaces with the last raw output for debugging.

What happens if a tool fails or times out mid-run? Per-node retries with exponential backoff (capped at 5s). Aborts propagate via AbortSignal. After the retry budget, the kernel records state.failed for that node id; downstream nodes that depended on it stay unscheduled, and the run lands in state.status === 'failed' with state.terminalError populated. Phase 3 wires the replanner so a downstream Zod precondition failure can swap in a new plan instead.

Acknowledgements

Sextant inherits design patterns from a small constellation of private tools (Sanhedrin, KeurSmid, ContentSmid) and one public artifact (Crucible at github.com/Bambushu/crucible). The pattern reuse is genuine; the code is green-field. Where a file's structure mirrors a prior tool, the file header notes the source.

License

MIT. See LICENSE.

Contact

Built by Maikel Slomp. For senior-tier agentic-AI work and partnerships, reach out via mad-it.agency.

About

Plan-and-Execute multi-agent orchestrator with bounded adaptive replanning. ~800 LoC TypeScript kernel. Real Anthropic web_search demo. No framework lock-in.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors