Skip to content

Latest commit

 

History

History
100 lines (78 loc) · 5.42 KB

File metadata and controls

100 lines (78 loc) · 5.42 KB

Branch-and-fan-out demo — real results

The forkd "fork a thinking agent" demo, end-to-end on real hardware. The latest clean run is in results-2026-05-18/; the earlier results-2026-05-17/ is the same mechanism with a less-capable model (Qwen2.5-7B) — kept for comparison so you can see what changes when you swap models.

TL;DR for a tweet thread

🍴 forkd just forked a running ReAct agent: 163 ms pause on tmpfs-backed snapshot storage, 4 s on the SATA SSD this demo recorded against. Same code, only the disk differs.

A source agent had spent 2 steps gathering weather + place data for a Kyoto + Osaka trip. We BRANCHed it and spawned 3 grandchildren from the same cognitive state. Each got a different steering hint — "be thorough", "be minimal", "optimize for cost".

All 3 produced different itineraries, inheriting the same tool results, same conversation history, same Python heap. The only thing that diverged was the next thought.

Headline divergence: the parent (no hint) put Nishiki Market on Day 1. All three hinted children dropped it and substituted Arashiyama Bamboo Grove — a free outdoor activity. The cost-focused child even annotated dining stops with "may be pricey" warnings.

This is the speculative-parallel-exploration primitive Modal Sandboxes keeps closed-source. Now on KVM, open-source. ↓

The setup that produced the run

  • Host: yangdongxu-desktop, Ubuntu 24.04, Linux 6.14, 20 vCPU, 30 GiB RAM
  • forkd built from demo/summary-show-in-flight (see PR #66)
  • Source rootfs: python:3.12-slim + requests, ~206 MiB
  • LLM: DeepSeek-V3 via SiliconFlow's OpenAI-compatible API
  • Task: "Plan a 2-day trip to Kyoto and Osaka. Use the tools to check weather and find places."

Headline numbers

Metric Value
Daemon-measured pause window 4007 ms (SATA SSD storage; see RESULTS-v0.2.md for 163 ms on tmpfs)
Memory image size 513 MiB
Grandchildren spawned 3
Steering hints applied 3 (one per child)
Network retries this run 0 (clean)
Per-agent token cost 1395–1546
Snapshot tag (auditable) langgraph-fork-1779037370

The divergence at a glance

Agent Hint Day-1 afternoon (Kyoto) Notable framing
parent (none — control) Nishiki Market ($$) baseline; no special framing
thorough "cultural depth, slow" Arashiyama Bamboo (free) replaced shopping w/ cultural-nature
minimal "daylight outside, no shopping" Arashiyama Bamboo (free) replaced shopping w/ outdoor
cost "avoid $$$, prefer free or $" Arashiyama Bamboo (free) + warning labels on $$ stops, explicit cost-optimization footer

Worth highlighting: the model wasn't told to "drop Nishiki Market" or "add Arashiyama". It chose to re-rank based on the hint. All three hinted children independently agreed on the substitution. Cost went further and added meta-commentary like "though dining options may be pricey" and an explicit "Cost Optimization" footer that the others didn't.

Full itineraries

See results-2026-05-18/summary.md for the auto-generated render of all four agents' final answers. Raw per-event JSONL is in the same directory.

What this validates

  1. The BRANCH primitive works on a real agent workload. 4 s pause, 0 errors, all 4 agents completed cleanly with their respective post-branch reasoning.
  2. In-guest agents are pause-blind. No socket errors, no timeouts at wake-up, no retries needed in this run. Same pattern we measured synthetically in bench/pause-window/RESULTS-v0.2.md, now confirmed on a real LLM agent.
  3. Hint-based perturbation post-branch is real. Each child's NEXT LLM call sees a different system message; the inherited conversation history + tool results stay the same. This is the cheapest faithful model of speculative parallel exploration on a stateful agent.

What the earlier run 9 shows (and what we learned from it)

The first end-to-end run (committed in results-2026-05-17/) used Qwen2.5-7B-Instruct. The mechanism worked but the model:

  • Had network retries on first call after restore (~90 s wall before reaching branch)
  • Occasionally emitted tool-call arguments as freeform content
  • Kept calling search_places past the point where it should have produced a final answer

The hint side-channel STILL worked — the children's in-flight think events showed clear divergence (e.g. minimal's "Nishiki Market - food, $" vs the original "food, $$" — model self-downgraded the price). But the answers came out messy.

The fix landed in PR #66:

  1. Default model bumped to DeepSeek-V3 (much better tool discipline)
  2. System prompt explicit about "use each tool at most twice, then stop calling tools"
  3. branch_after_step=2 (DeepSeek converges in 2 steps; the prior =3 was unreachable)
  4. summarize.py falls back to last think when no answer exists, so future flaky runs still tell a story

run-12 (2026-05-18) reflects all of those. Same mechanism, cleaner output.

Reproducing

export FORKD_URL=http://127.0.0.1:8889
export FORKD_TOKEN=$(cat /etc/forkd/token)
export SILICONFLOW_API_KEY=...
bash recipes/langgraph-react/demo.sh

recipes/langgraph-react/README.md has the detailed recipe + design notes.