A curated, implementation-first list of agent harness engineering resources, with GitHub projects as the primary focus.
- Total entries: 270
- GitHub entries: 243 (90.0%)
- GitHub in project categories (excluding readings): 238/238 (100.0%)
- Categories: 9
- Last verified: 2026-06-09
- Language: English | 中文
- Scaling Managed Agents: Decoupling the brain from the hands: Anthropic's meta-harness architecture for decoupling session logs, harness loops, and sandboxes in long-horizon agents.
- What We Learned Building Cloud Agents: Cognition's field report on secure cloud-agent infrastructure, VM isolation, full-state snapshots, orchestration, governance, integrations, and enterprise adoption.
- Claude Code auto mode: Anthropic's write-up on classifier-backed approval delegation for safer high-autonomy coding-agent runs.
- Harness engineering (OpenAI): Field report on building reliable agent-first software via harness constraints and verification.
- The next evolution of the Agents SDK: OpenAI's product and engineering post on model-native agent harnesses, native sandbox execution, manifests, memory, and filesystem and shell tools.
- Building Effective AI Agents: Anthropic's practical guidance on when to use workflows vs. autonomous agents and how to structure them.
- Writing effective tools for AI agents: Best practices for tool interface design so agents call tools safely and reliably.
- Effective harnesses for long-running agents: Practical guide to maintaining state, resumability, and reliability over long agent runs.
- Harness design for long-running application development: Follow-up article on improving long-running app generation through harness structure.
- Improving Deep Agents with harness engineering: Evidence that harness improvements alone can move benchmark performance.
- Evaluating Deep Agents: Our Learnings: LangChain's practical lessons on evaluating stateful and long-horizon agents.
- Your Agent Needs a Harness, Not a Framework: Argument for reliability-first infrastructure around agents instead of framework-only thinking.
- Category Overview
- Featured Harness Blogs
- Catalog
- Harness Architecture & Orchestration
- Context & Working-State Engineering
- Execution Substrates & Sandboxing
- Protocols, Tool Interfaces & Agent Contracts
- Evaluation Harnesses & Benchmarks
- Observability & Reliability Operations
- Guardrails, Security & Governance
- Reference Harness Implementations
- Essential Readings & Ecosystem Maps
- Maintenance Notes
- Citation
| Category | Entries |
|---|---|
| Harness Architecture & Orchestration | 44 |
| Context & Working-State Engineering | 16 |
| Execution Substrates & Sandboxing | 25 |
| Protocols, Tool Interfaces & Agent Contracts | 24 |
| Evaluation Harnesses & Benchmarks | 27 |
| Observability & Reliability Operations | 15 |
| Guardrails, Security & Governance | 19 |
| Reference Harness Implementations | 68 |
| Essential Readings & Ecosystem Maps | 32 |
Notes:
Starsare rendered as badges from snapshot values.- Repository update dates are tracked in
data/projects.yamland validation reports. - Entries are sorted by stars (descending) within each category.
| Project | Link | Stars | Tags | Summary |
|---|---|---|---|---|
| Superpowers | GitHub | skills, workflow, cross-agent | Cross-agent software development methodology built from composable skills, mandatory workflows, worktrees, planning, TDD, review, and subagent execution. | |
| ECC | GitHub | cross-harness, hooks, skills | Cross-harness operator system combining skills, hooks, memory optimization, security scanning, and validation workflows for agentic work. | |
| gstack | GitHub | skills, qa, release | Claude Code and cross-agent skill stack that turns product planning, architecture review, QA, security, release, and retrospectives into repeatable agent workflows. | |
| DeerFlow | GitHub | long-horizon, memory, subagents | Long-horizon super-agent harness integrating memory, tools, subagents, and sandboxes. | |
| oh-my-openagent | GitHub | multi-harness, team-mode, skills | Multi-harness agent OS for OpenCode, Codex, Claude Code, and other coding agents with team-mode orchestration, background agents, MCPs, and skills. | |
| AutoGen | GitHub | multi-agent, orchestration, framework | Programming framework for agentic AI with multi-agent interaction and orchestration. | |
| Ruflo | GitHub | multi-agent, swarm, mcp | Multi-agent orchestration platform for Claude Code with swarms, persistent memory, federation, plugins, and MCP hooks. | |
| CrewAI | GitHub | multi-agent, workflows, control-plane | Multi-agent automation framework with production Flows, autonomous Crews, event-driven control, tracing, guardrails, memory, and human review hooks. | |
| Addy's Agent Skills | GitHub | skills, quality-gates, coding-agents | Production-grade engineering skills for coding agents that package lifecycle workflows, quality gates, reviews, testing, debugging, security, and release practices. | |
| Agno | GitHub | scale, runtime, management | Agent software runtime focused on running and managing agentic systems at scale. | |
| LangGraph | GitHub | graph, workflow, runtime | Graph-based runtime for resilient stateful agents and deterministic workflow control. | |
| Semantic Kernel | GitHub | enterprise, orchestration, plugins | Enterprise-grade agentic application framework with orchestration and plugin patterns. | |
| OpenAI Agents SDK (Python) | GitHub | sdk, handoff, workflows | Lightweight framework for multi-agent workflows, handoffs, and production patterns. | |
| Symphony | GitHub | orchestration, control-plane, workflows | Ticket-driven orchestration layer that turns project work into isolated autonomous implementation runs. | |
| deepagents | GitHub | runtime, orchestration, long-running | Open-source harness for long-running, tool-using agents with planning and subagent patterns. | |
| Archon | GitHub | workflow-engine, worktrees, validation | Workflow engine for AI coding agents with YAML-defined phases, isolated worktrees, and validation gates. | |
| Google ADK (Python) | GitHub | toolkit, deployment, evaluation | Code-first toolkit to build, evaluate, and deploy advanced AI agents. | |
| elizaOS | GitHub | agent-os, plugins, benchmarks | Extensible agent runtime and operating system with CLI scaffolding, agent loop, plugins, memory/state primitives, dashboards, connectors, and benchmark suites. | |
| PydanticAI | GitHub | python, typing, schema | Type-safe Python framework for agents with strong schema contracts and tooling. | |
| Gas Town | GitHub | multi-agent, workspaces, coordination | Multi-agent workspace manager for coordinating coding agents with persistent work tracking, git-backed hooks, handoffs, supervision, and merge queues. | |
| Microsoft Agent Framework | GitHub | multi-agent, workflows, observability | Multi-language framework for building, orchestrating, and deploying AI agents with graph workflows and observability. | |
| Hive | GitHub | harness, orchestration, runtime | Outcome-driven agent runtime harness with explicit control loops and orchestration blocks. | |
| VoltAgent | GitHub | typescript, platform, runtime | TypeScript agent engineering platform built around open runtime abstractions. | |
| mcp-agent | GitHub | mcp, runtime, workflow | Practical agent framework centered on MCP tool ecosystems and workflow composition. | |
| PraisonAI | GitHub | multi-agent, workflow, memory | Multi-agent workforce framework with autonomous planning, execution, memory, RAG, dashboards, and multi-provider model support. | |
| Agent Squad | GitHub | routing, multi-agent, context | Multi-agent orchestration framework that routes requests, preserves conversation context, supports Python/TypeScript, and coordinates specialist agents. | |
| Yao | GitHub | single-binary, runtime, autonomous | Single-binary runtime for defining and running autonomous agents. | |
| Open Multi-Agent | GitHub | multi-agent, dag, tracing | TypeScript-native multi-agent orchestrator that turns goals into task DAGs with parallel execution, MCP integration, and live tracing. | |
| Strands Agents | GitHub | sdk, mcp, tools | Model-driven agent SDK and monorepo with Python/TypeScript agent loops, provider adapters, tools, MCP integration, multi-agent systems, and streaming. | |
| Cloudflare Agents | GitHub | platform, deployment, runtime | Platform runtime for building and deploying agents with production infrastructure primitives. | |
| Flue | GitHub | typescript, headless, sandbox | TypeScript harness framework for building headless agents with sessions, tools, skills, and pluggable sandboxes. | |
| Embabel Agent Framework | GitHub | jvm, planning, typed-flows | JVM agent framework for typed agentic flows with goals, actions, conditions, dynamic planning, platform modes, and testability. | |
| OpenAI Agents SDK (JS/TS) | GitHub | typescript, workflows, sandbox-agents | JavaScript/TypeScript framework for multi-agent workflows with handoffs, tools, guardrails, sessions, tracing, and sandbox agents. | |
| Docker Agent | GitHub | docker, runtime, container | Agent builder and runtime stack emphasizing container-native execution. | |
| NeMo Agent Toolkit | GitHub | multi-agent, optimization, toolkit | Open toolkit for connecting and optimizing teams of AI agents. | |
| Apache Burr | GitHub | state-machine, persistence, tracing | State-machine framework for decision-making agents and LLM apps with persistence, telemetry UI, tracing, and framework-agnostic execution. | |
| Scion | GitHub | multi-agent, containers, orchestration | Experimental multi-agent orchestration testbed that runs isolated agent harnesses in containers, worktrees, and remote runtimes. | |
| deepagentsjs | GitHub | typescript, langgraph, subagents | TypeScript agent harness with built-in planning, filesystem tools, subagents, and LangGraph-native runtime hooks. | |
| oh-my-agent | GitHub | multi-agent, skills, cross-runtime | Portable multi-agent harness that projects shared agents, skills, workflows, and rules into multiple coding-agent runtimes. | |
| Chorus | GitHub | ai-dlc, permissions, task-state | AI-human collaboration harness for session lifecycle, task state, sub-agent orchestration, observability, and recovery. | |
| Pydantic AI Harness | GitHub | capabilities, hooks, pydantic | Official Pydantic AI capability library for composing tools, lifecycle hooks, instructions, and model settings into reusable agent harnesses. | |
| Water | GitHub | python, framework, approval-gates | Python agent harness framework for orchestration, resilience, observability, guardrails, approval gates, sandboxing, and deployment. | |
| OmniCoreAgent | GitHub | python, mcp, serving | Python production harness with model loop, tools, MCP, memory, workspace files, guardrails, events, subagents, background tasks, and REST/SSE serving. | |
| hankweave | GitHub | long-horizon, runtime, checkpoints | Headless-first long-horizon runtime that orchestrates existing agent harnesses with sentinels, loops, checkpoints, and event journals. |
| Project | Link | Stars | Tags | Summary |
|---|---|---|---|---|
| claude-mem | GitHub | memory, context, session | Plugin-style memory layer that captures session history and reinjects relevant context into future coding runs. | |
| Beads | GitHub | memory, issue-tracking, work-state | Agent-optimized distributed issue tracker that stores long-horizon coding work as dependency-aware graph state with memory recall and multi-branch sync. | |
| planning-with-files | GitHub | planning, skills, persistence | Skill package for persistent file-based planning in coding-agent workflows. | |
| agentmemory | GitHub | memory, mcp, hooks | Persistent memory server for coding agents using hooks, MCP/REST integration, hybrid search, and shared session recall. | |
| Context Mode | GitHub | context, mcp, session | MCP context optimization server that sandboxes tool output, indexes session events, and restores continuity across agent compactions. | |
| Agent Skills for Context Engineering | GitHub | skills, context, production | Large skill library oriented around context engineering and production agents. | |
| Trellis | GitHub | specs, memory, workflow | Multi-platform coding-agent workflow framework with task context, project memory, and spec injection. | |
| Context-Engineering Handbook | GitHub | context-engineering, handbook, practices | First-principles handbook focused on practical context engineering for agent systems. | |
| CCPM | GitHub | planning, github-issues, parallel-execution | Spec-driven project-manager skill that turns PRDs and GitHub issues into persistent context and parallel agent execution. | |
| TencentDB Agent Memory | GitHub | memory, context-offloading, openclaw | Local agent memory plugin combining symbolic short-term state, layered long-term memory, traceability, and OpenClaw/Hermes integrations. | |
| Acontext | GitHub | skills, memory, progressive-disclosure | Skill-memory layer that distills agent runs into inspectable skill files and recalls them through agent-controlled tools. | |
| Awesome Context Engineering | GitHub | awesome-list, context, survey | Survey-style list for context engineering resources and frameworks. | |
| agentic-stack | GitHub | cross-harness, memory, skills | Portable memory, skills, protocols, and dashboard layer that keeps state across multiple coding-agent harnesses. | |
| context-space | GitHub | context, infrastructure, mcp | Infrastructure project focused on context engineering building blocks and MCP-centric integrations. | |
| Memorix | GitHub | memory, mcp, cross-agent | Local-first cross-agent memory control plane with MCP support, workspace sync, sessions, and orchestration state. | |
| sd0x-dev-flow | GitHub | hooks, state-machine, claude-code | Claude Code harness layer with hook-enforced dual review, durable state-machine gates, context-compaction recovery, and fail-closed safety. |
| Project | Link | Stars | Tags | Summary |
|---|---|---|---|---|
| Daytona | GitHub | sandbox, execution, infra | Secure and elastic sandbox infrastructure for running AI-generated code with file, Git, LSP, and execution APIs. | |
| CUA | GitHub | computer-use, sandbox, infra | Infrastructure stack for computer-use agents with sandbox, SDK, and benchmark support. | |
| Browser Harness | GitHub | browser, cdp, self-healing | Thin editable CDP harness that connects LLMs directly to real browsers and lets agents extend helpers in flight. | |
| E2B | GitHub | cloud-sandbox, execution, enterprise | Secure cloud environments with real tools for production-grade agent execution. | |
| OpenSandbox | GitHub | sandbox, security, runtime | Secure and extensible sandbox runtime built for agent workloads. | |
| OpenShell | GitHub | sandbox, policy, runtime | Safe private runtime for autonomous agents with sandbox lifecycle control and declarative filesystem, network, process, and inference policies. | |
| Microsandbox | GitHub | sandbox, vm, mcp | Rootless local VM sandbox runtime with SDKs, detached long-running sessions, agent skills, and MCP server integration. | |
| CubeSandbox | GitHub | microvm, sandbox, e2b-compatible | MicroVM-based sandbox service for AI agents with sub-60ms startup, E2B-compatible APIs, and hardware-level isolation. | |
| Sandcastle | GitHub | sandbox, typescript, branch-strategy | TypeScript library for orchestrating coding agents inside isolated sandboxes with configurable branch strategies. | |
| agent-infra sandbox | GitHub | all-in-one, browser, shell | All-in-one sandbox combining browser, shell, files, MCP, and IDE server. | |
| Judge0 | GitHub | code-execution, sandbox, backend | Scalable sandboxed code execution system usable as an agent execution backend. | |
| Agent Sandbox | GitHub | kubernetes, sandbox, stateful | Kubernetes-native sandbox control plane for isolated, stateful agent runtimes with stable identity, persistence, and warm-pool support. | |
| stakpak/agent | GitHub | always-on, autonomous, ops | Always-on open agent that runs on your machines with autonomous operational loops. | |
| Sandbox Agent | GitHub | sandbox, coding-agents, session-schema | HTTP/SSE control server for running coding agents inside sandboxes with normalized sessions, permissions, event streaming, and replay. | |
| E2B Desktop Sandbox | GitHub | desktop, sandbox, computer-use | Secure virtual desktop sandbox for computer-use agents with SDK control and screen streaming. | |
| OSS-Fuzz Gen | GitHub | fuzzing, security, execution | LLM-powered fuzzing workflows integrated with controlled execution contexts. | |
| AgentBay SDK | GitHub | cloud-sandbox, computer-use, sdk | Cloud sandbox SDK for agents spanning browser, desktop, mobile, and code execution environments. | |
| Tensorlake | GitHub | microvm, sandbox, orchestration | Serverless runtime for agent sandboxes with MicroVM isolation, snapshots, suspend-resume, and background orchestration. | |
| AgentScope Runtime | GitHub | runtime, sandbox, deployment | Production runtime for agent apps with secure tool sandboxes, deployment APIs, observability, and state services. | |
| SWE-ReX | GitHub | sandbox, execution, coding-agent | Sandboxed execution infrastructure for AI coding agents at local and cloud scale. | |
| sandboxed.sh | GitHub | self-hosted, isolation, orchestrator | Self-hosted orchestrator running coding agents inside isolated Linux workspaces. | |
| Capsule | GitHub | wasm, sandbox, task-runtime | Durable runtime that coordinates agent tasks inside isolated WebAssembly sandboxes with retries and lifecycle tracking. | |
| agentbox | GitHub | sandbox, coding-agents, network-policy | Locked-down local sandbox for AI coding agents with scoped filesystem access, egress policy, secret injection, firewalling, and persistent agent state. | |
| HexAgent | GitHub | computer-layer, sandbox, runtime | Agent harness that separates the runtime from the computer it operates on through local, VM, and cloud sandbox backends. | |
| terminal-bench-env | GitHub | terminal, benchmark-env, sandbox | Environment layer for terminal-agent benchmark execution. |
| Project | Link | Stars | Tags | Summary |
|---|---|---|---|---|
| Anthropic Agent Skills | GitHub | skills, spec, claude | Official Agent Skills repository containing the skills specification, templates, and reference skill implementations for Claude. | |
| GitHub Spec Kit | GitHub | spec-driven, workflows, tooling | Toolkit for spec-driven development to guide deterministic agent execution. | |
| MCP Servers | GitHub | mcp, servers, implementations | Official collection of MCP server implementations across tools and domains. | |
| Chrome DevTools MCP | GitHub | mcp, browser, devtools | Official MCP server that gives coding agents Chrome DevTools access for reliable browser automation, debugging, and performance analysis. | |
| Playwright MCP | GitHub | mcp, browser, playwright | Official Playwright MCP server giving agents structured accessibility snapshots and deterministic browser automation tools. | |
| Claude Code Plugins Directory | GitHub | plugins, claude-code, marketplace | Anthropic-managed Claude Code plugin marketplace defining plugin manifests, MCP configuration, commands, agents, skills, and submission quality gates. | |
| FastMCP | GitHub | mcp, python, framework | Python framework for building MCP servers and clients with generated schemas, validation, documentation, production deployment patterns, and governance hooks. | |
| Serena | GitHub | mcp, coding-agents, semantic-tools | MCP toolkit that gives coding agents IDE-like semantic retrieval, editing, refactoring, debugging, and memory tools. | |
| MCP Python SDK | GitHub | mcp, python, sdk | Official Python implementation of MCP for building clients and servers that expose tools, resources, prompts, protocol lifecycle events, and standard transports. | |
| AGENTS.md | GitHub | spec, agent-file, instructions | Open format for repository-local instructions that coding agents can follow. | |
| Agent Skills Specification | GitHub | skills, spec, progressive-disclosure | Open specification and documentation for packaging reusable agent capabilities, workflows, scripts, references, and assets behind progressive disclosure. | |
| MCP TypeScript SDK | GitHub | mcp, typescript, sdk | Official TypeScript MCP SDK with server and client packages, transports, auth helpers, middleware adapters, and runnable examples. | |
| Model Context Protocol | GitHub | mcp, protocol, interoperability | Core specification and docs for MCP-based tool and context interoperability. | |
| directories (rules and MCP indexes) | GitHub | directories, mcp, rules | Curated directories of agent rules and MCP servers for tool discovery. | |
| Atmosphere | GitHub | jvm, multi-protocol, governance | JVM runtime for streaming governable AI agents across MCP, A2A, AG-UI, and browser-facing transports. | |
| LangChain MCP Adapters | GitHub | mcp, adapters, integration | Adapters connecting LangChain components with MCP servers. | |
| SkillHub | GitHub | skills, registry, governance | Self-hosted enterprise agent skill registry with package publishing, versioning, discovery, namespaces, RBAC, reviews, and audit logs. | |
| Agent Client Protocol | GitHub | acp, protocol, coding-agents | Open protocol that standardizes communication between code editors and coding agents. | |
| Microsoft MCP Servers | GitHub | mcp, enterprise, servers | Microsoft's official MCP server catalog for enterprise data and tools. | |
| ACPX | GitHub | acp, client, sessions | Headless CLI client for stateful Agent Client Protocol sessions. | |
| GitAgentProtocol | GitHub | standard, git-native, workflows | Git-native, framework-agnostic standard for defining agents, skills, workflows, tools, and runtime memory in repositories. | |
| Microsoft Learn MCP | GitHub | mcp, docs, grounding | MCP server and CLI for grounding agents with Microsoft documentation sources. | |
| IBM MCP | GitHub | mcp, clients, tooling | IBM collection of MCP servers, clients, and developer tooling. | |
| AGENT.md | GitHub | standard, agent-file, interoperability | Standardized machine-readable file format for agentic coding tools. |
| Project | Link | Stars | Tags | Summary |
|---|---|---|---|---|
| Promptfoo | GitHub | eval, red-team, ci | Config-driven prompt/agent/RAG testing, comparison, and red-team evaluation tool. | |
| DeepEval | GitHub | evaluation, framework, testing | LLM evaluation framework supporting agent and workflow quality testing. | |
| RAGAS | GitHub | rag, metrics, evaluation | Open evaluation toolkit for LLM and RAG quality metrics. | |
| lm-evaluation-harness | GitHub | benchmark, harness, llm | Popular benchmark harness for consistent LLM evaluation across tasks. | |
| SWE-bench | GitHub | benchmark, swe, evaluation | Standard benchmark for evaluating issue-fixing software engineering agents. | |
| verifiers | GitHub | verifier, rl, evaluation | Library for RL environments and verifier-based evaluation loops. | |
| AgentBench | GitHub | benchmark, cross-domain, agent | Cross-environment benchmark for evaluating LLM agents as tool-using systems. | |
| LangWatch | GitHub | simulation, evaluation, testing | End-to-end platform for agent simulations, evaluation loops, and production testing. | |
| EvalScope | GitHub | benchmark, framework, llm | Customizable framework for large-model benchmarking and performance evaluation. | |
| Harbor | GitHub | evaluation, harness, rl-env | Framework for running agent evaluations and constructing RL-style environments. | |
| Terminal-Bench | GitHub | terminal, benchmark, long-horizon | Terminal-native benchmark suite for long-horizon, verification-heavy agent tasks. | |
| WebArena | GitHub | web-agent, benchmark, environment | Self-hostable web environment and evaluation harness for autonomous web agents with reproducible end-to-end tasks. | |
| tau2-bench | GitHub | tool-use, interaction, benchmark | Tool-agent-user interaction benchmark emphasizing multi-step execution quality. | |
| Meta-Harness | GitHub | harness-search, optimization, terminal-bench | Framework for automated search over task-specific model harnesses, with reference experiments for memory systems and terminal-agent scaffolds. | |
| NeMo Gym | GitHub | rl-env, training, evaluation | Toolkit for building RL environments suitable for LLM/agent training and eval. | |
| TheAgentCompany | GitHub | benchmark, workplace, multi-step | Agent benchmark with simulated software-company tasks for evaluating multi-step workplace autonomy. | |
| Claw-Eval | GitHub | benchmark, trajectory, safety | Evaluation harness and benchmark for autonomous agents with human-verified tasks, trajectory auditing, and completion, safety, and robustness rubrics. | |
| Inspect Evals | GitHub | inspect, eval-suite, reproducibility | Evaluation suite collection for Inspect AI workflows. | |
| auto-harness | GitHub | optimization, regression, evals | Benchmark-gated optimization loop that mines failures, edits agent code, and guards against regressions overnight. | |
| SWE-Bench Pro | GitHub | swe, benchmark, long-horizon | Long-horizon software-engineering benchmark with reproducible Docker-based evaluation for issue-driven coding agents. | |
| WildClawBench | GitHub | benchmark, harness-comparison, multimodal | In-the-wild benchmark that compares multiple agent harnesses on end-to-end multimodal, coding, safety, and productivity tasks inside a live OpenClaw environment. | |
| ClawBench | GitHub | browser-agent, benchmark, recording | Browser-agent benchmark with live-site tasks, isolated containers, five-layer recording, and agentic scoring. | |
| Agent Evaluation | GitHub | evaluation, testing, ci | AWS framework for testing virtual agents with evaluator-driven multi-turn conversations, hooks, and CI-friendly workflows. | |
| WorkArena | GitHub | browser, benchmark, enterprise | Browser benchmark for practical enterprise-like knowledge work tasks. | |
| OpenHands Benchmarks | GitHub | openhands, eval, harness | Evaluation harness and benchmark definitions for OpenHands systems. | |
| WebArena-Verified | GitHub | web-agent, benchmark, deterministic | Verified web-agent benchmark with deterministic evaluators. | |
| HarnessBench | GitHub | harness-comparison, browser-agent, benchmark | Benchmark for comparing agent harnesses on the same everyday web tasks with fixed models and per-harness containers. |
| Project | Link | Stars | Tags | Summary |
|---|---|---|---|---|
| Langfuse | GitHub | llmops, tracing, metrics | Open-source LLM engineering platform for traces, metrics, prompts, and evals. | |
| MLflow | GitHub | platform, monitoring, evaluation | Broad AI engineering platform with monitoring and evaluation support for agents. | |
| Opik | GitHub | monitoring, eval, tracing | End-to-end debug/eval/monitoring stack for LLM apps and agent workflows. | |
| RagaAI Catalyst | GitHub | agentops, analytics, monitoring | Agent observability and monitoring framework with timeline and graph analytics. | |
| TensorZero | GitHub | llmops, gateway, optimization | Open LLMOps stack unifying gateway, observability, evaluation, and optimization. | |
| Arize Phoenix | GitHub | observability, tracing, evaluation | Open platform for AI observability, tracing, and evaluation analytics. | |
| OpenLLMetry | GitHub | opentelemetry, instrumentation, tracing | OpenTelemetry-based instrumentation for GenAI and LLM applications. | |
| Helicone | GitHub | monitoring, traffic, production | Lightweight platform for monitoring and evaluating LLM traffic in production. | |
| AgentOps SDK | GitHub | agentops, monitoring, cost | Monitoring and benchmarking SDK for agent workflows with cost and trace tracking. | |
| Latitude | GitHub | platform, eval, observability | Open-source agent engineering platform with eval and observability capabilities. | |
| Laminar | GitHub | observability, tracing, evals | Agent-focused observability stack with tracing, evaluation runs, monitoring, and dashboards. | |
| Desloppify | GitHub | quality-gates, codebase-health, ci | Agent-facing codebase quality harness with scans, scoring, LLM review, prioritized fix loops, persistent state, and CI gates. | |
| claude-code-reverse | GitHub | trace, visualization, debugging | Tooling to visualize and inspect Claude Code LLM interaction traces. | |
| Future AGI | GitHub | observability, evaluation, guardrails | Self-hostable platform that closes the loop across agent tracing, evaluation, simulation, guardrails, and gateway operations. | |
| OpenInference | GitHub | spec, instrumentation, observability | Open instrumentation specification and tooling for AI observability. |
| Project | Link | Stars | Tags | Summary |
|---|---|---|---|---|
| LiteLLM | GitHub | gateway, proxy, guardrails | Unified LLM gateway/proxy with cost tracking, load balancing, and guardrails. | |
| Kong | GitHub | gateway, policy, infra | API and AI gateway infrastructure useful for policy enforcement in agent systems. | |
| Parlant | GitHub | interaction-control, guardrails, customer-agents | Interaction-control harness for customer-facing agents focused on consistent, predictable, and governed LLM behavior. | |
| Portkey Gateway | GitHub | gateway, guardrails, routing | AI gateway with routing and guardrails for multi-model production traffic. | |
| CAI (Cybersecurity AI) | GitHub | security, governance, framework | Security-focused agent framework for offensive/defensive AI workflows. | |
| OpenAI Realtime Agents | GitHub | realtime, orchestration, control | Advanced agentic realtime patterns with structured control and interaction loops. | |
| Plano | GitHub | proxy, safety, data-plane | AI-native proxy and data plane with orchestration, safety, and observability. | |
| OpenAI CS Agents Demo | GitHub | demo, handoffs, governance | Customer-service multi-agent demo highlighting handoffs and guardrail-like control points. | |
| Agent Governance Toolkit | GitHub | governance, policy, sandboxing | Runtime governance toolkit that deterministically enforces agent policy, identity, sandboxing, and audit controls before actions execute. | |
| ContextForge | GitHub | gateway, governance, observability | Registry and proxy layer that unifies MCP, A2A, and REST/gRPC endpoints with centralized governance and observability. | |
| Archestra | GitHub | enterprise, guardrails, governance | Enterprise AI platform with guardrails, MCP registry, and orchestration services. | |
| Tracecat | GitHub | security, automation, policy | AI automation platform for security teams with policy and workflow controls. | |
| AgentGateway | GitHub | gateway, mcp, proxy | Agentic proxy gateway for AI agents and MCP server ecosystems. | |
| ClawManager | GitHub | control-plane, governance, runtimes | Kubernetes-native control plane for governing agent runtimes, AI gateway access, and reusable skills across multiple agent backends. | |
| Agent Vault | GitHub | credentials, egress-policy, proxy | Credential proxy and vault that brokers agent API access without exposing real secrets, with egress filtering and request logging. | |
| Haft | GitHub | governance, decisions, mcp | Decision-governance harness that records falsifiable contracts, evidence, and commissions before agents execute. | |
| Sponsio | GitHub | contracts, runtime-safety, guardrails | Runtime enforcement layer that checks every agent action against deterministic contracts before execution. | |
| DashClaw | GitHub | approvals, policy, audit | Governance layer that intercepts risky agent actions, enforces policy, routes approvals, and records audit-ready decision trails. | |
| Tandem | GitHub | runtime-authority, approvals, audit | Governed runtime authority layer for agents with scoped execution, tool visibility, permissioned memory, approval gates, and audit trails. |
| Project | Link | Stars | Tags | Summary |
|---|---|---|---|---|
| OpenClaw | GitHub | gateway, channels, sandboxing | Local-first personal assistant harness with a gateway control plane for sessions, channels, tools, events, skills, and sandboxed non-main agents. | |
| Claw Code | GitHub | rust, cli, sessions | Public Rust implementation of the claw CLI agent harness with auth, sessions, parity checks, container workflows, and terminal execution guidance. | |
| Hermes Agent | GitHub | memory, skills, subagents | Self-improving agent runtime with memory, skill creation, subagents, scheduled automations, and pluggable terminal backends. | |
| OpenCode | GitHub | terminal, coding-agent, subagents | Open-source coding agent with built-in plan/build roles, subagents, LSP support, and a client-server runtime. | |
| Claude Code | GitHub | terminal, coding-agent, git-workflows | Official terminal coding agent that understands codebases and executes editing, debugging, and Git workflows through natural language. | |
| Gemini CLI | GitHub | terminal, coding-agent, mcp | Open-source terminal agent with built-in tools, MCP support, checkpointing, and sandboxing controls. | |
| Browser Use | GitHub | browser-agent, automation, benchmarks | Browser-agent framework that exposes websites to LLMs through browser state, tools, cloud browsers, and benchmarked task runs. | |
| Codex CLI | GitHub | terminal, coding-agent, local-execution | Terminal-native coding agent that runs locally and exposes practical agent workflows for software tasks. | |
| LobeHub | GitHub | operator, multi-agent, scheduling | Chief-agent-operator platform for scheduling, running, and reporting on multi-agent workstreams. | |
| OpenHands | GitHub | coding-agent, software-engineering, repo | Open-source AI software engineer focused on repo-level coding task execution. | |
| Paperclip | GitHub | managed-agents, control-plane, governance | Managed-agent control plane with org charts, ticketing, budgets, heartbeats, and audit trails for coordinating agent teams. | |
| learn-claude-code | GitHub | tutorial, harness, claude-code | Hands-on harness tutorial for building Claude Code-like systems from scratch. | |
| Cline | GitHub | coding-agent, mcp, checkpoints | Open-source coding agent spanning IDE, terminal, SDK, and kanban surfaces with shared approvals, MCP, checkpoints, and agent teams. | |
| pi | GitHub | coding-agent, runtime, monorepo | Agent harness monorepo combining a coding-agent CLI, shared runtime, and multi-provider LLM stack. | |
| OpenManus | GitHub | general-agent, autonomy, workflows | Open foundation for broad autonomous agent workflows with coding-heavy use cases. | |
| aider | GitHub | terminal, repo-map, testing | Terminal coding assistant with repo mapping, git-aware edits, and built-in lint/test feedback loops. | |
| CowAgent | GitHub | reference, skills, multi-channel | Reference agent harness implementation with planning, memory, knowledge, skills, tools, MCP integration, schedulers, browser automation, and multi-channel delivery. | |
| nanobot | GitHub | runtime, memory, multi-channel | Ultra-lightweight agent runtime with WebUI, chat channels, tools, memory, MCP, model routing, deployment, and long-running goal support. | |
| CLI-Anything | GitHub | cli, tool-use, automation | CLI agent system that unifies command-line tool usage in agent loops. | |
| Claude Code Plugins: Orchestration and Automation | GitHub | claude-code, plugins, orchestration | Production-ready Claude Code plugin marketplace bundling agents, skills, tools, and multi-agent workflow orchestrators. | |
| Agent TARS | GitHub | computer-use, browser-agent, mcp | Multimodal computer and browser agent stack with CLI/Web UI, hybrid GUI/DOM browser control, MCP tools, event streams, and sandbox support. | |
| oh-my-claudecode | GitHub | claude-code, multi-agent, worktrees | Team-first orchestration layer for Claude Code with staged multi-agent execution, worktree-aware setup, and persistent session artifacts. | |
| Multica | GitHub | managed-agents, coding-agent, runtimes | Managed-agents platform that assigns issues to coding agents, routes execution through runtimes, and compounds reusable skills. | |
| ZeroClaw | GitHub | runtime, approval-gates, sandboxing | Single-binary agent runtime with providers, channels, tools, memory, SOPs, approval gates, sandboxing, ACP, and tool receipts. | |
| oh-my-codex | GitHub | codex, workflow, worktrees | Workflow layer for OpenAI Codex CLI with stronger session startup, standard planning-to-completion flows, durable state, skills, hooks, and worktree launches. | |
| NanoClaw | GitHub | containers, claude-sdk, scheduling | Container-isolated Claude agent harness with channel routing, scheduled jobs, per-group memory, and small-codebase customization. | |
| Vibe Kanban | GitHub | coding-agent, workspaces, review | Kanban control plane for planning, running, reviewing, and merging work from coding agents in isolated workspaces. | |
| Qwen Code | GitHub | terminal, coding-agent, cli | Terminal-native open-source coding agent tuned for practical dev loops. | |
| SuperClaude Framework | GitHub | config, personas, workflow | Configuration framework adding commands, personas, and method templates to coding agents. | |
| cmux | GitHub | macos, workspace, browser | Native macOS terminal and browser workspace for AI coding agents with notifications, split panes, and scriptable control. | |
| Compound Engineering | GitHub | plugins, worktrees, review | Cross-agent engineering plugin that codifies brainstorming, planning, worktree execution, review, and knowledge compounding loops. | |
| Devika | GitHub | assistant, planning, coding | Open-source coding assistant system for planning and implementing development tasks. | |
| SWE-agent | GitHub | swe, issue-fixing, tooling | Research-grade coding agent that resolves GitHub issues with explicit tooling loops. | |
| OpenFang | GitHub | agent-os, guardrails, rust | Rust agent operating system with autonomous capability packages, manifests, guardrails, tools, memory, sandboxing, audit trails, and channel adapters. | |
| Aperant | GitHub | coding-agent, parallel, memory | Autonomous multi-agent coding framework with parallel execution, isolated workspaces, QA loops, and persistent memory. | |
| Eigent | GitHub | desktop, cowork, productivity | Open-source desktop cowork agent for autonomous task execution and productivity. | |
| OpenHarness | GitHub | tool-use, memory, multi-agent | Open agent harness implementation covering tool use, skills, memory, permissions, and multi-agent coordination. | |
| IronClaw | GitHub | security, wasm, routines | Security-first personal agent harness with WASM sandboxing, routines, tool plugins, and persistent memory. | |
| Agent S | GitHub | computer-use, gui-agent, evaluation | Open-source computer-use agent framework with grounding models, reflection, local code execution option, and OSWorld-style evaluation support. | |
| Superset | GitHub | worktrees, desktop, parallel | Worktree-based desktop orchestrator for running and reviewing parallel CLI coding agents from one workspace. | |
| oh-my-pi | GitHub | terminal, lsp, subagents | Terminal AI coding agent with edit safety, LSP integration, and subagent support. | |
| GitHub Copilot CLI | GitHub | terminal, coding-agent, mcp | Official terminal coding agent built on GitHub's Copilot harness with MCP extensibility, approval controls, and GitHub-native context. | |
| Open SWE | GitHub | async, coding-agent, swe | Asynchronous open-source coding agent focused on software issue workflows. | |
| Paseo | GitHub | coding-agent, daemon, multi-device | Multi-device coding-agent daemon and client stack for orchestrating local agents, parallel runs, and cross-provider workflows. | |
| Agent Orchestrator | GitHub | worktrees, parallel, dashboard | Worktree-based orchestration layer for parallel coding agents with autonomous CI and review feedback handling. | |
| jcode | GitHub | coding-agent, terminal, rust | Rust coding-agent harness built for multi-session workflows, customization, memory, and terminal performance. | |
| Harness | GitHub | claude-code, meta-factory, agent-teams | Claude Code meta-factory that generates domain-specific agent teams, skills, orchestration patterns, and validation steps from a project description. | |
| OSAURUS | GitHub | macos, local-first, memory | Native macOS harness for autonomous coding agents with persistent memory. | |
| 1Code | GitHub | coding-agent, orchestration, worktrees | Desktop-first coding-agent orchestrator with worktree isolation, background sandboxes, MCP tooling, and automation triggers. | |
| holaOS | GitHub | long-horizon, desktop, durable-state | Desktop-first long-horizon agent environment with runtime, memory, tools, apps, and durable state. | |
| Webwright | GitHub | browser-agent, code-as-action, playwright | Minimal browser-agent harness that lets coding models solve web tasks by writing rerunnable Playwright scripts in a workspace. | |
| mini-swe-agent | GitHub | minimal, swe, coding-agent | Minimal coding agent implementation with strong benchmark competitiveness. | |
| HiClaw | GitHub | multi-agent, human-in-the-loop, shared-state | Collaborative multi-agent OS with manager-worker coordination, shared state, and human-in-the-loop oversight via Matrix rooms. | |
| gptme | GitHub | terminal, tools, mcp | Terminal-native personal agent with local tools, shell and web access, provider-agnostic models, plugins, skills, MCP, guardrails, and autonomous loops. | |
| TinyAGI | GitHub | team-orchestration, autonomous, workflows | Team-style agent orchestrator for one-person-company style autonomous workflows. | |
| Open Claude Cowork | GitHub | desktop, ui, orchestration | Desktop coding cowork assistant that turns agent orchestration into GUI workflows. | |
| Amazon Bedrock AgentCore Samples | GitHub | aws, runtime, operations | Official sample suite for deploying and operating agents with runtime, gateway, memory, observability, evaluation, and policy layers. | |
| Maestro | GitHub | desktop, worktrees, orchestration | Desktop command center for parallel coding agents with worktree isolation, queued tasks, auto-run playbooks, and reusable sessions. | |
| AI-DLC Workflows | GitHub | workflow-rules, quality-gates, steering | Official AWS workflow ruleset that steers coding agents through adaptive phases, quality gates, and IDE-specific context files. | |
| Google Agents CLI | GitHub | google-cloud, lifecycle, skills | Google Cloud CLI and skill bundle that gives coding agents scaffold, evaluation, deployment, publishing, and observability workflows. | |
| Open Cowork | GitHub | desktop, sandbox, mcp | Desktop agent app with VM-backed sandboxing, MCP connectors, GUI control, and built-in skill workflows. | |
| thClaws | GitHub | rust, workspace, skills | Native Rust agent workspace with shared agent loop, sessions, tools, skills, MCP, memory, hooks, and sandboxing. | |
| mini-coding-agent | GitHub | coding-agent, minimal, approvals | Minimal coding agent harness illustrating approvals, memory, bounded delegation, and durable transcripts. | |
| codex-autorunner | GitHub | meta-harness, tickets, long-running | Meta-harness that treats tickets as the control plane for long-running coding agents, with queue execution, hub UI, and chat notifications. | |
| CheetahClaws | GitHub | coding-agent, python, mcp | Python agent harness infrastructure for long-horizon, multi-model, tool-using coding assistants with MCP, skills, memory, approvals, checkpoints, and bridges. | |
| MateClaw | GitHub | self-hosted, approvals, channels | Self-hosted multi-user agent harness with StateGraph reasoning, skills, MCP/ACP registry, approvals, audit trail, and channel adapters. | |
| OpenClaw.NET | GitHub | dotnet, gateway, governance | NativeAOT-friendly .NET agent runtime and gateway with tools, memory, MCP, governance ledger, evidence bundles, and harness regression tests. | |
| Utah | GitHub | durable-execution, event-driven, multi-channel | Inngest-powered durable agent harness with a think-act-observe loop, step-level retries, singleton concurrency, cancellation, and multi-channel adapters. |
| Project | Link | Stars | Tags | Summary |
|---|---|---|---|---|
| awesome-claude-code | GitHub | awesome-list, claude-code, skills | Community collection of Claude Code skills, hooks, and orchestrator tooling. | |
| Awesome Agent Skills | GitHub | awesome-list, skills, cross-harness | Curated cross-harness map of official and community Agent Skills for Claude Code, Codex, Gemini CLI, Cursor, OpenCode, Copilot, and related hosts. | |
| awesome-agentic-patterns | GitHub | awesome-list, patterns, design | Catalog of reusable agentic design patterns and implementation motifs. | |
| awesome-mcp-servers | GitHub | awesome-list, mcp, tools | Curated MCP server index for tool interoperability in agent systems. | |
| awesome-harness-engineering | GitHub | awesome-list, curation, harness | Curated list focused on harness engineering articles, benchmarks, and implementations. | |
| 12 Factor Agents | Reference | - | reading, operations, principles | Operations-oriented principles for building maintainable production agents. |
| Agent Frameworks, Runtimes, and Harnesses, oh my! | Reference | - | reading, langchain, architecture | Clear decomposition of framework vs runtime vs harness responsibilities. |
| An open-source spec for Codex orchestration: Symphony. | Reference | - | reading, openai, orchestration | OpenAI's orchestration write-up on turning issue trackers into always-on control planes for coding agents. |
| Building agents with the Claude Agent SDK | Reference | - | reading, claude, sdk | Claude blog on production-oriented SDK usage for sessions, tools, and orchestration. |
| Building Effective AI Agents | Reference | - | reading, anthropic, agents | Anthropic's practical guidance on when to use workflows vs. autonomous agents and how to structure them. |
| Claude Code auto mode | Reference | - | reading, anthropic, permissions | Anthropic's write-up on classifier-backed approval delegation for safer high-autonomy coding-agent runs. |
| Code execution with MCP | Reference | - | reading, anthropic, mcp | Anthropic's design notes on controlled code execution via MCP boundaries. |
| Demystifying Evals for AI Agents | Reference | - | reading, evals, anthropic | Methodology for designing robust agent evals in non-deterministic trajectories. |
| Effective context engineering for AI agents | Reference | - | reading, context, anthropic | Guidance on context-window budgeting and working-state management for agents. |
| Effective harnesses for long-running agents | Reference | - | reading, long-running, anthropic | Practical guide to maintaining state, resumability, and reliability over long agent runs. |
| Evaluating Deep Agents: Our Learnings | Reference | - | reading, langchain, evaluation | LangChain's practical lessons on evaluating stateful and long-horizon agents. |
| Harness design for long-running application development | Reference | - | reading, app-dev, anthropic | Follow-up article on improving long-running app generation through harness structure. |
| Harness Engineering (Martin Fowler) | Reference | - | reading, architecture, fowler | Architectural perspective on harness engineering and entropy control. |
| Harness engineering (OpenAI) | Reference | - | reading, methodology, openai | Field report on building reliable agent-first software via harness constraints and verification. |
| How we built our multi-agent research system | Reference | - | reading, anthropic, multi-agent | Anthropic architecture write-up on role separation and coordination in multi-agent systems. |
| Improving Deep Agents with harness engineering | Reference | - | reading, langchain, harness | Evidence that harness improvements alone can move benchmark performance. |
| Making Claude Code more secure and autonomous with sandboxing | Reference | - | reading, anthropic, sandboxing | How Anthropic uses sandbox boundaries to raise agent autonomy without giving up security controls. |
| Quantifying infrastructure noise in agentic coding evals | Reference | - | reading, anthropic, evaluation | Analysis of how infrastructure choices impact coding-agent benchmark outcomes. |
| Scaling Managed Agents: Decoupling the brain from the hands | Reference | - | reading, anthropic, architecture | Anthropic's meta-harness architecture for decoupling session logs, harness loops, and sandboxes in long-horizon agents. |
| Skill Issue: Harness Engineering for Coding Agents | Reference | - | reading, humanlayer, coding-agents | Practical breakdown of why coding-agent quality depends heavily on harness setup. |
| Testing Agent Skills Systematically with Evals | Reference | - | reading, openai, evals | OpenAI Developers guide for turning agent traces into repeatable skill evaluations. |
| The Anatomy of an Agent Harness | Reference | - | reading, architecture, langchain | Conceptual decomposition of agent harness components and their responsibilities. |
| The next evolution of the Agents SDK | Reference | - | reading, openai, sdk | OpenAI's product and engineering post on model-native agent harnesses, native sandbox execution, manifests, memory, and filesystem and shell tools. |
| Unrolling the Codex agent loop | Reference | - | reading, openai, architecture | OpenAI engineering deep dive into the Codex harness loop, prompt growth, tool-call replay, and stateless execution tradeoffs. |
| What We Learned Building Cloud Agents | Reference | - | reading, cognition, cloud-agents | Cognition's field report on secure cloud-agent infrastructure, VM isolation, full-state snapshots, orchestration, governance, integrations, and enterprise adoption. |
| Writing effective tools for AI agents | Reference | - | reading, anthropic, tools | Best practices for tool interface design so agents call tools safely and reliably. |
| Your Agent Needs a Harness, Not a Framework | Reference | - | reading, inngest, reliability | Argument for reliability-first infrastructure around agents instead of framework-only thinking. |
- Source of truth:
data/projects.yaml - Regenerate README files:
python3 scripts/render_readme.py - Verify catalog and links:
python3 scripts/verify_catalog.py
@misc{li2026agentharness,
title={Agent Harness Engineering: A Survey},
author={Li, Junjie and Xiao, Xi and Zhang, Yunbei and Liu, Chen and Zhao, Lin and Liao, Xiaoying and Ji, Yingrui and Wang, Janet and Gu, Jianyang and Ge, Yingqiang and Xu, Weijie and Fang, Xi and Xu, Xiang and Zhao, Tianchen and Kim, Youngeun and Wang, Tianyang and Hamm, Jihun and Krishnaswamy, Smita and Huan, Jun and Reddy, Chandan},
url={https://openreview.net/pdf?id=eONq7FdiHa},
year={2026}
}