- 1. What Cyber Pilot generates into host tools
- 2. Quick recommendation by tool
- 3. Support matrix
- 4. Shared best practices across all tools
- 5. Claude Code
- 6. Cursor
- 7. GitHub Copilot
- 8. OpenAI Codex
- 9. Windsurf
- 10. Common problems and fixes
- Problem: subagents are not available where you expected them
- Problem: review quality is poor after a long generation session
- Problem: the host appears to support read-only review, but you still do not trust it fully
- Problem: one giant task keeps going off the rails
- Problem: Windsurf feels worse than tools with subagents
- 11. How to think about subagents vs manual chat separation
- Further reading
How to use Cyber Pilot with different AI agent hosts, what each tool supports, where the rough edges are, and how to work around them.
For the canonical product model, workflow model, and setup paths, start with README. This guide is specifically about host differences and operational trade-offs.
Convention: 💬 = paste into AI coding tool chat. 🖥️ = run in terminal.
Cyber Pilot is not tied to one AI host.
Instead, it projects its workflows and instructions into the host tool you use.
In practice, cpt generate-agents --agent <tool> generates some combination of:
-
workflow commands
- entry points for
plan,generate,analyze, workspace flows, and kit workflows
- entry points for
-
skill outputs
- host-tool-visible Cyber Pilot skill entry points that route into the core instructions
-
subagents
- isolated task-specific agents with scoped permissions and dedicated prompts, where the host supports them
Claude Code is the canonical full-fidelity format for generated subagents.
Other tools receive the best adaptation their host format supports, with graceful degradation where a capability has no equivalent.
Typical setup:
🖥️ Terminal:
cpt generate-agents --agent claude
cpt generate-agents --agent cursor
cpt generate-agents --agent copilot
cpt generate-agents --agent openai
cpt generate-agents --agent windsurfSubagents are not equally supported across all tools.
That is one of the most important practical differences when using Cyber Pilot.
-
Claude Code
- Best overall fit when you want the fullest Cyber Pilot experience, especially for generation, strong subagent support, read-only review isolation, model selection, and worktree isolation.
-
Cursor
- Good general-purpose IDE host for Cyber Pilot. Supports subagents, and its multi-model nature is a real advantage because you can pair Anthropic for generation with GPT-style models for review. The isolation model is still weaker than Claude Code.
-
GitHub Copilot
- Usable with Cyber Pilot and supports subagents. It is especially attractive when you want strong review behavior with GPT-style models, though host-level control is less expressive than Claude Code.
-
OpenAI Codex
- Strong option for review, bounded analysis, and artifact-heavy work when you want GPT-style strictness and carefulness. Works best when tasks stay narrow and validation is explicit.
-
Windsurf
- Still usable with Cyber Pilot workflows and skills, and its multi-model nature is a practical plus. But it does not support subagents, so treat it as a single-agent host and manually separate contexts.
| Capability | Claude Code | Cursor | GitHub Copilot | OpenAI Codex | Windsurf |
|---|---|---|---|---|---|
| Workflow / skill integration | Yes | Yes | Yes | Yes | Yes |
| Host-native subagents | Yes | Yes | Yes | Yes | No |
| Read-only review enforcement | Strong | Strong | Partial | Weaker / prompt-led | No host-native subagent enforcement |
| Worktree isolation | Yes | No | No | No | No |
| Model selection in generated subagents | Yes | Yes | No equivalent | Tool-dependent / less central | N/A |
| Subagent-scoped hooks | Yes | No | Tool-specific / narrower current surface | No | No |
| Multi-model host advantage | No | Yes | Yes | No | Yes |
| Best use mode with Cyber Pilot | Full orchestration | Strong daily-driver | Structured assistance | Bounded execution | Manual separation |
Important distinction:
- if a host does not support subagents, that is a host limitation, not a Cyber Pilot limitation
- Cyber Pilot still gives you workflows, skill routing, validation, traceability, and planning
-
Regenerate integrations after setup changes
- If you changed workflows, skills, kits, or agent config, rerun
cpt generate-agents --agent <tool>.
- If you changed workflows, skills, kits, or agent config, rerun
-
Use
planbefore large work- Do not push a large ambiguous request straight into
generate.
- Do not push a large ambiguous request straight into
-
Separate generation from review
- If the host supports subagents, use them.
- If it does not, use separate chats.
-
Use a fresh chat for a new task
- Especially for generation and review.
- If you stay in the same session, clear the context first.
- In tools such as Claude Code or Codex-style shells,
💬 /clearis the practical reset.
-
Validate after generation
- A successful generation step is not the end.
- Run validation, review, fix, and validate again.
-
Keep tasks bounded
- Narrow requests are easier for every host tool to execute safely and correctly.
-
Require human review at the end
- Host tooling, subagents, validation, and AI review raise confidence.
- They do not replace final human judgment.
Use these tiers as the practical default:
-
fast-tier model
- good for high-volume, lower-complexity work where speed matters more than deep reasoning
-
default / inherit model
- use the user's current session model; this should usually be the default choice
-
strong reasoning model required
- use your best available model tier for tasks with ambiguity, architecture judgment, long dependencies, or expensive mistakes
The current subagent defaults already follow this rule of thumb:
- prefer
inheritunless a cheaper or faster tier is clearly enough - use
fastonly for high-volume, lower-complexity tasks such as structured PR review
In current practical use with Cyber Pilot, the model-family tendency is usually:
-
GPT-thinking models
- usually better for review, especially when you want stricter, more careful, more exact analysis
- often better for artifact generation too, though usually not by a dramatic margin; the main difference is that they tend to make fewer mistakes
-
Claude / Anthropic models
- usually better for generation, especially code generation
- often faster and stronger when turning an approved spec into implementation
- usually somewhat weaker than GPT-style models for artifact-heavy work, though the gap is not huge
This is a practical default, not a hard law.
The right choice still depends on the host, the task shape, and the exact model version available in your setup.
| Operation or prompt type | Recommended model tier | When fast-tier is acceptable | Notes |
|---|---|---|---|
show config, where-defined, list-ids, simple lookup, shallow repo navigation |
fast-tier or default / inherit | usually yes | Mostly bounded retrieval and formatting work. |
| Small formatting fixes, marker recovery, narrow edits with explicit target | fast-tier or default / inherit | yes, if scope is tight | Still validate after the change. |
| Structured PR review first pass | fast-tier by default, ideally GPT-style review model | yes | This matches the current cypilot-pr-review default. Escalate if review becomes architectural or cross-artifact. |
| Deterministic validation triage and checklist scanning | fast-tier or default / inherit | often yes | Good for first-pass issue grouping, but escalate when interpretation becomes non-trivial. |
plan for large, risky, or multi-phase work |
strong reasoning model required | usually no | Decomposition quality matters; weak planning creates downstream drift. |
Artifact generation or transformation such as PRD -> DESIGN -> DECOMPOSITION -> FEATURE |
strong reasoning model required, often GPT-style model preferred | only for tiny bounded rewrites | These tasks mix structure, interpretation, and consistency pressure. GPT-style models often make slightly fewer artifact-level mistakes. |
| Brownfield understanding, reverse engineering, architecture review, migration planning | strong reasoning model required | usually no | High ambiguity and hidden dependencies make weaker tiers risky. |
| Code generation from an approved FEATURE or DESIGN | default / inherit, often Claude / Anthropic preferred in practice | only for small well-specified local changes | For anything non-trivial, prefer the best code-generation model available in the session. Claude-style models are often the stronger default here. |
| Phase compilation, phase execution planning, RalphEx orchestration handoff | default / inherit or strong reasoning | rarely | These are coordination-heavy tasks; mistakes compound across steps. |
| Final acceptance decision | human review required | no | No model tier replaces final human judgment. |
-
Claude Code
- best fit when you want the host integration to carry explicit model hints together with isolation and tool scoping
-
Cursor
- supports generated model selection and is useful as a multi-model host; a common practical split is Anthropic for generation and GPT-style models for review. It still lacks Claude-style worktree isolation.
-
GitHub Copilot
- generated agent files do not have the same direct model-selection surface, so choose the right model in the host environment or session before invoking Cyber Pilot workflows. Its multi-model nature is useful when you want stronger review behavior from GPT-style models.
-
OpenAI Codex
- treat model choice as mostly session-level and prompt-level rather than strongly host-enforced metadata; use stronger GPT-style models for planning, review, architecture, and brownfield work
-
Windsurf
- because there are no subagents, model choice is entirely manual per chat; its multi-model nature is still useful because you can choose one model family for generation and another for review
Claude Code is the strongest fit for the full Cyber Pilot model.
It is the highest-fidelity host for:
- subagent definitions
- scoped tools
- read-only review setup
- model selection
- worktree isolation
- subagent-level hooks
This makes it the closest match to how Cyber Pilot wants generation and review to be separated.
- fully specified code generation
- isolated PR review
- phase compilation and phase execution
- complex multi-step work where context hygiene matters
-
You mix generation and review in one long parent chat
- Even with subagents available, the parent session can still accumulate stale context.
-
You expect codegen edits to behave like direct inline edits
- Claude Code is the one host where worktree isolation is available, so isolation behavior can differ from your expectations.
-
You forget to regenerate integrations after config changes
- Then the host may still be using old workflow or subagent definitions.
- use a fresh parent chat for a new major task
- prefer subagents for codegen and review instead of doing everything in one session
- rerun
cpt generate-agents --agent claudeafter integration-relevant changes - still validate and review after codegen
If you want the most complete Cyber Pilot host today, use Claude Code.
It is also the strongest default when the main task is code generation.
Cursor supports Cyber Pilot subagents and is a strong day-to-day host for normal coding workflows.
It supports:
- subagent definitions
- read-only review flag
- model selection in generated definitions
Cursor does not give you the same worktree-isolated model as Claude Code.
That means the overall separation is still useful, but not as strong as the Claude Code path.
- You assume Cursor subagents give the same isolation guarantees as Claude Code
- You let one large task sprawl instead of using phased execution
- You use subagents, but still keep too much stale context in the parent chat
- keep tasks narrower
- prefer
planfor large work - validate after each meaningful implementation step
- start a new chat before review work
- treat Cursor as strong, but not “full Claude-equivalent isolation”
Use Cursor when you want solid Cyber Pilot support inside an interactive IDE workflow and do not strictly depend on Claude-style worktree isolation.
It becomes especially attractive when you want one host where:
- Anthropic models can handle generation
- GPT-style models can handle review
GitHub Copilot supports generated Cyber Pilot subagents and can participate in structured generation and review flows.
It is useful for:
- structured implementation tasks
- read-oriented review flows
- teams already standardized on Copilot
Copilot's generated agent surface is less expressive than Claude Code.
The result is still useful, but some control that Claude can express more directly is thinner here.
- You expect Copilot subagents to enforce as much as Claude Code does
- You rely too much on the host tool and not enough on Cyber Pilot validation
- You have existing GitHub-side instruction files and assume Cyber Pilot will overwrite everything
- keep the tasks explicit and bounded
- lean more on Cyber Pilot workflows and deterministic validation
- inspect generated
.github/agents/outputs when debugging setup issues - do not treat host integration as the source of truth; treat Cyber Pilot workflows and validation as the source of truth
Use GitHub Copilot when it is already your team standard, but compensate for weaker host-level expressiveness with tighter prompts and stronger validation discipline.
It is especially reasonable when your review culture benefits from GPT-style thinking models.
OpenAI Codex can be used with Cyber Pilot and supports generated agent definitions.
It works best for:
- bounded execution tasks
- well-specified implementation work
- clear, explicit, low-ambiguity instructions
Compared with Claude Code, more of the intended behavior is carried by the prompt and workflow instructions rather than by rich host-native enforcement.
That means you should expect less safety from the host layer itself.
- You give Codex an oversized or ambiguous task and expect the host to keep it on rails
- You reuse a polluted session for review after generation
- You treat one clean run as sufficient evidence of correctness
- keep tasks tightly scoped
- use
planbefore big execution - validate after each phase
- use a new chat for a new task
- in Codex-style shells, use
💬 /clearbefore switching task type if you remain in the same session
Use OpenAI Codex when the task is explicit and bounded.
Do not rely on the host alone to provide the same isolation guarantees you would expect from Claude Code.
It is particularly strong for:
- review
- strict analysis
- artifact-heavy work
Windsurf does not support subagents.
This is the most important thing to understand before using Cyber Pilot there.
Cyber Pilot still works through:
- workflow integrations
- skill outputs
- workflow routing
- validation and traceability
But Windsurf does not give you host-native isolated child agents for codegen and review.
In Windsurf, you should manually do what subagents would otherwise help with automatically:
- use one chat for planning
- use another chat for generation
- use another chat for review
- keep validation separate and explicit
- You expect
cypilot-codegenorcypilot-pr-reviewto exist as host-native subagents - You run generation and review in the same long session
- You let generation context contaminate review quality
- You expect least-privilege separation that the host cannot enforce
- use fresh chats aggressively
- keep one role per chat
- use
planfor larger work so each phase stays bounded - run validation and review explicitly after generation
- accept that Windsurf is a single-agent host from Cyber Pilot's point of view
Use Windsurf with Cyber Pilot when you want the workflow and skill layer, but do not expect host-native subagent orchestration.
Think of Windsurf as manual context orchestration rather than subagent orchestration.
Its main practical upside is that it can still be valuable as a multi-model host, even without subagents.
Likely causes:
- the host does not support subagents
- integrations were not regenerated
- you assumed host parity that does not exist
Fix:
- rerun
🖥️ cpt generate-agents --agent <tool> - check whether that host actually supports subagents
- if not, switch to manual chat separation
Likely cause:
- stale generation context polluted the review context
Fix:
- use a separate review subagent where supported
- otherwise start a new chat
- in Claude Code or Codex-style sessions, use
💬 /clearbefore the review task if you stay in the same shell
Likely cause:
- host-level control is not equally strong across tools
Fix:
- treat host permissions as helpful, not as your only safety mechanism
- rely on Cyber Pilot validation, review workflow, and final human review
Likely cause:
- the task should have gone through
planfirst
Fix:
- use
💬 cypilot plan: ... - execute phase by phase
- validate after each meaningful step
Likely cause:
- Windsurf lacks the host-native isolation layer
Fix:
- split work by chat manually
- separate generation from review
- keep tasks smaller and more explicit
One practical workaround is to use Cyber Pilot to generate the next-chat prompt for you.
Example:
-
in the current chat:
- 💬
cypilot analyze: generate a bounded prompt for a fresh Windsurf chat that should implement only phase 2 of the approved plan, list the exact files to inspect first, preserve @cpt-* markers, and end by running validation and summarizing any remaining issues - 💬
cypilot analyze: generate a bounded prompt for a fresh Windsurf chat that should review the code changed for phase 2 against the approved FEATURE and DESIGN, check for missing @cpt-* markers, and return a structured list of issues by severity
- 💬
-
then open a new Windsurf chat and paste the generated prompt there
This partially simulates what a dedicated codegen or review subagent would have given you:
- a cleaner context boundary
- a bounded task
- explicit files and constraints
- a stronger separation between orchestration and execution
Prefer:
- parent chat for orchestration
- subagent for generation
- subagent for review
- fresh parent chat when switching to a new major task
This gives you:
- cleaner context boundaries
- better least-privilege separation
- more stable long-running workflows
Manually simulate the same separation:
- one chat for planning
- one chat for generation
- one chat for review
- explicit validation between them
This is especially important in Windsurf.
Whether the host supports subagents or not, the operating model should still be:
- plan or generate
- validate
- review
- fix
- validate again
- human review before acceptance