Skip to content

完善 skill evolution 闭环并打通 API/SDK/MCP 兼容#9

Open
w31r4 wants to merge 26 commits intomainfrom
feature/skill-extraction-intelligence
Open

完善 skill evolution 闭环并打通 API/SDK/MCP 兼容#9
w31r4 wants to merge 26 commits intomainfrom
feature/skill-extraction-intelligence

Conversation

@w31r4
Copy link
Collaborator

@w31r4 w31r4 commented Mar 12, 2026

背景

这组改动围绕 skill evolution 控制面做了三件事:

  • 补齐 goal / rubric / evaluator / auto-promote 的闭环,让演化链路从“只会 mutate”变成“可按目标评估并自动推进”。
  • 修复 scheduler 重复触发以及 outcome 归属校验缺失导致的信号污染问题。
  • 打通 API / SDK / MCP 对 evolution candidate 新数据格式的兼容,避免 list 形式的 preconditions / postconditions 在下游解析时报错。

同时补了一个旧数据兼容边角:历史上如果只写入了 rubric_json,但当时还没有回填 rubric_summary,现在会在缓存命中时自动补写,保证 /v1/skills/goals 的返回一致。

主要改动

1. 补齐 skill evolution 的目标驱动闭环

  • SkillMutationAgent 现在会在 mutate 时读取 skill goal,并把 goal 注入 mutation prompt。
  • 新增 RubricGeneratorGoalConditionedEvaluator 两个模块,通过 OpenAI-compatible HTTP 接口分别负责 rubric 生成和 mutation 评估。
  • rubric 会缓存到 SkillGoal.rubric_json,并同步维护 rubric_summary。goal 文本变更后会主动清空旧缓存,下一轮重新生成。
  • evaluator 评估时使用的是 mutation 之后的 skill 内容,不再误评父版本内容。
  • 当 evaluator 返回 passed=truescore >= auto_promote_threshold 时,candidate 会自动 promote 到 canary。
  • llm.enabled 与外层 evolution.enabled 的职责被明确分开:前者控制 LLM 驱动的演化逻辑,后者控制调度循环。

2. 修复 scheduler 重复触发和 outcome 信号污染

  • scheduler 在触发 mutation 前会检查最近一次 system:evolution candidate 是否已经覆盖当前 failure 窗口,避免同一批失败在多个周期里重复 mutate。
  • 当有新的 failure 出现时,scheduler 会重新触发 mutation,不会把 skill 锁死在“只跑过一次”的状态。
  • record_outcome 现在会校验 (release_id, owner, skill_key) 三者一致性,防止将错误 execution outcome 写到不属于自己的 release 或 skill 上。

3. 打通 API / SDK / MCP 的 evolution candidate 兼容

  • evolution candidate 的 preconditions / postconditions 由原来的单一 dict 形式,扩展为兼容 list[str] | dict[str, Any] | null
  • Bay API 的创建请求与响应序列化都已支持双格式。
  • SDK 类型定义、创建请求和解析测试已同步更新。
  • MCP 的 create_skill_candidate tool schema、handler 和回归测试已同步更新,创建/读取路径都不会再把 list 格式截断为 null。

4. 测试与回归

  • 新增/补强了 Bay 单测与集成测试,覆盖 rubric 缓存、goal 变更失效、mutation 评估、auto-promote、scheduler 去重、outcome 归属校验等关键路径。
  • 新增 SDK/MCP 回归测试,覆盖 evolution candidate 的新字段格式兼容。
  • 对旧缓存 rubric_json 缺失 rubric_summary 的场景新增回归测试,确保历史数据会自愈。

验证结果

  • pkgs/bay: 550 passed, 191 skipped
  • shipyard-neo-sdk: 46 passed
  • shipyard-neo-mcp: 52 passed
  • 变更相关文件 ruff check 已通过

影响范围

  • 变更内聚于 Bay 控制面及其 SDK/MCP 兼容层。
  • 不引入 runtime 侧 agent 决策逻辑。
  • 无破坏性 API 变更,属于向后兼容增强。

Reviewer 建议关注点

  • SkillMutationAgent 中 rubric 缓存/失效与 evaluator 的配合是否符合预期。
  • scheduler 对 failure 窗口的去重判定是否满足线上策略。
  • API / SDK / MCP 对 preconditions / postconditions 双格式兼容是否符合使用方预期。

Reviewer 导读

这个 PR 相对 main 的 diff 较大,原因是当前分支是一个叠加过历史提交的 feature branch。Files changed 里会同时出现 extraction pipeline、skill evolution、API/SDK/MCP 兼容、以及部分更早的分支基线改动。

如果 reviewer 主要想确认“这次 skill evolution/skill extraction 的逻辑是否可合并”,建议不要从 Files changed 顶部顺着看,而是按下面顺序 review:

推荐 review 顺序

  1. Extraction Pipeline 主线

    • 先看 3e02d60 这笔提交。
    • 重点文件:pkgs/bay/app/services/skills/extraction/*pkgs/bay/app/services/skills/scheduler.pypkgs/bay/tests/unit/managers/test_extraction_strategies.pypkgs/bay/tests/unit/managers/test_browser_learning_scheduler.py
    • 关注点:规则/LLM 提取策略、变量提取、payload_hash 去重、降级路径。
  2. Skill Evolution API 与基础链路

    • 再看 592b27d
    • 重点文件:pkgs/bay/app/api/v1/skills.pypkgs/bay/app/services/skills/evolution/*pkgs/bay/app/services/skills/service.pypkgs/bay/tests/integration/core/test_skill_evolution_api.py
    • 关注点:goal 声明、outcome 上报、scheduler 触发、mutation candidate 创建与 release 流程。
  3. Hardening 与闭环补齐

    • 再看 23de36d9f068d2a255127
    • 重点文件:pkgs/bay/app/services/skills/evolution/agent.pypkgs/bay/app/services/skills/evolution/llm.pypkgs/bay/app/services/skills/evolution/scheduler.pyshipyard-neo-sdk/shipyard_neo/types.pyshipyard-neo-mcp/tests/test_server.py
    • 关注点:goal/rubric/evaluator/auto-promote 闭环、scheduler 去重、record_outcome 归属校验、API/SDK/MCP 对 list/dict/null 双格式兼容、旧 rubric_json 缓存自愈。

Reviewer 常见问题预答

  • 为什么 rubric 不是在 declare_goal 时立即生成?

    • 设计上把 declare_goal 保持为纯控制面写操作,不强依赖 LLM。rubric 只在真正进入 evolve/mutate 路径时生成,并做缓存,这样 llm.enabled=false 时不会因为 goal 声明而触发额外外部依赖。
  • 为什么 preconditions / postconditions 要兼容 list | dict | null

    • 旧的人为/API 创建 candidate 使用 dict 结构,evolution agent 生成的 candidate 使用 list 结构。这里做的是兼容式扩展,不是替换旧格式。
  • 为什么 scheduler 去重是按“最近 mutation 是否晚于最近 failure”判定?

    • 目标是避免同一批 failure 在周期调度中被重复 mutate。只要出现新的 failure,时间线就会重新推进,scheduler 会再次触发 mutation。
  • 自动 promote 是否会误判?

    • 当前 guardrail 有三层:必须显式开启 evolution + llm;必须 evaluator passed=true;必须 score >= auto_promote_threshold。低于阈值的 candidate 保持 DRAFT,不会被自动拒绝,仍可人工审阅。

建议重点看哪些测试

  • pkgs/bay/tests/unit/managers/test_skill_mutation_agent.py
  • pkgs/bay/tests/unit/managers/test_evolution_scheduler.py
  • pkgs/bay/tests/unit/managers/test_skill_evolution_service.py
  • pkgs/bay/tests/integration/core/test_skill_evolution_api.py
  • shipyard-neo-sdk/tests/test_skills_and_history.py
  • shipyard-neo-mcp/tests/test_server.py

w31r4 and others added 24 commits February 21, 2026 02:26
Extend skill candidate create/read flows with optional summary, usage
notes, and structured pre/post conditions.

Add release promotion metadata fields for upgrade lineage and change
context, including parent release ID, upgrade reason, and change
summary.

Propagate these fields through Bay API models/services, SDK types and
client methods, MCP handlers/tool schemas, and integration/unit tests.
Document the project's two-layer self-iteration model separating
execution evidence from versioned release decisions.

Expand lifecycle guidance to show optional metadata fields for
candidate creation (`summary`, `usage_notes`, `preconditions`,
`postconditions`) and release promotion (`upgrade_of_release_id`,
`upgrade_reason`, `change_summary`) to improve explainability and
auditability.
Introduce soft-delete support across Bay skill lifecycle APIs, service
logic, and persistence models with delete metadata fields
(is_deleted, deleted_at, deleted_by, delete_reason).

Add DELETE endpoints for candidates and releases with guardrails:
active releases cannot be deleted, and candidates referenced by active
releases cannot be deleted. Exclude deleted records from get/list and
related lifecycle queries to keep APIs consistent.

Expose delete operations in shipyard-neo-sdk and shipyard-neo-mcp,
including optional delete reasons in DELETE request bodies, and add
integration/unit tests covering end-to-end behavior and tool output.
Make `delete_release` and `delete_candidate` accept an optional
`SkillDeleteRequest` so clients can call DELETE endpoints without sending
a body.

Update shipyard-neo docs to reflect cleanup operations in the lifecycle
flow and document delete tools plus optional candidate/release metadata
fields.
Update version metadata in pyproject.toml for bay, gull, ship,
shipyard-neo-mcp, and shipyard-neo-sdk.
Regenerate uv.lock entries to keep package versions aligned.
soft-delete no longer rejects active releases and now deactivates them
as part of deletion so active lookups skip deleted records.

update integration and unit tests to cover the new behavior and align
SDK/docs wording with server-side semantics.
Document valid `create_skill_payload` formats and note that top-level
scalar JSON values are not accepted.

Explain `payload_ref` usage for reusable storage and candidate attachment,
and add replay behavior details:
- browser replay is supported via the skill run endpoint and requires a
  JSON object payload with a non-empty `commands` array
- python/shell currently have no release-based replay endpoints
Ensure candidate promotion_release_id does not reference deleted or
missing releases.

When deleting a release, clear promotion pointers on matching
candidates in the same transaction. Also sanitize candidates on read
(get/list) to clean up historical dangling references and keep API
responses consistent.
Allow `create_payload` to take a JSON string and normalize it to an
object/array before sending the request.

Raise a `ValueError` for invalid or non-object/array JSON payloads to
fail fast, and add client tests for both success and failure paths.
promote requests from the test extra to core ship dependencies
and add tenacity, cachetools, tqdm, orjson, python-slugify,
and tomli to the runtime set.

regenerate uv.lock and requirements.txt to reflect the new
dependency graph, and update python-sandbox and shipyard-neo
skill docs to list the expanded web and utility libraries.
enqueue a warmup hook on successful sandbox creation so runtime
startup can begin without delaying API response completion.

skip warmup when returning an idempotency cache hit, and add unit
tests covering warmup scheduling and idempotent create paths.
introduce a warm pool system to reduce sandbox cold-start latency by
pre-warming instances and claiming them on create when available.

add global and per-profile warm pool configuration, startup/shutdown
lifecycle hooks, an in-process bounded warmup queue with workers, and a
periodic scheduler to replenish and rotate warm instances.

update sandbox creation flow to check idempotency first, attempt warm
claim before normal create, and enqueue warmup work through the queue
(with background-task fallback when queue is unavailable).

extend sandbox model/manager with warm pool state and atomic claim logic,
exclude warm pool instances from user listing and GC expiry/idle scans,
and add unit tests for claim behavior, manager methods, and queue
lifecycle/statistics.
Align bay, gull, ship, shipyard-neo-sdk, and shipyard-neo-mcp
versions and lockfiles for the 0.2.0 release.

Update Bay API v1 docs to clarify sandbox create behavior with
idempotency cache, warm pool claim fallback, and warmup queue
semantics.
Patch `utcnow` in warm pool manager unit tests to control `warm_ready_at`
and `warm_rotate_at` values directly.

This removes timing tolerance and sleep-based ordering, reducing flakiness
and making readiness/claim assertions exact and repeatable.
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @w31r4, your pull request is larger than the review limit of 150000 diff characters

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant enhancements to the skill evolution control plane, focusing on completing the evolution loop, fixing signal pollution, and ensuring compatibility across various components. Additionally, it implements a warm pool feature to reduce sandbox startup latency, improving overall system performance and reliability.

Highlights

  • Skill Evolution Loop: Completed the skill evolution loop by adding goal-driven evaluation and automated promotion, enhancing the system's ability to improve skills continuously.
  • Signal Pollution Fixes: Addressed issues with scheduler re-triggering and outcome attribution, preventing signal pollution and ensuring accurate skill evolution.
  • API/SDK/MCP Compatibility: Ensured compatibility across API, SDK, and MCP for the new evolution candidate data format, resolving parsing errors and supporting diverse data structures.
  • Warm Pool Implementation: Implemented a warm pool feature to pre-start standby sandbox instances, significantly reducing cold-start latency for new sandboxes.
Changelog
  • demo_evolution.py
    • Added a demo script to showcase the skill evolution pipeline end-to-end, including goal declaration, seeding, failure reporting, and LLM mutation.
  • deploy/docker/config.yaml
    • Modified the configuration file to include warm pool settings, such as enabling the warm pool, setting the number of workers, and defining queue policies.
  • doc/bay_api_v1.md
    • Updated the API documentation to reflect the new warm pool implementation, detailing the creation process and API responses.
  • doc/skills_enhancement_roadmap_zh.md
    • Updated the skills enhancement roadmap to indicate the planning status of skill extraction intelligence.
  • doc/skills_self_update_guide_zh.md
    • Added documentation for the extraction pipeline configuration, including LLM-assisted extraction settings.
  • pkgs/bay/app/api/v1/sandboxes.py
    • Implemented warm pool claiming logic in the sandbox creation endpoint, prioritizing warm instances and adding background tasks for warmup.
  • pkgs/bay/app/api/v1/skills.py
    • Added API endpoints for declaring skill goals, retrieving active skills, reporting skill outcomes, and deleting releases/candidates.
  • pkgs/bay/app/config.py
    • Added configuration classes for LLM-assisted extraction, skill evolution, and warm pool settings.
  • pkgs/bay/app/main.py
    • Initialized and shut down the evolution scheduler and warm pool services during application lifespan.
  • pkgs/bay/app/managers/sandbox/sandbox.py
    • Implemented methods for claiming warm sandboxes, creating warm sandboxes, and marking sandboxes as available or retiring.
  • pkgs/bay/app/models/init.py
    • Imported and exposed new models related to skill evolution and warm pool.
  • pkgs/bay/app/models/sandbox.py
    • Added attributes to the Sandbox model to support warm pool functionality, such as warm state and rotation times.
  • pkgs/bay/app/models/skill.py
    • Extended the SkillCandidate model with fields for human-readable documentation, evolution lineage, and deletion tracking; added models for SkillGoal and SkillOutcome.
  • pkgs/bay/app/services/gc/tasks/expired_sandbox.py
    • Modified the garbage collection task to exclude warm pool sandboxes.
  • pkgs/bay/app/services/gc/tasks/idle_session.py
    • Modified the idle session cleanup task to exclude warm pool sandboxes.
  • pkgs/bay/app/services/skills/evolution/agent.py
    • Implemented the SkillMutationAgent to generate mutated skill candidates using LLM and MetaPrompt strategies.
  • pkgs/bay/app/services/skills/evolution/evaluator.py
    • Implemented the GoalConditionedEvaluator to evaluate mutation candidates against declared goals and rubrics.
  • pkgs/bay/app/services/skills/evolution/lifecycle.py
    • Implemented lifecycle management for the evolution scheduler, including initialization, running, and shutdown.
  • pkgs/bay/app/services/skills/evolution/llm.py
    • Implemented the LlmEvolutionClient for interacting with LLMs to generate mutations and evaluate skills.
  • pkgs/bay/app/services/skills/evolution/meta_prompt.py
    • Implemented the MetaPromptService to manage the archive of mutation instructions.
  • pkgs/bay/app/services/skills/evolution/rubric.py
    • Implemented the SkillRubric data class and RubricGenerator for creating evaluation rubrics from natural language goals.
  • pkgs/bay/app/services/skills/evolution/scheduler.py
    • Implemented the EvolutionScheduler to periodically scan for skills with sufficient failures and trigger mutations.
  • pkgs/bay/app/services/skills/extraction/init.py
    • Added browser learning extraction strategies.
  • pkgs/bay/app/services/skills/extraction/base.py
    • Added extraction strategy interfaces and shared data structures.
  • pkgs/bay/app/services/skills/extraction/llm_strategy.py
    • Implemented LLM-assisted extraction strategy with automatic fallback.
  • pkgs/bay/app/services/skills/extraction/rule_strategy.py
    • Implemented rule-based extraction strategy for browser learning.
  • pkgs/bay/app/services/skills/scheduler.py
    • Refactored browser learning scheduler to use extraction strategies and handle candidate deduplication.
  • pkgs/bay/app/services/skills/service.py
    • Modified skill lifecycle service to include methods for declaring goals, retrieving active skills, recording outcomes, and managing skill releases and candidates.
  • pkgs/bay/app/services/warm_pool/init.py
    • Added warm pool service for pre-warming sandbox instances.
  • pkgs/bay/app/services/warm_pool/lifecycle.py
    • Implemented warm pool lifecycle management for FastAPI lifespan integration.
  • pkgs/bay/app/services/warm_pool/queue.py
    • Implemented WarmupQueue for warmup throttling.
  • pkgs/bay/app/services/warm_pool/scheduler.py
    • Implemented WarmPoolScheduler for periodic pool maintenance.
  • pkgs/bay/config.yaml.example
    • Added example configuration for warm pool settings.
  • pkgs/bay/pyproject.toml
    • Incremented version to 0.2.0.
  • pkgs/bay/tests/integration/core/test_browser_skill_e2e.py
    • Added tests for browser skill execution with variable payloads and browser learning deduplication.
  • pkgs/bay/tests/integration/core/test_skill_evolution_api.py
    • Added integration tests for skill evolution API endpoints, covering goal declaration, active skill retrieval, and outcome reporting.
  • pkgs/bay/tests/unit/api/test_sandbox_create_warmup.py
    • Added unit tests for sandbox creation endpoint warmup behavior.
  • pkgs/bay/tests/unit/managers/test_browser_learning_scheduler.py
    • Added unit tests for browser learning scheduler, covering deduplication and LLM strategy integration.
  • pkgs/bay/tests/unit/managers/test_skill_evolution_service.py
    • Added unit tests for skill evolution service methods, covering goal declaration, outcome reporting, and active skill retrieval.
  • pkgs/bay/tests/unit/managers/test_skill_mutation_agent.py
    • Added unit tests for SkillMutationAgent, covering mutation creation, linking, reasoning, and LLM integration.
  • pkgs/bay/tests/unit/warm_pool/init.py
    • Added warm pool unit tests.
  • pkgs/bay/tests/unit/warm_pool/test_create_with_claim.py
    • Added unit tests for sandbox creation with warm pool claim behavior.
  • pkgs/gull/pyproject.toml
    • Incremented version to 0.2.0.
  • pkgs/ship/pyproject.toml
    • Incremented version to 0.2.0 and added new dependencies.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/ci.yml
    • .github/workflows/nightly-browser-e2e.yml
Activity
  • Added a demo script for skill evolution.
  • Updated API documentation for warm pool implementation.
  • Added API endpoints for skill goals, active skills, and outcomes.
  • Implemented warm pool claiming logic.
  • Implemented SkillMutationAgent for generating mutated skill candidates.
  • Implemented GoalConditionedEvaluator for evaluating mutation candidates.
  • Implemented lifecycle management for the evolution scheduler.
  • Refactored browser learning scheduler to use extraction strategies.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

本次 PR 围绕技能演化(skill evolution)控制面进行了一系列重要改进,成功构建了从目标定义、变异、评估到自动晋升的完整闭环。主要改动包括:

  1. 完成了演化闭环:引入了 SkillGoalRubricGeneratorGoalConditionedEvaluator,使演化过程由人类意图驱动,并能根据评估结果自动晋升候选技能。
  2. 修复了调度和信号问题:通过对调度器去重和 record_outcome 的归属校验,有效防止了重复触发和信号污染。
  3. 打通了全栈兼容性:API、SDK 和 MCP 均已兼容演化候选(evolution candidate)的 preconditions/postconditions 新的列表格式,增强了向后兼容性。
  4. 引入了 Warm Pool 机制:通过预热沙箱实例,显著降低了冷启动延迟。

整体来看,这些改动设计精良,模块划分清晰,并配备了充分的单元测试和集成测试,是项目在智能化和健壮性方面迈出的重要一步。代码质量很高,有几处小的改进建议,主要关于异常处理的精确性和演示脚本的完整性。

Comment on lines +374 to +377
if isinstance(preconditions, list):
print(f" Pre: " + "\n ".join(preconditions))
if isinstance(postconditions, list):
print(f" Post: " + "\n ".join(postconditions))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

展示演化后候选(candidate)的逻辑目前只处理了 preconditionspostconditions 为列表(list)的情况。根据 PR 描述,这些字段也支持字典(dict)格式。为了让演示更完整地反映其兼容性,建议为 dict 类型也添加打印逻辑,例如使用 elif isinstance(preconditions, dict):

return None
try:
parsed = json.loads(raw)
except Exception:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_json_field_to_obj 函数中,捕获过于宽泛的 Exception 可能会掩盖意料之外的错误,并导致它们被静默处理为 None。建议将异常类型缩小到更具体的 json.JSONDecodeError,以便更精确地处理预期的解析失败情况,同时让其他意外错误能够暴露出来。

Suggested change
except Exception:
except json.JSONDecodeError:

Comment on lines +325 to +326
except Exception:
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

_decode_conditions 函数中,使用宽泛的 except Exception: pass 会静默地忽略所有解析错误,这可能导致难以调试的问题。当 JSON 解析失败时,建议至少记录一条警告日志,以便于追踪潜在的数据格式问题。

Suggested change
except Exception:
pass
except Exception as exc:
logger.warning("skill_mutation.decode_conditions_failed", json_str=json_str, error=str(exc))
pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants