Skip to content

Add WebSkills support to demos and evals CLI#49

Draft
MiguelsPizza wants to merge 1 commit intoGoogleChromeLabs:mainfrom
MiguelsPizza:webmcp-skills
Draft

Add WebSkills support to demos and evals CLI#49
MiguelsPizza wants to merge 1 commit intoGoogleChromeLabs:mainfrom
MiguelsPizza:webmcp-skills

Conversation

@MiguelsPizza
Copy link

@MiguelsPizza MiguelsPizza commented Mar 2, 2026

This PR adds WebSkills support across the demos and evals CLI. Pages declare skills via <script type="agent-context"> that agents can discover and progressively read.

  • All three demos (french-bistro, pizza-maker, react-flightsearch) embed <script type="agent-context"> skill blocks with inline reference documents
  • --skill flag on runevals loads a SKILL.md for benchmark comparison (baseline vs skill-augmented)
  • Multi-step agent loop handles read_site_context disclosure calls using agent-skills-ts-sdk
  • Extracted loadSkillFromFile() into evaluator/skill.ts for reuse
  • Deterministic eval date via WEBMCP_EVAL_DATE env var
  • Pizza-maker and travel skill eval test sets with SKILL.md files and reference docs
  • Updated README with --skill flag docs and benchmark mode

Extension submodule updated: beaufortfrancois/model-context-tool-inspector#9 (must merge first)

Eval results (gemini-2.5-flash, 2026-03-03)

Single-call skill evals (4 runs each):

Scenario Baseline With Skill Delta
Pizza 87.5% 100.0% +12.5 pts
Travel 54.2% 58.3% +4.2 pts

Results have run-to-run stochasticity due to model variability.

Test plan

  • cd evals-cli && npm run build && npm test
  • Pizza skill benchmark: node dist/bin/runevals.js --tools=examples/pizza-maker/schema.json --evals=examples/pizza-maker/skill-evals.json --skill=examples/pizza-maker/skill/SKILL.md
  • Travel skill benchmark: node dist/bin/runevals.js --tools=examples/travel/schema.json --evals=examples/travel/skill-evals.json --skill=examples/travel/skill/SKILL.md
  • Verify demo HTML skill blocks are well-formed

@google-cla
Copy link

google-cla bot commented Mar 2, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@MiguelsPizza MiguelsPizza marked this pull request as draft March 2, 2026 18:19
@MiguelsPizza MiguelsPizza marked this pull request as draft March 2, 2026 18:19
- All three demos embed <script type="agent-context"> skill blocks
- --skill flag on runevals for benchmark comparison
- Multi-step agent loop handles read_site_context disclosure calls
- Extracted loadSkillFromFile() into evaluator/skill.ts
- Deterministic eval date via WEBMCP_EVAL_DATE env var
- Pizza-maker and travel skill eval test sets with SKILL.md files
- Drop resource names from disclosure prompt entries
- Type safety improvements (Record<string,unknown> over any)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant