Skip to content

feat: add latest models, Google API support, and Openfort MCP server evals#5

Open
emauja wants to merge 1 commit into
mainfrom
feat/latest-models-google-mcp
Open

feat: add latest models, Google API support, and Openfort MCP server evals#5
emauja wants to merge 1 commit into
mainfrom
feat/latest-models-google-mcp

Conversation

@emauja

@emauja emauja commented Feb 28, 2026

Copy link
Copy Markdown

Summary

  • Updated model list to all latest models across 4 providers:

    • OpenAI: gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4o, gpt-5, o3, o3-mini, o4-mini
    • Anthropic: claude-opus-4-6, claude-sonnet-4-5-20250929, claude-haiku-4-5-20251001
    • Google (NEW): gemini-2.5-pro, gemini-2.5-flash
    • Vercel: v0-1.5-md
  • Added Google/Gemini provider support:

    • New @ai-sdk/google dependency for Gemini models
    • GOOGLE_API_KEY environment variable
    • Google rate limiting configuration (GOOGLE_RATE_LIMIT_RPM)
  • Added --provider flag to filter evals by model provider (-p openai, -p google, etc.)

  • Added --mcp flag and MCP runner for tool-assisted evaluations:

    • New src/runners/mcp.ts — MCP runner that connects to Openfort MCP server and provides tools to models during evaluation
    • New src/utils/mcp-client.ts — MCP client utility using @ai-sdk/mcp and Streamable HTTP transport
    • Default MCP server URL: https://mcp.openfort.io/sse (override via MCP_SERVER_URL_OVERRIDE)
    • bun start:mcp shortcut script
  • Created Openfort MCP server eval (evals/mcp-server):

    • Tests LLM ability to use Openfort MCP tools for project management, policy creation, contract registration, user/account management, and transaction simulation
    • 13 graders covering all MCP tool categories + 2 LLM-as-judge quality checks
  • Infrastructure improvements:

    • Created src/runners/shared.ts with reusable runner utilities (loadPrompt, loadGraders, runGraders, computeScore)
    • Fixed Clerk references in src/graders/catalog.ts and src/scorers/constants.ts — now Openfort-specific
    • Fixed LanguageModel type compatibility with AI SDK v5
    • Updated Framework type to support 'React' | 'Node.js' | 'Next.js' | 'MCP'
    • Updated README with new models table, CLI options, and MCP documentation

Environment Variables Required

To run all models, you need:

OPENAI_API_KEY=
ANTHROPIC_API_KEY=
GOOGLE_API_KEY=      # NEW - for Gemini models
V0_API_KEY=

Test plan

  • bun run lint — passes (only pre-existing warnings remain)
  • bunx tsc --noEmit — TypeScript compiles cleanly
  • bun start --help — shows updated help with all models and new flags
  • bun start --provider google --eval evals/basic-setup — runs Google models only
  • bun start --mcp --eval evals/mcp-server — runs MCP eval with tool support
  • Full suite: bun start with all API keys configured

🤖 Generated with Claude Code

…evals

- Update model list to latest: GPT-4.1/4.1-mini/4.1-nano, GPT-5, o3/o3-mini/o4-mini,
  Claude Opus 4.6/Sonnet 4.5/Haiku 4.5, Gemini 2.5 Pro/Flash, v0-1.5-md
- Add Google provider support with @ai-sdk/google and GOOGLE_API_KEY
- Add --provider flag to filter evals by model provider (openai, anthropic, google, vercel)
- Add --mcp flag and MCP runner for tool-assisted evaluations via Openfort MCP server
- Create MCP server eval (evals/mcp-server) testing LLM ability to use Openfort MCP tools
- Port MCP client utility from upstream clerk-evals using @ai-sdk/mcp and StreamableHTTP
- Create shared runner utilities (runners/shared.ts) for code reuse
- Update graders catalog and scorers to use Openfort references instead of Clerk
- Fix LanguageModel type compatibility with AI SDK v5
- Add Google rate limiting support
- Update README with new models, CLI options, and MCP documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@socket-security

Copy link
Copy Markdown

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Added@​ai-sdk/​mcp@​0.0.12981007398100
Added@​ai-sdk/​google@​3.0.34761008798100
Added@​modelcontextprotocol/​sdk@​1.27.19910010099100

View full report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant