This is the public roadmap for the Agentic AI-Guided Evaluation Platform. Knowing what's coming helps users and contributors plan, and lets the community give direct feedback on direction.
| Category | Description |
|---|---|
| Shipped | Generally available today. |
| In progress | Actively being worked on. Implementation details may still shift. |
| Coming soon | Next up. Think a couple of months out, give or take. |
| Researching | Being evaluated. The best phase to share use cases or design ideas. |
The roadmap can change at any time, and items here do not guarantee a feature ships as proposed. We don't publish target dates — operational stability and quality come first.
The evaluation engine (Inspect AI) already routes to OpenAI, Anthropic, Google, and Bedrock when the matching API key is present. The remaining provider lock-in is in dataset / judge / report synthesis, and in agent capture.
- Multi-provider eval execution via Inspect AI (Bedrock, OpenAI, Anthropic, Google)
- Cross-IDE install: Claude Code, Cursor, Kiro, Codex, VS Code (
uvxflow) - S3 team sharing with auto-replication and debounced pull-on-read
- File-based per-user storage at
~/.eval-mcp/users/{user}/
- Provider-agnostic synthesis — decouple
generate_qa_pairs,generate_judge, andgenerate_reportfromBedrockClientso users without AWS can bootstrap end-to-end on Anthropic or OpenAI keys alone - OpenAI / Anthropic SDK capture — extend
bedrock_capture.pywith bridges foropenai-pythonand the Anthropic SDK so agents calling those providers get captured into Inspect logs without code changes
- Cross-OS / cross-IDE install canary in CI — automated verification on macOS, Linux, and Windows with each release
- Concurrency hardening — load-test the 100-call Bedrock semaphore and the HTTP transport beyond the current ~43 lines of coverage
- Multi-tenant eval queue for shared deployments
Today the frontend ships a working streaming chat with markdown and tool-call display, plus a results dashboard built on colored text grids. There is no charting library installed and no dedicated pages for Datasets or Judges.
- SSE streaming chat with markdown, tool call → result trace, and file uploads
- Chat history viewer with split-pane session replay
- Results dashboard with per-criterion / per-model score grid and pipeline stage overview
- Data view with real charts — bar / line / scatter / heatmap for score trends, latency-vs-accuracy, criterion × model matrices, and cost projections
- Past evals, made browsable — search, filter, sort, and tagging on the results list; replace 10s polling with push updates
- Datasets and Judges pages — first-class browsing UI for both, mirroring what the MCP exposes (today they're agent-only)
- Side-by-side eval comparison view (pick two past runs, diff scores per sample)
- Code-block syntax highlighting, copy buttons, and message regenerate in chat
- Inline document attachment previews
- Eval scheduling and saved templates from the UI
- Shareable per-eval public links (read-only) backed by S3
Inspect AI is invoked by subprocess with --adaptive-connections=true. There's no judge-level cache, throughput hasn't been benchmarked, and the canary pre-flight has not been profiled.
- Inspect AI integration with adaptive connection tuning
- Jury system: multiple judges across model families, binary per-criterion scoring
- PDF report generation with narrative analysis and cost projection
retry_evaluationfor resuming incomplete samples
- Judge / scorer caching — avoid re-scoring identical (response, criterion) pairs across reruns and comparisons
- Throughput benchmarks — establish baselines for samples/minute across providers; profile canary and subprocess startup overhead
- Multi-judge integration tests at the eval-engine level (currently we test capture and subprocess wiring, not full jury runs)
- Cost-aware sample sizing recommendations from the agent
- Distributed eval workers backed by SQS / Step Functions for very large runs
Two agentic frameworks are tested end-to-end (Strands, LangChain+LangGraph). RAG today is a single citation criterion in the legal example — there are no first-class retrieval, faithfulness, or grounding scorers.
- Strands and LangChain / LangGraph agent eval via OTLP capture
- Auto-detection of agent entry points (
run_agent) and OTel bootstrap - Document-grounded QA generation from PDFs and knowledge bases
- First-class RAG scorers — deterministic retrieval correctness (right docs fetched), faithfulness (answer grounded in context), hallucination flags
- Framework-agnostic capture, verified end-to-end — once the OpenAI and Anthropic SDK bridges land (see MCP & Providers), confirm agents built with CrewAI, AutoGen, OpenAI Agents SDK, and the native Claude SDK work without code changes. The goal is the agnostic claim holding, not per-framework code paths
- 30-second quickstart eval — one bundled example under
examples/so a new user has something runnable immediately after install
- Framework-aware pipeline stages (RAG agents → retrieval + context_usage; tool-calling agents → tool_selection)
- Public-benchmark import recipes (MMLU, GSM8K, SQuAD-style) for cross-model comparison
- Multi-turn conversation evaluation with checkpoint scoring
- Tool-call sequence scoring (was the right tool picked, in the right order, with the right args?)
- Unit and adapter coverage for OTel receiver, agent detection, Bedrock capture, model discovery
- LangChain end-to-end integration test
- Subprocess isolation per agent invocation via ephemeral
uvvenvs
- End-to-end eval-engine integration tests across multiple frameworks and judges
- Real failure-mode regression tests (today
test_run_eval_fail_loud.pyonly tests the predicate)
- Concurrent eval scaling tests against the 100-call Bedrock semaphore
- HTTP transport coverage parity with stdio
- Public benchmark harness so contributors can compare scoring stability across releases
Found a bug, want a feature, or have a use case the roadmap doesn't cover? Open an issue. Community-submitted issues get tagged proposed and reviewed by the team. See CONTRIBUTING.md for development setup.
- Inspect AI — evaluation framework
- Inspect AI documentation