This repository serves as the benchmarking system for efforts to provide an agentic interface to cBioPortal.org. It is designed to evaluate various "agents" (like MCP servers or standalone APIs) that answer questions about cancer genomics data.
🏆 View the Leaderboard to see current benchmark results.
The system provides a modular CLI to:
- Ask single questions to different agents.
- Batch process a set of questions.
- Benchmark agents against a gold-standard dataset, automatically evaluating their accuracy using an LLM judge.
The system currently supports the following agent types via the --agent-type flag:
mcp-clickhouse: The original Model Context Protocol (MCP) agent connected to a ClickHouse database.cbio-agent-null: A baseline/testing agent (or a specific implementation hosted at a URL).
# Create Python 3.13 virtual environment
uv venv .venv --python 3.13
source .venv/bin/activate
# Install dependencies in editable mode
uv sync --editableCreate a .env file or export the following environment variables:
General:
ANTHROPIC_API_KEY: Required for the LLM judge (evaluation) and themcp-clickhouseagent.
For cbio-agent-null:
CBIO_NULL_AGENT_URL: URL of the agent API (e.g.,http://localhost:8000).
For mcp-clickhouse:
CLICKHOUSE_HOST,CLICKHOUSE_USER,CLICKHOUSE_PASSWORD,CLICKHOUSE_DATABASE: Connection details.
Optional (Tracing):
PHOENIX_API_KEY: For Arize Phoenix tracing.PHOENIX_COLLECTOR_ENDPOINT: Tracing endpoint.
The benchmark command is the main way to evaluate an agent. It automates generation, evaluation, and leaderboard updates.
# Run benchmark for the null agent
cbioportal-mcp-qa benchmark --agent-type cbio-agent-null --questions 1-5
# Run benchmark for the MCP agent
cbioportal-mcp-qa benchmark --agent-type mcp-clickhouseWhat happens:
- Questions are loaded from
input/autosync-public.csv. - The specified agent generates answers.
- Answers are saved to
results/{agent_type}/{YYYYMMDD}/answers/. simple_eval.pyevaluates the answers against the expected output (usingNavbot Expected Linkas the ground truth).- Results are saved to
results/{agent_type}/{YYYYMMDD}/eval/. LEADERBOARD.mdis updated with the latest scores.
You can also run individual components manually.
cbioportal-mcp-qa ask "How many studies are there?" --agent-type cbio-agent-nullGenerate answers without running the full benchmark evaluation.
cbioportal-mcp-qa batch input/autosync-public.csv --questions 1-10 --output-dir my_results/Run the evaluation script on existing output files.
python simple_eval.py \
--input-csv input/autosync-public.csv \
--answers-dir my_results/ \
--answer-column "Navbot Expected Link"src/cbioportal_mcp_qa/: Source code.main.py: CLI entry point.benchmark.py: Benchmarking workflow logic.evaluation.py: Core evaluation logic (LLM judge).base_client.py: Abstract base class for agents.null_agent_client.py: Client forcbio-agent-null.llm_client.py: Client formcp-clickhouse.
input/: Benchmark datasets (e.g.,autosync-public.csv).results/: Generated answers and evaluation reports.simple_eval.py: Wrapper script for running evaluation manually.agents/: Contains Docker Compose configurations for running external agent services, such asdocker-compose.ymlforcbio-null-agent.