AGENTS.md

EvalScope — LLM evaluation framework with a registry-based plugin architecture. This file is the contract for AI coding agents working in this repo.

Setup

pip install -e .       # basic install
make dev               # dev + perf + docs extras + pre-commit

Python ≥ 3.10 (3.10 / 3.11 / 3.12). Dependencies: requirements/framework.txt + pyproject.toml [project.optional-dependencies] (extras: opencompass, vlmeval, rag, perf, app, aigc, sandbox, service, dev, docs, all, plus per-benchmark extras).

Build, lint, test

make lint                                                                       # required before commit (yapf + isort + flake8 + basic pre-commit hooks)
pytest tests/cli/test_all.py::TestRun::test_ci_lite -v -s -p no:warnings        # CI smoke test
pytest tests/perf/test_perf_basic.py::TestPerfBasic::test_multi_parallel_sweep -v -s    # perf

Commits failing make lint are rejected on main.

Docs generation

Benchmark detail pages (docs/{zh,en}/benchmarks/<name>.md) and meta cache (evalscope/benchmarks/_meta/<name>.json) are auto-generated from each adapter's BenchmarkMeta.description + dataset statistics. Do not hand-edit those files.

When you add a benchmark or change its BenchmarkMeta.description, run:

make docs-pipeline BENCHMARK="<name1> <name2>" FORCE=1   # update _meta JSON + translate descriptions to zh
make docs-generate                                        # render .md files from _meta

Targets: docs-update (meta only), docs-update-stats (+ dataset statistics), docs-translate (zh), docs-pipeline (stats + translate), docs-generate (.md), docs (full Sphinx HTML build).

Conventions:

BENCHMARK="a b c" selects benchmarks; omit for --all.
FORCE=1 appends --force to recompute even if data is cached.
WORKERS=N parallelism (default 4).
--translate calls an LLM; needs DASHSCOPE_API_KEY (or equivalent) in env.

Quick eval

evalscope eval --model Qwen/Qwen2.5-0.5B-Instruct --datasets gsm8k --limit 5

from evalscope import run_task, TaskConfig
run_task(TaskConfig(model='Qwen/Qwen2.5-0.5B-Instruct', datasets=['gsm8k'], limit=5))

Code style (enforced)

Line width 120, 4-space indent, LF endings, trailing newline at EOF.
Quotes governed by double-quote-string-fixer hook — follow existing file style; do not mix.
f-strings for formatting (no % or .format() unless necessary).
Imports: isort with first_party = evalscope, groups STDLIB / THIRDPARTY / LOCALFOLDER, multi_line_output=3.
Type hints required on every function signature.
English only for comments and docstrings.
Public APIs need docstrings; internal helpers only when intent is non-obvious.
# TODO: prefix for pending work.

Element	Style
Class	`PascalCase`
Function / variable	`snake_case`
Constant	`UPPER_SNAKE_CASE`
Private	`_leading_underscore`
Handler function	`handle_` prefix
Benchmark adapter file	`<name>_adapter.py`

flake8 ignore list (setup.cfg): F401, F403, F405, F821, W503, E251, W504, F824, F541, E501, E226, E121-E129, E131, E741. Do not expand — new ignores must be justified in the PR.

Design rules

Early returns over nested conditionals.
Minimal changes: only touch code related to the current task; no drive-by cleanup.
Pydantic-first: cross-module data contracts use Pydantic models. Use TaskConfig / Arguments for configuration — never raw dicts at module boundaries.
Reuse existing patterns: new benchmarks / models / metrics go through existing registries and adapter base classes — no parallel mechanisms.
DRY but don't over-abstract just to remove minor duplication.

Tests

Live under tests/; files *test*.py, classes Test*, functions test_*.
New benchmark / model / metric must ship a minimal runnable test (pattern: tests/cli/test_all.py::TestRun::test_ci_lite).
Mock external services — no reliance on real network / paid APIs.

Architecture pointers

Don't try to learn the architecture from this file — read these and grep:

Topic	Source of truth
Main flow	`evalscope/run.py` → `evalscope/evaluator/evaluator.py`
Config schema	`evalscope/config.py` (`TaskConfig`)
Registries	`evalscope/api/registry.py`
Benchmark contract	`evalscope/api/benchmark/benchmark.py` (`DataAdapter`, `BenchmarkMeta`)
Model layer	`evalscope/api/model/model.py`, `evalscope/models/model_apis.py`
CLI dispatch	`evalscope/cli/`
Cache schema	`evalscope/api/evaluator/cache.py`

Registry decorators: @register_benchmark, @register_model_api, @register_metric, @register_aggregation, @register_filter, @register_evaluator.

Adapter base classes (extend, don't reinvent): DefaultDataAdapter, MultiChoiceAdapter, VisionLanguageAdapter, Text2ImageAdapter, ImageEditAdapter, NERAdapter, AgentAdapter. Optional capabilities via mixins: LLMJudgeMixin, SandboxMixin.

Non-native backends live under evalscope/backend/ (OpenCompass, VLMEvalKit, RAGEval) and are dispatched from run.py with their own BackendManager.

Adding a benchmark

Create evalscope/benchmarks/<name>/<name>_adapter.py.
Extend DefaultDataAdapter, override record_to_sample() (and optionally sample_to_fewshot(), extract_answer()).
Decorate with @register_benchmark(BenchmarkMeta(name=..., ...)).
Auto-discovered by globbing evalscope/benchmarks/*/**/*_adapter.py.
Add a smoke test.

Conventions & gotchas

eval_type: openai_api, llm_ckpt, mock_llm, text2image, image_editing. Deprecated aliases: server → openai_api, checkpoint → llm_ckpt.
limit: int = count, float = fraction.
repeats: duplicates items for k-metrics. generation_config.n is deprecated and mapped.
Use generation_config for runtime params. TaskConfig.timeout / stream are deprecated — forwarded with a warning.
dataset_args merges into BenchmarkMeta._update() (supports local_path, filters OrderedDict prepended).
Models are memoized by (name, config, base_url, api_key, args).
Use @thread_safe for model creation, run_in_threads_with_progress for concurrent eval.
Outputs land in outputs/<timestamp>/{logs,predictions,reviews,reports,configs}/ (see OutputsStructure). use_cache resumes runs; rerun_review recomputes scores only.
evalscope app CLI command is deprecated (see evalscope/cli/start_app.py) — use evalscope service for the Web dashboard.

Submission

make dev      # once
make lint     # before every commit
pytest tests/cli/test_all.py::TestRun::test_ci_lite -v -s -p no:warnings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AGENTS.md

Setup

Build, lint, test

Docs generation

Quick eval

Code style (enforced)

Design rules

Tests

Architecture pointers

Adding a benchmark

Conventions & gotchas

Submission

FilesExpand file tree

AGENTS.md

Latest commit

History

AGENTS.md

File metadata and controls

AGENTS.md

Setup

Build, lint, test

Docs generation

Quick eval

Code style (enforced)

Design rules

Tests

Architecture pointers

Adding a benchmark

Conventions & gotchas

Submission