EvalScope — LLM evaluation framework with a registry-based plugin architecture. This file is the contract for AI coding agents working in this repo.
pip install -e . # basic install
make dev # dev + perf + docs extras + pre-commitPython ≥ 3.10 (3.10 / 3.11 / 3.12). Dependencies: requirements/framework.txt + pyproject.toml [project.optional-dependencies] (extras: opencompass, vlmeval, rag, perf, app, aigc, sandbox, service, dev, docs, all, plus per-benchmark extras).
make lint # required before commit (yapf + isort + flake8 + basic pre-commit hooks)
pytest tests/cli/test_all.py::TestRun::test_ci_lite -v -s -p no:warnings # CI smoke test
pytest tests/perf/test_perf_basic.py::TestPerfBasic::test_multi_parallel_sweep -v -s # perfCommits failing make lint are rejected on main.
Benchmark detail pages (docs/{zh,en}/benchmarks/<name>.md) and meta cache (evalscope/benchmarks/_meta/<name>.json) are auto-generated from each adapter's BenchmarkMeta.description + dataset statistics. Do not hand-edit those files.
When you add a benchmark or change its BenchmarkMeta.description, run:
make docs-pipeline BENCHMARK="<name1> <name2>" FORCE=1 # update _meta JSON + translate descriptions to zh
make docs-generate # render .md files from _metaTargets: docs-update (meta only), docs-update-stats (+ dataset statistics), docs-translate (zh), docs-pipeline (stats + translate), docs-generate (.md), docs (full Sphinx HTML build).
Conventions:
BENCHMARK="a b c"selects benchmarks; omit for--all.FORCE=1appends--forceto recompute even if data is cached.WORKERS=Nparallelism (default 4).--translatecalls an LLM; needsDASHSCOPE_API_KEY(or equivalent) in env.
evalscope eval --model Qwen/Qwen2.5-0.5B-Instruct --datasets gsm8k --limit 5from evalscope import run_task, TaskConfig
run_task(TaskConfig(model='Qwen/Qwen2.5-0.5B-Instruct', datasets=['gsm8k'], limit=5))- Line width 120, 4-space indent, LF endings, trailing newline at EOF.
- Quotes governed by
double-quote-string-fixerhook — follow existing file style; do not mix. - f-strings for formatting (no
%or.format()unless necessary). - Imports: isort with
first_party = evalscope, groupsSTDLIB / THIRDPARTY / LOCALFOLDER,multi_line_output=3. - Type hints required on every function signature.
- English only for comments and docstrings.
- Public APIs need docstrings; internal helpers only when intent is non-obvious.
# TODO:prefix for pending work.
| Element | Style |
|---|---|
| Class | PascalCase |
| Function / variable | snake_case |
| Constant | UPPER_SNAKE_CASE |
| Private | _leading_underscore |
| Handler function | handle_ prefix |
| Benchmark adapter file | <name>_adapter.py |
flake8 ignore list (setup.cfg): F401, F403, F405, F821, W503, E251, W504, F824, F541, E501, E226, E121-E129, E131, E741. Do not expand — new ignores must be justified in the PR.
- Early returns over nested conditionals.
- Minimal changes: only touch code related to the current task; no drive-by cleanup.
- Pydantic-first: cross-module data contracts use Pydantic models. Use
TaskConfig/Argumentsfor configuration — never raw dicts at module boundaries. - Reuse existing patterns: new benchmarks / models / metrics go through existing registries and adapter base classes — no parallel mechanisms.
- DRY but don't over-abstract just to remove minor duplication.
- Live under
tests/; files*test*.py, classesTest*, functionstest_*. - New benchmark / model / metric must ship a minimal runnable test (pattern:
tests/cli/test_all.py::TestRun::test_ci_lite). - Mock external services — no reliance on real network / paid APIs.
Don't try to learn the architecture from this file — read these and grep:
| Topic | Source of truth |
|---|---|
| Main flow | evalscope/run.py → evalscope/evaluator/evaluator.py |
| Config schema | evalscope/config.py (TaskConfig) |
| Registries | evalscope/api/registry.py |
| Benchmark contract | evalscope/api/benchmark/benchmark.py (DataAdapter, BenchmarkMeta) |
| Model layer | evalscope/api/model/model.py, evalscope/models/model_apis.py |
| CLI dispatch | evalscope/cli/ |
| Cache schema | evalscope/api/evaluator/cache.py |
Registry decorators: @register_benchmark, @register_model_api, @register_metric, @register_aggregation, @register_filter, @register_evaluator.
Adapter base classes (extend, don't reinvent): DefaultDataAdapter, MultiChoiceAdapter, VisionLanguageAdapter, Text2ImageAdapter, ImageEditAdapter, NERAdapter, AgentAdapter. Optional capabilities via mixins: LLMJudgeMixin, SandboxMixin.
Non-native backends live under evalscope/backend/ (OpenCompass, VLMEvalKit, RAGEval) and are dispatched from run.py with their own BackendManager.
- Create
evalscope/benchmarks/<name>/<name>_adapter.py. - Extend
DefaultDataAdapter, overriderecord_to_sample()(and optionallysample_to_fewshot(),extract_answer()). - Decorate with
@register_benchmark(BenchmarkMeta(name=..., ...)). - Auto-discovered by globbing
evalscope/benchmarks/*/**/*_adapter.py. - Add a smoke test.
eval_type:openai_api,llm_ckpt,mock_llm,text2image,image_editing. Deprecated aliases:server→openai_api,checkpoint→llm_ckpt.limit:int= count,float= fraction.repeats: duplicates items for k-metrics.generation_config.nis deprecated and mapped.- Use
generation_configfor runtime params.TaskConfig.timeout/streamare deprecated — forwarded with a warning. dataset_argsmerges intoBenchmarkMeta._update()(supportslocal_path,filtersOrderedDict prepended).- Models are memoized by
(name, config, base_url, api_key, args). - Use
@thread_safefor model creation,run_in_threads_with_progressfor concurrent eval. - Outputs land in
outputs/<timestamp>/{logs,predictions,reviews,reports,configs}/(seeOutputsStructure).use_cacheresumes runs;rerun_reviewrecomputes scores only. evalscope appCLI command is deprecated (seeevalscope/cli/start_app.py) — useevalscope servicefor the Web dashboard.
make dev # once
make lint # before every commit
pytest tests/cli/test_all.py::TestRun::test_ci_lite -v -s -p no:warnings