Skip to content

Commit 94da674

Browse files
authored
test: unified CLI dispatch and task pipeline tests (#1203)
* refactor: remove dead read_video_pyav_pil and deduplicate _resize_image in load_video * refactor: rename read_video_pyav -> read_video, remove dead code - Rename read_video_pyav to read_video in load_video.py with backward-compat alias - Delete _resize_image and read_video_pyav_base64 dead functions - Update all 12 caller files to use read_video directly - Inline base64 encoding logic in qwen2_5_omni.py (was read_video_pyav_base64) - Fix missing import in vila.py (latent bug) - Remove use_custom_video_loader dead code from 5 models that declared but never checked it (qwen2_5_vl, qwen3_vl, qwen3_omni, llava_onevision1_5, huggingface) * docs: rewrite Section 7.1 to document read_video backends, remove dead Section 7.2 * feat: unified CLI with subcommand dispatch and interactive wizard Add lmms_eval/cli/ package with subcommand-based architecture: eval - run evaluation (wizard mode when no args) tasks - list/groups/subtasks/tags browser models - list backends with optional --aliases ui - launch Web UI serve - start HTTP eval server power - statistical power analysis version - version and environment info tui - terminal UI (textual) Full backward compat: lmms-eval --model X --tasks Y still works. Entrypoint rewired through cli.dispatch:main in pyproject.toml. * docs: add external usage guide for CLI and library access Add docs/external_usage.md covering CLI subcommands (tasks, models, eval wizard, ui, serve, power, version) and Python library usage (TaskManager, datasets, evaluator, metrics). Update docs index link. Polish v0.7 release notes for consistency. * test: add unified CLI dispatch and task pipeline tests Add test/cli/test_cli_dispatch.py (28 tests): - _is_legacy_invocation and _is_eval_wizard edge cases - main() routing: banner, help, subcommand dispatch - models subcommand --aliases gating - version output, tasks parser actions, legacy backward compat Add test/eval/test_task_pipeline.py (14 tests, 82 subtests): - 8 mainstream tasks: registration, YAML integrity, utils imports - process_results and doc_to_text callability - Cross-task output_type consistency, no duplicate names Remove per-task test files now covered by the unified pipeline: - test_tvbench_task.py, test_neptune_task.py, test_mmsi_bench_utils.py
1 parent 16021b4 commit 94da674

39 files changed

Lines changed: 2276 additions & 600 deletions

docs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@ Majority of this documentation is adapted from [lm-eval-harness](https://github.
1818

1919
* **[Commands Guide](commands.md)** - Learn about command line flags and options
2020
* **[Quick Start](quickstart.md)** - Evaluate your model in 5 minutes
21+
* **[External Usage](external_usage.md)** - CLI task browsing, Web UI, and Python library access to tasks, datasets, and evaluations
2122
* **[Model Guide](model_guide.md)** - How to add and integrate new models
2223
* **[Task Guide](task_guide.md)** - Create custom evaluation tasks
2324
* **[Current Tasks](current_tasks.md)** - List of all supported evaluation tasks

docs/external_usage.md

Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
# External Usage
2+
3+
`lmms_eval` can be used in two ways: as a **CLI tool** for quick tasks like browsing
4+
benchmarks and launching the Web UI, or as a **Python library** for programmatic
5+
access to tasks, datasets, and evaluations.
6+
7+
## Installation
8+
9+
```bash
10+
# From PyPI
11+
pip install lmms-eval
12+
pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git
13+
pip install "lmms-eval[all]"
14+
```
15+
16+
---
17+
18+
# Part I - CLI
19+
20+
## 1) Preview Available Tasks
21+
```bash
22+
# Flat list of every registered name (tasks + groups + tags)
23+
lmms-eval tasks list
24+
# Markdown table of task groups only
25+
lmms-eval tasks groups
26+
27+
# Markdown table of leaf tasks only (with config path and output type)
28+
lmms-eval tasks subtasks
29+
# Tags only
30+
lmms-eval tasks tags
31+
```
32+
33+
Example output for `tasks subtasks` (truncated):
34+
35+
```
36+
| Task | Config Location | Output Type |
37+
|------|-----------------|-------------|
38+
| mme | lmms_eval/tasks/mme/mme.yaml | generate_until |
39+
| mmmu_val | lmms_eval/tasks/mmmu/mmmu_val.yaml | generate_until |
40+
| ... | ... | ... |
41+
```
42+
43+
These commands only read YAML configs - no dataset download happens.
44+
45+
## 2) List Available Models
46+
47+
```bash
48+
# Show all registered model backends (chat, simple, dual-mode)
49+
lmms-eval models
50+
51+
# Include aliases
52+
lmms-eval models --aliases
53+
```
54+
55+
## 3) Launch the Web UI
56+
The Web UI provides a browser-based interface for configuring and running
57+
evaluations interactively. Requires Node.js 18+ (auto-built on first launch).
58+
59+
```bash
60+
# Start the Web UI (opens browser automatically)
61+
lmms-eval ui
62+
# Custom port
63+
lmms-eval ui --port 3000
64+
```
65+
66+
## 4) Interactive Evaluation Wizard
67+
68+
Run `lmms-eval eval` with no arguments to launch a step-by-step wizard that
69+
guides you through model selection, task selection, and options:
70+
71+
```bash
72+
lmms-eval eval
73+
```
74+
75+
The wizard lets you search/filter tasks, shows a command preview, and runs
76+
the evaluation after confirmation.
77+
78+
## 5) Direct Evaluation
79+
80+
Pass arguments directly (same flags as before, fully backward-compatible):
81+
82+
```bash
83+
# New style (with eval subcommand)
84+
lmms-eval eval --model qwen2_5_vl --tasks mme --batch_size 1 --limit 8
85+
86+
# Old style (still works, routes to eval automatically)
87+
lmms-eval --model qwen2_5_vl --tasks mme --batch_size 1 --limit 8
88+
```
89+
90+
## 6) Start the HTTP Eval Server
91+
92+
```bash
93+
lmms-eval serve --host 0.0.0.0 --port 8000
94+
```
95+
96+
## 7) Other Commands
97+
98+
```bash
99+
# Version and environment info
100+
lmms-eval version
101+
102+
# Statistical power analysis for benchmark planning
103+
lmms-eval power --effect-size 0.03 --tasks mme
104+
105+
# Terminal UI (requires textual package)
106+
lmms-eval tui
107+
```
108+
109+
---
110+
111+
# Part II - Python Library
112+
113+
Beyond the CLI, `lmms_eval` can be imported as a Python library. This lets
114+
external projects list benchmarks, load task definitions, download datasets,
115+
iterate over samples, and run evaluations - all programmatically.
116+
117+
## 8) List Available Benchmarks (Python)
118+
119+
Use `TaskManager` to index all built-in tasks without downloading any data:
120+
121+
```python
122+
from lmms_eval.tasks import TaskManager
123+
tm = TaskManager()
124+
# All registered names (tasks + groups + tags)
125+
print(tm.all_tasks)
126+
print(tm.all_subtasks) # e.g. ['mme', 'mmmu_val', 'mathvista', ...]
127+
print(tm.all_groups)
128+
print(tm.list_all_tasks())
129+
```
130+
No dataset download happens at this stage. `TaskManager` only reads YAML configs
131+
from the `lmms_eval/tasks/` directory to build its index.
132+
## 9) Load a Task and Download Its Dataset
133+
`get_task_dict` instantiates task objects. During construction each task calls
134+
`download()`, which triggers `datasets.load_dataset()` under the hood.
135+
```python
136+
from lmms_eval.tasks import TaskManager, get_task_dict
137+
tm = TaskManager()
138+
task_dict = get_task_dict(["mme"], task_manager=tm)
139+
task = task_dict["mme"]
140+
```
141+
After this call the HuggingFace dataset has been downloaded (or loaded from
142+
cache) and is stored in `task.dataset` as a `datasets.DatasetDict`.
143+
## 10) Iterate Over Benchmark Samples
144+
145+
Each task exposes its splits through accessor methods:
146+
147+
```python
148+
# Check which splits exist
149+
task.has_test_docs() # True / False
150+
task.has_validation_docs() # True / False
151+
task.has_training_docs() # True / False
152+
test_data = task.test_docs() # full dataset with images/audio
153+
test_data_lite = task.test_docs_no_media() # same rows, media columns removed
154+
for doc in test_data:
155+
print(doc.keys()) # e.g. dict_keys(['question', 'answer', 'image', ...])
156+
break
157+
```
158+
There is also a convenience property that returns whichever split the task uses
159+
for evaluation (test if available, otherwise validation):
160+
```python
161+
eval_data = task.eval_docs # datasets.Dataset
162+
eval_data_lite = task.eval_docs_no_media # without media columns
163+
```
164+
165+
## 11) Access Task Configuration
166+
167+
Every task carries its full YAML config as a `TaskConfig` dataclass:
168+
169+
```python
170+
cfg = task.config
171+
cfg.test_split # "test"
172+
cfg.output_type # "generate_until"
173+
cfg.metric_list # [{"metric": "mme_perception_score", ...}, ...]
174+
cfg.generation_kwargs # {"max_new_tokens": 16, "temperature": 0, ...}
175+
cfg.lmms_eval_specific_kwargs # per-model prompt variants
176+
```
177+
You can also read a raw YAML config without instantiating the task (and
178+
therefore without downloading data):
179+
```python
180+
raw = tm._get_config("mme") # returns the parsed YAML as a dict
181+
```
182+
183+
## 12) Load Tasks from a Custom Path
184+
External projects can maintain their own task YAMLs and load them alongside
185+
(or instead of) the built-in tasks:
186+
```python
187+
# Include custom tasks in addition to built-in ones
188+
tm = TaskManager(include_path="/path/to/my/tasks")
189+
tm = TaskManager(include_path="/path/to/my/tasks", include_defaults=False)
190+
tm = TaskManager(include_path=["/path/a", "/path/b"])
191+
```
192+
Task YAMLs in the custom directory follow the same format as built-in tasks.
193+
See the [Task Guide](task_guide.md) for the full specification.
194+
## 13) Run an Evaluation Programmatically
195+
196+
`simple_evaluate` is the same function the CLI calls internally:
197+
198+
```python
199+
from lmms_eval.evaluator import simple_evaluate
200+
model="qwen2_5_vl",
201+
model_args={"pretrained": "Qwen/Qwen2.5-VL-3B-Instruct"},
202+
tasks=["mme"],
203+
batch_size=1,
204+
limit=8, # set to None for full evaluation
205+
log_samples=True, # save per-sample outputs
206+
)
207+
# results["results"] contains per-task metrics
208+
# results["samples"] contains per-sample model outputs (if log_samples=True)
209+
print(results["results"]["mme"])
210+
```
211+
212+
Key parameters:
213+
| Parameter | Type | Description |
214+
|-----------|------|-------------|
215+
| `model` | `str` | Registered model name (e.g. `"qwen2_5_vl"`, `"vllm"`, `"openai"`) |
216+
| `model_args` | `str \| dict` | Model constructor arguments |
217+
| `tasks` | `list` | Task names, dicts, or Task objects |
218+
| `limit` | `int \| float` | Cap the number of samples per task (useful for testing) |
219+
| `batch_size` | `int` | Inference batch size |
220+
| `task_manager` | `TaskManager` | Pre-configured TaskManager (optional) |
221+
| `gen_kwargs` | `str` | Override generation parameters |
222+
| `predict_only` | `bool` | Generate outputs without computing metrics |
223+
## 14) Remote Evaluation via HTTP Server
224+
For async workflows (e.g. triggering evaluations during training), use the
225+
eval server and client:
226+
```python
227+
# Server side
228+
from lmms_eval.entrypoints import ServerArgs, launch_server
229+
```
230+
231+
```python
232+
# Client side
233+
from lmms_eval.entrypoints import EvalClient
234+
client = EvalClient("http://eval-server:8000")
235+
# Submit a non-blocking evaluation job
236+
job = client.evaluate(
237+
model="qwen2_5_vl",
238+
tasks=["mme", "mmmu_val"],
239+
model_args={"pretrained": "Qwen/Qwen2.5-VL-7B-Instruct"},
240+
)
241+
# Poll or wait for results
242+
result = client.wait_for_job(job["job_id"])
243+
print(result["result"])
244+
```
245+
An async client (`AsyncEvalClient`) is also available for use in async
246+
training loops. See the [v0.6 release notes](lmms-eval-0.6.md) for full
247+
server API documentation.
248+
---
249+
250+
## Quick Reference
251+
252+
| What you need | CLI / Import | Downloads data? |
253+
|---------------|--------------|-----------------|
254+
| List tasks | `lmms-eval tasks list` | No |
255+
| Task table | `lmms-eval tasks subtasks` | No |
256+
| List models | `lmms-eval models` | No |
257+
| Interactive wizard | `lmms-eval eval` (no args) | No |
258+
| Direct evaluation | `lmms-eval eval --model X --tasks Y` | **Yes** |
259+
| Web UI | `lmms-eval ui` | No |
260+
| HTTP server | `lmms-eval serve` | Server-side |
261+
| Power analysis | `lmms-eval power` | No |
262+
| Version info | `lmms-eval version` | No |
263+
| List benchmarks (Python) | `TaskManager().all_subtasks` | No |
264+
| Read raw YAML config | `TaskManager()._get_config(name)` | No |
265+
| Instantiate task + download | `get_task_dict([name])` | **Yes** |
266+
| Iterate samples | `task.test_docs()` | (at construction) |
267+
| Full evaluation (Python) | `simple_evaluate(...)` | **Yes** |
268+
| Remote evaluation (Python) | `EvalClient(url).evaluate(...)` | Server-side |
269+
## Data Flow
270+
271+
```
272+
TaskManager()
273+
└─ initialize_tasks() # scan lmms_eval/tasks/**/*.yaml
274+
└─ index: {name -> yaml_path, type}
275+
└─ TaskManager.load_task_or_group()
276+
└─ ConfigurableTask(config)
277+
└─ download() # datasets.load_dataset("lmms-lab/MME")
278+
└─ self.dataset # DatasetDict with all splits
279+
└─ self.dataset_no_image # same, media columns stripped
280+
task.config -> TaskConfig dataclass # all YAML fields as attributes
281+
```

0 commit comments

Comments
 (0)