|
| 1 | +# External Usage |
| 2 | + |
| 3 | +`lmms_eval` can be used in two ways: as a **CLI tool** for quick tasks like browsing |
| 4 | +benchmarks and launching the Web UI, or as a **Python library** for programmatic |
| 5 | +access to tasks, datasets, and evaluations. |
| 6 | + |
| 7 | +## Installation |
| 8 | + |
| 9 | +```bash |
| 10 | +# From PyPI |
| 11 | +pip install lmms-eval |
| 12 | +pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git |
| 13 | +pip install "lmms-eval[all]" |
| 14 | +``` |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +# Part I - CLI |
| 19 | + |
| 20 | +## 1) Preview Available Tasks |
| 21 | +```bash |
| 22 | +# Flat list of every registered name (tasks + groups + tags) |
| 23 | +lmms-eval tasks list |
| 24 | +# Markdown table of task groups only |
| 25 | +lmms-eval tasks groups |
| 26 | + |
| 27 | +# Markdown table of leaf tasks only (with config path and output type) |
| 28 | +lmms-eval tasks subtasks |
| 29 | +# Tags only |
| 30 | +lmms-eval tasks tags |
| 31 | +``` |
| 32 | + |
| 33 | +Example output for `tasks subtasks` (truncated): |
| 34 | + |
| 35 | +``` |
| 36 | +| Task | Config Location | Output Type | |
| 37 | +|------|-----------------|-------------| |
| 38 | +| mme | lmms_eval/tasks/mme/mme.yaml | generate_until | |
| 39 | +| mmmu_val | lmms_eval/tasks/mmmu/mmmu_val.yaml | generate_until | |
| 40 | +| ... | ... | ... | |
| 41 | +``` |
| 42 | + |
| 43 | +These commands only read YAML configs - no dataset download happens. |
| 44 | + |
| 45 | +## 2) List Available Models |
| 46 | + |
| 47 | +```bash |
| 48 | +# Show all registered model backends (chat, simple, dual-mode) |
| 49 | +lmms-eval models |
| 50 | + |
| 51 | +# Include aliases |
| 52 | +lmms-eval models --aliases |
| 53 | +``` |
| 54 | + |
| 55 | +## 3) Launch the Web UI |
| 56 | +The Web UI provides a browser-based interface for configuring and running |
| 57 | +evaluations interactively. Requires Node.js 18+ (auto-built on first launch). |
| 58 | + |
| 59 | +```bash |
| 60 | +# Start the Web UI (opens browser automatically) |
| 61 | +lmms-eval ui |
| 62 | +# Custom port |
| 63 | +lmms-eval ui --port 3000 |
| 64 | +``` |
| 65 | + |
| 66 | +## 4) Interactive Evaluation Wizard |
| 67 | + |
| 68 | +Run `lmms-eval eval` with no arguments to launch a step-by-step wizard that |
| 69 | +guides you through model selection, task selection, and options: |
| 70 | + |
| 71 | +```bash |
| 72 | +lmms-eval eval |
| 73 | +``` |
| 74 | + |
| 75 | +The wizard lets you search/filter tasks, shows a command preview, and runs |
| 76 | +the evaluation after confirmation. |
| 77 | + |
| 78 | +## 5) Direct Evaluation |
| 79 | + |
| 80 | +Pass arguments directly (same flags as before, fully backward-compatible): |
| 81 | + |
| 82 | +```bash |
| 83 | +# New style (with eval subcommand) |
| 84 | +lmms-eval eval --model qwen2_5_vl --tasks mme --batch_size 1 --limit 8 |
| 85 | + |
| 86 | +# Old style (still works, routes to eval automatically) |
| 87 | +lmms-eval --model qwen2_5_vl --tasks mme --batch_size 1 --limit 8 |
| 88 | +``` |
| 89 | + |
| 90 | +## 6) Start the HTTP Eval Server |
| 91 | + |
| 92 | +```bash |
| 93 | +lmms-eval serve --host 0.0.0.0 --port 8000 |
| 94 | +``` |
| 95 | + |
| 96 | +## 7) Other Commands |
| 97 | + |
| 98 | +```bash |
| 99 | +# Version and environment info |
| 100 | +lmms-eval version |
| 101 | + |
| 102 | +# Statistical power analysis for benchmark planning |
| 103 | +lmms-eval power --effect-size 0.03 --tasks mme |
| 104 | + |
| 105 | +# Terminal UI (requires textual package) |
| 106 | +lmms-eval tui |
| 107 | +``` |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +# Part II - Python Library |
| 112 | + |
| 113 | +Beyond the CLI, `lmms_eval` can be imported as a Python library. This lets |
| 114 | +external projects list benchmarks, load task definitions, download datasets, |
| 115 | +iterate over samples, and run evaluations - all programmatically. |
| 116 | + |
| 117 | +## 8) List Available Benchmarks (Python) |
| 118 | + |
| 119 | +Use `TaskManager` to index all built-in tasks without downloading any data: |
| 120 | + |
| 121 | +```python |
| 122 | +from lmms_eval.tasks import TaskManager |
| 123 | +tm = TaskManager() |
| 124 | +# All registered names (tasks + groups + tags) |
| 125 | +print(tm.all_tasks) |
| 126 | +print(tm.all_subtasks) # e.g. ['mme', 'mmmu_val', 'mathvista', ...] |
| 127 | +print(tm.all_groups) |
| 128 | +print(tm.list_all_tasks()) |
| 129 | +``` |
| 130 | +No dataset download happens at this stage. `TaskManager` only reads YAML configs |
| 131 | +from the `lmms_eval/tasks/` directory to build its index. |
| 132 | +## 9) Load a Task and Download Its Dataset |
| 133 | +`get_task_dict` instantiates task objects. During construction each task calls |
| 134 | +`download()`, which triggers `datasets.load_dataset()` under the hood. |
| 135 | +```python |
| 136 | +from lmms_eval.tasks import TaskManager, get_task_dict |
| 137 | +tm = TaskManager() |
| 138 | +task_dict = get_task_dict(["mme"], task_manager=tm) |
| 139 | +task = task_dict["mme"] |
| 140 | +``` |
| 141 | +After this call the HuggingFace dataset has been downloaded (or loaded from |
| 142 | +cache) and is stored in `task.dataset` as a `datasets.DatasetDict`. |
| 143 | +## 10) Iterate Over Benchmark Samples |
| 144 | + |
| 145 | +Each task exposes its splits through accessor methods: |
| 146 | + |
| 147 | +```python |
| 148 | +# Check which splits exist |
| 149 | +task.has_test_docs() # True / False |
| 150 | +task.has_validation_docs() # True / False |
| 151 | +task.has_training_docs() # True / False |
| 152 | +test_data = task.test_docs() # full dataset with images/audio |
| 153 | +test_data_lite = task.test_docs_no_media() # same rows, media columns removed |
| 154 | +for doc in test_data: |
| 155 | + print(doc.keys()) # e.g. dict_keys(['question', 'answer', 'image', ...]) |
| 156 | + break |
| 157 | +``` |
| 158 | +There is also a convenience property that returns whichever split the task uses |
| 159 | +for evaluation (test if available, otherwise validation): |
| 160 | +```python |
| 161 | +eval_data = task.eval_docs # datasets.Dataset |
| 162 | +eval_data_lite = task.eval_docs_no_media # without media columns |
| 163 | +``` |
| 164 | + |
| 165 | +## 11) Access Task Configuration |
| 166 | + |
| 167 | +Every task carries its full YAML config as a `TaskConfig` dataclass: |
| 168 | + |
| 169 | +```python |
| 170 | +cfg = task.config |
| 171 | +cfg.test_split # "test" |
| 172 | +cfg.output_type # "generate_until" |
| 173 | +cfg.metric_list # [{"metric": "mme_perception_score", ...}, ...] |
| 174 | +cfg.generation_kwargs # {"max_new_tokens": 16, "temperature": 0, ...} |
| 175 | +cfg.lmms_eval_specific_kwargs # per-model prompt variants |
| 176 | +``` |
| 177 | +You can also read a raw YAML config without instantiating the task (and |
| 178 | +therefore without downloading data): |
| 179 | +```python |
| 180 | +raw = tm._get_config("mme") # returns the parsed YAML as a dict |
| 181 | +``` |
| 182 | + |
| 183 | +## 12) Load Tasks from a Custom Path |
| 184 | +External projects can maintain their own task YAMLs and load them alongside |
| 185 | +(or instead of) the built-in tasks: |
| 186 | +```python |
| 187 | +# Include custom tasks in addition to built-in ones |
| 188 | +tm = TaskManager(include_path="/path/to/my/tasks") |
| 189 | +tm = TaskManager(include_path="/path/to/my/tasks", include_defaults=False) |
| 190 | +tm = TaskManager(include_path=["/path/a", "/path/b"]) |
| 191 | +``` |
| 192 | +Task YAMLs in the custom directory follow the same format as built-in tasks. |
| 193 | +See the [Task Guide](task_guide.md) for the full specification. |
| 194 | +## 13) Run an Evaluation Programmatically |
| 195 | + |
| 196 | +`simple_evaluate` is the same function the CLI calls internally: |
| 197 | + |
| 198 | +```python |
| 199 | +from lmms_eval.evaluator import simple_evaluate |
| 200 | + model="qwen2_5_vl", |
| 201 | + model_args={"pretrained": "Qwen/Qwen2.5-VL-3B-Instruct"}, |
| 202 | + tasks=["mme"], |
| 203 | + batch_size=1, |
| 204 | + limit=8, # set to None for full evaluation |
| 205 | + log_samples=True, # save per-sample outputs |
| 206 | +) |
| 207 | +# results["results"] contains per-task metrics |
| 208 | +# results["samples"] contains per-sample model outputs (if log_samples=True) |
| 209 | +print(results["results"]["mme"]) |
| 210 | +``` |
| 211 | + |
| 212 | +Key parameters: |
| 213 | +| Parameter | Type | Description | |
| 214 | +|-----------|------|-------------| |
| 215 | +| `model` | `str` | Registered model name (e.g. `"qwen2_5_vl"`, `"vllm"`, `"openai"`) | |
| 216 | +| `model_args` | `str \| dict` | Model constructor arguments | |
| 217 | +| `tasks` | `list` | Task names, dicts, or Task objects | |
| 218 | +| `limit` | `int \| float` | Cap the number of samples per task (useful for testing) | |
| 219 | +| `batch_size` | `int` | Inference batch size | |
| 220 | +| `task_manager` | `TaskManager` | Pre-configured TaskManager (optional) | |
| 221 | +| `gen_kwargs` | `str` | Override generation parameters | |
| 222 | +| `predict_only` | `bool` | Generate outputs without computing metrics | |
| 223 | +## 14) Remote Evaluation via HTTP Server |
| 224 | +For async workflows (e.g. triggering evaluations during training), use the |
| 225 | +eval server and client: |
| 226 | +```python |
| 227 | +# Server side |
| 228 | +from lmms_eval.entrypoints import ServerArgs, launch_server |
| 229 | +``` |
| 230 | + |
| 231 | +```python |
| 232 | +# Client side |
| 233 | +from lmms_eval.entrypoints import EvalClient |
| 234 | +client = EvalClient("http://eval-server:8000") |
| 235 | +# Submit a non-blocking evaluation job |
| 236 | +job = client.evaluate( |
| 237 | + model="qwen2_5_vl", |
| 238 | + tasks=["mme", "mmmu_val"], |
| 239 | + model_args={"pretrained": "Qwen/Qwen2.5-VL-7B-Instruct"}, |
| 240 | +) |
| 241 | +# Poll or wait for results |
| 242 | +result = client.wait_for_job(job["job_id"]) |
| 243 | +print(result["result"]) |
| 244 | +``` |
| 245 | +An async client (`AsyncEvalClient`) is also available for use in async |
| 246 | +training loops. See the [v0.6 release notes](lmms-eval-0.6.md) for full |
| 247 | +server API documentation. |
| 248 | +--- |
| 249 | + |
| 250 | +## Quick Reference |
| 251 | + |
| 252 | +| What you need | CLI / Import | Downloads data? | |
| 253 | +|---------------|--------------|-----------------| |
| 254 | +| List tasks | `lmms-eval tasks list` | No | |
| 255 | +| Task table | `lmms-eval tasks subtasks` | No | |
| 256 | +| List models | `lmms-eval models` | No | |
| 257 | +| Interactive wizard | `lmms-eval eval` (no args) | No | |
| 258 | +| Direct evaluation | `lmms-eval eval --model X --tasks Y` | **Yes** | |
| 259 | +| Web UI | `lmms-eval ui` | No | |
| 260 | +| HTTP server | `lmms-eval serve` | Server-side | |
| 261 | +| Power analysis | `lmms-eval power` | No | |
| 262 | +| Version info | `lmms-eval version` | No | |
| 263 | +| List benchmarks (Python) | `TaskManager().all_subtasks` | No | |
| 264 | +| Read raw YAML config | `TaskManager()._get_config(name)` | No | |
| 265 | +| Instantiate task + download | `get_task_dict([name])` | **Yes** | |
| 266 | +| Iterate samples | `task.test_docs()` | (at construction) | |
| 267 | +| Full evaluation (Python) | `simple_evaluate(...)` | **Yes** | |
| 268 | +| Remote evaluation (Python) | `EvalClient(url).evaluate(...)` | Server-side | |
| 269 | +## Data Flow |
| 270 | + |
| 271 | +``` |
| 272 | +TaskManager() |
| 273 | + └─ initialize_tasks() # scan lmms_eval/tasks/**/*.yaml |
| 274 | + └─ index: {name -> yaml_path, type} |
| 275 | + └─ TaskManager.load_task_or_group() |
| 276 | + └─ ConfigurableTask(config) |
| 277 | + └─ download() # datasets.load_dataset("lmms-lab/MME") |
| 278 | + └─ self.dataset # DatasetDict with all splits |
| 279 | + └─ self.dataset_no_image # same, media columns stripped |
| 280 | +task.config -> TaskConfig dataclass # all YAML fields as attributes |
| 281 | +``` |
0 commit comments