You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: SGLang refactor, distributed eval fixes, and cache simplification (#1253)
* refactor(models): replace dummy_video_reader with unified dummy model
Merge dummy_video_reader into a single dummy model that serves both use cases:
- Default mode: instant no-op responses for dataset hydration and task smoke tests
- Video-bench mode (read_bytes/decode_num_frames > 0): full IO/decode latency tracking
The old name dummy_video_reader is kept as a MODEL_ALIASES alias for backward compat.
* fix(sglang): prevent double multimodal processing for image and video inputs
SGLang's Engine runs its own Qwen3-VL processor internally. When
lmms-eval pre-tokenized inputs with the HF processor and passed the
expanded input_ids to SGLang, pad tokens were expanded twice, causing
IndexError on image inputs and potential failures on video inputs.
- Image path: pass prompt text directly to Engine.generate() instead of
pre-tokenized input_ids, letting SGLang handle tokenization end-to-end
- Video path: pass prompt text + video_data to Engine.generate() using
SGLang's native video support instead of pre-tokenizing and swapping
video tokens to image tokens
- Fix tools check: use truthy check instead of 'is not None' so empty
list from disabled MCP does not trigger tool-handling code paths
- Fix tools param: pass tools=None instead of tools=[] to
apply_chat_template to avoid unexpected preprocessing
- Lazy-import MCP deps: avoid ImportError at module load when mcp
package is not installed
- Broaden optional metric imports: catch Exception instead of
ImportError so numpy/spacy binary incompatibilities do not crash
metric aggregation for unrelated tasks
* fix: land layered cache support on main worktree
* fix: stabilize dataset loading and mmmu pro prompts
* fix: add eval batch watchdog heartbeats
* feat: promote sealed cache segments during eval
* style: auto-fix lint (black + isort)
* feat: SGLang refactor, distributed eval fixes, and cache simplification
SGLang model wrapper:
- Remove qwen_vl_utils dependency from generic wrapper
- Pass per-request image_data instead of flattening across batch
- Initialize _config with AutoConfig instead of returning processor
- Patch torchvision read_video missing video_fps fallback
- Pass flat image list to Engine.generate instead of nested lists
Distributed eval:
- Use global rank in model wrappers for correct TP+DP dispatch
- Add Slurm-aware progress reporting for batch jobs
- Redirect HF datasets cache to local scratch on remote FS
Response cache:
- Simplify to single create/finalize API
- Context-length and batch-size tuning for thinking models
Tests:
- Expanded cache tests for simplified API
- Filelock cross-class singleton regression test
- Task dataset cache redirect test
Deps:
- Add torchcodec to pyproject.toml
---------
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
|[Caching](advanced/caching.md)| SQLite-backed response cache for deterministic requests. Store, replay, merge shards across distributed ranks, and recover from crashes via JSONL audit log. |
190
190
|[Throughput Metrics](advanced/throughput_metrics.md)| Inference timing metrics logged by chat models — end-to-end latency, time to first token, tokens per second, and batch-level summaries. |
191
191
192
-
The response cache stores only deterministic requests (`temperature=0`, `do_sample=False`). Enable it with `--use_cache ./eval_cache` to skip redundant model calls on repeated runs:
192
+
The response cache stores only deterministic requests (`temperature=0`, `do_sample=False`). Enable it with `--use_cache ./eval_cache`or `--use_cache ./eval_cache/cache.db`to skip redundant model calls on repeated runs. In layered mode, lmms-eval keeps the shared root DB at `./eval_cache/cache.db` and writes each run into `./eval_cache/runs/<run_id>/` before merging it back:
Copy file name to clipboardExpand all lines: docs/advanced/caching.md
+33-12Lines changed: 33 additions & 12 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -15,6 +15,20 @@ python -m lmms_eval \
15
15
16
16
On a second run with the same command, cached responses are loaded and the model is only called for new or changed requests.
17
17
18
+
When `--use_cache` points to a directory, or to an explicit root `cache.db`, lmms-eval uses a layered layout:
19
+
20
+
```text
21
+
eval_cache/
22
+
cache.db
23
+
cache.audit.jsonl
24
+
runs/
25
+
<run_id>/
26
+
cache.db
27
+
cache.audit.jsonl
28
+
```
29
+
30
+
The root `cache.db` is the shared read cache. Each evaluation run writes to its own UUID-scoped directory and rank 0 merges completed runs back into the root database under an exclusive lock. That gives you cache reuse without asking concurrent jobs to write into the same SQLite file.
31
+
18
32
### What gets cached
19
33
20
34
Only **deterministic** requests are cached. A request is considered non-deterministic (and skipped) when any of:
@@ -56,16 +70,23 @@ Float/int normalization: `temperature=0.0` and `temperature=0` produce the same
56
70
57
71
### File layout
58
72
73
+
Layered directory mode (recommended for shared or long-running jobs):
Per-rank files avoid write contention in distributed runs.
89
+
Legacy file mode keeps the older behavior where a direct `.db` target may receive per-rank shard files next to the target DB.
69
90
70
91
### Cache invalidation
71
92
@@ -92,14 +113,16 @@ Responses are validated before caching:
92
113
93
114
### Merge distributed shards
94
115
95
-
After a multi-GPU run, merge per-rank DBs into one:
116
+
Layered directory mode merges distributed shards automatically on successful completion. Rank 0 acquires an exclusive merge lock, folds every ready run under `runs/` into the root `cache.db`, and marks the run directory as merged.
117
+
118
+
If you are using legacy file mode, you can still merge shard DBs manually:
96
119
97
120
```python
98
121
from lmms_eval.caching.response_cache import ResponseCache
@@ -211,5 +234,3 @@ On a second run with the same task/docs, cached responses will be loaded and onl
211
234
### Optional: legacy SQLite cache wrapper
212
235
213
236
There is also a separate optional wrapper `CachingLMM` (see `lmms_eval.api.model.CachingLMM`) that caches by hashing the entire call arguments to a SQLite DB (via `SqliteDict`). It is independent from the JSONL cache above and can be useful for broader API‑level caching. For most users, enabling `LMMS_EVAL_USE_CACHE=True` is sufficient and simpler.
Copy file name to clipboardExpand all lines: docs/getting-started/commands.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -29,7 +29,7 @@ This mode supports a number of command-line arguments, the details of which can
29
29
30
30
-`--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
31
31
32
-
-`--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db`for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
32
+
-`--use_cache` : Accepts either a cache root directory or a SQLite `.db` file path. When you pass a directory, or an explicit root `<dir>/cache.db`, lmms-eval stores the shared cache at `<dir>/cache.db`and writes each evaluation run into `<dir>/runs/<run_id>/...` before merging back into the root DB. This isolates concurrent writers automatically while keeping a single shared cache for reuse across runs. Other `.db` filenames keep the legacy single-target behavior.
33
33
34
34
-`--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lmms_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.
"HF datasets cache '{}' is on a remote filesystem but no local scratch directory was found; continuing with the remote cache, so file-lock errors may still occur.",
0 commit comments