Skip to content

Commit 9e69834

Browse files
feat: SGLang refactor, distributed eval fixes, and cache simplification (#1253)
* refactor(models): replace dummy_video_reader with unified dummy model Merge dummy_video_reader into a single dummy model that serves both use cases: - Default mode: instant no-op responses for dataset hydration and task smoke tests - Video-bench mode (read_bytes/decode_num_frames > 0): full IO/decode latency tracking The old name dummy_video_reader is kept as a MODEL_ALIASES alias for backward compat. * fix(sglang): prevent double multimodal processing for image and video inputs SGLang's Engine runs its own Qwen3-VL processor internally. When lmms-eval pre-tokenized inputs with the HF processor and passed the expanded input_ids to SGLang, pad tokens were expanded twice, causing IndexError on image inputs and potential failures on video inputs. - Image path: pass prompt text directly to Engine.generate() instead of pre-tokenized input_ids, letting SGLang handle tokenization end-to-end - Video path: pass prompt text + video_data to Engine.generate() using SGLang's native video support instead of pre-tokenizing and swapping video tokens to image tokens - Fix tools check: use truthy check instead of 'is not None' so empty list from disabled MCP does not trigger tool-handling code paths - Fix tools param: pass tools=None instead of tools=[] to apply_chat_template to avoid unexpected preprocessing - Lazy-import MCP deps: avoid ImportError at module load when mcp package is not installed - Broaden optional metric imports: catch Exception instead of ImportError so numpy/spacy binary incompatibilities do not crash metric aggregation for unrelated tasks * fix: land layered cache support on main worktree * fix: stabilize dataset loading and mmmu pro prompts * fix: add eval batch watchdog heartbeats * feat: promote sealed cache segments during eval * style: auto-fix lint (black + isort) * feat: SGLang refactor, distributed eval fixes, and cache simplification SGLang model wrapper: - Remove qwen_vl_utils dependency from generic wrapper - Pass per-request image_data instead of flattening across batch - Initialize _config with AutoConfig instead of returning processor - Patch torchvision read_video missing video_fps fallback - Pass flat image list to Engine.generate instead of nested lists Distributed eval: - Use global rank in model wrappers for correct TP+DP dispatch - Add Slurm-aware progress reporting for batch jobs - Redirect HF datasets cache to local scratch on remote FS Response cache: - Simplify to single create/finalize API - Context-length and batch-size tuning for thinking models Tests: - Expanded cache tests for simplified API - Filelock cross-class singleton regression test - Task dataset cache redirect test Deps: - Add torchcodec to pyproject.toml --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent a130a0c commit 9e69834

25 files changed

Lines changed: 1749 additions & 741 deletions

docs/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -189,7 +189,7 @@ results = evaluator.simple_evaluate(
189189
| [Caching](advanced/caching.md) | SQLite-backed response cache for deterministic requests. Store, replay, merge shards across distributed ranks, and recover from crashes via JSONL audit log. |
190190
| [Throughput Metrics](advanced/throughput_metrics.md) | Inference timing metrics logged by chat models — end-to-end latency, time to first token, tokens per second, and batch-level summaries. |
191191

192-
The response cache stores only deterministic requests (`temperature=0`, `do_sample=False`). Enable it with `--use_cache ./eval_cache` to skip redundant model calls on repeated runs:
192+
The response cache stores only deterministic requests (`temperature=0`, `do_sample=False`). Enable it with `--use_cache ./eval_cache` or `--use_cache ./eval_cache/cache.db` to skip redundant model calls on repeated runs. In layered mode, lmms-eval keeps the shared root DB at `./eval_cache/cache.db` and writes each run into `./eval_cache/runs/<run_id>/` before merging it back:
193193

194194
```bash
195195
python -m lmms_eval \

docs/advanced/caching.md

Lines changed: 33 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,20 @@ python -m lmms_eval \
1515

1616
On a second run with the same command, cached responses are loaded and the model is only called for new or changed requests.
1717

18+
When `--use_cache` points to a directory, or to an explicit root `cache.db`, lmms-eval uses a layered layout:
19+
20+
```text
21+
eval_cache/
22+
cache.db
23+
cache.audit.jsonl
24+
runs/
25+
<run_id>/
26+
cache.db
27+
cache.audit.jsonl
28+
```
29+
30+
The root `cache.db` is the shared read cache. Each evaluation run writes to its own UUID-scoped directory and rank 0 merges completed runs back into the root database under an exclusive lock. That gives you cache reuse without asking concurrent jobs to write into the same SQLite file.
31+
1832
### What gets cached
1933

2034
Only **deterministic** requests are cached. A request is considered non-deterministic (and skipped) when any of:
@@ -56,16 +70,23 @@ Float/int normalization: `temperature=0.0` and `temperature=0` produce the same
5670

5771
### File layout
5872

73+
Layered directory mode (recommended for shared or long-running jobs):
74+
5975
```
60-
{use_cache}/
61-
{model_hash}/ # sha256("{model}|{model_args}")[:16]
62-
rank0.db # SQLite (WAL mode) - primary lookup
63-
rank0.jsonl # write-ahead audit log - crash recovery
64-
rank1.db # (if multi-GPU)
65-
rank1.jsonl
76+
{cache_root}/
77+
cache.db
78+
cache.audit.jsonl
79+
runs/
80+
{run_id}/
81+
cache.db # single-rank writes
82+
cache.audit.jsonl
83+
cache.db.shard.{rank} # multi-rank writes
84+
cache.db.audit.shard.{rank}.jsonl
85+
.ready
86+
.merged
6687
```
6788

68-
Per-rank files avoid write contention in distributed runs.
89+
Legacy file mode keeps the older behavior where a direct `.db` target may receive per-rank shard files next to the target DB.
6990

7091
### Cache invalidation
7192

@@ -92,14 +113,16 @@ Responses are validated before caching:
92113

93114
### Merge distributed shards
94115

95-
After a multi-GPU run, merge per-rank DBs into one:
116+
Layered directory mode merges distributed shards automatically on successful completion. Rank 0 acquires an exclusive merge lock, folds every ready run under `runs/` into the root `cache.db`, and marks the run directory as merged.
117+
118+
If you are using legacy file mode, you can still merge shard DBs manually:
96119

97120
```python
98121
from lmms_eval.caching.response_cache import ResponseCache
99122

100123
ResponseCache.merge_shards(
101-
shard_paths=["eval_cache/abc123/rank0.db", "eval_cache/abc123/rank1.db"],
102-
output_path="eval_cache/abc123/merged.db",
124+
shard_paths=["eval_cache/cache.db.shard.0", "eval_cache/cache.db.shard.1"],
125+
output_path="eval_cache/cache.db",
103126
)
104127
```
105128

@@ -211,5 +234,3 @@ On a second run with the same task/docs, cached responses will be loaded and onl
211234
### Optional: legacy SQLite cache wrapper
212235

213236
There is also a separate optional wrapper `CachingLMM` (see `lmms_eval.api.model.CachingLMM`) that caches by hashing the entire call arguments to a SQLite DB (via `SqliteDict`). It is independent from the JSONL cache above and can be useful for broader API‑level caching. For most users, enabling `LMMS_EVAL_USE_CACHE=True` is sufficient and simpler.
214-
215-

docs/getting-started/commands.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ This mode supports a number of command-line arguments, the details of which can
2929

3030
- `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
3131

32-
- `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
32+
- `--use_cache` : Accepts either a cache root directory or a SQLite `.db` file path. When you pass a directory, or an explicit root `<dir>/cache.db`, lmms-eval stores the shared cache at `<dir>/cache.db` and writes each evaluation run into `<dir>/runs/<run_id>/...` before merging back into the root DB. This isolates concurrent writers automatically while keeping a single shared cache for reuse across runs. Other `.db` filenames keep the legacy single-target behavior.
3333

3434
- `--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lmms_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.
3535

examples/models/minerva_dummy_video_reader.sh renamed to examples/models/minerva_dummy.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,14 @@ set -euo pipefail
44

55
LIMIT="${LIMIT:-50}"
66
BATCH_SIZE="${BATCH_SIZE:-1}"
7-
OUTPUT_PATH="${OUTPUT_PATH:-./logs/minerva_dummy_video_reader/}"
7+
OUTPUT_PATH="${OUTPUT_PATH:-./logs/minerva_dummy/}"
88
VERBOSITY="${VERBOSITY:-INFO}"
99

1010
echo "[INFO] MINERVA dummy video-reader benchmark"
1111
echo "[INFO] tasks=minerva limit=${LIMIT} batch_size=${BATCH_SIZE}"
1212

1313
uv run --with pylance --with pyarrow python -m lmms_eval \
14-
--model dummy_video_reader \
14+
--model dummy \
1515
--model_args "read_bytes=65536,response=A,allow_remote=false,fail_on_missing=true" \
1616
--tasks minerva \
1717
--batch_size "${BATCH_SIZE}" \

lmms_eval/__main__.py

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -220,10 +220,11 @@ def parse_eval_args() -> tuple[argparse.ArgumentParser, argparse.Namespace]:
220220
type=str,
221221
default=None,
222222
metavar="PATH",
223-
help="Path to a SQLite .db file for response-level caching (e.g. ./my_cache.db). "
224-
"Caches deterministic model responses (temperature=0) for reuse across runs. "
225-
"In distributed mode, temporary per-rank shards are auto-merged into this file. "
226-
"A .db suffix is appended automatically if missing. `None` to disable.",
223+
help="Path to a response-cache root directory or SQLite .db file. "
224+
"If PATH is a directory, or an explicit PATH/cache.db root file, lmms-eval keeps the shared root cache at "
225+
"PATH/cache.db and writes each run into PATH/runs/<run_id>/ before auto-merging back into the root DB. "
226+
"This avoids write contention across concurrent jobs while preserving cache reuse. "
227+
"If PATH is a file, legacy single-target behavior is preserved. `None` disables response caching.",
227228
)
228229
parser.add_argument(
229230
"--cache_requests",

lmms_eval/api/metrics.py

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -565,7 +565,9 @@ def stderr_for_metric(metric, bootstrap_iters: int):
565565
ter,
566566
]
567567

568-
# Optional imports for tasks with extra dependencies (spacy, etc.)
568+
# Optional imports for tasks with extra dependencies (spacy, etc.).
569+
# Catch Exception (not just ImportError) because transitive deps may raise
570+
# ValueError/RuntimeError from binary incompatibilities (e.g. numpy/spacy).
569571
try:
570572
from lmms_eval.tasks.amber_g.utils import (
571573
amber_g_aggregate_chair,
@@ -582,7 +584,7 @@ def stderr_for_metric(metric, bootstrap_iters: int):
582584
amber_g_aggregate_cog,
583585
]
584586
)
585-
except ImportError:
587+
except Exception:
586588
pass
587589

588590
try:
@@ -599,7 +601,7 @@ def stderr_for_metric(metric, bootstrap_iters: int):
599601
coco_cap_chair_aggregate_results_recall,
600602
]
601603
)
602-
except ImportError:
604+
except Exception:
603605
pass
604606

605607
if metric in bootstrappable:

lmms_eval/api/task.py

Lines changed: 52 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@
1111
import subprocess
1212
from collections.abc import Callable
1313
from dataclasses import asdict, dataclass
14-
from functools import partial
14+
from functools import lru_cache, partial
1515
from glob import glob
1616
from typing import (
1717
Any,
@@ -46,6 +46,7 @@
4646
is_higher_better,
4747
)
4848
from lmms_eval.caching.cache import load_from_cache, save_to_cache
49+
from lmms_eval.caching.fs_detect import FsType, detect_fs_type, find_local_scratch
4950
from lmms_eval.filters import build_filter_ensemble
5051

5152
# HuggingfaceM4/NoCaps contains truncated image in test split
@@ -61,6 +62,42 @@
6162
]
6263

6364

65+
def _expand_cache_path(path: str) -> str:
66+
return os.path.expanduser(os.path.expandvars(path))
67+
68+
69+
@lru_cache(maxsize=1)
70+
def _resolve_hf_datasets_cache_dir() -> str:
71+
"""Pick a datasets cache directory that is safe for file locks."""
72+
73+
explicit_cache_dir = os.getenv("LMMS_EVAL_DATASETS_CACHE", "").strip()
74+
if explicit_cache_dir:
75+
resolved_cache_dir = _expand_cache_path(explicit_cache_dir)
76+
os.makedirs(resolved_cache_dir, exist_ok=True)
77+
return resolved_cache_dir
78+
79+
hf_home = _expand_cache_path(os.getenv("HF_HOME", "~/.cache/huggingface"))
80+
target_cache_dir = _expand_cache_path(os.getenv("HF_DATASETS_CACHE", os.path.join(hf_home, "datasets")))
81+
82+
if detect_fs_type(target_cache_dir) != FsType.REMOTE:
83+
os.makedirs(target_cache_dir, exist_ok=True)
84+
return target_cache_dir
85+
86+
local_scratch = find_local_scratch()
87+
if local_scratch is None:
88+
eval_logger.warning(
89+
"HF datasets cache '{}' is on a remote filesystem but no local scratch directory was found; continuing with the remote cache, so file-lock errors may still occur.",
90+
target_cache_dir,
91+
)
92+
os.makedirs(target_cache_dir, exist_ok=True)
93+
return target_cache_dir
94+
95+
local_cache_dir = os.path.join(local_scratch, "lmms_eval_hf_datasets", os.getenv("USER", "unknown"))
96+
os.makedirs(local_cache_dir, exist_ok=True)
97+
eval_logger.info("HF datasets cache '{}' is on a remote filesystem; using node-local cache '{}'.", target_cache_dir, local_cache_dir)
98+
return local_cache_dir
99+
100+
64101
@dataclass
65102
class TaskConfig(dict):
66103
# task naming/registry
@@ -263,18 +300,19 @@ def download(self, data_dir=None, cache_dir=None, download_mode=None) -> None:
263300
- `datasets.DownloadMode.FORCE_REDOWNLOAD`
264301
Fresh download and fresh dataset.
265302
"""
303+
resolved_cache_dir = cache_dir if cache_dir is not None else _resolve_hf_datasets_cache_dir()
266304
self.dataset = datasets.load_dataset(
267305
path=self.DATASET_PATH,
268306
name=self.DATASET_NAME,
269307
data_dir=data_dir,
270-
cache_dir=cache_dir,
308+
cache_dir=resolved_cache_dir,
271309
download_mode=download_mode,
272310
)
273311
self.dataset_no_image = datasets.load_dataset(
274312
path=self.DATASET_PATH,
275313
name=self.DATASET_NAME,
276314
data_dir=data_dir,
277-
cache_dir=cache_dir,
315+
cache_dir=resolved_cache_dir,
278316
download_mode=download_mode,
279317
)
280318
for doc_name in self.dataset_no_image:
@@ -921,8 +959,9 @@ def download(self, dataset_kwargs=None) -> None:
921959
# Recursively search whether their is a zip and unzip it to the huggingface home
922960
download_config = DownloadConfig()
923961
download_config.max_retries = dataset_kwargs.get("max_retries", 10) if dataset_kwargs is not None else 10
924-
download_config.num_proc = dataset_kwargs.get("num_proc", 8) if dataset_kwargs is not None else 8
962+
download_config.num_proc = dataset_kwargs.get("num_proc", 1) if dataset_kwargs is not None else 1
925963
download_config.local_files_only = dataset_kwargs.get("local_files_only", False) if dataset_kwargs is not None else False
964+
resolved_dataset_cache_dir = _resolve_hf_datasets_cache_dir()
926965
if dataset_kwargs is not None:
927966
if "From_YouTube" in dataset_kwargs:
928967

@@ -946,11 +985,14 @@ def _download_from_youtube(path):
946985
if accelerator.is_main_process:
947986
dataset_kwargs.pop("From_YouTube")
948987
assert "load_from_disk" not in dataset_kwargs, "load_from_disk must not be True when From_YouTube is True"
988+
youtube_dataset_kwargs = dict(dataset_kwargs)
989+
youtube_cache_dir = youtube_dataset_kwargs.pop("cache_dir", resolved_dataset_cache_dir)
949990
self.all_dataset = datasets.load_dataset(
950991
path=self.DATASET_PATH,
951992
name=self.DATASET_NAME,
993+
cache_dir=youtube_cache_dir,
952994
download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
953-
**dataset_kwargs if dataset_kwargs is not None else {},
995+
**youtube_dataset_kwargs,
954996
)
955997
dataset_kwargs["From_YouTube"] = True
956998
cache_path = snapshot_download(repo_id=self.DATASET_PATH, repo_type="dataset") # download_parquet
@@ -1098,12 +1140,16 @@ def concat_tar_parts(tar_parts, output_tar):
10981140
# `ds = load_datasets("lmms-lab/MMMU")`
10991141
self.dataset = datasets.load_from_disk(dataset_path=self.DATASET_PATH)
11001142
else:
1143+
load_dataset_kwargs = dict(dataset_kwargs) if dataset_kwargs is not None else {}
1144+
load_dataset_cache_dir = load_dataset_kwargs.pop("cache_dir", resolved_dataset_cache_dir)
11011145
self.dataset = datasets.load_dataset(
11021146
path=self.DATASET_PATH,
11031147
name=self.DATASET_NAME,
1148+
cache_dir=load_dataset_cache_dir,
11041149
download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
11051150
download_config=download_config,
1106-
**dataset_kwargs if dataset_kwargs is not None else {},
1151+
num_proc=1,
1152+
**load_dataset_kwargs,
11071153
)
11081154

11091155
if self.config.process_docs is not None:

0 commit comments

Comments
 (0)