EvolvingLMMs-Lab
diff --git a/‎docs/README.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/advanced/caching.md‎
Lines changed: 33 additions & 12 deletions b/‎docs/advanced/caching.md‎
Lines changed: 33 additions & 12 deletions
diff --git a/‎docs/getting-started/commands.md‎
Lines changed: 1 addition & 1 deletion b/‎docs/getting-started/commands.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎…les/models/minerva_dummy_video_reader.sh‎ ‎examples/models/minerva_dummy.sh‎examples/models/minerva_dummy_video_reader.sh renamed to examples/models/minerva_dummy.sh
Lines changed: 2 additions & 2 deletions b/‎…les/models/minerva_dummy_video_reader.sh‎ ‎examples/models/minerva_dummy.sh‎examples/models/minerva_dummy_video_reader.sh renamed to examples/models/minerva_dummy.sh
Lines changed: 2 additions & 2 deletions
diff --git a/‎lmms_eval/__main__.py‎
Lines changed: 5 additions & 4 deletions b/‎lmms_eval/__main__.py‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎lmms_eval/api/metrics.py‎
Lines changed: 5 additions & 3 deletions b/‎lmms_eval/api/metrics.py‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎lmms_eval/api/task.py‎
Lines changed: 52 additions & 6 deletions b/‎lmms_eval/api/task.py‎
Lines changed: 52 additions & 6 deletions
@@ -189,7 +189,7 @@ results = evaluator.simple_evaluate(
 | [Caching](advanced/caching.md) | SQLite-backed response cache for deterministic requests. Store, replay, merge shards across distributed ranks, and recover from crashes via JSONL audit log. |
 | [Throughput Metrics](advanced/throughput_metrics.md) | Inference timing metrics logged by chat models — end-to-end latency, time to first token, tokens per second, and batch-level summaries. |
 
-The response cache stores only deterministic requests (`temperature=0`, `do_sample=False`). Enable it with `--use_cache ./eval_cache` to skip redundant model calls on repeated runs:
+The response cache stores only deterministic requests (`temperature=0`, `do_sample=False`). Enable it with `--use_cache ./eval_cache` or `--use_cache ./eval_cache/cache.db` to skip redundant model calls on repeated runs. In layered mode, lmms-eval keeps the shared root DB at `./eval_cache/cache.db` and writes each run into `./eval_cache/runs/<run_id>/` before merging it back:
 
 ```bash
 python -m lmms_eval \
 
@@ -15,6 +15,20 @@ python -m lmms_eval \
 
 On a second run with the same command, cached responses are loaded and the model is only called for new or changed requests.
 
+When `--use_cache` points to a directory, or to an explicit root `cache.db`, lmms-eval uses a layered layout:
+
+```text
+eval_cache/
+  cache.db
+  cache.audit.jsonl
+  runs/
+    <run_id>/
+      cache.db
+      cache.audit.jsonl
+```
+
+The root `cache.db` is the shared read cache. Each evaluation run writes to its own UUID-scoped directory and rank 0 merges completed runs back into the root database under an exclusive lock. That gives you cache reuse without asking concurrent jobs to write into the same SQLite file.
+
 ### What gets cached
 
 Only **deterministic** requests are cached. A request is considered non-deterministic (and skipped) when any of:
@@ -56,16 +70,23 @@ Float/int normalization: `temperature=0.0` and `temperature=0` produce the same
 
 ### File layout
 
+Layered directory mode (recommended for shared or long-running jobs):
+
 ```
-{use_cache}/
-  {model_hash}/          # sha256("{model}|{model_args}")[:16]
-    rank0.db             # SQLite (WAL mode) - primary lookup
-    rank0.jsonl          # write-ahead audit log - crash recovery
-    rank1.db             # (if multi-GPU)
-    rank1.jsonl
+{cache_root}/
+  cache.db
+  cache.audit.jsonl
+  runs/
+    {run_id}/
+      cache.db                     # single-rank writes
+      cache.audit.jsonl
+      cache.db.shard.{rank}        # multi-rank writes
+      cache.db.audit.shard.{rank}.jsonl
+      .ready
+      .merged
 ```
 
-Per-rank files avoid write contention in distributed runs.
+Legacy file mode keeps the older behavior where a direct `.db` target may receive per-rank shard files next to the target DB.
 
 ### Cache invalidation
 
@@ -92,14 +113,16 @@ Responses are validated before caching:
 
 ### Merge distributed shards
 
-After a multi-GPU run, merge per-rank DBs into one:
+Layered directory mode merges distributed shards automatically on successful completion. Rank 0 acquires an exclusive merge lock, folds every ready run under `runs/` into the root `cache.db`, and marks the run directory as merged.
+
+If you are using legacy file mode, you can still merge shard DBs manually:
 
 ```python
 from lmms_eval.caching.response_cache import ResponseCache
 
 ResponseCache.merge_shards(
-    shard_paths=["eval_cache/abc123/rank0.db", "eval_cache/abc123/rank1.db"],
-    output_path="eval_cache/abc123/merged.db",
+    shard_paths=["eval_cache/cache.db.shard.0", "eval_cache/cache.db.shard.1"],
+    output_path="eval_cache/cache.db",
 )
 ```
 
@@ -211,5 +234,3 @@ On a second run with the same task/docs, cached responses will be loaded and onl
 ### Optional: legacy SQLite cache wrapper
 
 There is also a separate optional wrapper `CachingLMM` (see `lmms_eval.api.model.CachingLMM`) that caches by hashing the entire call arguments to a SQLite DB (via `SqliteDict`). It is independent from the JSONL cache above and can be useful for broader API‑level caching. For most users, enabling `LMMS_EVAL_USE_CACHE=True` is sufficient and simpler.
-
-
 
@@ -29,7 +29,7 @@ This mode supports a number of command-line arguments, the details of which can
 
 - `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
 
-- `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
+- `--use_cache` : Accepts either a cache root directory or a SQLite `.db` file path. When you pass a directory, or an explicit root `<dir>/cache.db`, lmms-eval stores the shared cache at `<dir>/cache.db` and writes each evaluation run into `<dir>/runs/<run_id>/...` before merging back into the root DB. This isolates concurrent writers automatically while keeping a single shared cache for reuse across runs. Other `.db` filenames keep the legacy single-target behavior.
 
 - `--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lmms_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.
 
 
@@ -4,14 +4,14 @@ set -euo pipefail
 
 LIMIT="${LIMIT:-50}"
 BATCH_SIZE="${BATCH_SIZE:-1}"
-OUTPUT_PATH="${OUTPUT_PATH:-./logs/minerva_dummy_video_reader/}"
+OUTPUT_PATH="${OUTPUT_PATH:-./logs/minerva_dummy/}"
 VERBOSITY="${VERBOSITY:-INFO}"
 
 echo "[INFO] MINERVA dummy video-reader benchmark"
 echo "[INFO] tasks=minerva limit=${LIMIT} batch_size=${BATCH_SIZE}"
 
 uv run --with pylance --with pyarrow python -m lmms_eval \
-    --model dummy_video_reader \
+    --model dummy \
     --model_args "read_bytes=65536,response=A,allow_remote=false,fail_on_missing=true" \
     --tasks minerva \
     --batch_size "${BATCH_SIZE}" \
 
@@ -220,10 +220,11 @@ def parse_eval_args() -> tuple[argparse.ArgumentParser, argparse.Namespace]:
         type=str,
         default=None,
         metavar="PATH",
-        help="Path to a SQLite .db file for response-level caching (e.g. ./my_cache.db). "
-        "Caches deterministic model responses (temperature=0) for reuse across runs. "
-        "In distributed mode, temporary per-rank shards are auto-merged into this file. "
-        "A .db suffix is appended automatically if missing. `None` to disable.",
+        help="Path to a response-cache root directory or SQLite .db file. "
+        "If PATH is a directory, or an explicit PATH/cache.db root file, lmms-eval keeps the shared root cache at "
+        "PATH/cache.db and writes each run into PATH/runs/<run_id>/ before auto-merging back into the root DB. "
+        "This avoids write contention across concurrent jobs while preserving cache reuse. "
+        "If PATH is a file, legacy single-target behavior is preserved. `None` disables response caching.",
     )
     parser.add_argument(
         "--cache_requests",
 
@@ -565,7 +565,9 @@ def stderr_for_metric(metric, bootstrap_iters: int):
         ter,
     ]
 
-    # Optional imports for tasks with extra dependencies (spacy, etc.)
+    # Optional imports for tasks with extra dependencies (spacy, etc.).
+    # Catch Exception (not just ImportError) because transitive deps may raise
+    # ValueError/RuntimeError from binary incompatibilities (e.g. numpy/spacy).
     try:
         from lmms_eval.tasks.amber_g.utils import (
             amber_g_aggregate_chair,
@@ -582,7 +584,7 @@ def stderr_for_metric(metric, bootstrap_iters: int):
                 amber_g_aggregate_cog,
             ]
         )
-    except ImportError:
+    except Exception:
         pass
 
     try:
@@ -599,7 +601,7 @@ def stderr_for_metric(metric, bootstrap_iters: int):
                 coco_cap_chair_aggregate_results_recall,
             ]
         )
-    except ImportError:
+    except Exception:
         pass
 
     if metric in bootstrappable:
 
@@ -11,7 +11,7 @@
 import subprocess
 from collections.abc import Callable
 from dataclasses import asdict, dataclass
-from functools import partial
+from functools import lru_cache, partial
 from glob import glob
 from typing import (
     Any,
@@ -46,6 +46,7 @@
     is_higher_better,
 )
 from lmms_eval.caching.cache import load_from_cache, save_to_cache
+from lmms_eval.caching.fs_detect import FsType, detect_fs_type, find_local_scratch
 from lmms_eval.filters import build_filter_ensemble
 
 # HuggingfaceM4/NoCaps contains truncated image in test split
@@ -61,6 +62,42 @@
 ]
 
 
+def _expand_cache_path(path: str) -> str:
+    return os.path.expanduser(os.path.expandvars(path))
+
+
+@lru_cache(maxsize=1)
+def _resolve_hf_datasets_cache_dir() -> str:
+    """Pick a datasets cache directory that is safe for file locks."""
+
+    explicit_cache_dir = os.getenv("LMMS_EVAL_DATASETS_CACHE", "").strip()
+    if explicit_cache_dir:
+        resolved_cache_dir = _expand_cache_path(explicit_cache_dir)
+        os.makedirs(resolved_cache_dir, exist_ok=True)
+        return resolved_cache_dir
+
+    hf_home = _expand_cache_path(os.getenv("HF_HOME", "~/.cache/huggingface"))
+    target_cache_dir = _expand_cache_path(os.getenv("HF_DATASETS_CACHE", os.path.join(hf_home, "datasets")))
+
+    if detect_fs_type(target_cache_dir) != FsType.REMOTE:
+        os.makedirs(target_cache_dir, exist_ok=True)
+        return target_cache_dir
+
+    local_scratch = find_local_scratch()
+    if local_scratch is None:
+        eval_logger.warning(
+            "HF datasets cache '{}' is on a remote filesystem but no local scratch directory was found; continuing with the remote cache, so file-lock errors may still occur.",
+            target_cache_dir,
+        )
+        os.makedirs(target_cache_dir, exist_ok=True)
+        return target_cache_dir
+
+    local_cache_dir = os.path.join(local_scratch, "lmms_eval_hf_datasets", os.getenv("USER", "unknown"))
+    os.makedirs(local_cache_dir, exist_ok=True)
+    eval_logger.info("HF datasets cache '{}' is on a remote filesystem; using node-local cache '{}'.", target_cache_dir, local_cache_dir)
+    return local_cache_dir
+
+
 @dataclass
 class TaskConfig(dict):
     # task naming/registry
@@ -263,18 +300,19 @@ def download(self, data_dir=None, cache_dir=None, download_mode=None) -> None:
             - `datasets.DownloadMode.FORCE_REDOWNLOAD`
                 Fresh download and fresh dataset.
         """
+        resolved_cache_dir = cache_dir if cache_dir is not None else _resolve_hf_datasets_cache_dir()
         self.dataset = datasets.load_dataset(
             path=self.DATASET_PATH,
             name=self.DATASET_NAME,
             data_dir=data_dir,
-            cache_dir=cache_dir,
+            cache_dir=resolved_cache_dir,
             download_mode=download_mode,
         )
         self.dataset_no_image = datasets.load_dataset(
             path=self.DATASET_PATH,
             name=self.DATASET_NAME,
             data_dir=data_dir,
-            cache_dir=cache_dir,
+            cache_dir=resolved_cache_dir,
             download_mode=download_mode,
         )
         for doc_name in self.dataset_no_image:
@@ -921,8 +959,9 @@ def download(self, dataset_kwargs=None) -> None:
         # Recursively search whether their is a zip and unzip it to the huggingface home
         download_config = DownloadConfig()
         download_config.max_retries = dataset_kwargs.get("max_retries", 10) if dataset_kwargs is not None else 10
-        download_config.num_proc = dataset_kwargs.get("num_proc", 8) if dataset_kwargs is not None else 8
+        download_config.num_proc = dataset_kwargs.get("num_proc", 1) if dataset_kwargs is not None else 1
         download_config.local_files_only = dataset_kwargs.get("local_files_only", False) if dataset_kwargs is not None else False
+        resolved_dataset_cache_dir = _resolve_hf_datasets_cache_dir()
         if dataset_kwargs is not None:
             if "From_YouTube" in dataset_kwargs:
 
@@ -946,11 +985,14 @@ def _download_from_youtube(path):
                 if accelerator.is_main_process:
                     dataset_kwargs.pop("From_YouTube")
                     assert "load_from_disk" not in dataset_kwargs, "load_from_disk must not be True when From_YouTube is True"
+                    youtube_dataset_kwargs = dict(dataset_kwargs)
+                    youtube_cache_dir = youtube_dataset_kwargs.pop("cache_dir", resolved_dataset_cache_dir)
                     self.all_dataset = datasets.load_dataset(
                         path=self.DATASET_PATH,
                         name=self.DATASET_NAME,
+                        cache_dir=youtube_cache_dir,
                         download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
-                        **dataset_kwargs if dataset_kwargs is not None else {},
+                        **youtube_dataset_kwargs,
                     )
                     dataset_kwargs["From_YouTube"] = True
                     cache_path = snapshot_download(repo_id=self.DATASET_PATH, repo_type="dataset")  # download_parquet
@@ -1098,12 +1140,16 @@ def concat_tar_parts(tar_parts, output_tar):
             # `ds = load_datasets("lmms-lab/MMMU")`
             self.dataset = datasets.load_from_disk(dataset_path=self.DATASET_PATH)
         else:
+            load_dataset_kwargs = dict(dataset_kwargs) if dataset_kwargs is not None else {}
+            load_dataset_cache_dir = load_dataset_kwargs.pop("cache_dir", resolved_dataset_cache_dir)
             self.dataset = datasets.load_dataset(
                 path=self.DATASET_PATH,
                 name=self.DATASET_NAME,
+                cache_dir=load_dataset_cache_dir,
                 download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
                 download_config=download_config,
-                **dataset_kwargs if dataset_kwargs is not None else {},
+                num_proc=1,
+                **load_dataset_kwargs,
             )
 
         if self.config.process_docs is not None:
Original file line number	Diff line number	Diff line change
`@@ -565,7 +565,9 @@ def stderr_for_metric(metric, bootstrap_iters: int):`
`565`	`565`	`ter,`
`566`	`566`	`]`
`567`	`567`
`568`		`- # Optional imports for tasks with extra dependencies (spacy, etc.)`
	`568`	`+ # Optional imports for tasks with extra dependencies (spacy, etc.).`
	`569`	`+ # Catch Exception (not just ImportError) because transitive deps may raise`
	`570`	`+ # ValueError/RuntimeError from binary incompatibilities (e.g. numpy/spacy).`
`569`	`571`	`try:`
`570`	`572`	`from lmms_eval.tasks.amber_g.utils import (`
`571`	`573`	`amber_g_aggregate_chair,`
`@@ -582,7 +584,7 @@ def stderr_for_metric(metric, bootstrap_iters: int):`
`582`	`584`	`amber_g_aggregate_cog,`
`583`	`585`	`]`
`584`	`586`	`)`
`585`		`- except ImportError:`
	`587`	`+ except Exception:`
`586`	`588`	`pass`
`587`	`589`
`588`	`590`	`try:`
`@@ -599,7 +601,7 @@ def stderr_for_metric(metric, bootstrap_iters: int):`
`599`	`601`	`coco_cap_chair_aggregate_results_recall,`
`600`	`602`	`]`
`601`	`603`	`)`
`602`		`- except ImportError:`
	`604`	`+ except Exception:`
`603`	`605`	`pass`
`604`	`606`
`605`	`607`	`if metric in bootstrappable:`