Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,7 +189,7 @@ results = evaluator.simple_evaluate(
| [Caching](advanced/caching.md) | SQLite-backed response cache for deterministic requests. Store, replay, merge shards across distributed ranks, and recover from crashes via JSONL audit log. |
| [Throughput Metrics](advanced/throughput_metrics.md) | Inference timing metrics logged by chat models — end-to-end latency, time to first token, tokens per second, and batch-level summaries. |

The response cache stores only deterministic requests (`temperature=0`, `do_sample=False`). Enable it with `--use_cache ./eval_cache` to skip redundant model calls on repeated runs:
The response cache stores only deterministic requests (`temperature=0`, `do_sample=False`). Enable it with `--use_cache ./eval_cache` or `--use_cache ./eval_cache/cache.db` to skip redundant model calls on repeated runs. In layered mode, lmms-eval keeps the shared root DB at `./eval_cache/cache.db` and writes each run into `./eval_cache/runs/<run_id>/` before merging it back:

```bash
python -m lmms_eval \
Expand Down
45 changes: 33 additions & 12 deletions docs/advanced/caching.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,20 @@ python -m lmms_eval \

On a second run with the same command, cached responses are loaded and the model is only called for new or changed requests.

When `--use_cache` points to a directory, or to an explicit root `cache.db`, lmms-eval uses a layered layout:

```text
eval_cache/
cache.db
cache.audit.jsonl
runs/
<run_id>/
cache.db
cache.audit.jsonl
```

The root `cache.db` is the shared read cache. Each evaluation run writes to its own UUID-scoped directory and rank 0 merges completed runs back into the root database under an exclusive lock. That gives you cache reuse without asking concurrent jobs to write into the same SQLite file.

### What gets cached

Only **deterministic** requests are cached. A request is considered non-deterministic (and skipped) when any of:
Expand Down Expand Up @@ -56,16 +70,23 @@ Float/int normalization: `temperature=0.0` and `temperature=0` produce the same

### File layout

Layered directory mode (recommended for shared or long-running jobs):

```
{use_cache}/
{model_hash}/ # sha256("{model}|{model_args}")[:16]
rank0.db # SQLite (WAL mode) - primary lookup
rank0.jsonl # write-ahead audit log - crash recovery
rank1.db # (if multi-GPU)
rank1.jsonl
{cache_root}/
cache.db
cache.audit.jsonl
runs/
{run_id}/
cache.db # single-rank writes
cache.audit.jsonl
cache.db.shard.{rank} # multi-rank writes
cache.db.audit.shard.{rank}.jsonl
.ready
.merged
```

Per-rank files avoid write contention in distributed runs.
Legacy file mode keeps the older behavior where a direct `.db` target may receive per-rank shard files next to the target DB.

### Cache invalidation

Expand All @@ -92,14 +113,16 @@ Responses are validated before caching:

### Merge distributed shards

After a multi-GPU run, merge per-rank DBs into one:
Layered directory mode merges distributed shards automatically on successful completion. Rank 0 acquires an exclusive merge lock, folds every ready run under `runs/` into the root `cache.db`, and marks the run directory as merged.

If you are using legacy file mode, you can still merge shard DBs manually:

```python
from lmms_eval.caching.response_cache import ResponseCache

ResponseCache.merge_shards(
shard_paths=["eval_cache/abc123/rank0.db", "eval_cache/abc123/rank1.db"],
output_path="eval_cache/abc123/merged.db",
shard_paths=["eval_cache/cache.db.shard.0", "eval_cache/cache.db.shard.1"],
output_path="eval_cache/cache.db",
)
```

Expand Down Expand Up @@ -211,5 +234,3 @@ On a second run with the same task/docs, cached responses will be loaded and onl
### Optional: legacy SQLite cache wrapper

There is also a separate optional wrapper `CachingLMM` (see `lmms_eval.api.model.CachingLMM`) that caches by hashing the entire call arguments to a SQLite DB (via `SqliteDict`). It is independent from the JSONL cache above and can be useful for broader API‑level caching. For most users, enabling `LMMS_EVAL_USE_CACHE=True` is sufficient and simpler.


2 changes: 1 addition & 1 deletion docs/getting-started/commands.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ This mode supports a number of command-line arguments, the details of which can

- `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.

- `--use_cache` : Should be a path where a sqlite db file can be written to. Takes a string of format `/path/to/sqlite_cache_` in order to create a cache db at `/path/to/sqlite_cache_rank{i}.db` for each process (0-NUM_GPUS). This allows results of prior runs to be cached, so that there is no need to re-run results in order to re-score or re-run a given (model, task) pair again.
- `--use_cache` : Accepts either a cache root directory or a SQLite `.db` file path. When you pass a directory, or an explicit root `<dir>/cache.db`, lmms-eval stores the shared cache at `<dir>/cache.db` and writes each evaluation run into `<dir>/runs/<run_id>/...` before merging back into the root DB. This isolates concurrent writers automatically while keeping a single shared cache for reuse across runs. Other `.db` filenames keep the legacy single-target behavior.

- `--cache_requests` : Can be "true", "refresh", or "delete". "true" means that the cache should be used. "refresh" means that you wish to regenerate the cache, which you should run if you change your dataset configuration for a given task. "delete" will delete the cache. Cached files are stored under lmms_eval/cache/.cache unless you specify a different path via the environment variable: `LM_HARNESS_CACHE_PATH`. e.g. `LM_HARNESS_CACHE_PATH=~/Documents/cache_for_lm_harness`.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,14 @@ set -euo pipefail

LIMIT="${LIMIT:-50}"
BATCH_SIZE="${BATCH_SIZE:-1}"
OUTPUT_PATH="${OUTPUT_PATH:-./logs/minerva_dummy_video_reader/}"
OUTPUT_PATH="${OUTPUT_PATH:-./logs/minerva_dummy/}"
VERBOSITY="${VERBOSITY:-INFO}"

echo "[INFO] MINERVA dummy video-reader benchmark"
echo "[INFO] tasks=minerva limit=${LIMIT} batch_size=${BATCH_SIZE}"

uv run --with pylance --with pyarrow python -m lmms_eval \
--model dummy_video_reader \
--model dummy \
--model_args "read_bytes=65536,response=A,allow_remote=false,fail_on_missing=true" \
--tasks minerva \
--batch_size "${BATCH_SIZE}" \
Expand Down
9 changes: 5 additions & 4 deletions lmms_eval/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -220,10 +220,11 @@ def parse_eval_args() -> tuple[argparse.ArgumentParser, argparse.Namespace]:
type=str,
default=None,
metavar="PATH",
help="Path to a SQLite .db file for response-level caching (e.g. ./my_cache.db). "
"Caches deterministic model responses (temperature=0) for reuse across runs. "
"In distributed mode, temporary per-rank shards are auto-merged into this file. "
"A .db suffix is appended automatically if missing. `None` to disable.",
help="Path to a response-cache root directory or SQLite .db file. "
"If PATH is a directory, or an explicit PATH/cache.db root file, lmms-eval keeps the shared root cache at "
"PATH/cache.db and writes each run into PATH/runs/<run_id>/ before auto-merging back into the root DB. "
"This avoids write contention across concurrent jobs while preserving cache reuse. "
"If PATH is a file, legacy single-target behavior is preserved. `None` disables response caching.",
)
parser.add_argument(
"--cache_requests",
Expand Down
8 changes: 5 additions & 3 deletions lmms_eval/api/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -565,7 +565,9 @@ def stderr_for_metric(metric, bootstrap_iters: int):
ter,
]

# Optional imports for tasks with extra dependencies (spacy, etc.)
# Optional imports for tasks with extra dependencies (spacy, etc.).
# Catch Exception (not just ImportError) because transitive deps may raise
# ValueError/RuntimeError from binary incompatibilities (e.g. numpy/spacy).
try:
from lmms_eval.tasks.amber_g.utils import (
amber_g_aggregate_chair,
Expand All @@ -582,7 +584,7 @@ def stderr_for_metric(metric, bootstrap_iters: int):
amber_g_aggregate_cog,
]
)
except ImportError:
except Exception:
pass

try:
Expand All @@ -599,7 +601,7 @@ def stderr_for_metric(metric, bootstrap_iters: int):
coco_cap_chair_aggregate_results_recall,
]
)
except ImportError:
except Exception:
pass

if metric in bootstrappable:
Expand Down
58 changes: 52 additions & 6 deletions lmms_eval/api/task.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
import subprocess
from collections.abc import Callable
from dataclasses import asdict, dataclass
from functools import partial
from functools import lru_cache, partial
from glob import glob
from typing import (
Any,
Expand Down Expand Up @@ -46,6 +46,7 @@
is_higher_better,
)
from lmms_eval.caching.cache import load_from_cache, save_to_cache
from lmms_eval.caching.fs_detect import FsType, detect_fs_type, find_local_scratch
from lmms_eval.filters import build_filter_ensemble

# HuggingfaceM4/NoCaps contains truncated image in test split
Expand All @@ -61,6 +62,42 @@
]


def _expand_cache_path(path: str) -> str:
return os.path.expanduser(os.path.expandvars(path))


@lru_cache(maxsize=1)
def _resolve_hf_datasets_cache_dir() -> str:
"""Pick a datasets cache directory that is safe for file locks."""

explicit_cache_dir = os.getenv("LMMS_EVAL_DATASETS_CACHE", "").strip()
if explicit_cache_dir:
resolved_cache_dir = _expand_cache_path(explicit_cache_dir)
os.makedirs(resolved_cache_dir, exist_ok=True)
return resolved_cache_dir

hf_home = _expand_cache_path(os.getenv("HF_HOME", "~/.cache/huggingface"))
target_cache_dir = _expand_cache_path(os.getenv("HF_DATASETS_CACHE", os.path.join(hf_home, "datasets")))

if detect_fs_type(target_cache_dir) != FsType.REMOTE:
os.makedirs(target_cache_dir, exist_ok=True)
return target_cache_dir

local_scratch = find_local_scratch()
if local_scratch is None:
eval_logger.warning(
"HF datasets cache '{}' is on a remote filesystem but no local scratch directory was found; continuing with the remote cache, so file-lock errors may still occur.",
target_cache_dir,
)
os.makedirs(target_cache_dir, exist_ok=True)
return target_cache_dir

local_cache_dir = os.path.join(local_scratch, "lmms_eval_hf_datasets", os.getenv("USER", "unknown"))
os.makedirs(local_cache_dir, exist_ok=True)
eval_logger.info("HF datasets cache '{}' is on a remote filesystem; using node-local cache '{}'.", target_cache_dir, local_cache_dir)
return local_cache_dir


@dataclass
class TaskConfig(dict):
# task naming/registry
Expand Down Expand Up @@ -263,18 +300,19 @@ def download(self, data_dir=None, cache_dir=None, download_mode=None) -> None:
- `datasets.DownloadMode.FORCE_REDOWNLOAD`
Fresh download and fresh dataset.
"""
resolved_cache_dir = cache_dir if cache_dir is not None else _resolve_hf_datasets_cache_dir()
self.dataset = datasets.load_dataset(
path=self.DATASET_PATH,
name=self.DATASET_NAME,
data_dir=data_dir,
cache_dir=cache_dir,
cache_dir=resolved_cache_dir,
download_mode=download_mode,
)
self.dataset_no_image = datasets.load_dataset(
path=self.DATASET_PATH,
name=self.DATASET_NAME,
data_dir=data_dir,
cache_dir=cache_dir,
cache_dir=resolved_cache_dir,
download_mode=download_mode,
)
for doc_name in self.dataset_no_image:
Expand Down Expand Up @@ -921,8 +959,9 @@ def download(self, dataset_kwargs=None) -> None:
# Recursively search whether their is a zip and unzip it to the huggingface home
download_config = DownloadConfig()
download_config.max_retries = dataset_kwargs.get("max_retries", 10) if dataset_kwargs is not None else 10
download_config.num_proc = dataset_kwargs.get("num_proc", 8) if dataset_kwargs is not None else 8
download_config.num_proc = dataset_kwargs.get("num_proc", 1) if dataset_kwargs is not None else 1
download_config.local_files_only = dataset_kwargs.get("local_files_only", False) if dataset_kwargs is not None else False
resolved_dataset_cache_dir = _resolve_hf_datasets_cache_dir()
if dataset_kwargs is not None:
if "From_YouTube" in dataset_kwargs:

Expand All @@ -946,11 +985,14 @@ def _download_from_youtube(path):
if accelerator.is_main_process:
dataset_kwargs.pop("From_YouTube")
assert "load_from_disk" not in dataset_kwargs, "load_from_disk must not be True when From_YouTube is True"
youtube_dataset_kwargs = dict(dataset_kwargs)
youtube_cache_dir = youtube_dataset_kwargs.pop("cache_dir", resolved_dataset_cache_dir)
self.all_dataset = datasets.load_dataset(
path=self.DATASET_PATH,
name=self.DATASET_NAME,
cache_dir=youtube_cache_dir,
download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
**dataset_kwargs if dataset_kwargs is not None else {},
**youtube_dataset_kwargs,
)
dataset_kwargs["From_YouTube"] = True
cache_path = snapshot_download(repo_id=self.DATASET_PATH, repo_type="dataset") # download_parquet
Expand Down Expand Up @@ -1098,12 +1140,16 @@ def concat_tar_parts(tar_parts, output_tar):
# `ds = load_datasets("lmms-lab/MMMU")`
self.dataset = datasets.load_from_disk(dataset_path=self.DATASET_PATH)
else:
load_dataset_kwargs = dict(dataset_kwargs) if dataset_kwargs is not None else {}
load_dataset_cache_dir = load_dataset_kwargs.pop("cache_dir", resolved_dataset_cache_dir)
self.dataset = datasets.load_dataset(
path=self.DATASET_PATH,
name=self.DATASET_NAME,
cache_dir=load_dataset_cache_dir,
download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
download_config=download_config,
**dataset_kwargs if dataset_kwargs is not None else {},
num_proc=1,
**load_dataset_kwargs,
)

if self.config.process_docs is not None:
Expand Down
Loading
Loading