Skip to content

lm-evaluation-harness container: shared aiohttp ClientSession with ClientTimeout(total=...) aborts long benchmark runs (e.g. MMLU-Pro 12k prompts) #974

@shauryr

Description

@shauryr

Summary

The nvidia_lm_eval 26.3 wheel (shipped in the eval-factory/lm-evaluation-harness:26.03 container, and a transitive dependency of nemo-evaluator) ships a known-broken aiohttp session pattern in lm_eval/models/api_models.py:510-548. On long-running benchmarks (e.g. MMLU-Pro, 12,032 prompts) every run aborts with asyncio.TimeoutError from aiohttp/helpers.py:713. EleutherAI fixed this upstream in PR #2795, but that PR is still open / unmerged and the NVIDIA repackage has not picked it up.

We have hit this five times in a row across two endpoints, two max_tokens values, and two parallelism settings (full matrix below). I'm filing here because nvidia_lm_eval does not itself have a public issue tracker (no Project-URL in its METADATA, no GitHub repo we could find), and the lm-evaluation-harness container in this repo is the natural integration point — feel free to reroute if there's a more appropriate channel.

Repro

  • lm-evaluation-harness derived package: nvidia_lm_eval==26.3 (*.dist-info/METADATA confirms NVIDIA repackage, MIT, top-level package still lm_eval)
  • Common-utilities package: nemo-evaluator==0.2.7
  • HTTP client: aiohttp==3.13.5
  • Inference server: vLLM, OpenAI-compatible endpoint, H200 nodes
  • Benchmark: mmlu_pro via local-completions, num_concurrent ∈ {8, 64}, max_retries=5, request_timeout ∈ {600, 7200}

The exact broken pattern in nvidia_lm_eval-26.3/lm_eval/models/api_models.py

# lines 510-548 (verbatim from installed wheel)
async def get_batched_requests(self, requests: list, cache_keys: list, ...):
    ...
    conn = TCPConnector(limit=self._concurrent, ssl=self.verify_certificate)
    async with ClientSession(
        connector=conn, timeout=ClientTimeout(total=self.timeout)
    ) as session:
        retry_: Callable[..., Awaitable[Any]] = retry(
            stop=stop_after_attempt(self.max_retries),
            wait=wait_exponential(multiplier=0.5, min=1, max=10),
            reraise=True,
        )(self.amodel_call)
        tasks = [asyncio.create_task(retry_(session=session, ...)) for ...]
        return await tqdm_asyncio.gather(*tasks, desc="Requesting API")

A single ClientSession is opened once and asked to service all tasks for the entire generate_until chunk — for MMLU-Pro that's ~12,032 concurrent tasks against TCPConnector(limit=64). The session-level ClientTimeout(total=self.timeout) is a single budget across the whole batch, not per-request. It also overrides aiohttp's DEFAULT_TIMEOUT, removing the implicit sock_connect=30 guard (aiohttp/client.py:225). Once the timer fires, every task waiting on a pooled connector slot surfaces as asyncio.TimeoutError from BaseTimerContext.__exit__, with zero Retrying/API request failed warnings preceding the cascade — the timeout hits inside __aenter__ before any bytes are sent. (Cross-confirmed by aiohttp #2538.)

Failure matrix (5/5 fail on mmlu_pro with reasoning models)

Attempt model class max_tokens parallelism wall-clock to crash last progress
1 7-9b reasoning 32768 64 10h 01m 39% (4737/12032)
2 70b+ reasoning 32768 64 10h 01m 18% (2192/12032)
3 (resume) 70b+ reasoning 8192 64 <1h 17% (3,278 timeouts in log)
4 (resume) 7-9b reasoning 8192 64 <1h 10% (418 timeouts in log)
5 7-9b reasoning 8192 8 2h 31m 14% (1626/12032)

Identical trace fragment in every crash:

File ".../aiohttp/connector.py", line 1251, in _create_connection
    _, proto = await self._create_connection_internal(...)
File ".../aiohttp/client.py", line 632, in _request
    with timer:
File ".../aiohttp/helpers.py", line 713, in __exit__
    raise asyncio.TimeoutError from exc_val
TimeoutError

Non-reasoning models on the same hardware (Qwen3.5-9B, MiniMax-M2.7) completed MMLU-Pro successfully — the bug is correlated with how long the shared ClientSession stays alive, not with model correctness.

Suggested fix

Cherry-pick / vendor EleutherAI/lm-evaluation-harness PR #2795 into the nvidia_lm_eval 26.x branch. The PR replaces the shared session with a per-call ClientSession + Semaphore, which is exactly the diagnosis here. If waiting on the upstream merge is preferred, an interim patch could:

  1. Pass timeout=None to ClientSession(...) and put a per-request ClientTimeout(total=..., sock_connect=30, sock_read=...) on the session.post(...) call at api_models.py:443-447, and
  2. Set enable_cleanup_closed=True on the TCPConnector (currently inherits aiohttp's unsafe default of False).

Either of these would unblock long benchmarks for downstream users today.

Happy to provide more logs / re-run with extra instrumentation if useful. Thanks for maintaining nemo-evaluator!

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions