Summary
The nvidia_lm_eval 26.3 wheel (shipped in the eval-factory/lm-evaluation-harness:26.03 container, and a transitive dependency of nemo-evaluator) ships a known-broken aiohttp session pattern in lm_eval/models/api_models.py:510-548. On long-running benchmarks (e.g. MMLU-Pro, 12,032 prompts) every run aborts with asyncio.TimeoutError from aiohttp/helpers.py:713. EleutherAI fixed this upstream in PR #2795, but that PR is still open / unmerged and the NVIDIA repackage has not picked it up.
We have hit this five times in a row across two endpoints, two max_tokens values, and two parallelism settings (full matrix below). I'm filing here because nvidia_lm_eval does not itself have a public issue tracker (no Project-URL in its METADATA, no GitHub repo we could find), and the lm-evaluation-harness container in this repo is the natural integration point — feel free to reroute if there's a more appropriate channel.
Repro
- lm-evaluation-harness derived package:
nvidia_lm_eval==26.3 (*.dist-info/METADATA confirms NVIDIA repackage, MIT, top-level package still lm_eval)
- Common-utilities package:
nemo-evaluator==0.2.7
- HTTP client:
aiohttp==3.13.5
- Inference server: vLLM, OpenAI-compatible endpoint, H200 nodes
- Benchmark:
mmlu_pro via local-completions, num_concurrent ∈ {8, 64}, max_retries=5, request_timeout ∈ {600, 7200}
The exact broken pattern in nvidia_lm_eval-26.3/lm_eval/models/api_models.py
# lines 510-548 (verbatim from installed wheel)
async def get_batched_requests(self, requests: list, cache_keys: list, ...):
...
conn = TCPConnector(limit=self._concurrent, ssl=self.verify_certificate)
async with ClientSession(
connector=conn, timeout=ClientTimeout(total=self.timeout)
) as session:
retry_: Callable[..., Awaitable[Any]] = retry(
stop=stop_after_attempt(self.max_retries),
wait=wait_exponential(multiplier=0.5, min=1, max=10),
reraise=True,
)(self.amodel_call)
tasks = [asyncio.create_task(retry_(session=session, ...)) for ...]
return await tqdm_asyncio.gather(*tasks, desc="Requesting API")
A single ClientSession is opened once and asked to service all tasks for the entire generate_until chunk — for MMLU-Pro that's ~12,032 concurrent tasks against TCPConnector(limit=64). The session-level ClientTimeout(total=self.timeout) is a single budget across the whole batch, not per-request. It also overrides aiohttp's DEFAULT_TIMEOUT, removing the implicit sock_connect=30 guard (aiohttp/client.py:225). Once the timer fires, every task waiting on a pooled connector slot surfaces as asyncio.TimeoutError from BaseTimerContext.__exit__, with zero Retrying/API request failed warnings preceding the cascade — the timeout hits inside __aenter__ before any bytes are sent. (Cross-confirmed by aiohttp #2538.)
Failure matrix (5/5 fail on mmlu_pro with reasoning models)
| Attempt |
model class |
max_tokens |
parallelism |
wall-clock to crash |
last progress |
| 1 |
7-9b reasoning |
32768 |
64 |
10h 01m |
39% (4737/12032) |
| 2 |
70b+ reasoning |
32768 |
64 |
10h 01m |
18% (2192/12032) |
| 3 (resume) |
70b+ reasoning |
8192 |
64 |
<1h |
17% (3,278 timeouts in log) |
| 4 (resume) |
7-9b reasoning |
8192 |
64 |
<1h |
10% (418 timeouts in log) |
| 5 |
7-9b reasoning |
8192 |
8 |
2h 31m |
14% (1626/12032) |
Identical trace fragment in every crash:
File ".../aiohttp/connector.py", line 1251, in _create_connection
_, proto = await self._create_connection_internal(...)
File ".../aiohttp/client.py", line 632, in _request
with timer:
File ".../aiohttp/helpers.py", line 713, in __exit__
raise asyncio.TimeoutError from exc_val
TimeoutError
Non-reasoning models on the same hardware (Qwen3.5-9B, MiniMax-M2.7) completed MMLU-Pro successfully — the bug is correlated with how long the shared ClientSession stays alive, not with model correctness.
Suggested fix
Cherry-pick / vendor EleutherAI/lm-evaluation-harness PR #2795 into the nvidia_lm_eval 26.x branch. The PR replaces the shared session with a per-call ClientSession + Semaphore, which is exactly the diagnosis here. If waiting on the upstream merge is preferred, an interim patch could:
- Pass
timeout=None to ClientSession(...) and put a per-request ClientTimeout(total=..., sock_connect=30, sock_read=...) on the session.post(...) call at api_models.py:443-447, and
- Set
enable_cleanup_closed=True on the TCPConnector (currently inherits aiohttp's unsafe default of False).
Either of these would unblock long benchmarks for downstream users today.
Happy to provide more logs / re-run with extra instrumentation if useful. Thanks for maintaining nemo-evaluator!
Summary
The
nvidia_lm_eval26.3 wheel (shipped in theeval-factory/lm-evaluation-harness:26.03container, and a transitive dependency ofnemo-evaluator) ships a known-brokenaiohttpsession pattern inlm_eval/models/api_models.py:510-548. On long-running benchmarks (e.g. MMLU-Pro, 12,032 prompts) every run aborts withasyncio.TimeoutErrorfromaiohttp/helpers.py:713. EleutherAI fixed this upstream in PR #2795, but that PR is still open / unmerged and the NVIDIA repackage has not picked it up.We have hit this five times in a row across two endpoints, two
max_tokensvalues, and two parallelism settings (full matrix below). I'm filing here becausenvidia_lm_evaldoes not itself have a public issue tracker (noProject-URLin its METADATA, no GitHub repo we could find), and thelm-evaluation-harnesscontainer in this repo is the natural integration point — feel free to reroute if there's a more appropriate channel.Repro
nvidia_lm_eval==26.3(*.dist-info/METADATAconfirms NVIDIA repackage, MIT, top-level package stilllm_eval)nemo-evaluator==0.2.7aiohttp==3.13.5mmlu_provialocal-completions,num_concurrent∈ {8, 64},max_retries=5,request_timeout∈ {600, 7200}The exact broken pattern in
nvidia_lm_eval-26.3/lm_eval/models/api_models.pyA single
ClientSessionis opened once and asked to service all tasks for the entiregenerate_untilchunk — for MMLU-Pro that's ~12,032 concurrent tasks againstTCPConnector(limit=64). The session-levelClientTimeout(total=self.timeout)is a single budget across the whole batch, not per-request. It also overrides aiohttp'sDEFAULT_TIMEOUT, removing the implicitsock_connect=30guard (aiohttp/client.py:225). Once the timer fires, every task waiting on a pooled connector slot surfaces asasyncio.TimeoutErrorfromBaseTimerContext.__exit__, with zeroRetrying/API request failedwarnings preceding the cascade — the timeout hits inside__aenter__before any bytes are sent. (Cross-confirmed by aiohttp #2538.)Failure matrix (5/5 fail on
mmlu_prowith reasoning models)Identical trace fragment in every crash:
Non-reasoning models on the same hardware (Qwen3.5-9B, MiniMax-M2.7) completed MMLU-Pro successfully — the bug is correlated with how long the shared
ClientSessionstays alive, not with model correctness.Suggested fix
Cherry-pick / vendor EleutherAI/lm-evaluation-harness PR #2795 into the
nvidia_lm_eval26.x branch. The PR replaces the shared session with a per-callClientSession+Semaphore, which is exactly the diagnosis here. If waiting on the upstream merge is preferred, an interim patch could:timeout=NonetoClientSession(...)and put a per-requestClientTimeout(total=..., sock_connect=30, sock_read=...)on thesession.post(...)call atapi_models.py:443-447, andenable_cleanup_closed=Trueon theTCPConnector(currently inherits aiohttp's unsafe default ofFalse).Either of these would unblock long benchmarks for downstream users today.
Happy to provide more logs / re-run with extra instrumentation if useful. Thanks for maintaining
nemo-evaluator!