lm-evaluation-harness container: shared aiohttp ClientSession with ClientTimeout(total=...) aborts long benchmark runs (e.g. MMLU-Pro 12k prompts)

## Summary

The `nvidia_lm_eval` 26.3 wheel (shipped in the `eval-factory/lm-evaluation-harness:26.03` container, and a transitive dependency of `nemo-evaluator`) ships a known-broken `aiohttp` session pattern in `lm_eval/models/api_models.py:510-548`. On long-running benchmarks (e.g. MMLU-Pro, 12,032 prompts) every run aborts with `asyncio.TimeoutError` from `aiohttp/helpers.py:713`. EleutherAI fixed this upstream in PR [#2795](https://github.com/EleutherAI/lm-evaluation-harness/pull/2795), but that PR is still **open / unmerged** and the NVIDIA repackage has not picked it up.

We have hit this five times in a row across two endpoints, two `max_tokens` values, and two parallelism settings (full matrix below). I'm filing here because `nvidia_lm_eval` does not itself have a public issue tracker (no `Project-URL` in its METADATA, no GitHub repo we could find), and the `lm-evaluation-harness` container in this repo is the natural integration point — feel free to reroute if there's a more appropriate channel.

## Repro

- **lm-evaluation-harness derived package:** `nvidia_lm_eval==26.3` (`*.dist-info/METADATA` confirms NVIDIA repackage, MIT, top-level package still `lm_eval`)
- **Common-utilities package:** `nemo-evaluator==0.2.7`
- **HTTP client:** `aiohttp==3.13.5`
- **Inference server:** vLLM, OpenAI-compatible endpoint, H200 nodes
- **Benchmark:** `mmlu_pro` via `local-completions`, `num_concurrent` ∈ {8, 64}, `max_retries=5`, `request_timeout` ∈ {600, 7200}

## The exact broken pattern in `nvidia_lm_eval-26.3/lm_eval/models/api_models.py`

```python
# lines 510-548 (verbatim from installed wheel)
async def get_batched_requests(self, requests: list, cache_keys: list, ...):
    ...
    conn = TCPConnector(limit=self._concurrent, ssl=self.verify_certificate)
    async with ClientSession(
        connector=conn, timeout=ClientTimeout(total=self.timeout)
    ) as session:
        retry_: Callable[..., Awaitable[Any]] = retry(
            stop=stop_after_attempt(self.max_retries),
            wait=wait_exponential(multiplier=0.5, min=1, max=10),
            reraise=True,
        )(self.amodel_call)
        tasks = [asyncio.create_task(retry_(session=session, ...)) for ...]
        return await tqdm_asyncio.gather(*tasks, desc="Requesting API")
```

A single `ClientSession` is opened once and asked to service *all* tasks for the entire `generate_until` chunk — for MMLU-Pro that's ~12,032 concurrent tasks against `TCPConnector(limit=64)`. The session-level `ClientTimeout(total=self.timeout)` is a single budget across the whole batch, not per-request. It also overrides aiohttp's `DEFAULT_TIMEOUT`, removing the implicit `sock_connect=30` guard (`aiohttp/client.py:225`). Once the timer fires, every task waiting on a pooled connector slot surfaces as `asyncio.TimeoutError` from `BaseTimerContext.__exit__`, with **zero `Retrying`/`API request failed` warnings preceding the cascade** — the timeout hits inside `__aenter__` before any bytes are sent. (Cross-confirmed by aiohttp [#2538](https://github.com/aio-libs/aiohttp/issues/2538).)

## Failure matrix (5/5 fail on `mmlu_pro` with reasoning models)

| Attempt | model class | max_tokens | parallelism | wall-clock to crash | last progress |
|---|---|---:|---:|---:|---:|
| 1 | 7-9b reasoning  | 32768 | 64 | 10h 01m | 39% (4737/12032) |
| 2 | 70b+ reasoning  | 32768 | 64 | 10h 01m | 18% (2192/12032) |
| 3 (resume) | 70b+ reasoning | 8192 | 64 | <1h | 17% (3,278 timeouts in log) |
| 4 (resume) | 7-9b reasoning | 8192 | 64 | <1h | 10% (418 timeouts in log) |
| 5 | 7-9b reasoning  | 8192  |  8 | 2h 31m | 14% (1626/12032) |

Identical trace fragment in every crash:

```
File ".../aiohttp/connector.py", line 1251, in _create_connection
    _, proto = await self._create_connection_internal(...)
File ".../aiohttp/client.py", line 632, in _request
    with timer:
File ".../aiohttp/helpers.py", line 713, in __exit__
    raise asyncio.TimeoutError from exc_val
TimeoutError
```

Non-reasoning models on the same hardware (Qwen3.5-9B, MiniMax-M2.7) completed MMLU-Pro successfully — the bug is correlated with how long the shared `ClientSession` stays alive, not with model correctness.

## Suggested fix

Cherry-pick / vendor EleutherAI/lm-evaluation-harness PR [#2795](https://github.com/EleutherAI/lm-evaluation-harness/pull/2795) into the `nvidia_lm_eval` 26.x branch. The PR replaces the shared session with a per-call `ClientSession` + `Semaphore`, which is exactly the diagnosis here. If waiting on the upstream merge is preferred, an interim patch could:

1. Pass `timeout=None` to `ClientSession(...)` and put a per-request `ClientTimeout(total=..., sock_connect=30, sock_read=...)` on the `session.post(...)` call at `api_models.py:443-447`, **and**
2. Set `enable_cleanup_closed=True` on the `TCPConnector` (currently inherits aiohttp's unsafe default of `False`).

Either of these would unblock long benchmarks for downstream users today.

Happy to provide more logs / re-run with extra instrumentation if useful. Thanks for maintaining `nemo-evaluator`!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lm-evaluation-harness container: shared aiohttp ClientSession with ClientTimeout(total=...) aborts long benchmark runs (e.g. MMLU-Pro 12k prompts) #974

Summary

Repro

The exact broken pattern in `nvidia_lm_eval-26.3/lm_eval/models/api_models.py`

Failure matrix (5/5 fail on `mmlu_pro` with reasoning models)

Suggested fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Attempt	model class	max_tokens	parallelism	wall-clock to crash	last progress
1	7-9b reasoning	32768	64	10h 01m	39% (4737/12032)
2	70b+ reasoning	32768	64	10h 01m	18% (2192/12032)
3 (resume)	70b+ reasoning	8192	64	<1h	17% (3,278 timeouts in log)
4 (resume)	7-9b reasoning	8192	64	<1h	10% (418 timeouts in log)
5	7-9b reasoning	8192	8	2h 31m	14% (1626/12032)

lm-evaluation-harness container: shared aiohttp ClientSession with ClientTimeout(total=...) aborts long benchmark runs (e.g. MMLU-Pro 12k prompts) #974

Description

Summary

Repro

The exact broken pattern in nvidia_lm_eval-26.3/lm_eval/models/api_models.py

Failure matrix (5/5 fail on mmlu_pro with reasoning models)

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

The exact broken pattern in `nvidia_lm_eval-26.3/lm_eval/models/api_models.py`

Failure matrix (5/5 fail on `mmlu_pro` with reasoning models)