vllm_chat / vllm_generate still apply last request sampling_params to whole batch (kwargs leakage across tasks)

### Checklist

- [x] I have searched for similar issues before opening this one.
- [x] I am using the latest version of lmms-eval.

### Bug Description

In upstream EvolvingLMMs-Lab/lmms-eval, vllm_chat and vllm_generate appear to use the last request’s sampling params for the entire batch.

  Files:

  - lmms_eval/models/chat/vllm.py
  - lmms_eval/models/chat/vllm_generate.py

  Pattern:

  - per-request sampling_params are built in a loop
  - loop variable gets overwritten each iteration
  - after loop, one SamplingParams(**sampling_params) is created and used for the full batch

  This causes cross-task generation kwargs leakage in mixed-task runs (e.g., max_new_tokens from one task affecting another).

  Expected: sampling params should be grouped by compatible kwargs (or enforced homogeneous per batch).
  Actual: final request params are broadcast to all requests in the batch.
### Steps to Reproduce

```shell
1. Run vllm_chat or vllm_generate with multiple tasks that have different generation_kwargs.max_new_tokens (e.g., OCR task at 128 and VQA task at 32/16).
  2. Enable --log_samples.
  3. Compare outputs of the OCR task in:
      - single-task run
      - mixed-task run
  4. Observe systematic shortening/truncation in mixed-task run and metric drop.
```

### Error Message / Traceback

```shell
No Python exception/traceback (functional correctness bug).

  Symptoms are metric/output regressions in multitask runs consistent with cross-task decode-cap contamination.
```

### Environment

- OS: Ubuntu 24.04.2 LTS
  - Python: 3.12.3
  - lmms-eval: 0.5.0
  - vllm: 0.19.1.dev3+gb44274e2e.precompiled
  - GPU: NVIDIA GH200 120GB
  - NVIDIA Driver: 590.48.01
  - CUDA (driver-reported): 13.1
  - Torch: 2.10.0+cu130
  - accelerate: 1.13.0

### Additional Context

bserved regression pattern in real runs:

  - ocrbench_v2 alone: higher score
  - ocrbench_v2 mixed with tasks that use shorter generation budgets: lower score, responses look truncated/shorter

  This is consistent with “last request sampling params win for entire batch”.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vllm_chat / vllm_generate still apply last request sampling_params to whole batch (kwargs leakage across tasks) #1325

Checklist

Bug Description

Steps to Reproduce

Error Message / Traceback

Environment

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

vllm_chat / vllm_generate still apply last request sampling_params to whole batch (kwargs leakage across tasks) #1325

Description

Checklist

Bug Description

Steps to Reproduce

Error Message / Traceback

Environment

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions