Checklist
Bug Description
In upstream EvolvingLMMs-Lab/lmms-eval, vllm_chat and vllm_generate appear to use the last request’s sampling params for the entire batch.
Files:
- lmms_eval/models/chat/vllm.py
- lmms_eval/models/chat/vllm_generate.py
Pattern:
- per-request sampling_params are built in a loop
- loop variable gets overwritten each iteration
- after loop, one SamplingParams(**sampling_params) is created and used for the full batch
This causes cross-task generation kwargs leakage in mixed-task runs (e.g., max_new_tokens from one task affecting another).
Expected: sampling params should be grouped by compatible kwargs (or enforced homogeneous per batch).
Actual: final request params are broadcast to all requests in the batch.
Steps to Reproduce
1. Run vllm_chat or vllm_generate with multiple tasks that have different generation_kwargs.max_new_tokens (e.g., OCR task at 128 and VQA task at 32/16).
2. Enable --log_samples.
3. Compare outputs of the OCR task in:
- single-task run
- mixed-task run
4. Observe systematic shortening/truncation in mixed-task run and metric drop.
Error Message / Traceback
No Python exception/traceback (functional correctness bug).
Symptoms are metric/output regressions in multitask runs consistent with cross-task decode-cap contamination.
Environment
- OS: Ubuntu 24.04.2 LTS
- Python: 3.12.3
- lmms-eval: 0.5.0
- vllm: 0.19.1.dev3+gb44274e2e.precompiled
- GPU: NVIDIA GH200 120GB
- NVIDIA Driver: 590.48.01
- CUDA (driver-reported): 13.1
- Torch: 2.10.0+cu130
- accelerate: 1.13.0
Additional Context
bserved regression pattern in real runs:
- ocrbench_v2 alone: higher score
- ocrbench_v2 mixed with tasks that use shorter generation budgets: lower score, responses look truncated/shorter
This is consistent with “last request sampling params win for entire batch”.
Checklist
Bug Description
In upstream EvolvingLMMs-Lab/lmms-eval, vllm_chat and vllm_generate appear to use the last request’s sampling params for the entire batch.
Files:
Pattern:
This causes cross-task generation kwargs leakage in mixed-task runs (e.g., max_new_tokens from one task affecting another).
Expected: sampling params should be grouped by compatible kwargs (or enforced homogeneous per batch).
Actual: final request params are broadcast to all requests in the batch.
Steps to Reproduce
1. Run vllm_chat or vllm_generate with multiple tasks that have different generation_kwargs.max_new_tokens (e.g., OCR task at 128 and VQA task at 32/16). 2. Enable --log_samples. 3. Compare outputs of the OCR task in: - single-task run - mixed-task run 4. Observe systematic shortening/truncation in mixed-task run and metric drop.Error Message / Traceback
No Python exception/traceback (functional correctness bug). Symptoms are metric/output regressions in multitask runs consistent with cross-task decode-cap contamination.Environment
Additional Context
bserved regression pattern in real runs:
This is consistent with “last request sampling params win for entire batch”.