Skip to content

vllm_chat / vllm_generate still apply last request sampling_params to whole batch (kwargs leakage across tasks) #1325

@Anunay-Yadav

Description

@Anunay-Yadav

Checklist

  • I have searched for similar issues before opening this one.
  • I am using the latest version of lmms-eval.

Bug Description

In upstream EvolvingLMMs-Lab/lmms-eval, vllm_chat and vllm_generate appear to use the last request’s sampling params for the entire batch.

Files:

  • lmms_eval/models/chat/vllm.py
  • lmms_eval/models/chat/vllm_generate.py

Pattern:

  • per-request sampling_params are built in a loop
  • loop variable gets overwritten each iteration
  • after loop, one SamplingParams(**sampling_params) is created and used for the full batch

This causes cross-task generation kwargs leakage in mixed-task runs (e.g., max_new_tokens from one task affecting another).

Expected: sampling params should be grouped by compatible kwargs (or enforced homogeneous per batch).
Actual: final request params are broadcast to all requests in the batch.

Steps to Reproduce

1. Run vllm_chat or vllm_generate with multiple tasks that have different generation_kwargs.max_new_tokens (e.g., OCR task at 128 and VQA task at 32/16).
  2. Enable --log_samples.
  3. Compare outputs of the OCR task in:
      - single-task run
      - mixed-task run
  4. Observe systematic shortening/truncation in mixed-task run and metric drop.

Error Message / Traceback

No Python exception/traceback (functional correctness bug).

  Symptoms are metric/output regressions in multitask runs consistent with cross-task decode-cap contamination.

Environment

  • OS: Ubuntu 24.04.2 LTS
    • Python: 3.12.3
    • lmms-eval: 0.5.0
    • vllm: 0.19.1.dev3+gb44274e2e.precompiled
    • GPU: NVIDIA GH200 120GB
    • NVIDIA Driver: 590.48.01
    • CUDA (driver-reported): 13.1
    • Torch: 2.10.0+cu130
    • accelerate: 1.13.0

Additional Context

bserved regression pattern in real runs:

  • ocrbench_v2 alone: higher score
  • ocrbench_v2 mixed with tasks that use shorter generation budgets: lower score, responses look truncated/shorter

This is consistent with “last request sampling params win for entire batch”.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions