Abnormal GPU Utilization When Evaluating Qwen 2.5 VL 72B

# Low GPU Utilization and Errors During Qwen 2.5 VL 72B Evaluation

First, I'd like to commend the lmms-eval team for developing such a robust evaluation framework - it's been incredibly valuable for multimodal research.

## Problem Description
When evaluating Qwen 2.5 VL 72B on 8x H100 80G GPUs using the following command:
```bash
accelerate launch --num_processes=1 --main_process_port=12346 -m lmms_eval \
  --model qwen2_5_vl \
  --model_args="pretrained=/mnt/mypath/models/Qwen2.5-VL-72B-Instruct,max_pixels=12845056,attn_implementation=flash_attention_2,interleave_visuals=False,device_map=auto" \
  --tasks super_clevr \
  --output_path /mnt/mypath/qwen72b \
  --batch_size 1 \
  --verbosity DEBUG
```

My GPU utilization only reaches a maximum of 10%. 

<img width="1480" height="483" alt="Image" src="https://github.com/user-attachments/assets/c4bf93a3-a7bc-4834-8eae-ee1062b7f321" />

## Attempted Solutions and Errors
1 Increasing num_processes to 8: Results in OOM (Out of Memory) errors despite using H100 GPUs 80G

This is most likely because the code in [this](../blob/main/lmms_eval/models/qwen2_5_vl.py#L75) ignores the device_map passed in by the user when num_processes is greater than 1 and uses f `“cuda:{accelerator.local_process_index}”`, which results in the model not being sharded correctly

```python
accelerator = Accelerator()
if accelerator.num_processes > 1:
    self._device = torch.device(f"cuda:{accelerator.local_process_index}")
    self.device_map = f"cuda:{accelerator.local_process_index}"
else:
    self._device = torch.device(device)
    self.device_map = device_map if device_map else device
```

2 Increasing batch_size > 1: Throws ValueError:
```bash
ValueError: You are attempting to perform batched generation with padding_side='right' this may lead to unexpected behaviour for Flash Attention version of Qwen2_5_VL. 
Make sure to call `tokenizer.padding_side = 'left'` before tokenizing the input
```

After that, I tried to change [code](../blob/main/lmms_eval/models/qwen2_5_vl.pyL102) :`self._tokenizer = AutoTokenizer.from_pretrained(pretrained)` in `qwen2_5_vl.py` to `self._tokenizer = AutoTokenizer.from_pretrained(pretrained, padding_side = 'left')` but it didn't work.

##  Request for Assistance
How should I properly configure either:
1. Multi-GPU evaluation to avoid OOM, or
2. Batched evaluation with correct padding configuration?
3. Any other recommended parameters for efficient evaluation of this large vision-language model?

We'd be grateful for any insights from the team or community members who have successfully evaluated similar large multimodal models.



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abnormal GPU Utilization When Evaluating Qwen 2.5 VL 72B #738

Low GPU Utilization and Errors During Qwen 2.5 VL 72B Evaluation

Problem Description

Attempted Solutions and Errors

Request for Assistance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Abnormal GPU Utilization When Evaluating Qwen 2.5 VL 72B #738

Description

Low GPU Utilization and Errors During Qwen 2.5 VL 72B Evaluation

Problem Description

Attempted Solutions and Errors

Request for Assistance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions