Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions docs/throughput_metrics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Throughput Metrics Documentation

This document describes the inference throughput metrics automatically logged by LMMS-Eval chat models during evaluation.

## Overview

LMMS-Eval chat models automatically log detailed timing metrics during inference to help users understand model performance characteristics. These metrics are logged at the INFO level and provide insights into end-to-end latency, token generation speed, and other performance indicators.

## Metrics Explained

### Core Timing Metrics

- **E2E (End-to-End Latency)**: Total time from request submission to response completion (in seconds)
- **TTFT (Time to First Token)**: Time from request submission until the first token is generated (in seconds)
- **TPOT (Time Per Output Token)**: Average time to generate each output token after the first (in seconds)
- **Speed (Inference Speed)**: Token generation rate calculated as 1/TPOT (tokens per second)
- **Output Tokens**: Number of tokens generated in the response

### Batch Metrics

For models that process multiple requests in batches:

- **Batch Summary**: Aggregated metrics across all outputs in a batch
- **Total Time**: Total batch processing time
- **Total Tokens**: Sum of all output tokens in the batch
- **Avg Speed**: Average throughput across the entire batch (tokens/s)

## Log Format Examples

### Individual Output Metrics
```
Output 0 - E2E: 2.145s, TTFT: 0.215s, TPOT: 0.048s, Speed: 20.8 tokens/s, Output tokens: 42
```

### Batch Summary Metrics
```
Batch summary - Total time: 2.145s, Total tokens: 128, Avg speed: 59.7 tokens/s
```

### Single Request Metrics (Non-batched)
```
Inference metrics - E2E: 1.823s, TTFT: 0.182s, TPOT: 0.052s, Speed: 19.2 tokens/s, Output tokens: 32
```

## Supported Models

The following chat models automatically log throughput metrics:

- **sglang_runtime** (`/lmms_eval/models/chat/sglang.py`)
- **vllm_chat** (`/lmms_eval/models/chat/vllm.py`)
- **llava_hf_chat** (`/lmms_eval/models/chat/llava_hf.py`)
- **openai_compatible_chat** (`/lmms_eval/models/chat/openai_compatible.py`)
- **qwen2_5_vl_chat** (`/lmms_eval/models/chat/qwen2_5_vl.py`)
- **huggingface_chat** (`/lmms_eval/models/chat/huggingface.py`)

## Usage

Throughput metrics are automatically logged during evaluation - no additional configuration is required. To view the metrics:

1. **Command Line Output**: Metrics appear in real-time during evaluation
2. **Log Files**: Metrics are written to log files if logging is configured
3. **Log Level**: Ensure logging level is set to INFO or lower to see metrics

### Example Evaluation Command
```bash
python -m lmms_eval \
--model sglang_runtime \
--model_args model=Qwen/Qwen2.5-VL-3B-Instruct \
--tasks mme \
--batch_size 4 \
--log_samples \
--output_path ./results
```

## Metric Calculation Details

### TTFT Calculation
- **Available from model**: Uses actual first token timestamp when provided
- **Estimated**: When unavailable, estimated as 10% of total inference time

### TPOT Calculation
- **Formula**: `(E2E_latency - TTFT) / (output_tokens - 1)`
- **Single token responses**: Uses full E2E latency as TPOT
- **Batch processing**: Divides total batch time by number of outputs

### Speed Calculation
- **Formula**: `1 / TPOT` (when TPOT > 0)
- **Edge cases**: Set to 0 for single-token responses or zero TPOT

## Performance Analysis

### Interpreting Metrics

- **High TTFT**: May indicate model loading, prompt processing, or scheduling delays
- **High TPOT**: Suggests slower token generation, possibly due to model size or hardware limitations
- **Low Speed**: Indicates throughput bottlenecks in token generation
- **E2E vs TTFT+TPOT**: Large differences may suggest batching overhead or system delays

### Optimization Insights

- **Reduce TTFT**: Optimize prompt processing, use model caching, improve scheduling
- **Reduce TPOT**: Use faster hardware, optimize model inference, adjust batch sizes
- **Batch Efficiency**: Compare individual vs batch metrics to assess batching benefits

## Troubleshooting

### Missing Metrics
- Ensure model supports throughput logging (see supported models list)
- Check log level is set to INFO or lower
- Verify model implementation includes timing instrumentation

### Inaccurate Metrics
- TTFT estimates may be imprecise when actual timing unavailable
- Batch metrics average across multiple outputs, individual variance not captured
- Network latency may affect metrics for API-based models

## Implementation Notes

Throughput metrics are implemented consistently across chat models using:
- `time.time()` for wall-clock timing measurements
- Model-specific metadata when available (e.g., SGLang, VLLM native metrics)
- Fallback estimation methods for missing timing data
- Structured logging format for consistent parsing and analysis
2 changes: 1 addition & 1 deletion examples/models/vllm_qwen2vl.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ export NCCL_DEBUG=DEBUG

python3 -m lmms_eval \
--model vllm \
--model_args model_version=Qwen/Qwen2-VL-7B-Instruct,tensor_parallel_size=4 \
--model_args model=Qwen/Qwen2-VL-7B-Instruct,tensor_parallel_size=4 \
--tasks mme,gsm8k_cot_self_consistency,mmmu_val \
--batch_size 64 \
--log_samples \
Expand Down
19 changes: 19 additions & 0 deletions lmms_eval/models/chat/huggingface.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import base64
import re
import time
from io import BytesIO
from typing import List, Optional, Tuple, Union

Expand All @@ -16,6 +17,7 @@
from lmms_eval.api.instance import Instance
from lmms_eval.api.model import lmms
from lmms_eval.api.registry import register_model
from lmms_eval.models.model_utils.gen_metrics import log_metrics
from lmms_eval.models.model_utils.reasoning_model_utils import (
parse_reasoning_model_answer,
)
Expand Down Expand Up @@ -188,6 +190,8 @@ def _collate(x):
chunks = re_ords.get_batched(n=self.batch_size, batch_fn=None)
num_iters = len(requests) // self.batch_size if len(requests) % self.batch_size == 0 else len(requests) // self.batch_size + 1
pbar = tqdm(total=num_iters, disable=(self.rank != 0), desc="Model Responding")
e2e_latency = 0
total_tokens = 0
for chunk in chunks:
ctx, doc_to_messages, all_gen_kwargs, doc_id, task, split = zip(*chunk)
chat_messages = [doc_to_messages[0](self.task_dict[task][split][ids]) for ids, task, split in zip(doc_id, task, split)]
Expand Down Expand Up @@ -242,6 +246,7 @@ def _collate(x):
current_gen_kwargs["temperature"] = None
current_gen_kwargs["top_p"] = None

start_time = time.time()
cont = self.model.generate(
**inputs,
eos_token_id=self.tokenizer.eos_token_id,
Expand All @@ -253,10 +258,15 @@ def _collate(x):
max_new_tokens=current_gen_kwargs["max_new_tokens"],
use_cache=self.use_cache,
)
end_time = time.time()

generated_ids_trimmed = [out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, cont)]
answers = self.processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)

# Calculate timing metrics for batch
e2e_latency += end_time - start_time
total_tokens += sum(len(ids) for ids in generated_ids_trimmed)

for ans, context in zip(answers, texts):
clean_ans = parse_reasoning_model_answer(ans)
res.append(clean_ans)
Expand All @@ -268,6 +278,15 @@ def _collate(x):
eval_logger.debug(f"Model Clean Response: {clean_ans}")
# reorder this group of results back to original unsorted form
res = re_ords.get_original(res)
# Calculate average speed
avg_speed = total_tokens / e2e_latency if e2e_latency > 0 else 0
# Log metrics
metric_dict = {
"total_tokens": total_tokens,
"e2e_latency": e2e_latency,
"avg_speed": avg_speed,
}
log_metrics(**metric_dict)

pbar.close()
return res
Expand Down
39 changes: 26 additions & 13 deletions lmms_eval/models/chat/llava_hf.py
Original file line number Diff line number Diff line change
@@ -1,24 +1,14 @@
import time
import warnings
from typing import List, Optional, Tuple, Union

import numpy as np
import PIL
import torch
from accelerate import Accelerator, DistributedType
from accelerate.state import AcceleratorState
from decord import VideoReader, cpu
from tqdm import tqdm
from transformers import (
AutoConfig,
AutoProcessor,
LlavaForConditionalGeneration,
LlavaNextForConditionalGeneration,
)

from lmms_eval import utils
from lmms_eval.api.instance import Instance
from lmms_eval.api.model import lmms
from lmms_eval.api.registry import register_model
from lmms_eval.models.model_utils.gen_metrics import log_metrics
from lmms_eval.protocol import ChatMessages

warnings.filterwarnings("ignore")
Expand Down Expand Up @@ -53,6 +43,8 @@ def _collate(x):
chunks = re_ords.get_batched(n=self.batch_size, batch_fn=None)
num_iters = len(requests) // self.batch_size if len(requests) % self.batch_size == 0 else len(requests) // self.batch_size + 1
pbar = tqdm(total=num_iters, disable=(self.rank != 0), desc="Model Responding")
e2e_latency = 0
total_tokens = 0
for chunk in chunks:
ctx, doc_to_messages, all_gen_kwargs, doc_id, task, split = zip(*chunk)
task = task[0]
Expand Down Expand Up @@ -94,6 +86,7 @@ def _collate(x):
gen_kwargs["num_beams"] = 1
do_sample = True if gen_kwargs["temperature"] > 0 else False
try:
start_time = time.time()
cont = self.model.generate(
**inputs,
do_sample=do_sample,
Expand All @@ -105,11 +98,21 @@ def _collate(x):
pad_token_id=self.eot_token_id,
eos_token_id=self.eot_token_id,
)
end_time = time.time()
cont = cont[:, inputs["input_ids"].shape[-1] :]

# Calculate timing metrics
e2e_latency += end_time - start_time
total_tokens += cont.shape[-1] if len(cont.shape) > 1 else len(cont)

except Exception as e:
eval_logger.error(f"Error {e} in generating")
cont = ""
text_outputs = self.tokenizer.batch_decode(cont, skip_special_tokens=True)[0]
e2e_latency += 0
total_tokens += 0

text_outputs = self.tokenizer.batch_decode(cont, skip_special_tokens=True)[0] if cont != "" else ""

if self.accelerator.is_main_process and doc_id[0] % 100 == 0:
eval_logger.debug(f"Generated text for doc ID {doc_id[0]}:\n\n{text_outputs}\n")

Expand All @@ -119,5 +122,15 @@ def _collate(x):
# reorder this group of results back to original unsorted form
res = re_ords.get_original(res)

metric_dict = {
"total_tokens": total_tokens,
"e2e_latency": e2e_latency,
"avg_speed": total_tokens / e2e_latency if e2e_latency > 0 else 0,
"additional_metrics": {
"rank": self.rank,
},
}
log_metrics(**metric_dict)

pbar.close()
return res
27 changes: 27 additions & 0 deletions lmms_eval/models/chat/openai_compatible.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
from openai import AzureOpenAI, OpenAI
from PIL import Image

from lmms_eval.models.model_utils.gen_metrics import log_metrics
from lmms_eval.models.simple.openai_compatible import (
OpenAICompatible as OpenAICompatibleSimple,
)
Expand All @@ -40,6 +41,8 @@ def generate_until(self, requests) -> List[str]:
res = []
pbar = tqdm(total=len(requests), disable=(self.rank != 0), desc="Model Responding")

e2e_latency = 0
total_tokens = 0
for ctx, doc_to_messages, gen_kwargs, doc_id, task, split in [reg.args for reg in requests]:
if self.continual_mode is True and self.cache_mode == "resume":
doc_uuid = f"{task}___{split}___{doc_id}"
Expand Down Expand Up @@ -81,8 +84,22 @@ def generate_until(self, requests) -> List[str]:

for attempt in range(self.max_retries):
try:
start_time = time.time()
response = self.client.chat.completions.create(**payload)
end_time = time.time()

response_text = response.choices[0].message.content

# Calculate timing metrics
e2e_latency += end_time - start_time

# Get token counts from response if available
if hasattr(response, "usage"):
total_tokens += response.usage.completion_tokens
else:
# Approximate token count if not provided
total_tokens += len(response_text.split())

break # If successful, break out of the loop

except Exception as e:
Expand All @@ -105,5 +122,15 @@ def generate_until(self, requests) -> List[str]:
with open(self.response_persistent_file, "w") as f:
json.dump(self.response_cache, f)

# Calculate average speed
avg_speed = total_tokens / e2e_latency if e2e_latency > 0 else 0
# Log metrics
metric_dict = {
"total_tokens": total_tokens,
"e2e_latency": e2e_latency,
"avg_speed": avg_speed,
}
log_metrics(**metric_dict)

pbar.close()
return res
Loading