feat: Add longvila-r1 and benchmarks by kcz358 · Pull Request #819 · EvolvingLMMs-Lab/lmms-eval

kcz358 · 2025-09-11T08:39:48Z

Before you open a pull-request, please check if a similar issue already exists or has been closed before.

When you open a pull-request, please be sure to include the following

A descriptive title: [xxx] XXXX
A detailed description

If you meet the lint warnings, you can use following scripts to reformat code.

pip install pre-commit
pre-commit install
pre-commit run --all-files

Thank you for your contributions!

Summary by CodeRabbit

New Features
- Added LongVILA chat-based multimodal model support (vLLM-backed) with batching, progress reporting, caching, and embedding-based prompts.
- Added LVBench evaluation task with video support, prompt templates, result processing, and scoring utilities.
Improvements
- Tool call outputs now include clearer, delimited responses with explicit image and text handling.
- Model argument parsing now recognizes additional model keys (version/name/id).
Bug Fixes
- Improved time-range extraction with broader patterns and a robust fallback.
Chores
- remote_code/ added to .gitignore.

coderabbitai · 2025-09-11T08:39:55Z

Walkthrough

Adds a LongVila chat-model wrapper (vLLM + prompt embeddings) and registers it, introduces LVBench task config and utilities for video MC evaluation, tweaks async OpenAI tool output handling, expands model-name parsing, and ignores remote_code/* in Git.

Changes

Cohort / File(s)	Summary
Git ignore updates `\.gitignore`	Add ignore rule for `remote_code/*`.
Model registry update `lmms_eval/models/__init__.py`	Register chat-template key `'longvila': 'LongVila'` in `AVAILABLE_CHAT_TEMPLATE_MODELS`.
LongVila integration `lmms_eval/models/chat/longvila.py`	Add `LongVila` model class (registered via `@register_model("longvila")`) implementing vLLM-based generation with prompt embeddings, media preprocessing via model `remote_code`, encoder-only embedding loading, per-request embedding construction, batching, caching, and generation pipeline; includes WORKERS env config and stubbed/unsupported methods.
LVBench task additions `lmms_eval/tasks/lvbench/lvbench.yaml`, `lmms_eval/tasks/lvbench/utils.py`	Add LVBench task config (dataset, generation params, metric `lvbench_score`) and utilities: HF cache resolution, `lvbench_doc_to_visual`, `lvbench_doc_to_text`, `extract_characters_regex`, and `lvbench_process_results` for prompt building and answer extraction/scoring.
Logger tweak `lmms_eval/loggers/evaluation_tracker.py`	Extend model-argument prefix recognition in `_get_model_name` to include `model_version=`, `model_name=`, and `model_id=`.
Async OpenAI tool handling `lmms_eval/models/chat/async_openai.py`	Improve tool-call output wrapping and messaging: create `tool`-role messages, append structured tool outputs (image URLs/text) into responses, and close tool response blocks with `</tool_response>`.
Charades timestamp fallback `lmms_eval/tasks/charades_sta/eval_tvg.py`	Broaden time-range regex to accept en dash and hyphen; add fallback regex to extract numeric time pairs when primary pattern finds none.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Eval as Evaluator
  participant LV as LongVila
  participant Enc as Encoder (embed-only)
  participant vLLM as vLLM Engine
  participant Cache as Cache

  Eval->>LV: generate_until(requests)
  loop per request
    LV->>LV: _to_remote_conversation(request)
    LV->>Enc: process media + tokenize conversation
    Enc-->>LV: inputs_embeds
  end
  LV->>Cache: check cached responses
  alt cache hit
    Cache-->>LV: cached responses
  else cache miss
    LV->>vLLM: generate(inputs_embeds, SamplingParams)
    vLLM-->>LV: outputs
    LV->>Cache: store outputs
  end
  LV-->>Eval: return texts

sequenceDiagram
  autonumber
  participant Runner as Task Runner
  participant DS as LVBench Dataset
  participant Utils as lvbench.utils
  participant Model as LongVila
  participant Metrics as Scorer

  Runner->>DS: load sample
  Runner->>Utils: lvbench_doc_to_visual / doc_to_text
  Utils-->>Runner: video path(s) + prompt
  Runner->>Model: generate_until(prompt + media)
  Model-->>Runner: prediction
  Runner->>Utils: lvbench_process_results(doc, [prediction])
  Utils-->>Runner: { lvbench_score: 0/1 }

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

Luodian

Poem

In burrows of code I quietly nibble,
embeddings and videos stacked in a dribble.
LVBench carrots, choose A–D with care,
vLLM hums, answers float on air.
A rabbit hops in — "Good job!" — nibble, nibble. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The PR description is just the repository contributor template checklist and does not describe the actual changes, motivation, files modified, testing performed, or any runtime/data requirements, so it fails to meet the repository's required description template. Because reviewers cannot see a summary of the LongVILA model addition, the LVBench benchmarks, or any instructions to reproduce or validate the changes, the description is insufficient for review.	Please replace the template placeholder with a full description that summarizes the changes (e.g., new LongVILA model wrapper, lvbench task/config and utils, logger adjustments, and any other file updates), explains the motivation and any breaking/runtime implications, and lists testing steps or example commands to run the benchmarks. Also include required environment or data details (for example remote_code expectations, HF cache paths, or where video chunks are stored), note known limitations or TODOs, and mark the checklist boxes (or explain why they remain unchecked) so reviewers can validate and reproduce the work. Finally, run and report pre-commit/lint checks and update the PR body before requesting review.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "feat: Add longvila-r1 and benchmarks" concisely and accurately summarizes the main changes in the diff—adding a LongVILA model wrapper and benchmark/config files—so it is relevant and clear for reviewers scanning history. It is short, focused, and free of noisy elements. The title captures the primary developer intent without needing to list every modified file.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/longvila_r1

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e7f19d and e324e19.

📒 Files selected for processing (1)

lmms_eval/loggers/evaluation_tracker.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

lmms_eval/loggers/evaluation_tracker.py

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

kcz358 · 2025-09-11T08:43:45Z

I have the video chunks on my local server. However, I haven't yet find ways to upload to hub...

coderabbitai

Actionable comments posted: 8

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

lmms_eval/tasks/lvbench/utils.py (1)

59-75: Fix docstring, remove extraneous f-string, add types

Docstring mentions “videomme”; should be lvbench.
Remove f prefix (Ruff F541).
Add type hints.

-def lvbench_process_results(doc, results):
-    """
-    Args:
-        doc: a instance of the eval dataset
-        results: [pred]
-    Returns:
-        a dictionary with key: metric name (in this case videomme score), value: metric value
-    """
+from typing import Dict, List
+def lvbench_process_results(doc: Dict, results: List[str]) -> Dict[str, bool]:
+    """
+    Args:
+        doc: one eval dataset instance
+        results: [pred]
+    Returns:
+        {"lvbench_score": bool}
+    """
     pred = results[0]
     pred_ans = extract_characters_regex(pred)
-    # gt_ans = doc["answer"].lower().strip().replace(".", "")
     gt_ans = doc["answer"]
     score = pred_ans == gt_ans
 
-    # return {f"videomme_perception_score": data_dict for metric in matrices}
-    return {f"lvbench_score": score}
+    return {"lvbench_score": score}

🧹 Nitpick comments (5)

lmms_eval/tasks/lvbench/utils.py (1)

27-35: Don’t mutate caller-provided dict; set defaults immutably + type hints

-from typing import Dict, Optional
+from typing import Dict, Optional
@@
-def lvbench_doc_to_text(doc, lmms_eval_specific_kwargs=None):
-    if lmms_eval_specific_kwargs is None:
-        lmms_eval_specific_kwargs = {}
-    if "pre_prompt" not in lmms_eval_specific_kwargs:
-        lmms_eval_specific_kwargs["pre_prompt"] = ""
-    if "post_prompt" not in lmms_eval_specific_kwargs:
-        lmms_eval_specific_kwargs["post_prompt"] = "\nAnswer the question with the option letter"
-    return lmms_eval_specific_kwargs["pre_prompt"] + doc["question"] + lmms_eval_specific_kwargs["post_prompt"]
+def lvbench_doc_to_text(
+    doc: Dict, lmms_eval_specific_kwargs: Optional[Dict[str, str]] = None
+) -> str:
+    cfg = {"pre_prompt": "", "post_prompt": "\nAnswer the question with the option letter"}
+    if lmms_eval_specific_kwargs:
+        cfg.update(lmms_eval_specific_kwargs)
+    return f'{cfg["pre_prompt"]}{doc["question"]}{cfg["post_prompt"]}'

lmms_eval/models/chat/longvila.py (4)

21-21: Unused constant

WORKERS isn’t used. Remove or use to control batching/threads.

-WORKERS = int(os.getenv("WORKERS", "32"))
+# WORKERS env reserved for future parallelism tuning

77-96: Add return type hints and docstring

Public helper should be typed and documented per guidelines.

-    def _to_remote_conversation(self, chat_messages: ChatMessages) -> list:
+    def _to_remote_conversation(self, chat_messages: ChatMessages) -> list:
         """
         Convert ChatMessages to LongVILA remote_code conversation format.
         [{"from": "human"|"gpt", "value": [str | {"path": media_path}, ...]}, ...]
         """

If you’re open, I can expand types to List[Dict[str, List[Union[str, Dict[str, str]]]]].

165-175: Cache write error handling

If a generation fails for one item, caching the rest silently could mask issues. Consider try/except around per-item cache writes.

-            for req, text in zip(batch_requests, response_text):
-                self.add_request_response_to_cache(req, text)
+            for req, text in zip(batch_requests, response_text):
+                try:
+                    self.add_request_response_to_cache(req, text)
+                except Exception:
+                    pass  # best-effort cache

1-15: Clean unused imports and align with guidelines

Remove unused: sys, ThreadPoolExecutor, Union, log_metrics. Keep line length ≤ 88.

-import sys
@@
-from concurrent.futures import ThreadPoolExecutor
-from typing import List, Optional, Tuple, Union
+from typing import List, Optional, Tuple
@@
-from lmms_eval.models.model_utils.gen_metrics import log_metrics

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b87aa4d and 1e5a57f.

📒 Files selected for processing (5)

.gitignore (1 hunks)
lmms_eval/models/__init__.py (1 hunks)
lmms_eval/models/chat/longvila.py (1 hunks)
lmms_eval/tasks/lvbench/lvbench.yaml (1 hunks)
lmms_eval/tasks/lvbench/utils.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/tasks/lvbench/utils.py
lmms_eval/models/__init__.py
lmms_eval/models/chat/longvila.py

**/*.{yml,yaml,json}

📄 CodeRabbit inference engine (CLAUDE.md)

Use Prettier (via pre-commit) to format YAML and JSON files

Files:

lmms_eval/tasks/lvbench/lvbench.yaml

🧬 Code graph analysis (1)

lmms_eval/models/chat/longvila.py (5)

lmms_eval/api/instance.py (1)

Instance (6-29)

lmms_eval/api/registry.py (1)

register_model (11-24)

lmms_eval/protocol.py (2)

ChatMessages (41-116)

extract_media (44-58)

lmms_eval/api/task.py (1)

doc_to_messages (1648-1680)

lmms_eval/api/model.py (3)

load_cache (107-112)

get_response_from_cache (208-223)

add_request_response_to_cache (201-206)

🪛 Ruff (0.12.2)

lmms_eval/tasks/lvbench/utils.py

12-12: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

74-74: f-string without any placeholders

Remove extraneous f prefix

(F541)

lmms_eval/models/chat/longvila.py

60-60: Do not catch blind exception: Exception

(BLE001)

61-61: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

61-61: Avoid specifying long messages outside the exception class

(TRY003)

179-179: Unused method argument: requests

(ARG002)

181-181: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

🔇 Additional comments (5)

.gitignore (1)

53-53: Ignoring remote_code/* may hide required runtime code

LongVILA tries to import remote_code.* at runtime. If you plan to place remote_code/ under the repo root during local runs, ignoring it means devs may forget to include it and hit ImportError. Prefer loading remote_code from the model folder (see comments in longvila.py) and keep the repo clean, or add a README note about where remote_code/ must live at runtime.

Would you like me to add a short README section clarifying where remote_code/ is expected?

lmms_eval/tasks/lvbench/lvbench.yaml (1)

1-29: Config reads clean; consider pinning dataset revision for reproducibility

LGTM overall. To ensure reproducible caching, consider adding a dataset_revision (commit hash or tag) and double-check whether using train as test_split is intentional.

Would you like me to add a dataset_revision: field and switch test_split if needed?

lmms_eval/models/__init__.py (1)

91-92: Model registry mapping LGTM — runtime import blocked by missing dependency

"longvila": "LongVila" is correct; LongVila is defined at lmms_eval/models/chat/longvila.py:25.
Dynamic import check failed with ModuleNotFoundError: No module named 'loguru' — install dependencies (pip install -r requirements.txt or pip install loguru) and re-run:

python - <<'PY'
from lmms_eval.models import get_model
print(get_model("longvila").name)
PY

lmms_eval/models/chat/longvila.py (2)

67-75: Type-safety and config: forward trust_remote_code; optional tokenizers

If AutoModel doesn’t expose .tokenizer / .vision_tower, this will break. At minimum, pass trust_remote_code=trust_remote_code and document the requirement.

Would you confirm the remote model exposes tokenizer and vision_tower.image_processor?

140-177: Confirm and guard for vLLM prompt_embeds; add fallback to token-ID flow

vLLM accepts prompt embeddings via the "prompt_embeds" parameter (offline: pass a torch tensor as vllm.inputs.EmbedsPrompt; online/OpenAI-compatible server: base64-encoded tensor and the server must be started with engine=v0 and --enable-prompt-embeds). Prompt-embeds are supported only in the v0 engine.

In lmms_eval/models/chat/longvila.py (lines 140–177) wrap the generate call with a capability check or try/except; if prompt_embeds is unsupported, fall back to constructing token-ID prompts (or raise a clear error instructing operators to enable --enable-prompt-embeds).

coderabbitai · 2025-09-11T08:56:37Z

+try:
+    from vllm import LLM, SamplingParams
+except ImportError:
+    vllm = None
+


🛠️ Refactor suggestion

vLLM import fallback can crash later; fail fast with actionable error

If vllm isn’t installed, SamplingParams is undefined but later used. Guard and raise with guidance.

-try: - from vllm import LLM, SamplingParams -except ImportError: - vllm = None +try: + from vllm import LLM, SamplingParams +except ImportError as _vllm_err: + LLM = None # type: ignore[assignment] + SamplingParams = None # type: ignore[assignment] + _VLLM_IMPORT_ERROR = _vllm_err

And in __init__:

def __init__( @@ ): + if LLM is None or SamplingParams is None: + raise ImportError( + "vLLM is required for LongVila. Install with `pip install vllm`." + ) from _VLLM_IMPORT_ERROR

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

try:

from vllm import LLM, SamplingParams

except ImportError:

vllm = None

try:

from vllm import LLM, SamplingParams

except ImportError as _vllm_err:

LLM = None # type: ignore[assignment]

SamplingParams = None # type: ignore[assignment]

_VLLM_IMPORT_ERROR = _vllm_err

🤖 Prompt for AI Agents

In lmms_eval/models/chat/longvila.py around lines 16-20, the vllm import fallback leaves SamplingParams undefined which will cause a crash later; modify the import block to fail fast by raising an ImportError (or RuntimeError) with an actionable message instructing to install vllm (and optionally provide pip install instructions and mention SamplingParams/LLM are required), and update the class __init__ to check that vllm and SamplingParams are available before using them, raising the same clear error if they are missing so callers get immediate, actionable feedback instead of a later NameError.

coderabbitai · 2025-09-11T08:56:37Z

+    def __init__(
+        self,
+        model="Efficient-Large-Model/LongVILA-R1-7B",
+        tensor_parallel_size=1,
+        data_parallel_size=1,
+        gpu_memory_utilization=0.5,
+        batch_size=1,
+        max_frame_num=32,
+        trust_remote_code=True,
+        chat_template=None,
+        max_pixels: int = 1605632,
+        min_image_pixels=28,
+        fps: Optional[int] = None,
+        device_map: Optional[str] = "cuda",
+        **kwargs,
+    ):
+        # vLLM requires the path to the autoregressive llm weights under the model root
+        model_root = model
+        llm_path = os.path.join(model_root, "llm")
+        # Enable prompt embeddings so we can pass encoder-produced embeddings directly


🛠️ Refactor suggestion

Validate model paths and pass-through trust flag

Check that llm_path exists to fail early with a clear message; also forward trust_remote_code consistently.

model_root = model llm_path = os.path.join(model_root, "llm") + if not os.path.isdir(llm_path): + raise FileNotFoundError( + f"Expected autoregressive LLM under '{llm_path}'. " + "Verify your LongVILA model layout." + ) @@ - kwargs["enable_prompt_embeds"] = True + kwargs["enable_prompt_embeds"] = True @@ - super().__init__(llm_path, tensor_parallel_size, data_parallel_size, gpu_memory_utilization, batch_size, max_frame_num, trust_remote_code, chat_template, min_image_pixels, **kwargs) + super().__init__( + llm_path, + tensor_parallel_size, + data_parallel_size, + gpu_memory_utilization, + batch_size, + max_frame_num, + trust_remote_code, + chat_template, + min_image_pixels, + **kwargs, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(

self,

model="Efficient-Large-Model/LongVILA-R1-7B",

tensor_parallel_size=1,

data_parallel_size=1,

gpu_memory_utilization=0.5,

batch_size=1,

max_frame_num=32,

trust_remote_code=True,

chat_template=None,

max_pixels: int = 1605632,

min_image_pixels=28,

fps: Optional[int] = None,

device_map: Optional[str] = "cuda",

**kwargs,

):

# vLLM requires the path to the autoregressive llm weights under the model root

model_root = model

llm_path = os.path.join(model_root, "llm")

# Enable prompt embeddings so we can pass encoder-produced embeddings directly

def __init__(

self,

model="Efficient-Large-Model/LongVILA-R1-7B",

tensor_parallel_size=1,

data_parallel_size=1,

gpu_memory_utilization=0.5,

batch_size=1,

max_frame_num=32,

trust_remote_code=True,

chat_template=None,

max_pixels: int = 1605632,

min_image_pixels=28,

fps: Optional[int] = None,

device_map: Optional[str] = "cuda",

**kwargs,

):

# vLLM requires the path to the autoregressive llm weights under the model root

model_root = model

llm_path = os.path.join(model_root, "llm")

if not os.path.isdir(llm_path):

raise FileNotFoundError(

f"Expected autoregressive LLM under '{llm_path}'. "

"Verify your LongVILA model layout."

)

# Enable prompt embeddings so we can pass encoder-produced embeddings directly

kwargs["enable_prompt_embeds"] = True

super().__init__(

llm_path,

tensor_parallel_size,

data_parallel_size,

gpu_memory_utilization,

batch_size,

max_frame_num,

trust_remote_code,

chat_template,

min_image_pixels,

**kwargs,

)

🤖 Prompt for AI Agents

In lmms_eval/models/chat/longvila.py around lines 28 to 47, add an explicit existence check for the computed llm_path and raise a clear FileNotFoundError if it doesn't exist so the class fails fast with a helpful message; also ensure the incoming trust_remote_code flag is stored on self (e.g., self.trust_remote_code = trust_remote_code) and passed through to any downstream model-loading calls that accept trust_remote_code so the same behavior is consistently applied.

coderabbitai · 2025-09-11T08:56:38Z

+        # Set up imports from the model's remote_code directory
+        # The LongVILA repo provides preprocessing utilities we must call directly
+        try:
+            from remote_code.media import extract_media as _extract_media
+            from remote_code.mm_utils import process_images as _process_images
+            from remote_code.tokenizer_utils import (
+                tokenize_conversation as _tokenize_conversation,
+            )
+        except Exception as e:
+            raise ImportError(f"Failed to import LongVILA remote_code utilities from '{model_root}'. Ensure the model path contains remote_code. Original error: {e}")
+
+        self.extract_media = _extract_media
+        self.process_images = _process_images
+        self.tokenize_conversation = _tokenize_conversation
+


⚠️ Potential issue

remote_code import path will fail unless it’s on sys.path

The code imports remote_code.* without adding model_root to sys.path. Load explicitly from the model folder to avoid relying on a repo-level remote_code/.

- try: - from remote_code.media import extract_media as _extract_media - from remote_code.mm_utils import process_images as _process_images - from remote_code.tokenizer_utils import ( - tokenize_conversation as _tokenize_conversation, - ) - except Exception as e: - raise ImportError(f"Failed to import LongVILA remote_code utilities from '{model_root}'. Ensure the model path contains remote_code. Original error: {e}") + try: + remote_code_dir = os.path.join(model_root) + if remote_code_dir not in sys.path: + sys.path.insert(0, remote_code_dir) + from remote_code.media import extract_media as _extract_media # type: ignore + from remote_code.mm_utils import process_images as _process_images # type: ignore + from remote_code.tokenizer_utils import ( # type: ignore + tokenize_conversation as _tokenize_conversation, + ) + except Exception as err: + raise ImportError( + f"Failed to import LongVILA remote_code from '{model_root}'. " + "Ensure the model path contains a 'remote_code/' package." + ) from err

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Set up imports from the model's remote_code directory

# The LongVILA repo provides preprocessing utilities we must call directly

try:

from remote_code.media import extract_media as _extract_media

from remote_code.mm_utils import process_images as _process_images

from remote_code.tokenizer_utils import (

tokenize_conversation as _tokenize_conversation,

)

except Exception as e:

raise ImportError(f"Failed to import LongVILA remote_code utilities from '{model_root}'. Ensure the model path contains remote_code. Original error: {e}")

self.extract_media = _extract_media

self.process_images = _process_images

self.tokenize_conversation = _tokenize_conversation

# Set up imports from the model's remote_code directory

# The LongVILA repo provides preprocessing utilities we must call directly

try:

remote_code_dir = os.path.join(model_root)

if remote_code_dir not in sys.path:

sys.path.insert(0, remote_code_dir)

from remote_code.media import extract_media as _extract_media # type: ignore

from remote_code.mm_utils import process_images as _process_images # type: ignore

from remote_code.tokenizer_utils import ( # type: ignore

tokenize_conversation as _tokenize_conversation,

)

except Exception as err:

raise ImportError(

f"Failed to import LongVILA remote_code from '{model_root}'. "

"Ensure the model path contains a 'remote_code/' package."

) from err

self.extract_media = _extract_media

self.process_images = _process_images

self.tokenize_conversation = _tokenize_conversation

🧰 Tools

🪛 Ruff (0.12.2)

60-60: Do not catch blind exception: Exception

(BLE001)

61-61: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling

(B904)

61-61: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In lmms_eval/models/chat/longvila.py around lines 52 to 66, the imports from remote_code will fail unless model_root is on sys.path; prepend the absolute model_root to sys.path before the try block (and optionally pop it afterward) so Python can import remote_code from the model folder, ensure model_root exists and is resolved with os.path.abspath, and keep the existing try/except and attribute assignments intact.

coderabbitai · 2025-09-11T08:56:38Z

+    def make_one_request(self, request: Instance) -> Tuple["object", dict]:
+        """
+        Build prompt embeddings and per-request sampling params from an Instance.
+        Returns (inputs_embeds, params_dict). Does not mutate input.
+        """
+        ctx, doc_to_messages, gen_kwargs, doc_id, task, split = request.arguments
+        raw_messages = doc_to_messages(self.task_dict[task][split][doc_id])
+        chat_messages = ChatMessages(messages=raw_messages)
+
+        # Copy to avoid side-effects across threads
+        _gen = dict(gen_kwargs or {})
+        _gen.setdefault("max_new_tokens", 4096)
+        _gen.setdefault("temperature", 0)
+        _gen.setdefault("top_p", 0.95)
+
+        params = {
+            "temperature": _gen["temperature"],
+            "max_tokens": _gen["max_new_tokens"],
+            "top_p": _gen["top_p"],
+        }
+
+        # Convert to LongVILA remote_code conversation format
+        conversation = self._to_remote_conversation(chat_messages)
+
+        # Extract and preprocess media
+        if self.fps:
+            self.model_encoder.config.fps = self.fps
+        else:
+            self.model_encoder.config.num_video_frames = self.max_frame_num
+            self.model_encoder.config.fps = 0
+        media = self.extract_media(conversation, self.model_encoder.config)
+        if "video" in media and media["video"] is not None:
+            media["video"] = [self.process_images(images, self.model_encoder.vision_tower.image_processor, self.model_encoder.config).half() for images in media["video"]]
+
+        # Tokenize conversation and move to CUDA for embedding
+        input_ids = self.tokenize_conversation(conversation, self.model_encoder.tokenizer, add_generation_prompt=True).unsqueeze(0).cuda()
+
+        # Create prompt embeddings using the model encoder
+        inputs_embeds, _, _ = self.model_encoder._embed(input_ids, media, {"video": {}}, None, None)
+
+        return inputs_embeds, params
+


⚠️ Potential issue

Device placement: avoid unconditional .cuda()

Calling .cuda() breaks on CPU-only setups or non-CUDA devices. Derive device from the encoder.

- input_ids = self.tokenize_conversation(conversation, self.model_encoder.tokenizer, add_generation_prompt=True).unsqueeze(0).cuda() + input_ids = ( + self.tokenize_conversation( + conversation, + self.model_encoder.tokenizer, + add_generation_prompt=True, + ) + .unsqueeze(0) + .to(next(self.model_encoder.parameters()).device) + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def make_one_request(self, request: Instance) -> Tuple["object", dict]:

"""

Build prompt embeddings and per-request sampling params from an Instance.

Returns (inputs_embeds, params_dict). Does not mutate input.

"""

ctx, doc_to_messages, gen_kwargs, doc_id, task, split = request.arguments

raw_messages = doc_to_messages(self.task_dict[task][split][doc_id])

chat_messages = ChatMessages(messages=raw_messages)

# Copy to avoid side-effects across threads

_gen = dict(gen_kwargs or {})

_gen.setdefault("max_new_tokens", 4096)

_gen.setdefault("temperature", 0)

_gen.setdefault("top_p", 0.95)

params = {

"temperature": _gen["temperature"],

"max_tokens": _gen["max_new_tokens"],

"top_p": _gen["top_p"],

}

# Convert to LongVILA remote_code conversation format

conversation = self._to_remote_conversation(chat_messages)

# Extract and preprocess media

if self.fps:

self.model_encoder.config.fps = self.fps

else:

self.model_encoder.config.num_video_frames = self.max_frame_num

self.model_encoder.config.fps = 0

media = self.extract_media(conversation, self.model_encoder.config)

if "video" in media and media["video"] is not None:

media["video"] = [self.process_images(images, self.model_encoder.vision_tower.image_processor, self.model_encoder.config).half() for images in media["video"]]

# Tokenize conversation and move to CUDA for embedding

input_ids = self.tokenize_conversation(conversation, self.model_encoder.tokenizer, add_generation_prompt=True).unsqueeze(0).cuda()

# Create prompt embeddings using the model encoder

inputs_embeds, _, _ = self.model_encoder._embed(input_ids, media, {"video": {}}, None, None)

return inputs_embeds, params

def make_one_request(self, request: Instance) -> Tuple["object", dict]:

"""

Build prompt embeddings and per-request sampling params from an Instance.

Returns (inputs_embeds, params_dict). Does not mutate input.

"""

ctx, doc_to_messages, gen_kwargs, doc_id, task, split = request.arguments

raw_messages = doc_to_messages(self.task_dict[task][split][doc_id])

chat_messages = ChatMessages(messages=raw_messages)

# Copy to avoid side-effects across threads

_gen = dict(gen_kwargs or {})

_gen.setdefault("max_new_tokens", 4096)

_gen.setdefault("temperature", 0)

_gen.setdefault("top_p", 0.95)

params = {

"temperature": _gen["temperature"],

"max_tokens": _gen["max_new_tokens"],

"top_p": _gen["top_p"],

}

# Convert to LongVILA remote_code conversation format

conversation = self._to_remote_conversation(chat_messages)

# Extract and preprocess media

if self.fps:

self.model_encoder.config.fps = self.fps

else:

self.model_encoder.config.num_video_frames = self.max_frame_num

self.model_encoder.config.fps = 0

media = self.extract_media(conversation, self.model_encoder.config)

if "video" in media and media["video"] is not None:

media["video"] = [self.process_images(images, self.model_encoder.vision_tower.image_processor, self.model_encoder.config).half() for images in media["video"]]

# Tokenize conversation and move to CUDA for embedding

input_ids = (

self.tokenize_conversation(

conversation,

self.model_encoder.tokenizer,

add_generation_prompt=True,

)

.unsqueeze(0)

.to(next(self.model_encoder.parameters()).device)

)

# Create prompt embeddings using the model encoder

inputs_embeds, _, _ = self.model_encoder._embed(input_ids, media, {"video": {}}, None, None)

return inputs_embeds, params

coderabbitai · 2025-09-11T08:56:38Z

+    def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
+        # TODO
+        assert False, "GPT4V not support"
+
+    def generate_until_multi_round(self, requests) -> List[str]:
+        raise NotImplementedError("TODO: Implement multi-round generation")


🛠️ Refactor suggestion

Replace assert False and remove unused arg warning

Use a proper exception and underscore the unused parameter.

- def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: - # TODO - assert False, "GPT4V not support" + def loglikelihood(self, _requests: List[Instance]) -> List[Tuple[float, bool]]: # noqa: ARG002 + raise NotImplementedError("LongVila does not support loglikelihood.")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:

# TODO

assert False, "GPT4V not support"

def generate_until_multi_round(self, requests) -> List[str]:

raise NotImplementedError("TODO: Implement multi-round generation")

def loglikelihood(self, _requests: List[Instance]) -> List[Tuple[float, bool]]: # noqa: ARG002

raise NotImplementedError("LongVila does not support loglikelihood.")

def generate_until_multi_round(self, requests) -> List[str]:

raise NotImplementedError("TODO: Implement multi-round generation")

🧰 Tools

🪛 Ruff (0.12.2)

179-179: Unused method argument: requests

(ARG002)

181-181: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

🤖 Prompt for AI Agents

In lmms_eval/models/chat/longvila.py around lines 179 to 184, replace the assert False with a proper exception and silence the unused-parameter warning by renaming the argument to _requests: implement loglikelihood to raise NotImplementedError("GPT4V not supported") instead of asserting, and change generate_until_multi_round(self, requests) to generate_until_multi_round(self, _requests) (or similarly prefix the parameter with an underscore) and keep its NotImplementedError; this removes the assert and the unused-arg warning while clearly signaling unimplemented functionality.

coderabbitai · 2025-09-11T08:56:38Z

+hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/")
+base_cache_dir = os.path.expanduser(hf_home)
+with open(Path(__file__).parent / "lvbench.yaml", "r") as f:
+    raw_data = f.readlines()
+    safe_data = []
+    for i, line in enumerate(raw_data):
+        # remove function definition since yaml load cannot handle it
+        if "!function" not in line:
+            safe_data.append(line)
+cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]


🛠️ Refactor suggestion

Harden YAML parsing and fix unused loop var

Rename unused i to _i (Ruff B007).

Add defensive checks around YAML keys to avoid KeyError if the schema changes.

Apply:

-raw_data = f.readlines() -safe_data = [] -for i, line in enumerate(raw_data): +raw_data = f.readlines() +safe_data = [] +for _i, line in enumerate(raw_data): # remove function definition since yaml load cannot handle it if "!function" not in line: safe_data.append(line) -cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"] +data = yaml.safe_load("".join(safe_data)) or {} +try: + cache_name = data["dataset_kwargs"]["cache_dir"] +except KeyError as exc: + raise KeyError("Expected 'dataset_kwargs.cache_dir' in lvbench.yaml") from exc

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/")

base_cache_dir = os.path.expanduser(hf_home)

with open(Path(__file__).parent / "lvbench.yaml", "r") as f:

raw_data = f.readlines()

safe_data = []

for i, line in enumerate(raw_data):

# remove function definition since yaml load cannot handle it

if "!function" not in line:

safe_data.append(line)

cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]

hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/")

base_cache_dir = os.path.expanduser(hf_home)

with open(Path(__file__).parent / "lvbench.yaml", "r") as f:

raw_data = f.readlines()

safe_data = []

for _i, line in enumerate(raw_data):

# remove function definition since yaml load cannot handle it

if "!function" not in line:

safe_data.append(line)

data = yaml.safe_load("".join(safe_data)) or {}

try:

cache_name = data["dataset_kwargs"]["cache_dir"]

except KeyError as exc:

raise KeyError("Expected 'dataset_kwargs.cache_dir' in lvbench.yaml") from exc

🧰 Tools

🪛 Ruff (0.12.2)

12-12: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

🤖 Prompt for AI Agents

In lmms_eval/tasks/lvbench/utils.py around lines 7 to 16, the loop uses an unused variable `i` and the YAML access assumes keys exist which can raise KeyError; rename `i` to `_i` to satisfy Ruff B007 and add defensive checks when loading the YAML (e.g., check that the top-level mapping exists and contains "dataset_kwargs" and "cache_dir" keys, raising a clear error or falling back to a default) before assigning cache_name so the code won't crash if the schema changes.

coderabbitai · 2025-09-11T08:56:38Z

+def lvbench_doc_to_visual(doc):
+    cache_dir = os.path.join(base_cache_dir, cache_name)
+    video_path = doc["video_path"]
+    assert os.path.exists(os.path.join(cache_dir, video_path))
+    video_path = os.path.join(cache_dir, video_path)
+    return [video_path]


🛠️ Refactor suggestion

Avoid asserts for I/O; add types and clearer errors

Use Path checks and raise FileNotFoundError. Add type hints.

-def lvbench_doc_to_visual(doc): - cache_dir = os.path.join(base_cache_dir, cache_name) - video_path = doc["video_path"] - assert os.path.exists(os.path.join(cache_dir, video_path)) - video_path = os.path.join(cache_dir, video_path) - return [video_path] +from typing import Dict, List + +def lvbench_doc_to_visual(doc: Dict) -> List[str]: + cache_dir = Path(base_cache_dir) / cache_name + video_path = Path(doc["video_path"]) + abs_path = video_path if video_path.is_absolute() else (cache_dir / video_path) + if not abs_path.exists(): + raise FileNotFoundError(f"Video not found: {abs_path}") + return [str(abs_path)]

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In lmms_eval/tasks/lvbench/utils.py around lines 19 to 24, the function uses an assert for an I/O check and lacks type hints; replace the assert with pathlib.Path checks and raise FileNotFoundError with a clear message if the file is missing, add type hints (e.g., doc: Mapping[str, Any] -> List[Path]) on the function signature, construct video_path as a Path using base_cache_dir/cache_name and doc["video_path"], and return a list of Path objects; ensure pathlib.Path is imported and adjust any callers if they expect strings.

coderabbitai · 2025-09-11T08:56:38Z

+def extract_characters_regex(s):
+    s = s.strip()
+    answer_prefixes = [
+        "The best answer is",
+        "The correct answer is",
+        "The answer is",
+        "The answer",
+        "The best option is" "The correct option is",
+        "Best answer:" "Best option:",
+    ]
+    for answer_prefix in answer_prefixes:
+        s = s.replace(answer_prefix, "")
+
+    if len(s.split()) > 10 and not re.search("[ABCD]", s):
+        return ""
+
+    matches = re.search(r"[ABCD]", s)
+    if matches is None:
+        return ""
+    return matches[0]


⚠️ Potential issue

Fix concatenated strings bug and improve extraction robustness

Missing commas are concatenating literals (e.g., "Best answer:" "Best option:"). Also prefer case-insensitive matching and strip punctuation.

-def extract_characters_regex(s): +from typing import Optional +def extract_characters_regex(s: str) -> str: s = s.strip() answer_prefixes = [ "The best answer is", "The correct answer is", "The answer is", "The answer", - "The best option is" "The correct option is", - "Best answer:" "Best option:", + "The best option is", + "The correct option is", + "Best answer:", + "Best option:", ] for answer_prefix in answer_prefixes: - s = s.replace(answer_prefix, "") + s = re.sub(re.escape(answer_prefix), "", s, flags=re.IGNORECASE).strip() - if len(s.split()) > 10 and not re.search("[ABCD]", s): + if len(s.split()) > 10 and not re.search(r"\b[ABCD]\b", s, flags=re.I): return "" - matches = re.search(r"[ABCD]", s) + matches = re.search(r"\b([ABCD])\b", s, flags=re.I) if matches is None: return "" - return matches[0] + return matches.group(1).upper()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def extract_characters_regex(s):

s = s.strip()

answer_prefixes = [

"The best answer is",

"The correct answer is",

"The answer is",

"The answer",

"The best option is" "The correct option is",

"Best answer:" "Best option:",

]

for answer_prefix in answer_prefixes:

s = s.replace(answer_prefix, "")

if len(s.split()) > 10 and not re.search("[ABCD]", s):

return ""

matches = re.search(r"[ABCD]", s)

if matches is None:

return ""

return matches[0]

from typing import Optional

def extract_characters_regex(s: str) -> str:

s = s.strip()

answer_prefixes = [

"The best answer is",

"The correct answer is",

"The answer is",

"The answer",

"The best option is",

"The correct option is",

"Best answer:",

"Best option:",

]

for answer_prefix in answer_prefixes:

s = re.sub(re.escape(answer_prefix), "", s, flags=re.IGNORECASE).strip()

if len(s.split()) > 10 and not re.search(r"\b[ABCD]\b", s, flags=re.I):

return ""

matches = re.search(r"\b([ABCD])\b", s, flags=re.I)

if matches is None:

return ""

return matches.group(1).upper()

🤖 Prompt for AI Agents

In lmms_eval/tasks/lvbench/utils.py around lines 37 to 56, the answer_prefix list has missing commas causing string concatenation and the character extraction is brittle; fix by inserting the missing commas between list literals, perform prefix removal and then normalize s (lower/upper consistently) and strip surrounding punctuation, use a case-insensitive regex (e.g., re.search(r"\b[ABCD]\b", s, flags=re.IGNORECASE) or re.search(r"[ABCD]", s, flags=re.IGNORECASE)) to find the first A/B/C/D, and return the uppercase character (or "" if none); also keep the existing short-word-length guard but apply it after normalization.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

lmms_eval/models/chat/async_openai.py (4)
174-179: Guard against None content from tool_calls responses

When finish_reason == "tool_calls", message.content is often None. Current concatenation and message append can raise TypeError or send invalid payload.
-        last_response = response.choices[0].message.content
-        all_response += last_response
+        last_response = response.choices[0].message.content or ""
+        all_response += last_response
 ...
-            messages.append({"role": "assistant", "content": last_response})
+            if last_response:
+                messages.append({"role": "assistant", "content": last_response})
67-70: Possible deadlock: num_cpus can be 0 on single‑core machines

cpu_count() // 2 yields 0 when cpu_count() == 1, causing Semaphore(0) to block forever.
-            self.num_cpus = cpu_count() // 2
+            self.num_cpus = max(1, cpu_count() // 2)
45-61: Type hints: fix Optional annotations and add return type

Defaults of None require Optional[...] and the async method should declare its return type. This also satisfies Pyright.
-    def __init__(
+    def __init__(
         self,
         model_version: str = "grok-2-latest",
-        base_url: str = None,
-        api_key: str = None,
+        base_url: Optional[str] = None,
+        api_key: Optional[str] = None,
         timeout: int = 600,
         max_retries: int = 5,
         max_size_in_mb: int = 20,
-        mcp_server_path: str = None,
-        num_cpus: int = None,
-        work_dir: str = None,
+        mcp_server_path: Optional[str] = None,
+        num_cpus: Optional[int] = None,
+        work_dir: Optional[str] = None,
         fps: Optional[int] = None,
         nframes: Optional[int] = 64,
         max_pixels: Optional[int] = 151200,
         min_pixels: Optional[int] = 28 * 28,
         **kwargs,
     ) -> None:
-    async def maybe_forward_with_tool(self, request: Instance, idx: int):
+    async def maybe_forward_with_tool(self, request: Instance, idx: int) -> Tuple[str, int]:
Also applies to: 124-124

212-237: Guard asyncio.run and fix broken wrapper

async_openai.generate_until (lmms_eval/models/chat/async_openai.py:212-237) calls asyncio.run unconditionally — this will raise if invoked on an existing event loop. lmms_eval/models/simple/from_log.py:119 is the only direct caller found, but its wrapper is broken (it does return generate_until(self, requests) — NameError).

lmms_eval/models/simple/from_log.py:119 — change return generate_until(self, requests) to return self.generate_until(requests) (critical fix).

lmms_eval/models/chat/async_openai.py:212-237 — do not call asyncio.run unconditionally; either make the method fully async (and let callers await) or detect a running loop (asyncio.get_running_loop / loop.is_running) and use an in-loop approach (create tasks + gather) as the fallback.

🧹 Nitpick comments (5)

lmms_eval/models/chat/async_openai.py (5)
1-37: Clean up duplicate/unused imports and satisfy Ruff (I001/F401)

Removes duplicate lmms import and unused modules.
 import asyncio
-import base64
 import json
 import os
 import tempfile
-import time
 import uuid
-from io import BytesIO
 from multiprocessing import cpu_count
 from typing import List, Optional, Tuple, Union
-
-import numpy as np
-import requests as url_requests
 from accelerate import Accelerator, DistributedType
 from tqdm import tqdm
@@
-try:
-    from decord import VideoReader, cpu
-except ImportError:
-    pass
+# decord not used here; import where needed to avoid F401
@@
-from PIL import Image
+# PIL.Image not used directly in this module; ChatMessages provides PIL images
@@
-from lmms_eval.api.model import lmms
+from lmms_eval.api.model import lmms  # (deduped)
86-89: Adjust assertion message to match allowed DistributedTypes

Message mentions DDP, but code allows FSDP, MULTI_GPU, DEEPSPEED.
-            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
+            assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type. Supported: FSDP, MULTI_GPU, DEEPSPEED."
173-206: Resilience: add retries/backoff on API and tool calls

max_retries is unused; transient 5xx/429s will fail the whole run. Consider exponential backoff for self.client.chat.completions.create and mcp_client.run_tool.

I can draft a tenacity-based wrapper if you want.

41-44: Docs: add minimal docstrings for public API

Class and init lack docstrings; guidelines require them.

Happy to propose short docstrings aligned with the registry’s expectations.

141-150: Ensure messages[-1]['content'] exists and is a list before appending media-path hints.

to_openai_messages builds each message with "content": [] so appending is safe when messages is non-empty; still guard in lmms_eval/models/chat/async_openai.py around the append to (1) handle an empty messages list (create a final message with 'content': []) and (2) coerce non-list content to a list before appending to avoid IndexError or invalid payloads.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1e5a57f and 57055c8.

📒 Files selected for processing (2)

lmms_eval/loggers/evaluation_tracker.py (1 hunks)
lmms_eval/models/chat/async_openai.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/loggers/evaluation_tracker.py
lmms_eval/models/chat/async_openai.py

🧬 Code graph analysis (1)

lmms_eval/models/chat/async_openai.py (1)

lmms_eval/mcp/client.py (1)

convert_result_to_openai_format (49-67)

🔇 Additional comments (1)

lmms_eval/loggers/evaluation_tracker.py (1)

71-75: Prefer model identity keys over engine=; sanitize extraction and wrap list.

Engine URLs can be logged as the model name when both are present — extract quoted/whitespace values with a regex, prefer overlays (peft/delta) then explicit identity (model_id/model_name) and treat engine= as a last-resort fallback. File: lmms_eval/loggers/evaluation_tracker.py (around lines 69–75).

@@
-        def extract_model_name(model_args: str, key: str) -> str:
-            """Extracts the model name from the model arguments using a key."""
-            args_after_key = model_args.split(key)[1]
-            return args_after_key.split(",")[0]
+        def extract_model_name(model_args: str, key: str) -> str:
+            """Extract the value for `key` (supports quoted values) up to the next comma."""
+            m = re.search(
+                rf"{re.escape(key)}\s*(?P<val>(?:'[^']*'|\"[^\"]*\"|[^,]+))",
+                model_args,
+            )
+            if not m:
+                return ""
+            return m.group("val").strip().strip("'\"")
@@
-        # order does matter, e.g. peft and delta are provided together with pretrained
-        prefixes = ["peft=", "delta=", "pretrained=", "model=", "path=", "engine=", "model_version=", "model_name=", "model_id="]
+        # Order matters:
+        # - Prefer overlays (peft/delta) over base identifiers.
+        # - Prefer explicit model identity (model_id/name) over generic fields.
+        # - Treat engine as a last-resort fallback to avoid logging URLs as model names.
+        prefixes = [
+            "peft=",
+            "delta=",
+            "pretrained=",
+            "model_id=",
+            "model_name=",
+            "model=",
+            "path=",
+            "model_version=",
+            "engine=",
+        ]

coderabbitai · 2025-09-14T05:34:18Z

@@ -184,11 +184,17 @@ async def maybe_forward_with_tool(self, request: Instance, idx: int):
                for call in message.tool_calls:
                    eval_logger.debug(f"Calling {call.function.name}...")
                    result = await self.mcp_client.run_tool(call.function.name, eval(call.function.arguments))


⚠️ Potential issue

Blocker: Replace eval(...) of tool arguments with safe JSON parsing

Evaluating LLM-provided strings is an RCE vector. Parse as JSON and fall back safely.

- result = await self.mcp_client.run_tool(call.function.name, eval(call.function.arguments)) + raw_args = call.function.arguments + if isinstance(raw_args, str): + try: + args = json.loads(raw_args) + except Exception as e: + eval_logger.warning(f"Failed to parse tool args for {call.function.name}: {e}") + args = {} + elif raw_args is None: + args = {} + else: + args = raw_args + result = await self.mcp_client.run_tool(call.function.name, args)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

result = await self.mcp_client.run_tool(call.function.name, eval(call.function.arguments))

raw_args = call.function.arguments

if isinstance(raw_args, str):

try:

args = json.loads(raw_args)

except Exception as e:

eval_logger.warning(f"Failed to parse tool args for {call.function.name}: {e}")

args = {}

elif raw_args is None:

args = {}

else:

args = raw_args

result = await self.mcp_client.run_tool(call.function.name, args)

🧰 Tools

🪛 Ruff (0.12.2)

186-186: Use of possibly insecure function; consider using ast.literal_eval

(S307)

🤖 Prompt for AI Agents

In lmms_eval/models/chat/async_openai.py around line 186, replace the unsafe eval(call.function.arguments) with safe JSON parsing: import json at the top, then attempt to parse call.function.arguments via json.loads in a try/except (catch json.JSONDecodeError, TypeError, ValueError) and on success pass the parsed object to run_tool; on failure log/warn about the parse error and fall back to a safe default (e.g., {} or the raw string wrapped appropriately) so no arbitrary code is executed.

coderabbitai · 2025-09-14T05:34:18Z

+                    all_response += f"<tool_call>{call.function.name} {call.function.arguments}</tool_call></tool_response>"
                    tool_messages.append({"role": "tool", "name": call.function.name, "content": []})
                    for content in result.content:
                        tool_message = self.mcp_client.convert_result_to_openai_format(content)
+                        for content in tool_message:
+                            if content["type"] == "image_url":
+                                all_response += "<image_url>"
+                            elif content["type"] == "text":
+                                all_response += content["text"]
                        tool_messages[-1]["content"].extend(tool_message)
+                    all_response += "</tool_response>"


⚠️ Potential issue

Fix malformed tool_response markup and include tool_call_id on tool messages

The first line appends a stray closing </tool_response> without an opener (and you close again later).

Add tool_call_id to tool messages to align with OpenAI Chat Completions tool schema.

Minor: avoid reusing the name content inside the inner loop; add audio_url handling.

- all_response += f"<tool_call>{call.function.name} {call.function.arguments}</tool_call></tool_response>" - tool_messages.append({"role": "tool", "name": call.function.name, "content": []}) + all_response += f"<tool_call>{call.function.name} {call.function.arguments}</tool_call><tool_response>" + tool_messages.append( + { + "role": "tool", + "tool_call_id": call.id, + "name": call.function.name, + "content": [], + } + ) for content in result.content: tool_message = self.mcp_client.convert_result_to_openai_format(content) - for content in tool_message: - if content["type"] == "image_url": + for item in tool_message: + if item["type"] == "image_url": all_response += "<image_url>" - elif content["type"] == "text": - all_response += content["text"] - tool_messages[-1]["content"].extend(tool_message) + elif item["type"] == "audio_url": + all_response += "<audio_url>" + elif item["type"] == "text": + all_response += item["text"] + tool_messages[-1]["content"].extend(tool_message) all_response += "</tool_response>"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

all_response += f"<tool_call>{call.function.name} {call.function.arguments}</tool_call></tool_response>"

tool_messages.append({"role": "tool", "name": call.function.name, "content": []})

for content in result.content:

tool_message = self.mcp_client.convert_result_to_openai_format(content)

for content in tool_message:

if content["type"] == "image_url":

all_response += "<image_url>"

elif content["type"] == "text":

all_response += content["text"]

tool_messages[-1]["content"].extend(tool_message)

all_response += "</tool_response>"

all_response += f"<tool_call>{call.function.name} {call.function.arguments}</tool_call><tool_response>"

tool_messages.append(

{

"role": "tool",

"tool_call_id": call.id,

"name": call.function.name,

"content": [],

}

)

for content in result.content:

tool_message = self.mcp_client.convert_result_to_openai_format(content)

for item in tool_message:

if item["type"] == "image_url":

all_response += "<image_url>"

elif item["type"] == "audio_url":

all_response += "<audio_url>"

elif item["type"] == "text":

all_response += item["text"]

tool_messages[-1]["content"].extend(tool_message)

all_response += "</tool_response>"

🤖 Prompt for AI Agents

In lmms_eval/models/chat/async_openai.py around lines 187 to 197, the code emits a stray closing </tool_response> before the opening tag, doesn't include tool_call_id on tool messages, reuses the variable name "content" in nested loops, and lacks audio_url handling; fix by removing the premature </tool_response>, prepend a proper <tool_response> (including tool_call id and call.function.name/arguments) before writing the tool call tag, add "tool_call_id": call.id to the tool_messages entries, rename the inner loop variable to avoid shadowing (e.g., result_item), add handling for content types "image_url" and "audio_url" (append appropriate markers and include them in tool_messages[-1]["content"]), and ensure the final closing </tool_response> remains only at the end.

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)

lmms_eval/tasks/charades_sta/eval_tvg.py (4)

43-49: Tighten the primary time-range regex and remove literal en dash to satisfy Ruff.

The current pattern allows multiple dots (e.g., "3....5") and includes a literal EN DASH, which Ruff flags (RUF001). Use an explicit char class with U+2013 and a precise decimal pattern.

Apply this diff:

-    patterns = [r"(\d+\.*\d*)\s*[–-]\s*(\d+\.*\d*)"]
+    # Allow hyphen-minus and EN DASH (U+2013) without embedding the Unicode literal
+    patterns = [r"(\d+(?:\.\d+)?)\s*[\u2013-]\s*(\d+(?:\.\d+)?)"]

70-86: HH:MM:SS parsing is broken (tuple from nested groups; .split on tuple).

re.findall returns tuples due to the inner capture, resulting in runtime errors when calling t.count(":") and t.split(":"). Also, the code never processes multiple matches per sentence.

Apply this diff to fix parsing and support both MM:SS and HH:MM:SS:

-        times = []
-        time_regex = re.compile(r"\b((\d{1,2}:\d{2}:\d{2}))\b")  # time formats (e.g., 18:00, 00:18:05)
-        for sentence in candidates:
-            time = re.findall(time_regex, sentence)
-            if time:
-                t = time[0]
-            else:
-                continue
-            # If time is in HH:MM:SS format, convert to seconds
-            if t.count(":") == 2:
-                h, m, s = map(int, t.split(":"))
-                time_in_sec = h * 3600 + m * 60 + s
-            elif t.count(":") == 1:
-                m, s = map(int, t.split(":"))
-                time_in_sec = m * 60 + s
-            times.append(time_in_sec)
+        times = []
+        # Support HH:MM:SS or MM:SS without nested capture groups
+        time_regex = re.compile(r"\b(?:\d{1,2}:)?\d{2}:\d{2}\b")
+        for sentence in candidates:
+            matches = time_regex.findall(sentence)
+            for t in matches:
+                parts = list(map(int, t.split(":")))
+                if len(parts) == 3:
+                    h, m, s = parts
+                    time_in_sec = h * 3600 + m * 60 + s
+                else:
+                    m, s = parts
+                    time_in_sec = m * 60 + s
+                times.append(time_in_sec)

130-130: Avoid eval on untrusted input; use ast.literal_eval.

Executing eval(gt) is a security risk and unnecessary.

Apply this diff:

-        gt = eval(gt)
+        gt = ast.literal_eval(gt)

Additionally, add this import at the top of the file:

+import ast

13-17: Add type hints and minimal docstrings to public functions.

Required by the repo guidelines; improves readability and Pyright coverage.

Apply this diff:

-def read_json(path):
+def read_json(path: str) -> dict:
+    """Read and parse a JSON file from path."""
     with open(path, "r") as fin:
         datas = json.load(fin)
     return datas
@@
-def write_json(path, data):
+def write_json(path: str, data: Any) -> None:
+    """Write data as JSON to path."""
     with open(path, "w") as fout:
         json.dump(data, fout)
-    print("The format file has been saved at:{}".format(path))
+    print(f"The format file has been saved at: {path}")
     return
@@
-def extract_time(paragraph):
+def extract_time(paragraph: str) -> list[list[float]]:
+    """Extract a single [start, end] interval (in seconds) from free text."""
@@
-def iou(A, B):
+def iou(a: Sequence[float], b: Sequence[float]) -> float:
+    """IoU of two 1D segments [start, end]."""
-    max0 = max((A[0]), (B[0]))
-    min0 = min((A[0]), (B[0]))
-    max1 = max((A[1]), (B[1]))
-    min1 = min((A[1]), (B[1]))
+    max0 = max(a[0], b[0])
+    min0 = min(a[0], b[0])
+    max1 = max(a[1], b[1])
+    min1 = min(a[1], b[1])
     return max(min1 - max0, 0) / (max1 - min0)

Also applies to: 19-24, 26-26, 108-114

🧹 Nitpick comments (3)

lmms_eval/tasks/charades_sta/eval_tvg.py (3)
28-28: Don’t blindly replace “to”; use a word-boundary substitution.

.replace("to", "-") will mangle words like “tomorrow” → “-morrow”.

Apply this diff:
-    paragraph = paragraph.lower().replace(prompt, "").replace("to", "-")
+    paragraph = paragraph.lower().replace(prompt, "")
+    paragraph = re.sub(r"\bto\b", "-", paragraph)
1-10: Trim unused imports and add typing imports per guidelines.

os, pdb, numpy, deepcopy, and Path are unused. Ruff will flag these, and guidelines require type hints.

Apply this diff:
-import argparse
-import json
-import os
-import pdb
-import re
-from copy import deepcopy
-from pathlib import Path
-
-import numpy as np
+import argparse
+import json
+import re
+from typing import Any, Sequence
145-145: Wrap the long print to respect the 88-char limit and prefer f-strings.

Apply this diff:
-    print("IOU 0.3: {0}\nIOU 0.5: {1}\nIOU 0.7: {2}\nmIOU".format(Result[0.3] * 100 / num, Result[0.5] * 100 / num, Result[0.7] * 100 / num), sum(ious) * 100 / num)
+    print(
+        f"IOU 0.3: {Result[0.3] * 100 / num}\n"
+        f"IOU 0.5: {Result[0.5] * 100 / num}\n"
+        f"IOU 0.7: {Result[0.7] * 100 / num}\n"
+        f"mIOU {sum(ious) * 100 / num}"
+    )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 57055c8 and 5e7f19d.

📒 Files selected for processing (1)

lmms_eval/tasks/charades_sta/eval_tvg.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly

Files:

lmms_eval/tasks/charades_sta/eval_tvg.py

🪛 Ruff (0.12.2)

lmms_eval/tasks/charades_sta/eval_tvg.py

43-43: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

91-91: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

92-92: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

coderabbitai · 2025-09-15T08:31:39Z

+    # Fallback: if no timestamps found, search for any two number patterns with dash
+    if len(timestamps) == 0:
+        # More comprehensive pattern to match various formats like:
+        # xx - xx, x.xx s - x.xx s, x.xxs - x.xxs, etc.
+        # Also handle en dash (–) and regular dash (-)
+        fallback_pattern = r"(\d+(?:\.\d+)?)\s*s?\s*[–-]\s*(\d+(?:\.\d+)?)\s*s?"
+        fallback_matches = re.findall(fallback_pattern, paragraph)
+        if fallback_matches:
+            timestamps = [[float(start), float(end)] for start, end in fallback_matches]
+


⚠️ Potential issue

Fallback regex: remove literal en dash; reduce false positives; handle units.

Replace the literal EN DASH with \u2013 to avoid Ruff RUF001/RUF003.

Limit search scope to keyword-filtered candidates to avoid catching unrelated numeric ranges in the whole paragraph.

Optionally accept "s/sec/seconds".

Apply this diff:

- if len(timestamps) == 0: - # More comprehensive pattern to match various formats like: - # xx - xx, x.xx s - x.xx s, x.xxs - x.xxs, etc. - # Also handle en dash (–) and regular dash (-) - fallback_pattern = r"(\d+(?:\.\d+)?)\s*s?\s*[–-]\s*(\d+(?:\.\d+)?)\s*s?" - fallback_matches = re.findall(fallback_pattern, paragraph) + if len(timestamps) == 0: + # More comprehensive pattern to match various formats like: + # xx - xx, x.xx s - x.xx s, x.xxs - x.xxs, etc. + # Also handle hyphen-minus (-) and EN DASH (U+2013) + fallback_pattern = ( + r"(\d+(?:\.\d+)?)\s*(?:s|sec|secs|second|seconds)?\s*" + r"[\u2013-]\s*" + r"(\d+(?:\.\d+)?)\s*(?:s|sec|secs|second|seconds)?\b" + ) + scope = ' '.join(candidates) if candidates else paragraph + fallback_matches = re.findall(fallback_pattern, scope)

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 Ruff (0.12.2)

91-91: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF003)

92-92: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?

(RUF001)

* Add longvila * Allow long vila pass in video kwargs * Add lvbench * Fix format better * Better tvg eval * resolve model name

kcz358 added 2 commits September 8, 2025 07:12

Add longvila

01d24b9

Allow long vila pass in video kwargs

40bfd79

Add lvbench

1e5a57f

coderabbitai Bot reviewed Sep 11, 2025

View reviewed changes

Fix format better

57055c8

coderabbitai Bot reviewed Sep 14, 2025

View reviewed changes

Better tvg eval

5e7f19d

coderabbitai Bot reviewed Sep 15, 2025

View reviewed changes

resolve model name

e324e19

kcz358 merged commit e8b44ff into main Sep 17, 2025
3 checks passed

kcz358 deleted the feat/longvila_r1 branch September 17, 2025 07:11

Luodian pushed a commit that referenced this pull request Feb 28, 2026

feat: Add longvila-r1 and benchmarks (#819)

5bc4a76

* Add longvila * Allow long vila pass in video kwargs * Add lvbench * Fix format better * Better tvg eval * resolve model name

stisiTT pushed a commit to bgoelTT/lmms-eval that referenced this pull request Mar 6, 2026

feat: Add longvila-r1 and benchmarks (EvolvingLMMs-Lab#819)

158bf93

* Add longvila * Allow long vila pass in video kwargs * Add lvbench * Fix format better * Better tvg eval * resolve model name

-                    result = await self.mcp_client.run_tool(call.function.name, eval(call.function.arguments))
+                    raw_args = call.function.arguments
+                    if isinstance(raw_args, str):
+                        try:
+                            args = json.loads(raw_args)
+                        except Exception as e:
+                            eval_logger.warning(f"Failed to parse tool args for {call.function.name}: {e}")
+                            args = {}
+                    elif raw_args is None:
+                        args = {}
+                    else:
+                        args = raw_args
+                    result = await self.mcp_client.run_tool(call.function.name, args)

Conversation

kcz358 commented Sep 11, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

When you open a pull-request, please be sure to include the following

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

kcz358 commented Sep 11, 2025

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kcz358 commented Sep 11, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 11, 2025 •

edited

Loading