feat: Add longvila-r1 and benchmarks#819
Conversation
WalkthroughAdds a LongVila chat-model wrapper (vLLM + prompt embeddings) and registers it, introduces LVBench task config and utilities for video MC evaluation, tweaks async OpenAI tool output handling, expands model-name parsing, and ignores Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Eval as Evaluator
participant LV as LongVila
participant Enc as Encoder (embed-only)
participant vLLM as vLLM Engine
participant Cache as Cache
Eval->>LV: generate_until(requests)
loop per request
LV->>LV: _to_remote_conversation(request)
LV->>Enc: process media + tokenize conversation
Enc-->>LV: inputs_embeds
end
LV->>Cache: check cached responses
alt cache hit
Cache-->>LV: cached responses
else cache miss
LV->>vLLM: generate(inputs_embeds, SamplingParams)
vLLM-->>LV: outputs
LV->>Cache: store outputs
end
LV-->>Eval: return texts
sequenceDiagram
autonumber
participant Runner as Task Runner
participant DS as LVBench Dataset
participant Utils as lvbench.utils
participant Model as LongVila
participant Metrics as Scorer
Runner->>DS: load sample
Runner->>Utils: lvbench_doc_to_visual / doc_to_text
Utils-->>Runner: video path(s) + prompt
Runner->>Model: generate_until(prompt + media)
Model-->>Runner: prediction
Runner->>Utils: lvbench_process_results(doc, [prediction])
Utils-->>Runner: { lvbench_score: 0/1 }
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
I have the video chunks on my local server. However, I haven't yet find ways to upload to hub... |
There was a problem hiding this comment.
Actionable comments posted: 8
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
lmms_eval/tasks/lvbench/utils.py (1)
59-75: Fix docstring, remove extraneous f-string, add types
- Docstring mentions “videomme”; should be lvbench.
- Remove
fprefix (Ruff F541).- Add type hints.
-def lvbench_process_results(doc, results): - """ - Args: - doc: a instance of the eval dataset - results: [pred] - Returns: - a dictionary with key: metric name (in this case videomme score), value: metric value - """ +from typing import Dict, List +def lvbench_process_results(doc: Dict, results: List[str]) -> Dict[str, bool]: + """ + Args: + doc: one eval dataset instance + results: [pred] + Returns: + {"lvbench_score": bool} + """ pred = results[0] pred_ans = extract_characters_regex(pred) - # gt_ans = doc["answer"].lower().strip().replace(".", "") gt_ans = doc["answer"] score = pred_ans == gt_ans - # return {f"videomme_perception_score": data_dict for metric in matrices} - return {f"lvbench_score": score} + return {"lvbench_score": score}
🧹 Nitpick comments (5)
lmms_eval/tasks/lvbench/utils.py (1)
27-35: Don’t mutate caller-provided dict; set defaults immutably + type hints-from typing import Dict, Optional +from typing import Dict, Optional @@ -def lvbench_doc_to_text(doc, lmms_eval_specific_kwargs=None): - if lmms_eval_specific_kwargs is None: - lmms_eval_specific_kwargs = {} - if "pre_prompt" not in lmms_eval_specific_kwargs: - lmms_eval_specific_kwargs["pre_prompt"] = "" - if "post_prompt" not in lmms_eval_specific_kwargs: - lmms_eval_specific_kwargs["post_prompt"] = "\nAnswer the question with the option letter" - return lmms_eval_specific_kwargs["pre_prompt"] + doc["question"] + lmms_eval_specific_kwargs["post_prompt"] +def lvbench_doc_to_text( + doc: Dict, lmms_eval_specific_kwargs: Optional[Dict[str, str]] = None +) -> str: + cfg = {"pre_prompt": "", "post_prompt": "\nAnswer the question with the option letter"} + if lmms_eval_specific_kwargs: + cfg.update(lmms_eval_specific_kwargs) + return f'{cfg["pre_prompt"]}{doc["question"]}{cfg["post_prompt"]}'lmms_eval/models/chat/longvila.py (4)
21-21: Unused constant
WORKERSisn’t used. Remove or use to control batching/threads.-WORKERS = int(os.getenv("WORKERS", "32")) +# WORKERS env reserved for future parallelism tuning
77-96: Add return type hints and docstringPublic helper should be typed and documented per guidelines.
- def _to_remote_conversation(self, chat_messages: ChatMessages) -> list: + def _to_remote_conversation(self, chat_messages: ChatMessages) -> list: """ Convert ChatMessages to LongVILA remote_code conversation format. [{"from": "human"|"gpt", "value": [str | {"path": media_path}, ...]}, ...] """If you’re open, I can expand types to
List[Dict[str, List[Union[str, Dict[str, str]]]]].
165-175: Cache write error handlingIf a generation fails for one item, caching the rest silently could mask issues. Consider try/except around per-item cache writes.
- for req, text in zip(batch_requests, response_text): - self.add_request_response_to_cache(req, text) + for req, text in zip(batch_requests, response_text): + try: + self.add_request_response_to_cache(req, text) + except Exception: + pass # best-effort cache
1-15: Clean unused imports and align with guidelinesRemove unused:
sys,ThreadPoolExecutor,Union,log_metrics. Keep line length ≤ 88.-import sys @@ -from concurrent.futures import ThreadPoolExecutor -from typing import List, Optional, Tuple, Union +from typing import List, Optional, Tuple @@ -from lmms_eval.models.model_utils.gen_metrics import log_metrics
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
.gitignore(1 hunks)lmms_eval/models/__init__.py(1 hunks)lmms_eval/models/chat/longvila.py(1 hunks)lmms_eval/tasks/lvbench/lvbench.yaml(1 hunks)lmms_eval/tasks/lvbench/utils.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly
Files:
lmms_eval/tasks/lvbench/utils.pylmms_eval/models/__init__.pylmms_eval/models/chat/longvila.py
**/*.{yml,yaml,json}
📄 CodeRabbit inference engine (CLAUDE.md)
Use Prettier (via pre-commit) to format YAML and JSON files
Files:
lmms_eval/tasks/lvbench/lvbench.yaml
🧬 Code graph analysis (1)
lmms_eval/models/chat/longvila.py (5)
lmms_eval/api/instance.py (1)
Instance(6-29)lmms_eval/api/registry.py (1)
register_model(11-24)lmms_eval/protocol.py (2)
ChatMessages(41-116)extract_media(44-58)lmms_eval/api/task.py (1)
doc_to_messages(1648-1680)lmms_eval/api/model.py (3)
load_cache(107-112)get_response_from_cache(208-223)add_request_response_to_cache(201-206)
🪛 Ruff (0.12.2)
lmms_eval/tasks/lvbench/utils.py
12-12: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
74-74: f-string without any placeholders
Remove extraneous f prefix
(F541)
lmms_eval/models/chat/longvila.py
60-60: Do not catch blind exception: Exception
(BLE001)
61-61: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
61-61: Avoid specifying long messages outside the exception class
(TRY003)
179-179: Unused method argument: requests
(ARG002)
181-181: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
🔇 Additional comments (5)
.gitignore (1)
53-53: Ignoringremote_code/*may hide required runtime codeLongVILA tries to import
remote_code.*at runtime. If you plan to placeremote_code/under the repo root during local runs, ignoring it means devs may forget to include it and hitImportError. Prefer loadingremote_codefrom the model folder (see comments inlongvila.py) and keep the repo clean, or add a README note about whereremote_code/must live at runtime.Would you like me to add a short README section clarifying where
remote_code/is expected?lmms_eval/tasks/lvbench/lvbench.yaml (1)
1-29: Config reads clean; consider pinning dataset revision for reproducibilityLGTM overall. To ensure reproducible caching, consider adding a
dataset_revision(commit hash or tag) and double-check whether usingtrainastest_splitis intentional.Would you like me to add a
dataset_revision:field and switchtest_splitif needed?lmms_eval/models/__init__.py (1)
91-92: Model registry mapping LGTM — runtime import blocked by missing dependency"longvila": "LongVila" is correct; LongVila is defined at lmms_eval/models/chat/longvila.py:25.
Dynamic import check failed with ModuleNotFoundError: No module named 'loguru' — install dependencies (pip install -r requirements.txt or pip install loguru) and re-run:python - <<'PY'
from lmms_eval.models import get_model
print(get_model("longvila").name)
PYlmms_eval/models/chat/longvila.py (2)
67-75: Type-safety and config: forwardtrust_remote_code; optional tokenizersIf
AutoModeldoesn’t expose.tokenizer/.vision_tower, this will break. At minimum, passtrust_remote_code=trust_remote_codeand document the requirement.Would you confirm the remote model exposes
tokenizerandvision_tower.image_processor?
140-177: Confirm and guard for vLLM prompt_embeds; add fallback to token-ID flowvLLM accepts prompt embeddings via the "prompt_embeds" parameter (offline: pass a torch tensor as vllm.inputs.EmbedsPrompt; online/OpenAI-compatible server: base64-encoded tensor and the server must be started with engine=v0 and --enable-prompt-embeds). Prompt-embeds are supported only in the v0 engine.
In lmms_eval/models/chat/longvila.py (lines 140–177) wrap the generate call with a capability check or try/except; if prompt_embeds is unsupported, fall back to constructing token-ID prompts (or raise a clear error instructing operators to enable --enable-prompt-embeds).
| try: | ||
| from vllm import LLM, SamplingParams | ||
| except ImportError: | ||
| vllm = None | ||
|
|
There was a problem hiding this comment.
🛠️ Refactor suggestion
vLLM import fallback can crash later; fail fast with actionable error
If vllm isn’t installed, SamplingParams is undefined but later used. Guard and raise with guidance.
-try:
- from vllm import LLM, SamplingParams
-except ImportError:
- vllm = None
+try:
+ from vllm import LLM, SamplingParams
+except ImportError as _vllm_err:
+ LLM = None # type: ignore[assignment]
+ SamplingParams = None # type: ignore[assignment]
+ _VLLM_IMPORT_ERROR = _vllm_errAnd in __init__:
def __init__(
@@
):
+ if LLM is None or SamplingParams is None:
+ raise ImportError(
+ "vLLM is required for LongVila. Install with `pip install vllm`."
+ ) from _VLLM_IMPORT_ERROR📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try: | |
| from vllm import LLM, SamplingParams | |
| except ImportError: | |
| vllm = None | |
| try: | |
| from vllm import LLM, SamplingParams | |
| except ImportError as _vllm_err: | |
| LLM = None # type: ignore[assignment] | |
| SamplingParams = None # type: ignore[assignment] | |
| _VLLM_IMPORT_ERROR = _vllm_err |
🤖 Prompt for AI Agents
In lmms_eval/models/chat/longvila.py around lines 16-20, the vllm import
fallback leaves SamplingParams undefined which will cause a crash later; modify
the import block to fail fast by raising an ImportError (or RuntimeError) with
an actionable message instructing to install vllm (and optionally provide pip
install instructions and mention SamplingParams/LLM are required), and update
the class __init__ to check that vllm and SamplingParams are available before
using them, raising the same clear error if they are missing so callers get
immediate, actionable feedback instead of a later NameError.
| def __init__( | ||
| self, | ||
| model="Efficient-Large-Model/LongVILA-R1-7B", | ||
| tensor_parallel_size=1, | ||
| data_parallel_size=1, | ||
| gpu_memory_utilization=0.5, | ||
| batch_size=1, | ||
| max_frame_num=32, | ||
| trust_remote_code=True, | ||
| chat_template=None, | ||
| max_pixels: int = 1605632, | ||
| min_image_pixels=28, | ||
| fps: Optional[int] = None, | ||
| device_map: Optional[str] = "cuda", | ||
| **kwargs, | ||
| ): | ||
| # vLLM requires the path to the autoregressive llm weights under the model root | ||
| model_root = model | ||
| llm_path = os.path.join(model_root, "llm") | ||
| # Enable prompt embeddings so we can pass encoder-produced embeddings directly |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Validate model paths and pass-through trust flag
Check that llm_path exists to fail early with a clear message; also forward trust_remote_code consistently.
model_root = model
llm_path = os.path.join(model_root, "llm")
+ if not os.path.isdir(llm_path):
+ raise FileNotFoundError(
+ f"Expected autoregressive LLM under '{llm_path}'. "
+ "Verify your LongVILA model layout."
+ )
@@
- kwargs["enable_prompt_embeds"] = True
+ kwargs["enable_prompt_embeds"] = True
@@
- super().__init__(llm_path, tensor_parallel_size, data_parallel_size, gpu_memory_utilization, batch_size, max_frame_num, trust_remote_code, chat_template, min_image_pixels, **kwargs)
+ super().__init__(
+ llm_path,
+ tensor_parallel_size,
+ data_parallel_size,
+ gpu_memory_utilization,
+ batch_size,
+ max_frame_num,
+ trust_remote_code,
+ chat_template,
+ min_image_pixels,
+ **kwargs,
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def __init__( | |
| self, | |
| model="Efficient-Large-Model/LongVILA-R1-7B", | |
| tensor_parallel_size=1, | |
| data_parallel_size=1, | |
| gpu_memory_utilization=0.5, | |
| batch_size=1, | |
| max_frame_num=32, | |
| trust_remote_code=True, | |
| chat_template=None, | |
| max_pixels: int = 1605632, | |
| min_image_pixels=28, | |
| fps: Optional[int] = None, | |
| device_map: Optional[str] = "cuda", | |
| **kwargs, | |
| ): | |
| # vLLM requires the path to the autoregressive llm weights under the model root | |
| model_root = model | |
| llm_path = os.path.join(model_root, "llm") | |
| # Enable prompt embeddings so we can pass encoder-produced embeddings directly | |
| def __init__( | |
| self, | |
| model="Efficient-Large-Model/LongVILA-R1-7B", | |
| tensor_parallel_size=1, | |
| data_parallel_size=1, | |
| gpu_memory_utilization=0.5, | |
| batch_size=1, | |
| max_frame_num=32, | |
| trust_remote_code=True, | |
| chat_template=None, | |
| max_pixels: int = 1605632, | |
| min_image_pixels=28, | |
| fps: Optional[int] = None, | |
| device_map: Optional[str] = "cuda", | |
| **kwargs, | |
| ): | |
| # vLLM requires the path to the autoregressive llm weights under the model root | |
| model_root = model | |
| llm_path = os.path.join(model_root, "llm") | |
| if not os.path.isdir(llm_path): | |
| raise FileNotFoundError( | |
| f"Expected autoregressive LLM under '{llm_path}'. " | |
| "Verify your LongVILA model layout." | |
| ) | |
| # Enable prompt embeddings so we can pass encoder-produced embeddings directly | |
| kwargs["enable_prompt_embeds"] = True | |
| super().__init__( | |
| llm_path, | |
| tensor_parallel_size, | |
| data_parallel_size, | |
| gpu_memory_utilization, | |
| batch_size, | |
| max_frame_num, | |
| trust_remote_code, | |
| chat_template, | |
| min_image_pixels, | |
| **kwargs, | |
| ) |
🤖 Prompt for AI Agents
In lmms_eval/models/chat/longvila.py around lines 28 to 47, add an explicit
existence check for the computed llm_path and raise a clear FileNotFoundError if
it doesn't exist so the class fails fast with a helpful message; also ensure the
incoming trust_remote_code flag is stored on self (e.g., self.trust_remote_code
= trust_remote_code) and passed through to any downstream model-loading calls
that accept trust_remote_code so the same behavior is consistently applied.
| # Set up imports from the model's remote_code directory | ||
| # The LongVILA repo provides preprocessing utilities we must call directly | ||
| try: | ||
| from remote_code.media import extract_media as _extract_media | ||
| from remote_code.mm_utils import process_images as _process_images | ||
| from remote_code.tokenizer_utils import ( | ||
| tokenize_conversation as _tokenize_conversation, | ||
| ) | ||
| except Exception as e: | ||
| raise ImportError(f"Failed to import LongVILA remote_code utilities from '{model_root}'. Ensure the model path contains remote_code. Original error: {e}") | ||
|
|
||
| self.extract_media = _extract_media | ||
| self.process_images = _process_images | ||
| self.tokenize_conversation = _tokenize_conversation | ||
|
|
There was a problem hiding this comment.
remote_code import path will fail unless it’s on sys.path
The code imports remote_code.* without adding model_root to sys.path. Load explicitly from the model folder to avoid relying on a repo-level remote_code/.
- try:
- from remote_code.media import extract_media as _extract_media
- from remote_code.mm_utils import process_images as _process_images
- from remote_code.tokenizer_utils import (
- tokenize_conversation as _tokenize_conversation,
- )
- except Exception as e:
- raise ImportError(f"Failed to import LongVILA remote_code utilities from '{model_root}'. Ensure the model path contains remote_code. Original error: {e}")
+ try:
+ remote_code_dir = os.path.join(model_root)
+ if remote_code_dir not in sys.path:
+ sys.path.insert(0, remote_code_dir)
+ from remote_code.media import extract_media as _extract_media # type: ignore
+ from remote_code.mm_utils import process_images as _process_images # type: ignore
+ from remote_code.tokenizer_utils import ( # type: ignore
+ tokenize_conversation as _tokenize_conversation,
+ )
+ except Exception as err:
+ raise ImportError(
+ f"Failed to import LongVILA remote_code from '{model_root}'. "
+ "Ensure the model path contains a 'remote_code/' package."
+ ) from err📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Set up imports from the model's remote_code directory | |
| # The LongVILA repo provides preprocessing utilities we must call directly | |
| try: | |
| from remote_code.media import extract_media as _extract_media | |
| from remote_code.mm_utils import process_images as _process_images | |
| from remote_code.tokenizer_utils import ( | |
| tokenize_conversation as _tokenize_conversation, | |
| ) | |
| except Exception as e: | |
| raise ImportError(f"Failed to import LongVILA remote_code utilities from '{model_root}'. Ensure the model path contains remote_code. Original error: {e}") | |
| self.extract_media = _extract_media | |
| self.process_images = _process_images | |
| self.tokenize_conversation = _tokenize_conversation | |
| # Set up imports from the model's remote_code directory | |
| # The LongVILA repo provides preprocessing utilities we must call directly | |
| try: | |
| remote_code_dir = os.path.join(model_root) | |
| if remote_code_dir not in sys.path: | |
| sys.path.insert(0, remote_code_dir) | |
| from remote_code.media import extract_media as _extract_media # type: ignore | |
| from remote_code.mm_utils import process_images as _process_images # type: ignore | |
| from remote_code.tokenizer_utils import ( # type: ignore | |
| tokenize_conversation as _tokenize_conversation, | |
| ) | |
| except Exception as err: | |
| raise ImportError( | |
| f"Failed to import LongVILA remote_code from '{model_root}'. " | |
| "Ensure the model path contains a 'remote_code/' package." | |
| ) from err | |
| self.extract_media = _extract_media | |
| self.process_images = _process_images | |
| self.tokenize_conversation = _tokenize_conversation |
🧰 Tools
🪛 Ruff (0.12.2)
60-60: Do not catch blind exception: Exception
(BLE001)
61-61: Within an except clause, raise exceptions with raise ... from err or raise ... from None to distinguish them from errors in exception handling
(B904)
61-61: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In lmms_eval/models/chat/longvila.py around lines 52 to 66, the imports from
remote_code will fail unless model_root is on sys.path; prepend the absolute
model_root to sys.path before the try block (and optionally pop it afterward) so
Python can import remote_code from the model folder, ensure model_root exists
and is resolved with os.path.abspath, and keep the existing try/except and
attribute assignments intact.
| def make_one_request(self, request: Instance) -> Tuple["object", dict]: | ||
| """ | ||
| Build prompt embeddings and per-request sampling params from an Instance. | ||
| Returns (inputs_embeds, params_dict). Does not mutate input. | ||
| """ | ||
| ctx, doc_to_messages, gen_kwargs, doc_id, task, split = request.arguments | ||
| raw_messages = doc_to_messages(self.task_dict[task][split][doc_id]) | ||
| chat_messages = ChatMessages(messages=raw_messages) | ||
|
|
||
| # Copy to avoid side-effects across threads | ||
| _gen = dict(gen_kwargs or {}) | ||
| _gen.setdefault("max_new_tokens", 4096) | ||
| _gen.setdefault("temperature", 0) | ||
| _gen.setdefault("top_p", 0.95) | ||
|
|
||
| params = { | ||
| "temperature": _gen["temperature"], | ||
| "max_tokens": _gen["max_new_tokens"], | ||
| "top_p": _gen["top_p"], | ||
| } | ||
|
|
||
| # Convert to LongVILA remote_code conversation format | ||
| conversation = self._to_remote_conversation(chat_messages) | ||
|
|
||
| # Extract and preprocess media | ||
| if self.fps: | ||
| self.model_encoder.config.fps = self.fps | ||
| else: | ||
| self.model_encoder.config.num_video_frames = self.max_frame_num | ||
| self.model_encoder.config.fps = 0 | ||
| media = self.extract_media(conversation, self.model_encoder.config) | ||
| if "video" in media and media["video"] is not None: | ||
| media["video"] = [self.process_images(images, self.model_encoder.vision_tower.image_processor, self.model_encoder.config).half() for images in media["video"]] | ||
|
|
||
| # Tokenize conversation and move to CUDA for embedding | ||
| input_ids = self.tokenize_conversation(conversation, self.model_encoder.tokenizer, add_generation_prompt=True).unsqueeze(0).cuda() | ||
|
|
||
| # Create prompt embeddings using the model encoder | ||
| inputs_embeds, _, _ = self.model_encoder._embed(input_ids, media, {"video": {}}, None, None) | ||
|
|
||
| return inputs_embeds, params | ||
|
|
There was a problem hiding this comment.
Device placement: avoid unconditional .cuda()
Calling .cuda() breaks on CPU-only setups or non-CUDA devices. Derive device from the encoder.
- input_ids = self.tokenize_conversation(conversation, self.model_encoder.tokenizer, add_generation_prompt=True).unsqueeze(0).cuda()
+ input_ids = (
+ self.tokenize_conversation(
+ conversation,
+ self.model_encoder.tokenizer,
+ add_generation_prompt=True,
+ )
+ .unsqueeze(0)
+ .to(next(self.model_encoder.parameters()).device)
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def make_one_request(self, request: Instance) -> Tuple["object", dict]: | |
| """ | |
| Build prompt embeddings and per-request sampling params from an Instance. | |
| Returns (inputs_embeds, params_dict). Does not mutate input. | |
| """ | |
| ctx, doc_to_messages, gen_kwargs, doc_id, task, split = request.arguments | |
| raw_messages = doc_to_messages(self.task_dict[task][split][doc_id]) | |
| chat_messages = ChatMessages(messages=raw_messages) | |
| # Copy to avoid side-effects across threads | |
| _gen = dict(gen_kwargs or {}) | |
| _gen.setdefault("max_new_tokens", 4096) | |
| _gen.setdefault("temperature", 0) | |
| _gen.setdefault("top_p", 0.95) | |
| params = { | |
| "temperature": _gen["temperature"], | |
| "max_tokens": _gen["max_new_tokens"], | |
| "top_p": _gen["top_p"], | |
| } | |
| # Convert to LongVILA remote_code conversation format | |
| conversation = self._to_remote_conversation(chat_messages) | |
| # Extract and preprocess media | |
| if self.fps: | |
| self.model_encoder.config.fps = self.fps | |
| else: | |
| self.model_encoder.config.num_video_frames = self.max_frame_num | |
| self.model_encoder.config.fps = 0 | |
| media = self.extract_media(conversation, self.model_encoder.config) | |
| if "video" in media and media["video"] is not None: | |
| media["video"] = [self.process_images(images, self.model_encoder.vision_tower.image_processor, self.model_encoder.config).half() for images in media["video"]] | |
| # Tokenize conversation and move to CUDA for embedding | |
| input_ids = self.tokenize_conversation(conversation, self.model_encoder.tokenizer, add_generation_prompt=True).unsqueeze(0).cuda() | |
| # Create prompt embeddings using the model encoder | |
| inputs_embeds, _, _ = self.model_encoder._embed(input_ids, media, {"video": {}}, None, None) | |
| return inputs_embeds, params | |
| def make_one_request(self, request: Instance) -> Tuple["object", dict]: | |
| """ | |
| Build prompt embeddings and per-request sampling params from an Instance. | |
| Returns (inputs_embeds, params_dict). Does not mutate input. | |
| """ | |
| ctx, doc_to_messages, gen_kwargs, doc_id, task, split = request.arguments | |
| raw_messages = doc_to_messages(self.task_dict[task][split][doc_id]) | |
| chat_messages = ChatMessages(messages=raw_messages) | |
| # Copy to avoid side-effects across threads | |
| _gen = dict(gen_kwargs or {}) | |
| _gen.setdefault("max_new_tokens", 4096) | |
| _gen.setdefault("temperature", 0) | |
| _gen.setdefault("top_p", 0.95) | |
| params = { | |
| "temperature": _gen["temperature"], | |
| "max_tokens": _gen["max_new_tokens"], | |
| "top_p": _gen["top_p"], | |
| } | |
| # Convert to LongVILA remote_code conversation format | |
| conversation = self._to_remote_conversation(chat_messages) | |
| # Extract and preprocess media | |
| if self.fps: | |
| self.model_encoder.config.fps = self.fps | |
| else: | |
| self.model_encoder.config.num_video_frames = self.max_frame_num | |
| self.model_encoder.config.fps = 0 | |
| media = self.extract_media(conversation, self.model_encoder.config) | |
| if "video" in media and media["video"] is not None: | |
| media["video"] = [self.process_images(images, self.model_encoder.vision_tower.image_processor, self.model_encoder.config).half() for images in media["video"]] | |
| # Tokenize conversation and move to CUDA for embedding | |
| input_ids = ( | |
| self.tokenize_conversation( | |
| conversation, | |
| self.model_encoder.tokenizer, | |
| add_generation_prompt=True, | |
| ) | |
| .unsqueeze(0) | |
| .to(next(self.model_encoder.parameters()).device) | |
| ) | |
| # Create prompt embeddings using the model encoder | |
| inputs_embeds, _, _ = self.model_encoder._embed(input_ids, media, {"video": {}}, None, None) | |
| return inputs_embeds, params |
| def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: | ||
| # TODO | ||
| assert False, "GPT4V not support" | ||
|
|
||
| def generate_until_multi_round(self, requests) -> List[str]: | ||
| raise NotImplementedError("TODO: Implement multi-round generation") |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Replace assert False and remove unused arg warning
Use a proper exception and underscore the unused parameter.
- def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
- # TODO
- assert False, "GPT4V not support"
+ def loglikelihood(self, _requests: List[Instance]) -> List[Tuple[float, bool]]: # noqa: ARG002
+ raise NotImplementedError("LongVila does not support loglikelihood.")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]: | |
| # TODO | |
| assert False, "GPT4V not support" | |
| def generate_until_multi_round(self, requests) -> List[str]: | |
| raise NotImplementedError("TODO: Implement multi-round generation") | |
| def loglikelihood(self, _requests: List[Instance]) -> List[Tuple[float, bool]]: # noqa: ARG002 | |
| raise NotImplementedError("LongVila does not support loglikelihood.") | |
| def generate_until_multi_round(self, requests) -> List[str]: | |
| raise NotImplementedError("TODO: Implement multi-round generation") |
🧰 Tools
🪛 Ruff (0.12.2)
179-179: Unused method argument: requests
(ARG002)
181-181: Do not assert False (python -O removes these calls), raise AssertionError()
Replace assert False
(B011)
🤖 Prompt for AI Agents
In lmms_eval/models/chat/longvila.py around lines 179 to 184, replace the assert
False with a proper exception and silence the unused-parameter warning by
renaming the argument to _requests: implement loglikelihood to raise
NotImplementedError("GPT4V not supported") instead of asserting, and change
generate_until_multi_round(self, requests) to generate_until_multi_round(self,
_requests) (or similarly prefix the parameter with an underscore) and keep its
NotImplementedError; this removes the assert and the unused-arg warning while
clearly signaling unimplemented functionality.
| hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/") | ||
| base_cache_dir = os.path.expanduser(hf_home) | ||
| with open(Path(__file__).parent / "lvbench.yaml", "r") as f: | ||
| raw_data = f.readlines() | ||
| safe_data = [] | ||
| for i, line in enumerate(raw_data): | ||
| # remove function definition since yaml load cannot handle it | ||
| if "!function" not in line: | ||
| safe_data.append(line) | ||
| cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"] |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Harden YAML parsing and fix unused loop var
- Rename unused
ito_i(Ruff B007). - Add defensive checks around YAML keys to avoid
KeyErrorif the schema changes.
Apply:
-raw_data = f.readlines()
-safe_data = []
-for i, line in enumerate(raw_data):
+raw_data = f.readlines()
+safe_data = []
+for _i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
-cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
+data = yaml.safe_load("".join(safe_data)) or {}
+try:
+ cache_name = data["dataset_kwargs"]["cache_dir"]
+except KeyError as exc:
+ raise KeyError("Expected 'dataset_kwargs.cache_dir' in lvbench.yaml") from exc📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/") | |
| base_cache_dir = os.path.expanduser(hf_home) | |
| with open(Path(__file__).parent / "lvbench.yaml", "r") as f: | |
| raw_data = f.readlines() | |
| safe_data = [] | |
| for i, line in enumerate(raw_data): | |
| # remove function definition since yaml load cannot handle it | |
| if "!function" not in line: | |
| safe_data.append(line) | |
| cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"] | |
| hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/") | |
| base_cache_dir = os.path.expanduser(hf_home) | |
| with open(Path(__file__).parent / "lvbench.yaml", "r") as f: | |
| raw_data = f.readlines() | |
| safe_data = [] | |
| for _i, line in enumerate(raw_data): | |
| # remove function definition since yaml load cannot handle it | |
| if "!function" not in line: | |
| safe_data.append(line) | |
| data = yaml.safe_load("".join(safe_data)) or {} | |
| try: | |
| cache_name = data["dataset_kwargs"]["cache_dir"] | |
| except KeyError as exc: | |
| raise KeyError("Expected 'dataset_kwargs.cache_dir' in lvbench.yaml") from exc |
🧰 Tools
🪛 Ruff (0.12.2)
12-12: Loop control variable i not used within loop body
Rename unused i to _i
(B007)
🤖 Prompt for AI Agents
In lmms_eval/tasks/lvbench/utils.py around lines 7 to 16, the loop uses an
unused variable `i` and the YAML access assumes keys exist which can raise
KeyError; rename `i` to `_i` to satisfy Ruff B007 and add defensive checks when
loading the YAML (e.g., check that the top-level mapping exists and contains
"dataset_kwargs" and "cache_dir" keys, raising a clear error or falling back to
a default) before assigning cache_name so the code won't crash if the schema
changes.
| def lvbench_doc_to_visual(doc): | ||
| cache_dir = os.path.join(base_cache_dir, cache_name) | ||
| video_path = doc["video_path"] | ||
| assert os.path.exists(os.path.join(cache_dir, video_path)) | ||
| video_path = os.path.join(cache_dir, video_path) | ||
| return [video_path] |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Avoid asserts for I/O; add types and clearer errors
Use Path checks and raise FileNotFoundError. Add type hints.
-def lvbench_doc_to_visual(doc):
- cache_dir = os.path.join(base_cache_dir, cache_name)
- video_path = doc["video_path"]
- assert os.path.exists(os.path.join(cache_dir, video_path))
- video_path = os.path.join(cache_dir, video_path)
- return [video_path]
+from typing import Dict, List
+
+def lvbench_doc_to_visual(doc: Dict) -> List[str]:
+ cache_dir = Path(base_cache_dir) / cache_name
+ video_path = Path(doc["video_path"])
+ abs_path = video_path if video_path.is_absolute() else (cache_dir / video_path)
+ if not abs_path.exists():
+ raise FileNotFoundError(f"Video not found: {abs_path}")
+ return [str(abs_path)]Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In lmms_eval/tasks/lvbench/utils.py around lines 19 to 24, the function uses an
assert for an I/O check and lacks type hints; replace the assert with
pathlib.Path checks and raise FileNotFoundError with a clear message if the file
is missing, add type hints (e.g., doc: Mapping[str, Any] -> List[Path]) on the
function signature, construct video_path as a Path using
base_cache_dir/cache_name and doc["video_path"], and return a list of Path
objects; ensure pathlib.Path is imported and adjust any callers if they expect
strings.
| def extract_characters_regex(s): | ||
| s = s.strip() | ||
| answer_prefixes = [ | ||
| "The best answer is", | ||
| "The correct answer is", | ||
| "The answer is", | ||
| "The answer", | ||
| "The best option is" "The correct option is", | ||
| "Best answer:" "Best option:", | ||
| ] | ||
| for answer_prefix in answer_prefixes: | ||
| s = s.replace(answer_prefix, "") | ||
|
|
||
| if len(s.split()) > 10 and not re.search("[ABCD]", s): | ||
| return "" | ||
|
|
||
| matches = re.search(r"[ABCD]", s) | ||
| if matches is None: | ||
| return "" | ||
| return matches[0] |
There was a problem hiding this comment.
Fix concatenated strings bug and improve extraction robustness
Missing commas are concatenating literals (e.g., "Best answer:" "Best option:"). Also prefer case-insensitive matching and strip punctuation.
-def extract_characters_regex(s):
+from typing import Optional
+def extract_characters_regex(s: str) -> str:
s = s.strip()
answer_prefixes = [
"The best answer is",
"The correct answer is",
"The answer is",
"The answer",
- "The best option is" "The correct option is",
- "Best answer:" "Best option:",
+ "The best option is",
+ "The correct option is",
+ "Best answer:",
+ "Best option:",
]
for answer_prefix in answer_prefixes:
- s = s.replace(answer_prefix, "")
+ s = re.sub(re.escape(answer_prefix), "", s, flags=re.IGNORECASE).strip()
- if len(s.split()) > 10 and not re.search("[ABCD]", s):
+ if len(s.split()) > 10 and not re.search(r"\b[ABCD]\b", s, flags=re.I):
return ""
- matches = re.search(r"[ABCD]", s)
+ matches = re.search(r"\b([ABCD])\b", s, flags=re.I)
if matches is None:
return ""
- return matches[0]
+ return matches.group(1).upper()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def extract_characters_regex(s): | |
| s = s.strip() | |
| answer_prefixes = [ | |
| "The best answer is", | |
| "The correct answer is", | |
| "The answer is", | |
| "The answer", | |
| "The best option is" "The correct option is", | |
| "Best answer:" "Best option:", | |
| ] | |
| for answer_prefix in answer_prefixes: | |
| s = s.replace(answer_prefix, "") | |
| if len(s.split()) > 10 and not re.search("[ABCD]", s): | |
| return "" | |
| matches = re.search(r"[ABCD]", s) | |
| if matches is None: | |
| return "" | |
| return matches[0] | |
| from typing import Optional | |
| def extract_characters_regex(s: str) -> str: | |
| s = s.strip() | |
| answer_prefixes = [ | |
| "The best answer is", | |
| "The correct answer is", | |
| "The answer is", | |
| "The answer", | |
| "The best option is", | |
| "The correct option is", | |
| "Best answer:", | |
| "Best option:", | |
| ] | |
| for answer_prefix in answer_prefixes: | |
| s = re.sub(re.escape(answer_prefix), "", s, flags=re.IGNORECASE).strip() | |
| if len(s.split()) > 10 and not re.search(r"\b[ABCD]\b", s, flags=re.I): | |
| return "" | |
| matches = re.search(r"\b([ABCD])\b", s, flags=re.I) | |
| if matches is None: | |
| return "" | |
| return matches.group(1).upper() |
🤖 Prompt for AI Agents
In lmms_eval/tasks/lvbench/utils.py around lines 37 to 56, the answer_prefix
list has missing commas causing string concatenation and the character
extraction is brittle; fix by inserting the missing commas between list
literals, perform prefix removal and then normalize s (lower/upper consistently)
and strip surrounding punctuation, use a case-insensitive regex (e.g.,
re.search(r"\b[ABCD]\b", s, flags=re.IGNORECASE) or re.search(r"[ABCD]", s,
flags=re.IGNORECASE)) to find the first A/B/C/D, and return the uppercase
character (or "" if none); also keep the existing short-word-length guard but
apply it after normalization.
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
lmms_eval/models/chat/async_openai.py (4)
174-179: Guard against None content from tool_calls responsesWhen finish_reason == "tool_calls", message.content is often None. Current concatenation and message append can raise TypeError or send invalid payload.
- last_response = response.choices[0].message.content - all_response += last_response + last_response = response.choices[0].message.content or "" + all_response += last_response ... - messages.append({"role": "assistant", "content": last_response}) + if last_response: + messages.append({"role": "assistant", "content": last_response})
67-70: Possible deadlock: num_cpus can be 0 on single‑core machinescpu_count() // 2 yields 0 when cpu_count() == 1, causing Semaphore(0) to block forever.
- self.num_cpus = cpu_count() // 2 + self.num_cpus = max(1, cpu_count() // 2)
45-61: Type hints: fix Optional annotations and add return typeDefaults of None require Optional[...] and the async method should declare its return type. This also satisfies Pyright.
- def __init__( + def __init__( self, model_version: str = "grok-2-latest", - base_url: str = None, - api_key: str = None, + base_url: Optional[str] = None, + api_key: Optional[str] = None, timeout: int = 600, max_retries: int = 5, max_size_in_mb: int = 20, - mcp_server_path: str = None, - num_cpus: int = None, - work_dir: str = None, + mcp_server_path: Optional[str] = None, + num_cpus: Optional[int] = None, + work_dir: Optional[str] = None, fps: Optional[int] = None, nframes: Optional[int] = 64, max_pixels: Optional[int] = 151200, min_pixels: Optional[int] = 28 * 28, **kwargs, ) -> None:- async def maybe_forward_with_tool(self, request: Instance, idx: int): + async def maybe_forward_with_tool(self, request: Instance, idx: int) -> Tuple[str, int]:Also applies to: 124-124
212-237: Guard asyncio.run and fix broken wrapperasync_openai.generate_until (lmms_eval/models/chat/async_openai.py:212-237) calls asyncio.run unconditionally — this will raise if invoked on an existing event loop. lmms_eval/models/simple/from_log.py:119 is the only direct caller found, but its wrapper is broken (it does
return generate_until(self, requests)— NameError).
- lmms_eval/models/simple/from_log.py:119 — change
return generate_until(self, requests)toreturn self.generate_until(requests)(critical fix).- lmms_eval/models/chat/async_openai.py:212-237 — do not call
asyncio.rununconditionally; either make the method fully async (and let callers await) or detect a running loop (asyncio.get_running_loop / loop.is_running) and use an in-loop approach (create tasks + gather) as the fallback.
🧹 Nitpick comments (5)
lmms_eval/models/chat/async_openai.py (5)
1-37: Clean up duplicate/unused imports and satisfy Ruff (I001/F401)Removes duplicate lmms import and unused modules.
import asyncio -import base64 import json import os import tempfile -import time import uuid -from io import BytesIO from multiprocessing import cpu_count from typing import List, Optional, Tuple, Union - -import numpy as np -import requests as url_requests from accelerate import Accelerator, DistributedType from tqdm import tqdm @@ -try: - from decord import VideoReader, cpu -except ImportError: - pass +# decord not used here; import where needed to avoid F401 @@ -from PIL import Image +# PIL.Image not used directly in this module; ChatMessages provides PIL images @@ -from lmms_eval.api.model import lmms +from lmms_eval.api.model import lmms # (deduped)
86-89: Adjust assertion message to match allowed DistributedTypesMessage mentions DDP, but code allows FSDP, MULTI_GPU, DEEPSPEED.
- assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported." + assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type. Supported: FSDP, MULTI_GPU, DEEPSPEED."
173-206: Resilience: add retries/backoff on API and tool callsmax_retries is unused; transient 5xx/429s will fail the whole run. Consider exponential backoff for self.client.chat.completions.create and mcp_client.run_tool.
I can draft a tenacity-based wrapper if you want.
41-44: Docs: add minimal docstrings for public APIClass and init lack docstrings; guidelines require them.
Happy to propose short docstrings aligned with the registry’s expectations.
141-150: Ensure messages[-1]['content'] exists and is a list before appending media-path hints.to_openai_messages builds each message with "content": [] so appending is safe when messages is non-empty; still guard in lmms_eval/models/chat/async_openai.py around the append to (1) handle an empty messages list (create a final message with 'content': []) and (2) coerce non-list content to a list before appending to avoid IndexError or invalid payloads.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
lmms_eval/loggers/evaluation_tracker.py(1 hunks)lmms_eval/models/chat/async_openai.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly
Files:
lmms_eval/loggers/evaluation_tracker.pylmms_eval/models/chat/async_openai.py
🧬 Code graph analysis (1)
lmms_eval/models/chat/async_openai.py (1)
lmms_eval/mcp/client.py (1)
convert_result_to_openai_format(49-67)
🔇 Additional comments (1)
lmms_eval/loggers/evaluation_tracker.py (1)
71-75: Prefer model identity keys overengine=; sanitize extraction and wrap list.Engine URLs can be logged as the model name when both are present — extract quoted/whitespace values with a regex, prefer overlays (peft/delta) then explicit identity (model_id/model_name) and treat
engine=as a last-resort fallback. File: lmms_eval/loggers/evaluation_tracker.py (around lines 69–75).@@ - def extract_model_name(model_args: str, key: str) -> str: - """Extracts the model name from the model arguments using a key.""" - args_after_key = model_args.split(key)[1] - return args_after_key.split(",")[0] + def extract_model_name(model_args: str, key: str) -> str: + """Extract the value for `key` (supports quoted values) up to the next comma.""" + m = re.search( + rf"{re.escape(key)}\s*(?P<val>(?:'[^']*'|\"[^\"]*\"|[^,]+))", + model_args, + ) + if not m: + return "" + return m.group("val").strip().strip("'\"") @@ - # order does matter, e.g. peft and delta are provided together with pretrained - prefixes = ["peft=", "delta=", "pretrained=", "model=", "path=", "engine=", "model_version=", "model_name=", "model_id="] + # Order matters: + # - Prefer overlays (peft/delta) over base identifiers. + # - Prefer explicit model identity (model_id/name) over generic fields. + # - Treat engine as a last-resort fallback to avoid logging URLs as model names. + prefixes = [ + "peft=", + "delta=", + "pretrained=", + "model_id=", + "model_name=", + "model=", + "path=", + "model_version=", + "engine=", + ]
| @@ -184,11 +184,17 @@ async def maybe_forward_with_tool(self, request: Instance, idx: int): | |||
| for call in message.tool_calls: | |||
| eval_logger.debug(f"Calling {call.function.name}...") | |||
| result = await self.mcp_client.run_tool(call.function.name, eval(call.function.arguments)) | |||
There was a problem hiding this comment.
Blocker: Replace eval(...) of tool arguments with safe JSON parsing
Evaluating LLM-provided strings is an RCE vector. Parse as JSON and fall back safely.
- result = await self.mcp_client.run_tool(call.function.name, eval(call.function.arguments))
+ raw_args = call.function.arguments
+ if isinstance(raw_args, str):
+ try:
+ args = json.loads(raw_args)
+ except Exception as e:
+ eval_logger.warning(f"Failed to parse tool args for {call.function.name}: {e}")
+ args = {}
+ elif raw_args is None:
+ args = {}
+ else:
+ args = raw_args
+ result = await self.mcp_client.run_tool(call.function.name, args)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| result = await self.mcp_client.run_tool(call.function.name, eval(call.function.arguments)) | |
| raw_args = call.function.arguments | |
| if isinstance(raw_args, str): | |
| try: | |
| args = json.loads(raw_args) | |
| except Exception as e: | |
| eval_logger.warning(f"Failed to parse tool args for {call.function.name}: {e}") | |
| args = {} | |
| elif raw_args is None: | |
| args = {} | |
| else: | |
| args = raw_args | |
| result = await self.mcp_client.run_tool(call.function.name, args) |
🧰 Tools
🪛 Ruff (0.12.2)
186-186: Use of possibly insecure function; consider using ast.literal_eval
(S307)
🤖 Prompt for AI Agents
In lmms_eval/models/chat/async_openai.py around line 186, replace the unsafe
eval(call.function.arguments) with safe JSON parsing: import json at the top,
then attempt to parse call.function.arguments via json.loads in a try/except
(catch json.JSONDecodeError, TypeError, ValueError) and on success pass the
parsed object to run_tool; on failure log/warn about the parse error and fall
back to a safe default (e.g., {} or the raw string wrapped appropriately) so no
arbitrary code is executed.
| all_response += f"<tool_call>{call.function.name} {call.function.arguments}</tool_call></tool_response>" | ||
| tool_messages.append({"role": "tool", "name": call.function.name, "content": []}) | ||
| for content in result.content: | ||
| tool_message = self.mcp_client.convert_result_to_openai_format(content) | ||
| for content in tool_message: | ||
| if content["type"] == "image_url": | ||
| all_response += "<image_url>" | ||
| elif content["type"] == "text": | ||
| all_response += content["text"] | ||
| tool_messages[-1]["content"].extend(tool_message) | ||
| all_response += "</tool_response>" |
There was a problem hiding this comment.
Fix malformed tool_response markup and include tool_call_id on tool messages
- The first line appends a stray closing </tool_response> without an opener (and you close again later).
- Add tool_call_id to tool messages to align with OpenAI Chat Completions tool schema.
- Minor: avoid reusing the name content inside the inner loop; add audio_url handling.
- all_response += f"<tool_call>{call.function.name} {call.function.arguments}</tool_call></tool_response>"
- tool_messages.append({"role": "tool", "name": call.function.name, "content": []})
+ all_response += f"<tool_call>{call.function.name} {call.function.arguments}</tool_call><tool_response>"
+ tool_messages.append(
+ {
+ "role": "tool",
+ "tool_call_id": call.id,
+ "name": call.function.name,
+ "content": [],
+ }
+ )
for content in result.content:
tool_message = self.mcp_client.convert_result_to_openai_format(content)
- for content in tool_message:
- if content["type"] == "image_url":
+ for item in tool_message:
+ if item["type"] == "image_url":
all_response += "<image_url>"
- elif content["type"] == "text":
- all_response += content["text"]
- tool_messages[-1]["content"].extend(tool_message)
+ elif item["type"] == "audio_url":
+ all_response += "<audio_url>"
+ elif item["type"] == "text":
+ all_response += item["text"]
+ tool_messages[-1]["content"].extend(tool_message)
all_response += "</tool_response>"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| all_response += f"<tool_call>{call.function.name} {call.function.arguments}</tool_call></tool_response>" | |
| tool_messages.append({"role": "tool", "name": call.function.name, "content": []}) | |
| for content in result.content: | |
| tool_message = self.mcp_client.convert_result_to_openai_format(content) | |
| for content in tool_message: | |
| if content["type"] == "image_url": | |
| all_response += "<image_url>" | |
| elif content["type"] == "text": | |
| all_response += content["text"] | |
| tool_messages[-1]["content"].extend(tool_message) | |
| all_response += "</tool_response>" | |
| all_response += f"<tool_call>{call.function.name} {call.function.arguments}</tool_call><tool_response>" | |
| tool_messages.append( | |
| { | |
| "role": "tool", | |
| "tool_call_id": call.id, | |
| "name": call.function.name, | |
| "content": [], | |
| } | |
| ) | |
| for content in result.content: | |
| tool_message = self.mcp_client.convert_result_to_openai_format(content) | |
| for item in tool_message: | |
| if item["type"] == "image_url": | |
| all_response += "<image_url>" | |
| elif item["type"] == "audio_url": | |
| all_response += "<audio_url>" | |
| elif item["type"] == "text": | |
| all_response += item["text"] | |
| tool_messages[-1]["content"].extend(tool_message) | |
| all_response += "</tool_response>" |
🤖 Prompt for AI Agents
In lmms_eval/models/chat/async_openai.py around lines 187 to 197, the code emits
a stray closing </tool_response> before the opening tag, doesn't include
tool_call_id on tool messages, reuses the variable name "content" in nested
loops, and lacks audio_url handling; fix by removing the premature
</tool_response>, prepend a proper <tool_response> (including tool_call id and
call.function.name/arguments) before writing the tool call tag, add
"tool_call_id": call.id to the tool_messages entries, rename the inner loop
variable to avoid shadowing (e.g., result_item), add handling for content types
"image_url" and "audio_url" (append appropriate markers and include them in
tool_messages[-1]["content"]), and ensure the final closing </tool_response>
remains only at the end.
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
lmms_eval/tasks/charades_sta/eval_tvg.py (4)
43-49: Tighten the primary time-range regex and remove literal en dash to satisfy Ruff.The current pattern allows multiple dots (e.g., "3....5") and includes a literal EN DASH, which Ruff flags (RUF001). Use an explicit char class with U+2013 and a precise decimal pattern.
Apply this diff:
- patterns = [r"(\d+\.*\d*)\s*[–-]\s*(\d+\.*\d*)"] + # Allow hyphen-minus and EN DASH (U+2013) without embedding the Unicode literal + patterns = [r"(\d+(?:\.\d+)?)\s*[\u2013-]\s*(\d+(?:\.\d+)?)"]
70-86: HH:MM:SS parsing is broken (tuple from nested groups;.spliton tuple).
re.findallreturns tuples due to the inner capture, resulting in runtime errors when callingt.count(":")andt.split(":"). Also, the code never processes multiple matches per sentence.Apply this diff to fix parsing and support both MM:SS and HH:MM:SS:
- times = [] - time_regex = re.compile(r"\b((\d{1,2}:\d{2}:\d{2}))\b") # time formats (e.g., 18:00, 00:18:05) - for sentence in candidates: - time = re.findall(time_regex, sentence) - if time: - t = time[0] - else: - continue - # If time is in HH:MM:SS format, convert to seconds - if t.count(":") == 2: - h, m, s = map(int, t.split(":")) - time_in_sec = h * 3600 + m * 60 + s - elif t.count(":") == 1: - m, s = map(int, t.split(":")) - time_in_sec = m * 60 + s - times.append(time_in_sec) + times = [] + # Support HH:MM:SS or MM:SS without nested capture groups + time_regex = re.compile(r"\b(?:\d{1,2}:)?\d{2}:\d{2}\b") + for sentence in candidates: + matches = time_regex.findall(sentence) + for t in matches: + parts = list(map(int, t.split(":"))) + if len(parts) == 3: + h, m, s = parts + time_in_sec = h * 3600 + m * 60 + s + else: + m, s = parts + time_in_sec = m * 60 + s + times.append(time_in_sec)
130-130: Avoidevalon untrusted input; useast.literal_eval.Executing
eval(gt)is a security risk and unnecessary.Apply this diff:
- gt = eval(gt) + gt = ast.literal_eval(gt)Additionally, add this import at the top of the file:
+import ast
13-17: Add type hints and minimal docstrings to public functions.Required by the repo guidelines; improves readability and Pyright coverage.
Apply this diff:
-def read_json(path): +def read_json(path: str) -> dict: + """Read and parse a JSON file from path.""" with open(path, "r") as fin: datas = json.load(fin) return datas @@ -def write_json(path, data): +def write_json(path: str, data: Any) -> None: + """Write data as JSON to path.""" with open(path, "w") as fout: json.dump(data, fout) - print("The format file has been saved at:{}".format(path)) + print(f"The format file has been saved at: {path}") return @@ -def extract_time(paragraph): +def extract_time(paragraph: str) -> list[list[float]]: + """Extract a single [start, end] interval (in seconds) from free text.""" @@ -def iou(A, B): +def iou(a: Sequence[float], b: Sequence[float]) -> float: + """IoU of two 1D segments [start, end].""" - max0 = max((A[0]), (B[0])) - min0 = min((A[0]), (B[0])) - max1 = max((A[1]), (B[1])) - min1 = min((A[1]), (B[1])) + max0 = max(a[0], b[0]) + min0 = min(a[0], b[0]) + max1 = max(a[1], b[1]) + min1 = min(a[1], b[1]) return max(min1 - max0, 0) / (max1 - min0)Also applies to: 19-24, 26-26, 108-114
🧹 Nitpick comments (3)
lmms_eval/tasks/charades_sta/eval_tvg.py (3)
28-28: Don’t blindly replace “to”; use a word-boundary substitution.
.replace("to", "-")will mangle words like “tomorrow” → “-morrow”.Apply this diff:
- paragraph = paragraph.lower().replace(prompt, "").replace("to", "-") + paragraph = paragraph.lower().replace(prompt, "") + paragraph = re.sub(r"\bto\b", "-", paragraph)
1-10: Trim unused imports and add typing imports per guidelines.
os,pdb,numpy,deepcopy, andPathare unused. Ruff will flag these, and guidelines require type hints.Apply this diff:
-import argparse -import json -import os -import pdb -import re -from copy import deepcopy -from pathlib import Path - -import numpy as np +import argparse +import json +import re +from typing import Any, Sequence
145-145: Wrap the long print to respect the 88-char limit and prefer f-strings.Apply this diff:
- print("IOU 0.3: {0}\nIOU 0.5: {1}\nIOU 0.7: {2}\nmIOU".format(Result[0.3] * 100 / num, Result[0.5] * 100 / num, Result[0.7] * 100 / num), sum(ious) * 100 / num) + print( + f"IOU 0.3: {Result[0.3] * 100 / num}\n" + f"IOU 0.5: {Result[0.5] * 100 / num}\n" + f"IOU 0.7: {Result[0.7] * 100 / num}\n" + f"mIOU {sum(ious) * 100 / num}" + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
lmms_eval/tasks/charades_sta/eval_tvg.py(2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
**/*.py: Type hints are required for all Python code
Public APIs must have docstrings
Maximum line length is 88 characters
Use PEP 8 naming: snake_case for functions/variables
Class names must use PascalCase
Constants should be in UPPER_SNAKE_CASE
Use f-strings for string formatting
Use early returns to avoid nested conditions
Use descriptive names; prefix handler functions with 'handle'
Prefer constants over functions where possible
Prefer functional, immutable approaches when not verbose
Define composing (higher-level) functions before their components
Mark issues in existing code with TODO: prefix in comments
Use functional and stateless approaches where they improve clarity
Use Ruff to enforce: import sorting (I001) and no unused imports
For long strings, wrap using parentheses rather than backslashes
Format long function calls over multiple lines with proper indentation
Split long import lists across multiple lines
Use Pyright type checking: add explicit None checks for Optional values
Use Pyright type narrowing for strings where applicable
Use Ruff (via pre-commit) to format and lint Python files
Document public APIs and test thoroughly
Files:
lmms_eval/tasks/charades_sta/eval_tvg.py
🪛 Ruff (0.12.2)
lmms_eval/tasks/charades_sta/eval_tvg.py
43-43: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF001)
91-91: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
92-92: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF001)
| # Fallback: if no timestamps found, search for any two number patterns with dash | ||
| if len(timestamps) == 0: | ||
| # More comprehensive pattern to match various formats like: | ||
| # xx - xx, x.xx s - x.xx s, x.xxs - x.xxs, etc. | ||
| # Also handle en dash (–) and regular dash (-) | ||
| fallback_pattern = r"(\d+(?:\.\d+)?)\s*s?\s*[–-]\s*(\d+(?:\.\d+)?)\s*s?" | ||
| fallback_matches = re.findall(fallback_pattern, paragraph) | ||
| if fallback_matches: | ||
| timestamps = [[float(start), float(end)] for start, end in fallback_matches] | ||
|
|
There was a problem hiding this comment.
Fallback regex: remove literal en dash; reduce false positives; handle units.
- Replace the literal EN DASH with
\u2013to avoid Ruff RUF001/RUF003. - Limit search scope to keyword-filtered
candidatesto avoid catching unrelated numeric ranges in the whole paragraph. - Optionally accept "s/sec/seconds".
Apply this diff:
- if len(timestamps) == 0:
- # More comprehensive pattern to match various formats like:
- # xx - xx, x.xx s - x.xx s, x.xxs - x.xxs, etc.
- # Also handle en dash (–) and regular dash (-)
- fallback_pattern = r"(\d+(?:\.\d+)?)\s*s?\s*[–-]\s*(\d+(?:\.\d+)?)\s*s?"
- fallback_matches = re.findall(fallback_pattern, paragraph)
+ if len(timestamps) == 0:
+ # More comprehensive pattern to match various formats like:
+ # xx - xx, x.xx s - x.xx s, x.xxs - x.xxs, etc.
+ # Also handle hyphen-minus (-) and EN DASH (U+2013)
+ fallback_pattern = (
+ r"(\d+(?:\.\d+)?)\s*(?:s|sec|secs|second|seconds)?\s*"
+ r"[\u2013-]\s*"
+ r"(\d+(?:\.\d+)?)\s*(?:s|sec|secs|second|seconds)?\b"
+ )
+ scope = ' '.join(candidates) if candidates else paragraph
+ fallback_matches = re.findall(fallback_pattern, scope)Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 Ruff (0.12.2)
91-91: Comment contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF003)
92-92: String contains ambiguous – (EN DASH). Did you mean - (HYPHEN-MINUS)?
(RUF001)
* Add longvila * Allow long vila pass in video kwargs * Add lvbench * Fix format better * Better tvg eval * resolve model name
* Add longvila * Allow long vila pass in video kwargs * Add lvbench * Fix format better * Better tvg eval * resolve model name
Before you open a pull-request, please check if a similar issue already exists or has been closed before.
When you open a pull-request, please be sure to include the following
If you meet the lint warnings, you can use following scripts to reformat code.
Thank you for your contributions!
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Chores