Skip to content

Conversation

@kaln27
Copy link
Contributor

@kaln27 kaln27 commented Dec 29, 2025

What does this PR do?

Support video data in Agent Loop.
Currently, Agent Loop only supports images as multimodal input. This PR adds video support.

Successfully test on Qwen3 VL.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds valuable support for video data in the Agent Loop, which is a significant feature enhancement. The implementation correctly threads the video_data through the various components. My review focuses on improving the robustness and maintainability of the new code. I've identified a few areas with high-severity issues, including brittle logic based on class name strings, potential bugs related to attribute access and data handling, and duplicated code that could be refactored. Addressing these points will make the new functionality more reliable and easier to maintain.

Comment on lines +510 to +516
videos = getattr(output, "multi_modal_data", {}).get("video", None)
if videos is not None:
videos, video_metadatas = zip(*videos, strict=False)
videos, video_metadatas = list(videos), list(video_metadatas)
videos_kwargs = {"video_metadata": video_metadatas, "do_sample_frames": False}
else:
videos_kwargs = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code for processing video data is nearly identical to the one in verl/experimental/agent_loop/tool_agent_loop.py at lines 223-228. Duplicating this logic increases maintenance overhead and the risk of introducing inconsistencies if one is updated and the other is not. Consider refactoring this into a shared helper function to promote code reuse and simplify future modifications. For example, a function like _prepare_video_kwargs(videos) could encapsulate this logic.

Comment on lines 527 to 531
if self.processor is not None and "Qwen2VLImageProcessor" in self.processor.image_processor.__class__.__name__:
from verl.models.transformers.qwen2_vl import get_rope_index
if self.processor.__class__.__name__ == "Qwen3VLProcessor":
from verl.models.transformers.qwen3_vl import get_rope_index
else:
from verl.models.transformers.qwen2_vl import get_rope_index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for selecting the get_rope_index function relies on string comparisons of class names ("Qwen2VLImageProcessor", "Qwen3VLProcessor"). This approach is brittle and not easily extensible. A more robust design would be to use polymorphism. For instance, you could add a method to the processor classes (e.g., get_rope_index_func) that returns the correct function. This would eliminate the need for these conditional checks and make the code cleaner and easier to maintain.

Example of a more robust approach:

# In your processor class (e.g., Qwen3VLProcessor)
@staticmethod
def get_rope_index_func():
    from verl.models.transformers.qwen3_vl import get_rope_index
    return get_rope_index

# In this file, the logic would become:
if self.processor is not None and hasattr(self.processor, "get_rope_index_func"):
    get_rope_index = self.processor.get_rope_index_func()
    # ... rest of the logic for using get_rope_index
else:
    # handle cases where the function is not available
    ...

multi_modal_data["image"] = image_data
if video_data is not None:
multi_modal_data["video"] = video_data
prompt = TokensPrompt(prompt_token_ids=prompt_ids, multi_modal_data=multi_modal_data)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The previous implementation passed None for multi_modal_data when no image data was present. The new logic passes an empty dictionary {} when both image_data and video_data are None. This change in behavior (from None to {}) might be unintended and could lead to unexpected issues downstream in the vLLM engine, which expects Optional[Dict]. To maintain the original behavior, consider passing None if the multi_modal_data dictionary is empty.

Suggested change
prompt = TokensPrompt(prompt_token_ids=prompt_ids, multi_modal_data=multi_modal_data)
prompt = TokensPrompt(prompt_token_ids=prompt_ids, multi_modal_data=multi_modal_data if multi_modal_data else None)


# Find where the array equals the value
is_value = prompt_ids == processor.image_token_id
is_value = (prompt_ids == processor.image_token_id) | (prompt_ids == processor.video_token_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Accessing processor.video_token_id directly could raise an AttributeError if the processor does not support video and thus lacks this attribute. To make this code more robust, you should check for the existence of video_token_id before using it.

A safer implementation would be:

is_value = prompt_ids == processor.image_token_id
if hasattr(processor, "video_token_id"):
    is_value |= (prompt_ids == processor.video_token_id)

@kaln27 kaln27 changed the title fix: Add support for video data in Agent Loop fix: add support for video data in Agent Loop and Qwen3 VL Dec 31, 2025
@Miku2G
Copy link

Miku2G commented Dec 31, 2025

Please look at #4513, this may help you solve the question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants