-
Notifications
You must be signed in to change notification settings - Fork 2.9k
fix: add support for video data in Agent Loop and Qwen3 VL
#4727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds valuable support for video data in the Agent Loop, which is a significant feature enhancement. The implementation correctly threads the video_data through the various components. My review focuses on improving the robustness and maintainability of the new code. I've identified a few areas with high-severity issues, including brittle logic based on class name strings, potential bugs related to attribute access and data handling, and duplicated code that could be refactored. Addressing these points will make the new functionality more reliable and easier to maintain.
| videos = getattr(output, "multi_modal_data", {}).get("video", None) | ||
| if videos is not None: | ||
| videos, video_metadatas = zip(*videos, strict=False) | ||
| videos, video_metadatas = list(videos), list(video_metadatas) | ||
| videos_kwargs = {"video_metadata": video_metadatas, "do_sample_frames": False} | ||
| else: | ||
| videos_kwargs = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block of code for processing video data is nearly identical to the one in verl/experimental/agent_loop/tool_agent_loop.py at lines 223-228. Duplicating this logic increases maintenance overhead and the risk of introducing inconsistencies if one is updated and the other is not. Consider refactoring this into a shared helper function to promote code reuse and simplify future modifications. For example, a function like _prepare_video_kwargs(videos) could encapsulate this logic.
| if self.processor is not None and "Qwen2VLImageProcessor" in self.processor.image_processor.__class__.__name__: | ||
| from verl.models.transformers.qwen2_vl import get_rope_index | ||
| if self.processor.__class__.__name__ == "Qwen3VLProcessor": | ||
| from verl.models.transformers.qwen3_vl import get_rope_index | ||
| else: | ||
| from verl.models.transformers.qwen2_vl import get_rope_index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation for selecting the get_rope_index function relies on string comparisons of class names ("Qwen2VLImageProcessor", "Qwen3VLProcessor"). This approach is brittle and not easily extensible. A more robust design would be to use polymorphism. For instance, you could add a method to the processor classes (e.g., get_rope_index_func) that returns the correct function. This would eliminate the need for these conditional checks and make the code cleaner and easier to maintain.
Example of a more robust approach:
# In your processor class (e.g., Qwen3VLProcessor)
@staticmethod
def get_rope_index_func():
from verl.models.transformers.qwen3_vl import get_rope_index
return get_rope_index
# In this file, the logic would become:
if self.processor is not None and hasattr(self.processor, "get_rope_index_func"):
get_rope_index = self.processor.get_rope_index_func()
# ... rest of the logic for using get_rope_index
else:
# handle cases where the function is not available
...| multi_modal_data["image"] = image_data | ||
| if video_data is not None: | ||
| multi_modal_data["video"] = video_data | ||
| prompt = TokensPrompt(prompt_token_ids=prompt_ids, multi_modal_data=multi_modal_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous implementation passed None for multi_modal_data when no image data was present. The new logic passes an empty dictionary {} when both image_data and video_data are None. This change in behavior (from None to {}) might be unintended and could lead to unexpected issues downstream in the vLLM engine, which expects Optional[Dict]. To maintain the original behavior, consider passing None if the multi_modal_data dictionary is empty.
| prompt = TokensPrompt(prompt_token_ids=prompt_ids, multi_modal_data=multi_modal_data) | |
| prompt = TokensPrompt(prompt_token_ids=prompt_ids, multi_modal_data=multi_modal_data if multi_modal_data else None) |
|
|
||
| # Find where the array equals the value | ||
| is_value = prompt_ids == processor.image_token_id | ||
| is_value = (prompt_ids == processor.image_token_id) | (prompt_ids == processor.video_token_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accessing processor.video_token_id directly could raise an AttributeError if the processor does not support video and thus lacks this attribute. To make this code more robust, you should check for the existence of video_token_id before using it.
A safer implementation would be:
is_value = prompt_ids == processor.image_token_id
if hasattr(processor, "video_token_id"):
is_value |= (prompt_ids == processor.video_token_id)Agent Loop and Qwen3 VL
|
Please look at #4513, this may help you solve the question. |
What does this PR do?
Support video data in Agent Loop.
Currently, Agent Loop only supports images as multimodal input. This PR adds video support.
Successfully test on Qwen3 VL.
Checklist Before Submitting
Important
Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.
pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=alwaysci-requestchannel in theverlSlack workspace. (If not accessible, please try the Feishu group (飞书群).)