fix: add support for video data in `Agent Loop` and Qwen3 VL #4727

kaln27 · 2025-12-29T17:54:39Z

What does this PR do?

Support video data in Agent Loop.
Currently, Agent Loop only supports images as multimodal input. This PR adds video support.

Successfully test on Qwen3 VL.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

gemini-code-assist

Code Review

This pull request adds valuable support for video data in the Agent Loop, which is a significant feature enhancement. The implementation correctly threads the video_data through the various components. My review focuses on improving the robustness and maintainability of the new code. I've identified a few areas with high-severity issues, including brittle logic based on class name strings, potential bugs related to attribute access and data handling, and duplicated code that could be refactored. Addressing these points will make the new functionality more reliable and easier to maintain.

gemini-code-assist · 2025-12-29T17:56:30Z

verl/experimental/agent_loop/agent_loop.py

+            videos = getattr(output, "multi_modal_data", {}).get("video", None)
+            if videos is not None:
+                videos, video_metadatas = zip(*videos, strict=False)
+                videos, video_metadatas = list(videos), list(video_metadatas)
+                videos_kwargs = {"video_metadata": video_metadatas, "do_sample_frames": False}
+            else:
+                videos_kwargs = {}


This block of code for processing video data is nearly identical to the one in verl/experimental/agent_loop/tool_agent_loop.py at lines 223-228. Duplicating this logic increases maintenance overhead and the risk of introducing inconsistencies if one is updated and the other is not. Consider refactoring this into a shared helper function to promote code reuse and simplify future modifications. For example, a function like _prepare_video_kwargs(videos) could encapsulate this logic.

gemini-code-assist · 2025-12-29T17:56:30Z

verl/experimental/agent_loop/agent_loop.py

        if self.processor is not None and "Qwen2VLImageProcessor" in self.processor.image_processor.__class__.__name__:
-            from verl.models.transformers.qwen2_vl import get_rope_index
+            if self.processor.__class__.__name__ == "Qwen3VLProcessor":
+                from verl.models.transformers.qwen3_vl import get_rope_index
+            else:
+                from verl.models.transformers.qwen2_vl import get_rope_index


The current implementation for selecting the get_rope_index function relies on string comparisons of class names ("Qwen2VLImageProcessor", "Qwen3VLProcessor"). This approach is brittle and not easily extensible. A more robust design would be to use polymorphism. For instance, you could add a method to the processor classes (e.g., get_rope_index_func) that returns the correct function. This would eliminate the need for these conditional checks and make the code cleaner and easier to maintain.

Example of a more robust approach:

# In your processor class (e.g., Qwen3VLProcessor) @staticmethod def get_rope_index_func(): from verl.models.transformers.qwen3_vl import get_rope_index return get_rope_index # In this file, the logic would become: if self.processor is not None and hasattr(self.processor, "get_rope_index_func"): get_rope_index = self.processor.get_rope_index_func() # ... rest of the logic for using get_rope_index else: # handle cases where the function is not available ...

gemini-code-assist · 2025-12-29T17:56:30Z

verl/workers/rollout/vllm_rollout/vllm_async_server.py

+            multi_modal_data["image"] = image_data
+        if video_data is not None:
+            multi_modal_data["video"] = video_data
+        prompt = TokensPrompt(prompt_token_ids=prompt_ids, multi_modal_data=multi_modal_data)


The previous implementation passed None for multi_modal_data when no image data was present. The new logic passes an empty dictionary {} when both image_data and video_data are None. This change in behavior (from None to {}) might be unintended and could lead to unexpected issues downstream in the vLLM engine, which expects Optional[Dict]. To maintain the original behavior, consider passing None if the multi_modal_data dictionary is empty.

Suggested change

prompt = TokensPrompt(prompt_token_ids=prompt_ids, multi_modal_data=multi_modal_data)

prompt = TokensPrompt(prompt_token_ids=prompt_ids, multi_modal_data=multi_modal_data if multi_modal_data else None)

gemini-code-assist · 2025-12-29T17:56:31Z

verl/workers/rollout/vllm_rollout/vllm_async_server.py


        # Find where the array equals the value
-        is_value = prompt_ids == processor.image_token_id
+        is_value = (prompt_ids == processor.image_token_id) | (prompt_ids == processor.video_token_id)


Accessing processor.video_token_id directly could raise an AttributeError if the processor does not support video and thus lacks this attribute. To make this code more robust, you should check for the existence of video_token_id before using it.

A safer implementation would be:

is_value = prompt_ids == processor.image_token_id if hasattr(processor, "video_token_id"): is_value |= (prompt_ids == processor.video_token_id)

…checks

Miku2G · 2025-12-31T15:26:36Z

Please look at #4513, this may help you solve the question.

fix: Add support for video data in Agent Loop

9eb1b39

kaln27 requested review from PeterSH6, chenhaiq and wuxibin89 as code owners December 29, 2025 17:54

gemini-code-assist bot reviewed Dec 29, 2025

View reviewed changes

Enhance vision token deduplication for Qwen3-VL and update processor …

9714c3b

…checks

kaln27 changed the title ~~fix: Add support for video data in Agent Loop~~ fix: add support for video data in Agent Loop and Qwen3 VL Dec 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add support for video data in `Agent Loop` and Qwen3 VL #4727

fix: add support for video data in `Agent Loop` and Qwen3 VL #4727

kaln27 commented Dec 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 29, 2025

Uh oh!

gemini-code-assist bot Dec 29, 2025

Uh oh!

gemini-code-assist bot Dec 29, 2025

Uh oh!

gemini-code-assist bot Dec 29, 2025

Uh oh!

Miku2G commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	prompt = TokensPrompt(prompt_token_ids=prompt_ids, multi_modal_data=multi_modal_data)
	prompt = TokensPrompt(prompt_token_ids=prompt_ids, multi_modal_data=multi_modal_data if multi_modal_data else None)

fix: add support for video data in Agent Loop and Qwen3 VL #4727

Are you sure you want to change the base?

fix: add support for video data in Agent Loop and Qwen3 VL #4727

Conversation

kaln27 commented Dec 29, 2025

What does this PR do?

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 29, 2025

Choose a reason for hiding this comment

Uh oh!

Miku2G commented Dec 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix: add support for video data in `Agent Loop` and Qwen3 VL #4727

fix: add support for video data in `Agent Loop` and Qwen3 VL #4727