针对VLM的数据encoding的问题 | Questions about image filtering and prompt formatting in RLVR VLM training

我最近在使用 ROLL 做 VLM 的 RLVR 训练时，有两个问题想请教。

1. 关于 encoding VLM 数据集时的 image filtering
我注意到在这里的实现中：
[encoding vlm](https://github.com/alibaba/ROLL/blob/main/roll/pipeline/rlvr/rlvr_vlm_pipeline.py#L118)

相比 verl 的实现：
[verl dataset logic](https://github.com/verl-project/verl/blob/main/verl/utils/dataset/rl_dataset.py#L182)

似乎少了一些针对 image 的过滤逻辑。
比如，verl 这边会把 prompt 和 images 一起交给 processor，再基于处理结果做过滤。

我基于这个思路做了一点修改，目前在我自己的使用场景里是可以正常工作的。
后面我也会补充一些数据和额外测试，看是否可以整理成一个更完整的改动。

我想确认一下：这里没有加入这部分 image filtering，是有意为之吗？还是说目前这部分逻辑还没有补齐？

2. 关于 format prompt 的处理
另外我还注意到这里：
[format prompt](https://github.com/alibaba/ROLL/blob/main/roll/pipeline/rlvr/rlvr_vlm_pipeline.py#L53)

这里会对 user prompt 做一次额外处理。
我有点想了解，这样设计的原因是什么？

因为在我的实际训练里，我已经有自己预先设置好的 prompt。
加上这里的处理之后，会对训练和推理效果产生影响。

所以我想请教两点：

- 这里对 user prompt 的修改，主要是为了解决什么问题？
- 这部分行为是否应该在 README 里做更明确的说明？

如果我理解有误，也欢迎直接指出。
如果需要的话，我后面也可以把我这边的修改和测试结果补充上来。
这个是[初步](https://github.com/Damon-GSY/ROLL/commit/80fd5e19e09edd49cde2fff8cecc9d4412fa18ec)的代码修改

---

Hi, thanks for open-sourcing ROLL.
I recently ran into a couple of questions while using ROLL for RLVR training on VLMs.

1. Image filtering in VLM dataset encoding
I noticed that in this implementation:
[encoding vlm](https://github.com/alibaba/ROLL/blob/main/roll/pipeline/rlvr/rlvr_vlm_pipeline.py#L118)

compared with verl’s implementation here:
[verl dataset logic](https://github.com/verl-project/verl/blob/main/verl/utils/dataset/rl_dataset.py#L182)

there seems to be some missing image filtering logic.
For example, in verl, the prompt and images are passed to the processor together, and filtering is done based on the processed results.

I made a small change based on this idea, and it works in my own use case so far.
I plan to add more data and some extra tests later to validate it more carefully.

I wanted to ask: was the absence of this image filtering logic intentional, or is this part not fully implemented yet?

2. Prompt formatting behavior
I also noticed this part here:
[format prompt](https://github.com/alibaba/ROLL/blob/main/roll/pipeline/rlvr/rlvr_vlm_pipeline.py#L53)

It seems to automatically modify the user prompt.
I wanted to understand the reason behind this behavior.

In my training setup, I already use a custom prompt format.
With this extra modification, it affects both training and inference behavior.

So I wanted to ask:

What problem is this prompt modification intended to solve?
Should this behavior be documented more explicitly in the README?

If I misunderstood anything, please feel free to correct me.
If helpful, I can also follow up later with my local changes and some test results.
[This](https://github.com/Damon-GSY/ROLL/commit/80fd5e19e09edd49cde2fff8cecc9d4412fa18ec) code is my initial attempt.





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

针对VLM的数据encoding的问题 | Questions about image filtering and prompt formatting in RLVR VLM training #365

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

针对VLM的数据encoding的问题 | Questions about image filtering and prompt formatting in RLVR VLM training #365

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions