-
Notifications
You must be signed in to change notification settings - Fork 220
Open
Labels
Description
Overview
Sky-RL currently supports RL for LLMs. This issue outlines the progress and path towards multi-modal RL. That is, multi-modal inputs and text outputs. The project scope and tasks are very much a WIP, and will likely change.
Example Use Case
To test learning dynamics, we can use the multimodal-open-r1-8k dataset and reference the corresponding github repo. They have W&B logs to compare to.
Other potential datasets / tasks:
- Geometry8k: geometric visual reasoning.
- MathVision: Visual math problems.
- GRL: Reinforcement learning on games (textual representation). We can use a visual renderer of e.g., Sokoban as a point of comparison.
The above tasks only evaluate visual reasoning, it would be useful to find tasks which incorporate other modalities.
Implementation Tasks
Some tasks which have been identified through some prototype implementations.
- It is convention to treat multi-modal inputs as a dictionary with e.g., keys
vision_inputs. TensorBatch needs to support either nested dictionaries or nested TensorBatches. I believe the latter is easiest. - The inference engine needs to support passing multi-modal inputs. vllm multi-modal support is a work in progress but, seems to be a good starting point. We should also consider using vllm-omni.
- Data processing: Huggingface provides a processor implementation, but this is typically insufficient for multi-modal inputs. Taking vision as an example, different models have different input pre-processing methods see Qwen-3-VL. This might not be an issue for all models, Qwen-3-omni seems to have a unified processor.
- Ensuring consistency between vllm’s multi-modal processor and the training backend processor. vllm expects PIL or bytes for images, not processed image patches. As a result, we need to be careful the training backend receives the same inputs, otherwise training dynamics will be off.