[skyrl-train] Multi-modal support

## Overview
Sky-RL currently supports RL for LLMs. This issue outlines the progress and path towards multi-modal RL. That is, multi-modal inputs and text outputs. The project scope and tasks are very much a WIP, and will likely change.

## Example Use Case
To test learning dynamics, we can use the [multimodal-open-r1-8k dataset](https://huggingface.co/datasets/lmms-lab/multimodal-open-r1-8k-verified) and reference the corresponding [github repo](https://github.com/EvolvingLMMs-Lab/open-r1-multimodal/tree/main?tab=readme-ov-file). They have W&B logs to compare to.

### Other potential datasets / tasks:
- [Geometry8k](https://huggingface.co/datasets/hiyouga/geometry3k): geometric visual reasoning.
- [MathVision](https://huggingface.co/datasets/MathLLMs/MathVision): Visual math problems.
- [GRL](https://github.com/lmgame-org/GRL): Reinforcement learning on games (textual representation). We can use a visual renderer of e.g., [Sokoban](https://github.com/mpSchrader/gym-sokoban/blob/default/gym_sokoban/envs/sokoban_env.py) as a point of comparison.

The above tasks only evaluate visual reasoning, it would be useful to find tasks which incorporate other modalities.

## Implementation Tasks
Some tasks which have been identified through some prototype implementations.
- [ ] It is convention to treat multi-modal inputs as a dictionary with e.g., keys `vision_inputs` . TensorBatch needs to support either nested dictionaries or nested TensorBatches. I believe the latter is easiest.
- [ ] The inference engine needs to support passing multi-modal inputs. [vllm multi-modal support](https://docs.vllm.ai/en/latest/features/multimodal_inputs/) is a work in progress but, seems to be a good starting point. We should also consider using [vllm-omni](https://github.com/vllm-project/vllm-omni).
- [ ] Data processing: Huggingface provides a processor implementation, but this is typically insufficient for multi-modal inputs. Taking vision as an example, different models have different input pre-processing methods see [Qwen-3-VL](https://github.com/QwenLM/Qwen3-VL/blob/main/qwen-vl-utils/src/qwen_vl_utils/vision_process.py). This might not be an issue for all models, [Qwen-3-omni](https://github.com/QwenLM/Qwen3-Omni/blob/main/cookbooks/omni_captioner.ipynb) seems to have a unified processor. 
- [ ] Ensuring consistency between vllm’s multi-modal processor and the training backend processor. vllm expects PIL or bytes for images, not processed image patches. As a result, we need to be careful the training backend receives the same inputs, otherwise training dynamics will be off.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[skyrl-train] Multi-modal support #876

Overview

Example Use Case

Other potential datasets / tasks:

Implementation Tasks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[skyrl-train] Multi-modal support #876

Description

Overview

Example Use Case

Other potential datasets / tasks:

Implementation Tasks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions