Skip to content

[skyrl-train] Multi-modal support #876

@nithinvc

Description

@nithinvc

Overview

Sky-RL currently supports RL for LLMs. This issue outlines the progress and path towards multi-modal RL. That is, multi-modal inputs and text outputs. The project scope and tasks are very much a WIP, and will likely change.

Example Use Case

To test learning dynamics, we can use the multimodal-open-r1-8k dataset and reference the corresponding github repo. They have W&B logs to compare to.

Other potential datasets / tasks:

  • Geometry8k: geometric visual reasoning.
  • MathVision: Visual math problems.
  • GRL: Reinforcement learning on games (textual representation). We can use a visual renderer of e.g., Sokoban as a point of comparison.

The above tasks only evaluate visual reasoning, it would be useful to find tasks which incorporate other modalities.

Implementation Tasks

Some tasks which have been identified through some prototype implementations.

  • It is convention to treat multi-modal inputs as a dictionary with e.g., keys vision_inputs . TensorBatch needs to support either nested dictionaries or nested TensorBatches. I believe the latter is easiest.
  • The inference engine needs to support passing multi-modal inputs. vllm multi-modal support is a work in progress but, seems to be a good starting point. We should also consider using vllm-omni.
  • Data processing: Huggingface provides a processor implementation, but this is typically insufficient for multi-modal inputs. Taking vision as an example, different models have different input pre-processing methods see Qwen-3-VL. This might not be an issue for all models, Qwen-3-omni seems to have a unified processor.
  • Ensuring consistency between vllm’s multi-modal processor and the training backend processor. vllm expects PIL or bytes for images, not processed image patches. As a result, we need to be careful the training backend receives the same inputs, otherwise training dynamics will be off.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions