Skip to content

official repo for AGNOSTOS, a cross-task manipulation benchmark, and X-ICM method, a cross-task in-context manipulation (VLA) method

License

Notifications You must be signed in to change notification settings

jiaming-zhou/X-ICM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ AGNOSTOS: Exploring the Limits of Vision-Language-Action Manipulations in Cross-task Generalization

Project Page Paper Hugging Face Data Hugging Face Model Video
Jiaming Zhou1, Ke Ye1, Jiayi Liu1, Teli Ma1, Zifan Wang1, Ronghe Qiu1, Kun-Yu Lin2, Zhilin Zhao3, Junwei Liang1,4
1HKUST (Guangzhou), 2HKU, 3SYSU, 4HKUST

πŸ“ Overview

The project introduces AGNOSTOS, a simulation manipulation benchmark designed to rigorously evaluate cross-task zero-shot generalization of Vision-Language-Action models, and proposes Cross-Task In-Context Manipulation (X-ICM), a method that significantly improves cross-task generalization capabilities.

πŸ”§ Environment Setup

🐳 Option 1: Using Docker

Please refer to INSTALL_docker.md to initialize your environment.

βš™οΈ Option 2: Local Installation

For simplified installation using modern package management, we recommend Pixi. Install it via the official guide, and you can set up dependencies with minimal commands:

git clone https://github.com/jiaming-zhou/X-ICM.git && cd X-ICM
pixi shell  # Install dependencies and enter virtual environment
pixi run setup_env  # Install additional system dependencies (like xvfb, CoppeliaSim and flash-attention etc.)

For more options, run pixi run --list.

⚠️ Important: You need to install CUDA 12.4 before running the commands above.

πŸ’‘ For mainland China users: In pixi.toml, comment out the default lines and uncomment the mirror lines in [workspace] and [pypi-options] sections for faster installation.

πŸ“Š AGNOSTOS Benchmark Data

The benchmark consists of two parts. To download the data, use the following Pixi tasks:

pixi run get_seen_tasks  # Downloads and extracts 18 seen tasks (140G)
pixi run get_unseen_tasks  # Downloads and extracts 23 unseen tasks (20.2GB)

Data will be placed in the data/ directory. For manual download instructions, see MANUAL_DATA_DOWNLOAD.md.

πŸ€– X-ICM Method

1️⃣ Model Download

To download the pre-trained dynamics diffusion model, run:

pixi run get_model

The model will be extracted to data/dynamics_diffusion/. For manual download instructions, see MANUAL_DATA_DOWNLOAD.md.

2️⃣ Evaluation

Parameters (Click to expand)
### set seed numbers for different runs
seeds: [example: "0,99"]
### set the number of rollouts for each run
episodes: [example: 25]
### set the method of LLM
modelname: [example: Qwen2.5.7B.instruct]
### set the number of cross-task in-context samples
num_icls: [example: 18]
### set the gpu list
gpu_ids: [example: 0,1]
### set the in-context sample selection method
ranking_method: [example: "lang_vis.out"]

For dynamics-guided in-context manipulation:

pixi run eval_xicm "0,99" 25 Qwen2.5.7B.instruct 18 0,1 "lang_vis.out"

For random selection of cross-task samples:

pixi run eval_xicm "0,99" 25 Qwen2.5.7B.instruct 18 0,1 "random"

After testing, use gather_score.py to collect and analyze results.

πŸ’‘ Note: Download required models (Stable-Diffusion, Qwen-LLM) from HuggingFace and configure paths in main.py and rlbench_inference_dynamics_diffusion.py.

🎯 Benchmarking Results over all 23 unseen tasks

We provide the testing results of our X-ICM (7B) and X-ICM (72B) models under the sub-folder logs/.

  • X-ICM (7B) achieves 23.5% average success rate and X-ICM (72B) achieves 30.1% average success rate, both versions outperform all existing VLA models;
  • X-ICM (7B) fails on only two tasks, while X-ICM (72B) succeeds on all tasks;

πŸ”¬ Benchmarking Your VLA Model on AGNOSTOS

1️⃣ Fine-tuning

Due to the embodiment gap, existing VLA models need to be fine-tuned on RLBench data.

Please follow your VLA model's fine-tuning guidelines to fine-tune your models on our 18 seen tasks, and then test the models on our 23 unseen tasks.

Example: Fine-tuning Qwen2.5-VL (Click to expand)

We provide a complete fine-tuning pipeline for Qwen2.5-VL, using the Qwen2-VL-Finetune framework:

# Download seen tasks data (18 tasks, ~140GB)
pixi run get_seen_tasks

# Basic training with default parameters
bash scripts/train_qwen2.5VL_sft.sh

# Custom training: bash scripts/train_qwen2.5VL_sft.sh [LR_LLM] [LR_VISION] [LR_MERGER] [EPOCHS] [BS] [GPUS]
# LR_LLM: LLM learning rate (default: 1e-4)
# LR_VISION: Vision tower learning rate (default: 2e-5)
# LR_MERGER: MLP learning rate (default: 1e-5)
# EPOCHS: Training epochs (default: 20)
# BS: Batch size per GPU (default: 128)
# GPUS: GPU IDs (default: 0,1,2,3,4,5,6,7)

# Example with custom parameters:
bash scripts/train_qwen2.5VL_sft.sh 1e-4 2e-5 1e-5 20 128 0,1,2,3,4,5,6,7

2️⃣ Testing Fine-tuned VLA Models

For Generic VLA Models

Modify the custom_agent.py file:

  1. Load your VLA model in the load_weights function;

  2. Implement VLA model inference in the _inference function, including input construction and output format conversion;

  3. Run the evaluation:

    bash scripts/eval_CustomModel.sh seeds episodes gpu_ids

    Example:

    bash scripts/eval_CustomModel.sh "0,99" 25 0,1

πŸ’‘ Note: Different VLA models may require different input image sizes (default is 256x256). Please modify IMAGE_SIZE in main_custom.py accordingly.

Example: Testing Qwen2.5-VL on AGNOSTOS (Click to expand)

After fine-tuning, evaluate your Qwen2.5-VL model on AGNOSTOS:

# bash scripts/eval_qwen2.5VL_sft.sh [MODE] [CHECKPOINT] [SEEDS] [EPISODES] [GPU_ID] [H_LEN] [T_LEN] [STEPS] [START] [NUM]
# MODE: ood (23 unseen tasks) or withintask (18 seen tasks)
# CHECKPOINT: Path to model checkpoint (e.g., outputs/checkpoint-1860)
# SEEDS: Random seeds (e.g., 0,1,2)
# EPISODES: Number of rollouts per task
# GPU_ID: GPU device ID
# H_LEN: History length | T_LEN: Target length | STEPS: Episode steps
# START: Start task index | NUM: Number of tasks to evaluate

# OOD evaluation on all 23 unseen tasks
bash scripts/eval_qwen2.5VL_sft.sh ood outputs/checkpoint-1860 0 25 0 1 1 25 0 23

# WithinTask evaluation on all 18 seen tasks
bash scripts/eval_qwen2.5VL_sft.sh withintask outputs/checkpoint-1860 0 25 0 1 1 25 0 18

# OOD with multiple seeds
bash scripts/eval_qwen2.5VL_sft.sh ood outputs/checkpoint-1860 0,1,2 25 0 1 1 25 0 23

πŸ“‚ Repository Structure

Directory Overview (Click to expand)
X-ICM/
β”œβ”€β”€ data/                          # Dataset and models
β”‚   β”œβ”€β”€ seen_tasks/                # 18 seen training tasks (~140GB)
β”‚   β”œβ”€β”€ unseen_tasks/              # 23 unseen evaluation tasks (~20.2GB)
β”‚   └── dynamics_diffusion/        # Pre-trained dynamics diffusion model
β”œβ”€β”€ scripts/                       # Training and evaluation scripts
β”‚   β”œβ”€β”€ train_qwen2.5VL_sft.sh    # Qwen2.5-VL fine-tuning script
β”‚   β”œβ”€β”€ eval_qwen2.5VL_sft.sh     # Qwen2.5-VL evaluation script
β”‚   └── eval_XICM.sh              # X-ICM evaluation script
β”œβ”€β”€ qwen2vl_finetune/              # Qwen2.5-VL fine-tuning module
β”œβ”€β”€ RLBench/                       
β”œβ”€β”€ YARR/                          
β”œβ”€β”€ PyRep/                         
β”œβ”€β”€ CoppeliaSim/                   # CoppeliaSim simulator
β”œβ”€β”€ main.py                        # X-ICM inference entry point
β”œβ”€β”€ main_custom.py                 # Generic VLA model evaluation
β”œβ”€β”€ custom_agent.py                # Custom VLA agent template
β”œβ”€β”€ rlbench_inference_dynamics_diffusion.py
└── gather_score.py                # Result aggregation script

πŸ™ Acknowledgments

This repository is built upon the RoboPrompt and Qwen2-VL-Finetune. Some resources from RVT and RLBench are used in this work.

πŸ“„ Citation

If you find our work helpful to your research, please kindly give us a star and cite our paper.

@inproceedings{
zhou2025exploring,
    title={Exploring the Limits of Vision-Language-Action Manipulation in Cross-task Generalization},
    author={Jiaming Zhou and Ke Ye and Jiayi Liu and Teli Ma and Zifan Wang and Ronghe Qiu and Kun-Yu Lin and Zhilin Zhao and Junwei Liang},
    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
    year={2025},
    url={https://openreview.net/forum?id=h6xQClTm4W}
}

About

official repo for AGNOSTOS, a cross-task manipulation benchmark, and X-ICM method, a cross-task in-context manipulation (VLA) method

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •