Skip to content

AMD-AGI/Micro-World

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation



AMD Micro-World: action-controlled Interactive world model

Table of Contents

Introduction

Micro-World is a series of action-controlled interactive world models developed by the AMD AIG team and trained on AMD Instinct™ MI250/MI325 GPUs. It includes both text-to-video and image-to-video variants, enabling a wide range of application scenarios. Built on Wan as the base model, Micro-World is trained on a MineCraft dataset and is designed to generate high-quality, open-domain visual environments.

Quick Start

Installation

Our model is build on AMD GPU and ROCm enviroment.

a. From docker

We strongly recommend docker enviroment.

# build image
docker build -t microworld:latest .

# enter image
docker run -it --rm --name=agent --network=host \
  --device=/dev/kfd --device=/dev/dri --group-add=video --group-add=render \
  --ipc=host \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_SOCKET_IFNAME=lo \
  -e NCCL_SOCKET_FAMILY=AF_INET \
  -e NCCL_DEBUG=WARN \
  -e RCCL_MSCCLPP_ENABLE=0 \
  -e RCCL_MSCCL_ENABLE=0 \
  -e NCCL_MIN_NCHANNELS=16 -e NCCL_MAX_NCHANNELS=32 \
  microworld:latest

b. Conda

conda create -n AMD_microworld python=3.12
conda activate AMD_microworld

# Install torch/torchvision
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/rocm6.4

# Important: Install ROCm flash-attn by offical guideline https://github.com/Dao-AILab/flash-attention
pip install -r requirements.txt

Inference:

  • Step 1: Download the corresponding model weights and place them in the models folder.
# Download T2W model weights:
hf download amd/Micro-World-T2W --local-dir models/T2W

# Download I2W model weights:
hf download amd/Micro-World-I2W --local-dir models/I2W
  • Step 2: We provide action-controlled model inference scripts under the examples folder.
    • Modify config in the script, such as transformer_path, lora_path, prompt, neg_prompt, GPU_memory_mode, validation_image_start, action_list.
      • validation_image_start is the reference image path of image-to-video generation.
      • action_list follows formation of [[{end frame}, "w, s, a, d, shift, ctrl, _, mouse_y, mouse_x"],...,"{space frames}"]. For example, [[10, "0 1 0 0 0 0 0 0 0"], [80, "0 0 0 1 0 0 0 0 0"], "30 65"] indicates press s from frame 0 to frame 10, press d from frame 10 to frame 80, and press space at frame 30 and 65.
    • For example, you can run T2W action controled model inference using following command:
python examples/wan2.1/predict_t2w_action_control.py

Training

We have provided our collected minecraft action controlled dataset, the game lora and action controlled model weights.

  • Step 1: Download your target base model weights and place them in the models folder.
# Download Wan2.1 T2V model weights:
hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir models/Diffusion_Transformer/Wan2.1-T2V-1.3B

# Download pretrained model weights and lora
hf download amd/Micro-World-T2W --local-dir models/T2W

# Download Wan2.1 I2V model weights:
hf download Wan-AI/Wan2.1-I2V-14B-480P --local-dir models/Diffusion_Transformer/Wan2.1-I2V-14B-480P

# Download pretrained model weights and lora
hf download amd/Micro-World-I2W --local-dir models/I2W

# Download train dataset:
hf download amd/Micro-World-MC-Dataset --local-dir datasets --repo-type=dataset
  • Step 2: We provide action-controlled model training scripts under the scripts folder.
    • Modify config in the bash script.
    • For example, you can run T2W action controled model training using following command:
bash scripts/wan2.1/train_game_action_t2w.sh

Evaluation

We follow the evaluation protocol from Matrix-Game to assess image quality, action controllability, and temporal stability.

  • Step 1 Download the IDM model and weights from VPT, and place them under test_metrics/idm_model.
  • Step 2 Clone GameWorldScore and move the folder into test_metrics/GameWorldScore.
  • Step 3 Run the evaluation script.
cd test_metrics
python evaluate.py \
        --model idm_moddel/4x_idm.model \
        --weights idm_moddel/4x_idm.weights \
        --visual_metrics \ # evaluate image quality(PSNR, LPIPS, FVD)
        --control_metrics \ # evaluate action Controllability 
        --temporal_metrics \ # evaluate temporal stability
        --infer-demo-num 0 --n-frames 81 \
        --video-path your_testing_output_path \
        --ref-path dataset_val/video \
        --output-file "metrics_log/ground_truth/idm_res_pred.json" \
        --json-path dataset_val/metadata-detection/

Video Result

T2W Model

In Domain

mc-w.mp4
W
mc-s.mp4
S
mc-a.mp4
A
mc-d.mp4
D
mc-w-ctrl.mp4
W+Ctrl
mc-w-shift.mp4
W+Shift
mc-multicontrol.mp4
Multiple control
mc-mouse_du.mp4
Mouse down and up
mc-mouse_rl.mp4
Mouse right and left

Open Domain

livingroom-m.mp4
View Prompt
A cozy living room with sunlight streaming through window, vintage furniture, soft shadows.
livingroom-w.mp4
View Prompt
A cozy living room with sunlight streaming through window, vintage furniture, soft shadows.
cliff.mp4
View Prompt
Running along a cliffside path in a tropical island in first person perspective, with turquoise waters crashing against the rocks far below, the salty scent of the ocean carried by the breeze, and the sound of distant waves blending with the calls of seagulls as the path twists and turns along the jagged cliffs.
bear.mp4
View Prompt
A young bear stands next to a large tree in a grassy meadow, its dark fur catching the soft daylight. The bear seems poised, observing its surroundings in a tranquil landscape, with rolling hills and sparse trees dotting the background under a pale blue sky.
panda.mp4
View Prompt
A giant panda rests peacefully under a blooming cherry blossom tree, its black and white fur contrasting beautifully with the delicate pink petals. The ground is lightly sprinkled with fallen blossoms, and the tranquil setting is framed by the soft hues of the blossoms and the grassy field surrounding the tree.
ruin.mp4
View Prompt
Exploring an ancient jungle ruin in first person perspective surrounded by towering stone statues covered in moss and vines.

I2W Model

We observe that fully decoupling the action module from game-specific styles in large-scale models remains challenging. As a result, we apply both the LoRA weights and the action module during inference for the I2W results.

20251219-1210.mp4
View Prompt
First-person perspective walking down a lively city street at night. Neon signs and bright billboards glow on both sides, cars drive past with headlights and taillights streaking slightly. camera motion directly aligned with user actions, immersive urban night scene.
20251219-1201.mp4
View Prompt
First-person perspective standing in front of an ornate traditional Chinese temple. The symmetrical facade features red lanterns, intricate carvings, and a curved tiled roof decorated with dragons. Bright daytime lighting, consistent environment, camera motion directly aligned with user actions, immersive and interactive exploration.
20251218-1053.mp4
View Prompt
First-person perspective of standing in a rocky desert valley, looking at a camel a few meters ahead. The camel stands calmly on uneven stones, its long legs and single hump clearly visible. Bright midday sunlight, dry air, muted earth tones, distant barren mountains. Natural handheld camera feeling, camera motion controlled by user actions, smooth movement, cinematic realism.
20251218-1031.mp4
View Prompt
First-person perspective walking through a narrow urban alley, old red brick industrial buildings on both sides, cobblestone street stretching forward with strong depth, metal walkways connecting buildings above, overcast daylight, soft diffused lighting, cool and muted color tones, quiet and empty environment, no people, camera motion controlled by user actions, smooth movement, stable horizon, realistic scale and geometry, high realism, cinematic urban scene.
20251218-0352.mp4
View Prompt
First-person perspective coastal exploration scene, walking along a cliffside stone path with wooden railings, green bushes lining the walkway, ocean to the left with gentle waves, distant islands visible under a clear sky, realistic head-mounted camera view, smooth forward motion, stable horizon, natural human eye level, high realism, consistent environment, camera motion directly aligned with user actions, immersive and interactive exploration.
20251218-0815.mp4
View Prompt
First-person perspective inside a cozy living room, walking around a warm fireplace, soft carpet underfoot, furniture arranged neatly, bookshelves, plants, and warm table lamps on both sides, warm indoor lighting, calm and quiet atmosphere, natural head-level camera movement, camera motion driven by user actions, realistic scale and depth, high realism, cinematic lighting, no people, no distortion.

🤗Resources

Technical blog

Pre-trained models

Dataset

Acknowledgement

Our codebase is built upon Wan2.1, VideoX-Fun. We sincerely thank the authors for open-sourcing their excellent codebases.

Our datasets are collected using MineDojo and captioned with miniCPM-V. We also extend our appreciation to the respective teams for their high-quality tools and contributions.

About

For world model code developing and releasing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published