AMD Micro-World: action-controlled Interactive world model

Introduction

Micro-World is a series of action-controlled interactive world models developed by the AMD AIG team and trained on AMD Instinct™ MI250/MI325 GPUs. It includes both text-to-video and image-to-video variants, enabling a wide range of application scenarios. Built on Wan as the base model, Micro-World is trained on a MineCraft dataset and is designed to generate high-quality, open-domain visual environments.

Quick Start

Installation

Our model is build on AMD GPU and ROCm enviroment.

a. From docker

We strongly recommend docker enviroment.

# build image
docker build -t microworld:latest .

# enter image
docker run -it --rm --name=agent --network=host \
  --device=/dev/kfd --device=/dev/dri --group-add=video --group-add=render \
  --ipc=host \
  -e NCCL_IB_DISABLE=1 \
  -e NCCL_SOCKET_IFNAME=lo \
  -e NCCL_SOCKET_FAMILY=AF_INET \
  -e NCCL_DEBUG=WARN \
  -e RCCL_MSCCLPP_ENABLE=0 \
  -e RCCL_MSCCL_ENABLE=0 \
  -e NCCL_MIN_NCHANNELS=16 -e NCCL_MAX_NCHANNELS=32 \
  microworld:latest

b. Conda

conda create -n AMD_microworld python=3.12
conda activate AMD_microworld

# Install torch/torchvision
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/rocm6.4

# Important: Install ROCm flash-attn by offical guideline https://github.com/Dao-AILab/flash-attention
pip install -r requirements.txt

Inference:

Step 1: Download the corresponding model weights and place them in the models folder.

# Download T2W model weights:
hf download amd/Micro-World-T2W --local-dir models/T2W

# Download I2W model weights:
hf download amd/Micro-World-I2W --local-dir models/I2W

Step 2: We provide action-controlled model inference scripts under the examples folder.
- Modify config in the script, such as transformer_path, lora_path, prompt, neg_prompt, GPU_memory_mode, validation_image_start, action_list.
  - validation_image_start is the reference image path of image-to-video generation.
  - action_list follows formation of [[{end frame}, "w, s, a, d, shift, ctrl, _, mouse_y, mouse_x"],...,"{space frames}"]. For example, [[10, "0 1 0 0 0 0 0 0 0"], [80, "0 0 0 1 0 0 0 0 0"], "30 65"] indicates press s from frame 0 to frame 10, press d from frame 10 to frame 80, and press space at frame 30 and 65.
- For example, you can run T2W action controled model inference using following command:

python examples/wan2.1/predict_t2w_action_control.py

Training

We have provided our collected minecraft action controlled dataset, the game lora and action controlled model weights.

Step 1: Download your target base model weights and place them in the models folder.

# Download Wan2.1 T2V model weights:
hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir models/Diffusion_Transformer/Wan2.1-T2V-1.3B

# Download pretrained model weights and lora
hf download amd/Micro-World-T2W --local-dir models/T2W

# Download Wan2.1 I2V model weights:
hf download Wan-AI/Wan2.1-I2V-14B-480P --local-dir models/Diffusion_Transformer/Wan2.1-I2V-14B-480P

# Download pretrained model weights and lora
hf download amd/Micro-World-I2W --local-dir models/I2W

# Download train dataset:
hf download amd/Micro-World-MC-Dataset --local-dir datasets --repo-type=dataset

Step 2: We provide action-controlled model training scripts under the scripts folder.
- Modify config in the bash script.
- For example, you can run T2W action controled model training using following command:

bash scripts/wan2.1/train_game_action_t2w.sh

Evaluation

We follow the evaluation protocol from Matrix-Game to assess image quality, action controllability, and temporal stability.

Step 1 Download the IDM model and weights from VPT, and place them under test_metrics/idm_model.
Step 2 Clone GameWorldScore and move the folder into test_metrics/GameWorldScore.
Step 3 Run the evaluation script.

cd test_metrics
python evaluate.py \
        --model idm_moddel/4x_idm.model \
        --weights idm_moddel/4x_idm.weights \
        --visual_metrics \ # evaluate image quality(PSNR, LPIPS, FVD)
        --control_metrics \ # evaluate action Controllability 
        --temporal_metrics \ # evaluate temporal stability
        --infer-demo-num 0 --n-frames 81 \
        --video-path your_testing_output_path \
        --ref-path dataset_val/video \
        --output-file "metrics_log/ground_truth/idm_res_pred.json" \
        --json-path dataset_val/metadata-detection/

Video Result

T2W Model

In Domain

mc-w.mp4

W

mc-s.mp4

S

mc-a.mp4

A

mc-d.mp4

D

mc-w-ctrl.mp4

W+Ctrl

mc-w-shift.mp4

W+Shift

mc-multicontrol.mp4

Multiple control

mc-mouse_du.mp4

Mouse down and up

mc-mouse_rl.mp4

Mouse right and left

Open Domain

livingroom-m.mp4

View Prompt

A cozy living room with sunlight streaming through window, vintage furniture, soft shadows.

livingroom-w.mp4

View Prompt

A cozy living room with sunlight streaming through window, vintage furniture, soft shadows.

cliff.mp4

View Prompt

Running along a cliffside path in a tropical island in first person perspective, with turquoise waters crashing against the rocks far below, the salty scent of the ocean carried by the breeze, and the sound of distant waves blending with the calls of seagulls as the path twists and turns along the jagged cliffs.

bear.mp4

View Prompt

A young bear stands next to a large tree in a grassy meadow, its dark fur catching the soft daylight. The bear seems poised, observing its surroundings in a tranquil landscape, with rolling hills and sparse trees dotting the background under a pale blue sky.

panda.mp4

View Prompt

A giant panda rests peacefully under a blooming cherry blossom tree, its black and white fur contrasting beautifully with the delicate pink petals. The ground is lightly sprinkled with fallen blossoms, and the tranquil setting is framed by the soft hues of the blossoms and the grassy field surrounding the tree.

ruin.mp4

View Prompt

Exploring an ancient jungle ruin in first person perspective surrounded by towering stone statues covered in moss and vines.

I2W Model

We observe that fully decoupling the action module from game-specific styles in large-scale models remains challenging. As a result, we apply both the LoRA weights and the action module during inference for the I2W results.

20251219-1210.mp4

View Prompt

First-person perspective walking down a lively city street at night. Neon signs and bright billboards glow on both sides, cars drive past with headlights and taillights streaking slightly. camera motion directly aligned with user actions, immersive urban night scene.

20251219-1201.mp4

View Prompt

First-person perspective standing in front of an ornate traditional Chinese temple. The symmetrical facade features red lanterns, intricate carvings, and a curved tiled roof decorated with dragons. Bright daytime lighting, consistent environment, camera motion directly aligned with user actions, immersive and interactive exploration.

20251218-1053.mp4

View Prompt

First-person perspective of standing in a rocky desert valley, looking at a camel a few meters ahead. The camel stands calmly on uneven stones, its long legs and single hump clearly visible. Bright midday sunlight, dry air, muted earth tones, distant barren mountains. Natural handheld camera feeling, camera motion controlled by user actions, smooth movement, cinematic realism.

20251218-1031.mp4

View Prompt

First-person perspective walking through a narrow urban alley, old red brick industrial buildings on both sides, cobblestone street stretching forward with strong depth, metal walkways connecting buildings above, overcast daylight, soft diffused lighting, cool and muted color tones, quiet and empty environment, no people, camera motion controlled by user actions, smooth movement, stable horizon, realistic scale and geometry, high realism, cinematic urban scene.

20251218-0352.mp4

View Prompt

First-person perspective coastal exploration scene, walking along a cliffside stone path with wooden railings, green bushes lining the walkway, ocean to the left with gentle waves, distant islands visible under a clear sky, realistic head-mounted camera view, smooth forward motion, stable horizon, natural human eye level, high realism, consistent environment, camera motion directly aligned with user actions, immersive and interactive exploration.

20251218-0815.mp4

View Prompt

First-person perspective inside a cozy living room, walking around a warm fireplace, soft carpet underfoot, furniture arranged neatly, bookshelves, plants, and warm table lamps on both sides, warm indoor lighting, calm and quiet atmosphere, natural head-level camera movement, camera motion driven by user actions, realistic scale and depth, high realism, cinematic lighting, no people, no distortion.

🤗Resources

Technical blog

Micro-World: First AMD Open-Source World Models for Interactive Video Generation

Pre-trained models

Text-to-World: Micro-World-T2W
Image-to-World: Micro-World-I2W

Dataset

MC-Dataset

Acknowledgement

Our codebase is built upon Wan2.1, VideoX-Fun. We sincerely thank the authors for open-sourcing their excellent codebases.

Our datasets are collected using MineDojo and captioned with miniCPM-V. We also extend our appreciation to the respective teams for their high-quality tools and contributions.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
asset		asset
config		config
datasets		datasets
examples/wan2.1		examples/wan2.1
microworld		microworld
models		models
scripts/wan2.1		scripts/wan2.1
test_metrics		test_metrics
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
dockerfile		dockerfile
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AMD Micro-World: action-controlled Interactive world model

Table of Contents

Introduction

Quick Start

Installation

a. From docker

b. Conda

Inference:

Training

Evaluation

Video Result

T2W Model

In Domain

Open Domain

I2W Model

🤗Resources

Technical blog

Pre-trained models

Dataset

Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

AMD-AGI/Micro-World

Folders and files

Latest commit

History

Repository files navigation

AMD Micro-World: action-controlled Interactive world model

Table of Contents

Introduction

Quick Start

Installation

a. From docker

b. Conda

Inference:

Training

Evaluation

Video Result

T2W Model

In Domain

Open Domain

I2W Model

🤗Resources

Technical blog

Pre-trained models

Dataset

Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages