Micro-World is a series of action-controlled interactive world models developed by the AMD AIG team and trained on AMD Instinct™ MI250/MI325 GPUs. It includes both text-to-video and image-to-video variants, enabling a wide range of application scenarios. Built on Wan as the base model, Micro-World is trained on a MineCraft dataset and is designed to generate high-quality, open-domain visual environments.
Our model is build on AMD GPU and ROCm enviroment.
We strongly recommend docker enviroment.
# build image
docker build -t microworld:latest .
# enter image
docker run -it --rm --name=agent --network=host \
--device=/dev/kfd --device=/dev/dri --group-add=video --group-add=render \
--ipc=host \
-e NCCL_IB_DISABLE=1 \
-e NCCL_SOCKET_IFNAME=lo \
-e NCCL_SOCKET_FAMILY=AF_INET \
-e NCCL_DEBUG=WARN \
-e RCCL_MSCCLPP_ENABLE=0 \
-e RCCL_MSCCL_ENABLE=0 \
-e NCCL_MIN_NCHANNELS=16 -e NCCL_MAX_NCHANNELS=32 \
microworld:latest
conda create -n AMD_microworld python=3.12
conda activate AMD_microworld
# Install torch/torchvision
pip install torch==2.8.0 torchvision==0.23.0 torchaudio==2.8.0 --index-url https://download.pytorch.org/whl/rocm6.4
# Important: Install ROCm flash-attn by offical guideline https://github.com/Dao-AILab/flash-attention
pip install -r requirements.txt
- Step 1: Download the corresponding model weights and place them in the
modelsfolder.
# Download T2W model weights:
hf download amd/Micro-World-T2W --local-dir models/T2W
# Download I2W model weights:
hf download amd/Micro-World-I2W --local-dir models/I2W
- Step 2: We provide action-controlled model inference scripts under the
examplesfolder.- Modify config in the script, such as
transformer_path,lora_path,prompt,neg_prompt,GPU_memory_mode,validation_image_start,action_list.validation_image_startis the reference image path of image-to-video generation.action_listfollows formation of [[{end frame}, "w, s, a, d, shift, ctrl, _, mouse_y, mouse_x"],...,"{space frames}"]. For example, [[10, "0 1 0 0 0 0 0 0 0"], [80, "0 0 0 1 0 0 0 0 0"], "30 65"] indicates press s from frame 0 to frame 10, press d from frame 10 to frame 80, and press space at frame 30 and 65.
- For example, you can run T2W action controled model inference using following command:
- Modify config in the script, such as
python examples/wan2.1/predict_t2w_action_control.py
We have provided our collected minecraft action controlled dataset, the game lora and action controlled model weights.
- Step 1: Download your target base model weights and place them in the
modelsfolder.
# Download Wan2.1 T2V model weights:
hf download Wan-AI/Wan2.1-T2V-1.3B --local-dir models/Diffusion_Transformer/Wan2.1-T2V-1.3B
# Download pretrained model weights and lora
hf download amd/Micro-World-T2W --local-dir models/T2W
# Download Wan2.1 I2V model weights:
hf download Wan-AI/Wan2.1-I2V-14B-480P --local-dir models/Diffusion_Transformer/Wan2.1-I2V-14B-480P
# Download pretrained model weights and lora
hf download amd/Micro-World-I2W --local-dir models/I2W
# Download train dataset:
hf download amd/Micro-World-MC-Dataset --local-dir datasets --repo-type=dataset
- Step 2: We provide action-controlled model training scripts under the
scriptsfolder.- Modify config in the bash script.
- For example, you can run T2W action controled model training using following command:
bash scripts/wan2.1/train_game_action_t2w.sh
We follow the evaluation protocol from Matrix-Game to assess image quality, action controllability, and temporal stability.
- Step 1 Download the IDM model and weights from VPT, and place them under
test_metrics/idm_model. - Step 2 Clone GameWorldScore and move the folder into
test_metrics/GameWorldScore. - Step 3 Run the evaluation script.
cd test_metrics
python evaluate.py \
--model idm_moddel/4x_idm.model \
--weights idm_moddel/4x_idm.weights \
--visual_metrics \ # evaluate image quality(PSNR, LPIPS, FVD)
--control_metrics \ # evaluate action Controllability
--temporal_metrics \ # evaluate temporal stability
--infer-demo-num 0 --n-frames 81 \
--video-path your_testing_output_path \
--ref-path dataset_val/video \
--output-file "metrics_log/ground_truth/idm_res_pred.json" \
--json-path dataset_val/metadata-detection/
mc-w.mp4
W
|
mc-s.mp4
S
|
mc-a.mp4
A
|
mc-d.mp4
D
|
mc-w-ctrl.mp4
W+Ctrl
|
mc-w-shift.mp4
W+Shift
|
mc-multicontrol.mp4
Multiple control
|
mc-mouse_du.mp4
Mouse down and up
|
mc-mouse_rl.mp4
Mouse right and left
|
livingroom-m.mp4View Prompt
A cozy living room with sunlight streaming through window, vintage furniture, soft shadows.
|
livingroom-w.mp4View Prompt
A cozy living room with sunlight streaming through window, vintage furniture, soft shadows.
|
cliff.mp4View Prompt
Running along a cliffside path in a tropical island in first person perspective, with turquoise waters crashing against the rocks far below, the salty scent of the ocean carried by the breeze, and the sound of distant waves blending with the calls of seagulls as the path twists and turns along the jagged cliffs.
|
bear.mp4View Prompt
A young bear stands next to a large tree in a grassy meadow, its dark fur catching the soft daylight. The bear seems poised, observing its surroundings in a tranquil landscape, with rolling hills and sparse trees dotting the background under a pale blue sky.
|
panda.mp4View Prompt
A giant panda rests peacefully under a blooming cherry blossom tree, its black and white fur contrasting beautifully with the delicate pink petals. The ground is lightly sprinkled with fallen blossoms, and the tranquil setting is framed by the soft hues of the blossoms and the grassy field surrounding the tree.
|
ruin.mp4View Prompt
Exploring an ancient jungle ruin in first person perspective surrounded by towering stone statues covered in moss and vines.
|
We observe that fully decoupling the action module from game-specific styles in large-scale models remains challenging. As a result, we apply both the LoRA weights and the action module during inference for the I2W results.
20251219-1210.mp4View Prompt
First-person perspective walking down a lively city street at night. Neon signs and bright billboards glow on both sides, cars drive past with headlights and taillights streaking slightly. camera motion directly aligned with user actions, immersive urban night scene.
|
20251219-1201.mp4View Prompt
First-person perspective standing in front of an ornate traditional Chinese temple. The symmetrical facade features red lanterns, intricate carvings, and a curved tiled roof decorated with dragons. Bright daytime lighting, consistent environment, camera motion directly aligned with user actions, immersive and interactive exploration.
|
20251218-1053.mp4View Prompt
First-person perspective of standing in a rocky desert valley, looking at a camel a few meters ahead. The camel stands calmly on uneven stones, its long legs and single hump clearly visible. Bright midday sunlight, dry air, muted earth tones, distant barren mountains. Natural handheld camera feeling, camera motion controlled by user actions, smooth movement, cinematic realism.
|
20251218-1031.mp4View Prompt
First-person perspective walking through a narrow urban alley, old red brick industrial buildings on both sides, cobblestone street stretching forward with strong depth, metal walkways connecting buildings above, overcast daylight, soft diffused lighting, cool and muted color tones, quiet and empty environment, no people, camera motion controlled by user actions, smooth movement, stable horizon, realistic scale and geometry, high realism, cinematic urban scene.
|
20251218-0352.mp4View Prompt
First-person perspective coastal exploration scene, walking along a cliffside stone path with wooden railings, green bushes lining the walkway, ocean to the left with gentle waves, distant islands visible under a clear sky, realistic head-mounted camera view, smooth forward motion, stable horizon, natural human eye level, high realism, consistent environment, camera motion directly aligned with user actions, immersive and interactive exploration.
|
20251218-0815.mp4View Prompt
First-person perspective inside a cozy living room, walking around a warm fireplace, soft carpet underfoot, furniture arranged neatly, bookshelves, plants, and warm table lamps on both sides, warm indoor lighting, calm and quiet atmosphere, natural head-level camera movement, camera motion driven by user actions, realistic scale and depth, high realism, cinematic lighting, no people, no distortion.
|
- Text-to-World: Micro-World-T2W
- Image-to-World: Micro-World-I2W
Our codebase is built upon Wan2.1, VideoX-Fun. We sincerely thank the authors for open-sourcing their excellent codebases.
Our datasets are collected using MineDojo and captioned with miniCPM-V. We also extend our appreciation to the respective teams for their high-quality tools and contributions.