You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and then Prepare Prompts
"city_square": (
"A bullet-time effect video in a frozen 3D photography style. The entire small-town courthouse-square intersection is captured as a single, perfectly static moment in time. The red-brick civic building with a clock tower, the surrounding storefronts and awnings, the gazebo, the roads, cars, pedestrians, flag, trees, shadows, and warm afternoon atmosphere all remain completely motionless, with no movement anywhere in the environment. The only change is the camera, which moves smoothly and stably in a gentle aerial arc while maintaining a high-angle perspective. The scene should remain a coherent small-town urban center throughout the shot: newly revealed areas should continue the courthouse-square layout with more connected streets, sidewalks, rooftops, storefronts, parked vehicles, and town-block structures consistent with the existing intersection. The surrounding view should preserve the continuity of the town rather than collapsing into generic woodland or empty natural scenery."
),
and then Video Inference with command:
#!/bin/bash
# WorldForge (WAN) Video Generation - Batch Inference Script
# Usage: bash wan_for_worldforge/run_test_case.sh (run from project root)
# export HF_ENDPOINT=https://hf-mirror.com # Uncomment if you need a HuggingFace mirror
cd "$(dirname "$0")"
# ==================== Basic Configuration ====================
MODELS_DIR="/.cache/huggingface/models" # Model weights directory (must contain Wan2.1-I2V-14B-480P-Diffusers etc.)
VIDEO_REF="vggt/output_images_1_vggt_warp_degree90/warped_images" # Input frames + masks directory
OUTPUT_DIR="./output_image_1_wan_worldforge_degree90" # Output directory
SCENE="city_square" # Scene name (see utils/prompts.py)
NUM_FRAMES=49 # Number of output frames
RESOLUTION="720p" # 480p or 720p
STATIC="True" # True for static scenes, False for dynamic
NUM_INFERENCE_STEPS=50 # Diffusion sampling steps
# ==================== Parameter Grid ====================
# Modify these arrays to sweep different parameter combinations
omegas=(4) # Auto-guidance strength (recommended: (4 6))
guidance_scales=(4) # CFG scale (recommended: (4) )
transition_distances=(15) # Mask softening distance in pixels (0=hard edge; recommended: (15 20 25))
resample_steps=(2) # Resampling iterations per step (recommended: (2))
guide_steps=(10 18 23) # Guide steps: apply guided fusion for first N steps (recommended: (10 15 18 20 23))
step_additions=(0) # Extra steps for resample_round = guide_steps + addition (recommended: (0 1))
# ==================== Batch Inference ====================
mkdir -p "$OUTPUT_DIR"
for omega in "${omegas[@]}"; do
for cfg in "${guidance_scales[@]}"; do
for mask in "${transition_distances[@]}"; do
for resample in "${resample_steps[@]}"; do
for guide in "${guide_steps[@]}"; do
for add in "${step_additions[@]}"; do
round=$((guide + add))
output="${OUTPUT_DIR}/o${omega}_guide${guide}_round${round}_mask${mask}_cfg${cfg}.mp4"
echo "========================================"
echo "omega=$omega, guide=$guide, round=$round, mask=$mask, cfg=$cfg"
echo "output: $output"
echo "========================================"
python infer_worldforge.py \
--model "$RESOLUTION" \
--models-dir "$MODELS_DIR" \
--video-ref "$VIDEO_REF" \
--scene "$SCENE" \
--num-frames $NUM_FRAMES \
--num-inference-steps $NUM_INFERENCE_STEPS \
--guidance-scale $cfg \
--static "$STATIC" \
--guided \
--resample-steps $resample \
--guide-steps $guide \
--resample-round $round \
--omega $omega \
--omega_resample $omega \
--soften-mask \
--transition-distance $mask \
--use-pca-channel-selection \
--output "$output"
done
done
done
done
done
done
but get results very weird no matter which guide_steps
When I use custom data to generate videos, I sometimes get poor results. Could you please provide some guidance on how to improve them?
if i give a input image like this:
first step: VGGT — 3D scene warping (single / few images)
with command:
and then Prepare Prompts
"city_square": (
"A bullet-time effect video in a frozen 3D photography style. The entire small-town courthouse-square intersection is captured as a single, perfectly static moment in time. The red-brick civic building with a clock tower, the surrounding storefronts and awnings, the gazebo, the roads, cars, pedestrians, flag, trees, shadows, and warm afternoon atmosphere all remain completely motionless, with no movement anywhere in the environment. The only change is the camera, which moves smoothly and stably in a gentle aerial arc while maintaining a high-angle perspective. The scene should remain a coherent small-town urban center throughout the shot: newly revealed areas should continue the courthouse-square layout with more connected streets, sidewalks, rooftops, storefronts, parked vehicles, and town-block structures consistent with the existing intersection. The surrounding view should preserve the continuity of the town rather than collapsing into generic woodland or empty natural scenery."
),
and then Video Inference with command:
but get results very weird no matter which guide_steps
o4_guide23_round23_mask15_cfg4.mp4