Skip to content

[Fix]Ensure HuggingFace downloads complete before initialization.#1213

Open
zzhuoxin1508 wants to merge 14 commits intovllm-project:mainfrom
zzhuoxin1508:fix/load-before-init
Open

[Fix]Ensure HuggingFace downloads complete before initialization.#1213
zzhuoxin1508 wants to merge 14 commits intovllm-project:mainfrom
zzhuoxin1508:fix/load-before-init

Conversation

@zzhuoxin1508
Copy link
Contributor

Purpose

This PR enhances the startup stability of multimodal models within multi-stage pipelines. By ensuring the Orchestrator completes all critical file downloads before spawning any Stage Workers, eliminate issues related to concurrent download conflicts, and initialization timeouts in multi-process environments.

Solution

  • Enabled recursive mode (**/*.ext) in omni_snapshot_download. This forces the Orchestrator to fully pull and verify all model files (including subdirectories) before initializing child processes.
  • Implemented require_all logic within download_weights_from_hf_specific. When enabled, the downloader ensures every defined matching pattern is accurately validated and successfully downloaded.
  • Refactored the omni_snapshot_download logic to prioritize local path validation.

Test Plan

Validated the fix using the Tongyi-MAI/Z-Image-Turbo model.

python text_to_image.py \
  --model Tongyi-MAI/Z-Image-Turbo \
  --prompt "a cup of coffee on the table" \
  --seed 42 \
  --cfg_scale 4.0 \
  --num_images_per_prompt 1 \
  --num_inference_steps 50 \
  --height 1024 \
  --width 1024 \
  --output outputs/coffee.png

Test Result

(vllm-omni-env) root@91f3d7b48993:/workspace/vllm-omni/examples/offline_inference/text_to_image# python text_to_image.py \
  --model Tongyi-MAI/Z-Image-Turbo \
  --prompt "a cup of coffee on the table" \
  --seed 42 \
  --cfg_scale 4.0 \
  --num_images_per_prompt 1 \
  --num_inference_steps 50 \
  --height 1024 \
  --width 1024 \
  --output outputs/coffee.png
INFO 02-05 02:12:36 [weight_utils.py:49] Using model weights format ['*.json', '*.bin', '*.safetensors', '*.pt', '*.txt', '*.model', '*.yaml']
model_index.json: 100%|████████████████████████████████████████████████████████████████████████████████████| 467/467 [00:00<00:00, 4.79MB/s]
scheduler_config.json: 100%|███████████████████████████████████████████████████████████████████████████████| 173/173 [00:00<00:00, 1.37MB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 726/726 [00:00<00:00, 9.06MB/s]
generation_config.json: 100%|██████████████████████████████████████████████████████████████████████████████| 239/239 [00:00<00:00, 1.88MB/s]
model.safetensors.index.json: 32.8kB [00:00, 77.9MB/s]
tokenizer/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 25.9MB/s]
tokenizer_config.json: 9.73kB [00:00, 28.8MB/s]
vocab.json: 2.78MB [00:00, 95.6MB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 473/473 [00:00<00:00, 3.35MB/s]
(…)ion_pytorch_model.safetensors.index.json: 49.0kB [00:00, 130MB/s]
config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 805/805 [00:00<00:00, 10.3MB/s]
text_encoder/model-00001-of-00003.safete(…): 100%|██████████████████████████████████████████████████████| 3.96G/3.96G [00:30<00:00, 131MB/s]
text_encoder/model-00002-of-00003.safete(…): 100%|██████████████████████████████████████████████████████| 3.99G/3.99G [00:30<00:00, 132MB/s]
text_encoder/model-00003-of-00003.safete(…): 100%|██████████████████████████████████████████████████████| 99.6M/99.6M [00:00<00:00, 123MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|██████████████████████████████████████████████████████| 9.97G/9.97G [00:39<00:00, 252MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|██████████████████████████████████████████████████████| 9.97G/9.97G [00:40<00:00, 247MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|██████████████████████████████████████████████████████| 4.67G/4.67G [00:19<00:00, 243MB/s]
vae/diffusion_pytorch_model.safetensors: 100%|███████████████████████████████████████████████████████████| 168M/168M [00:01<00:00, 98.0MB/s]
merges.txt: 1.67MB [00:00, 75.0MB/s]
INFO 02-05 02:15:22 [weight_utils.py:70] Time spent downloading weights for Tongyi-MAI/Z-Image-Turbo: 165.555125 seconds
INFO 02-05 02:15:22 [omni.py:137] Initializing stages for model: /workspace/.cache/huggingface/hub/models--Tongyi-MAI--Z-Image-Turbo/snapshots/f332072aa78be7aecdf3ee76d5c247082da564a6
INFO 02-05 02:15:22 [initialization.py:35] No OmniTransferConfig provided
INFO 02-05 02:15:22 [omni_stage.py:100] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'diffusion', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'enable_layerwise_offload': False, 'layerwise_num_gpu_layers': 1, 'vae_use_slicing': False, 'vae_use_tiling': False, 'cache_backend': None, 'cache_config': None, 'enable_cache_dit_summary': False, 'parallel_config': {'pipeline_parallel_size': 1, 'data_parallel_size': 1, 'tensor_parallel_size': 1, 'sequence_parallel_size': 1, 'ulysses_degree': 1, 'ring_degree': 1, 'cfg_parallel_size': 1}, 'enforce_eager': False, 'enable_cpu_offload': False, 'model': '/workspace/.cache/huggingface/hub/models--Tongyi-MAI--Z-Image-Turbo/snapshots/f332072aa78be7aecdf3ee76d5c247082da564a6', 'model_stage': 'diffusion'}, 'final_output': True, 'final_output_type': 'image'}
INFO 02-05 02:15:22 [omni.py:356] [Orchestrator] Waiting for 1 stages to initialize (timeout: 300s)
[Stage-0] INFO 02-05 02:15:31 [omni_stage.py:497] Starting stage worker with model: /workspace/.cache/huggingface/hub/models--Tongyi-MAI--Z-Image-Turbo/snapshots/f332072aa78be7aecdf3ee76d5c247082da564a6
[Stage-0] INFO 02-05 02:15:31 [omni_stage.py:510] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
[Stage-0] INFO 02-05 02:15:52 [multiproc_executor.py:74] Starting server...
[Stage-0] INFO 02-05 02:16:02 [diffusion_worker.py:269] Worker 0 created result MessageQueue
[Stage-0] INFO 02-05 02:16:02 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=2048.
[Stage-0] INFO 02-05 02:16:02 [vllm.py:624] Asynchronous scheduling is enabled.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 02-05 02:16:02 [diffusion_worker.py:95] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Stage-0] INFO 02-05 02:16:02 [parallel_state.py:565] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
[Stage-0] INFO 02-05 02:16:02 [parallel_state.py:607] SP group details for rank 0: sp_group=[0], ulysses_group=[0], ring_group=[0]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.74it/s]
[Stage-0] INFO 02-05 02:16:05 [z_image_transformer.py:619] Z-Image init: dim=3840 n_heads=30 n_kv_heads=30 ffn_hidden_dim=10240 final_out_dims=(64,) tp=1 (supported_tp=(1, 2))
[Stage-0] INFO 02-05 02:16:05 [platform.py:77] Defaulting to diffusion attention backend FLASH_ATTN
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:04<00:09,  4.79s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:08<00:04,  4.40s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:10<00:00,  3.27s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:10<00:00,  3.62s/it]

[Stage-0] INFO 02-05 02:16:17 [diffusers_loader.py:227] Loading weights took 11.11 seconds
[Stage-0] INFO 02-05 02:16:17 [diffusion_model_runner.py:103] Model loading took 19.1516 GiB and 14.864855 seconds
[Stage-0] INFO 02-05 02:16:17 [diffusion_model_runner.py:108] Model runner: Model loaded successfully.
[Stage-0] INFO 02-05 02:16:17 [diffusion_model_runner.py:122] Model runner: Model compiled with torch.compile.
[Stage-0] INFO 02-05 02:16:17 [diffusion_model_runner.py:137] Model runner: Initialization complete.
[Stage-0] INFO 02-05 02:16:17 [manager.py:90] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
[Stage-0] INFO 02-05 02:16:17 [diffusion_worker.py:126] Worker 0: Initialization complete.
[Stage-0] INFO 02-05 02:16:17 [diffusion_worker.py:393] Worker 0: Scheduler loop started.
[Stage-0] INFO 02-05 02:16:17 [diffusion_worker.py:320] Worker 0 ready to receive requests via shared memory
[Stage-0] INFO 02-05 02:16:17 [scheduler.py:38] SyncScheduler initialized result MessageQueue
[Stage-0] INFO 02-05 02:16:17 [diffusion_engine.py:337] dummy run to warm up the model
[Stage-0] INFO 02-05 02:16:17 [manager.py:538] Deactivating all adapters: 0 layers
[Stage-0] WARNING 02-05 02:16:17 [kv_transfer_manager.py:452] Request has no ID, cannot receive KV cache
[Stage-0] INFO 02-05 02:16:31 [omni_stage.py:740] Max batch size: 1
INFO 02-05 02:16:31 [omni.py:349] [Orchestrator] Stage-0 reported ready
INFO 02-05 02:16:31 [omni.py:375] [Orchestrator] All stages initialized successfully

Essential Elements of an Effective PR Description Checklist
  • [ x ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • [ x ] The test plan, such as providing test command.
  • [ x ] The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1edaa8168

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 91 to 93
allow_patterns=[
"**/*.json", "**/*.bin", "**/*.safetensors", "**/*.pt",
"**/*.txt", "**/*.model", "**/*.yaml"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include legacy .npy assets in HF prefetch

The new prefetch list omits *.npy, but the Qwen2.5 Omni loader explicitly supports legacy speaker assets stored under inputs/*spk_emb.npy and inputs/*ref_mel.npy in the model directory (see _init_token2wav_model in vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni.py). Because omni_snapshot_download now converts repo IDs into a local snapshot path, downstream loaders will treat it as a local directory and won’t fall back to Hugging Face to fetch missing files. For models that only ship the legacy .npy assets (no spk_dict.pt), this change silently drops conditioning data and forces the fallback zeros path, which breaks speaker conditioning quality. Consider adding **/*.npy (or using * for the prefetch) to avoid losing these files.

Useful? React with 👍 / 👎.

@hsliuustc0106 hsliuustc0106 requested a review from Copilot February 5, 2026 04:33
@hsliuustc0106
Copy link
Collaborator

fix precommits please

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@hsliuustc0106
Copy link
Collaborator

could you please also test the qwen2.5-omni model?

Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
@zzhuoxin1508
Copy link
Contributor Author

could you please also test the qwen2.5-omni model?

ok i'll try it

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request addresses initialization issues in multimodal models within multi-stage pipelines by ensuring that the Orchestrator completes all critical file downloads before spawning Stage Workers. This eliminates concurrent download conflicts and initialization timeouts in multi-process environments.

Changes:

  • Added require_all parameter to download_weights_from_hf_specific to force downloading all matching patterns
  • Refactored omni_snapshot_download to use recursive glob patterns (**/*.ext) and the new require_all functionality
  • Added local path validation to omni_snapshot_download for optimization

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
vllm_omni/model_executor/model_loader/weight_utils.py Added require_all parameter to control whether all patterns should be downloaded
vllm_omni/entrypoints/omni.py Refactored snapshot download to use recursive patterns and ensure complete downloads

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

# If it's already a local path, just return it
if os.path.exists(model_id):
return model_id
# TODO: this is just a workaround for quickly use modelscope, we should support
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 77 has trailing whitespace. Remove the trailing space after "return model_id".

Copilot uses AI. Check for mistakes.
else:
return _dummy_snapshot_download(model_id)
# For other cases (Hugging Face), perform a real download to ensure all
# necessary files (including *.pt for audio/diffusion) are available locally
Copy link

Copilot AI Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 84 has trailing whitespace. Remove the trailing space after "return snapshot_download(model_id)".

Copilot uses AI. Check for mistakes.
@zzhuoxin1508
Copy link
Contributor Author

qwen2.5-omni test

@hsliuustc0106 @lishunyang12 @nussejzz PTAL

(workspace) root@72e628fb7449:/workspace/vllm-omni/examples/offline_inference/qwen2_5_omni# bash run_single_prompt.sh
INFO 02-05 05:55:11 [weight_utils.py:49] Using model weights format ['*.json', '*.bin', '*.safetensors', '*.pt', '*.txt', '*.model', '*.yaml']
added_tokens.json: 100%|███████████████████████████████████████████████████████████████████████████████████| 579/579 [00:00<00:00, 5.77MB/s]
chat_template.json: 1.31kB [00:00, 4.78MB/s]
config.json: 13.2kB [00:00, 64.8MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████| 74.0/74.0 [00:00<00:00, 910kB/s]
model.safetensors.index.json: 233kB [00:00, 326MB/s]
preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████| 667/667 [00:00<00:00, 4.76MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████| 832/832 [00:00<00:00, 5.19MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:00<00:00, 22.7MB/s]
tokenizer_config.json: 6.47kB [00:00, 22.5MB/s]
vocab.json: 2.78MB [00:00, 85.8MB/s]
model-00001-of-00005.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:38<00:00, 129MB/s]
model-00002-of-00005.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:38<00:00, 128MB/s]
model-00003-of-00005.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:38<00:00, 131MB/s]
model-00004-of-00005.safetensors: 100%|█████████████████████████████████████████████████████████████████| 4.97G/4.97G [00:37<00:00, 132MB/s]
model-00005-of-00005.safetensors: 100%|█████████████████████████████████████████████████████████████████| 2.43G/2.43G [00:12<00:00, 196MB/s]
spk_dict.pt: 100%|████████████████████████████████████████████████████████████████████████████████████████| 260k/260k [00:00<00:00, 925kB/s]
merges.txt: 1.67MB [00:00, 141MB/s]
INFO 02-05 05:58:00 [weight_utils.py:70] Time spent downloading weights for Qwen/Qwen2.5-Omni-7B: 169.257966 seconds
INFO 02-05 05:58:00 [omni.py:137] Initializing stages for model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
INFO 02-05 05:58:00 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
INFO 02-05 05:58:00 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
INFO 02-05 05:58:00 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
INFO 02-05 05:58:00 [factory.py:46] Created connector: SharedMemoryConnector
INFO 02-05 05:58:00 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
INFO 02-05 05:58:00 [factory.py:46] Created connector: SharedMemoryConnector
INFO 02-05 05:58:00 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
INFO 02-05 05:58:00 [omni_stage.py:100] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'llm', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'thinker', 'model_arch': 'Qwen2_5OmniForConditionalGeneration', 'worker_type': 'ar', 'scheduler_cls': 'vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler', 'gpu_memory_utilization': 0.8, 'enforce_eager': True, 'trust_remote_code': True, 'engine_output_type': 'latent', 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'max_num_seqs': 1, 'async_chunk': False}, 'is_comprehension': True, 'final_output': True, 'final_output_type': 'text', 'default_sampling_params': {'temperature': 0.0, 'top_p': 1.0, 'top_k': -1, 'max_tokens': 2048, 'seed': 42, 'detokenize': True, 'repetition_penalty': 1.1}}
INFO 02-05 05:58:00 [omni_stage.py:100] [OmniStage] stage_config: {'stage_id': 1, 'stage_type': 'llm', 'runtime': {'process': True, 'devices': '1', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'talker', 'model_arch': 'Qwen2_5OmniForConditionalGeneration', 'worker_type': 'ar', 'scheduler_cls': 'vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler', 'gpu_memory_utilization': 0.8, 'enforce_eager': True, 'trust_remote_code': True, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'engine_output_type': 'latent', 'max_num_seqs': 1, 'async_chunk': False}, 'engine_input_source': [0], 'custom_process_input_func': 'vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker', 'default_sampling_params': {'temperature': 0.9, 'top_p': 0.8, 'top_k': 40, 'max_tokens': 2048, 'seed': 42, 'detokenize': True, 'repetition_penalty': 1.05, 'stop_token_ids': [8294]}}
INFO 02-05 05:58:00 [omni_stage.py:100] [OmniStage] stage_config: {'stage_id': 2, 'stage_type': 'llm', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'code2wav', 'model_arch': 'Qwen2_5OmniForConditionalGeneration', 'worker_type': 'generation', 'scheduler_cls': 'vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler', 'gpu_memory_utilization': 0.15, 'enforce_eager': True, 'trust_remote_code': True, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'async_scheduling': False, 'engine_output_type': 'audio', 'max_num_seqs': 1, 'async_chunk': False}, 'engine_input_source': [1], 'final_output': True, 'final_output_type': 'audio', 'default_sampling_params': {'temperature': 0.0, 'top_p': 1.0, 'top_k': -1, 'max_tokens': 2048, 'seed': 42, 'detokenize': True, 'repetition_penalty': 1.1}}
INFO 02-05 05:58:00 [omni.py:356] [Orchestrator] Waiting for 3 stages to initialize (timeout: 300s)
[Stage-2] INFO 02-05 05:58:09 [omni_stage.py:497] Starting stage worker with model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00
[Stage-1] INFO 02-05 05:58:09 [omni_stage.py:497] Starting stage worker with model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00
[Stage-1] INFO 02-05 05:58:09 [omni_stage.py:510] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
[Stage-2] INFO 02-05 05:58:09 [omni_stage.py:510] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-2] INFO 02-05 05:58:09 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
[Stage-2] INFO 02-05 05:58:09 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
[Stage-2] INFO 02-05 05:58:09 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
[Stage-2] INFO 02-05 05:58:09 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-2] INFO 02-05 05:58:09 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-2] INFO 02-05 05:58:09 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-2] INFO 02-05 05:58:09 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
[Stage-1] INFO 02-05 05:58:09 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
[Stage-1] INFO 02-05 05:58:09 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
[Stage-1] INFO 02-05 05:58:09 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
[Stage-1] INFO 02-05 05:58:09 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 02-05 05:58:09 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-1] INFO 02-05 05:58:09 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 02-05 05:58:09 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-0] INFO 02-05 05:58:10 [omni_stage.py:497] Starting stage worker with model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00
[Stage-0] INFO 02-05 05:58:10 [omni_stage.py:510] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
[Stage-2] INFO 02-05 05:58:18 [model.py:541] Resolved architecture: Qwen2_5OmniModel
[Stage-2] INFO 02-05 05:58:18 [model.py:1561] Using max model len 32768
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-1] INFO 02-05 05:58:18 [model.py:541] Resolved architecture: Qwen2_5OmniModel
[Stage-1] INFO 02-05 05:58:18 [model.py:1561] Using max model len 32768
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-2] INFO 02-05 05:58:29 [model.py:222] Resolved architecture: Qwen2_5OmniForConditionalGeneration
[Stage-2] INFO 02-05 05:58:29 [model.py:1561] Using max model len 32768
[Stage-2] INFO 02-05 05:58:29 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-2] INFO 02-05 05:58:29 [vllm.py:624] Asynchronous scheduling is disabled.
[Stage-2] WARNING 02-05 05:58:29 [vllm.py:662] Enforce eager set, overriding optimization level to -O0
[Stage-2] INFO 02-05 05:58:29 [vllm.py:762] Cudagraph is disabled under eager mode
[Stage-1] INFO 02-05 05:58:29 [model.py:222] Resolved architecture: Qwen2_5OmniForConditionalGeneration
[Stage-1] INFO 02-05 05:58:29 [model.py:1561] Using max model len 32768
[Stage-1] INFO 02-05 05:58:29 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-1] INFO 02-05 05:58:29 [vllm.py:624] Asynchronous scheduling is enabled.
[Stage-1] WARNING 02-05 05:58:29 [vllm.py:662] Enforce eager set, overriding optimization level to -O0
[Stage-1] INFO 02-05 05:58:29 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:39 [core.py:96] Initializing a V1 LLM engine (v0.15.0) with config: model='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', speculative_config=None, tokenizer='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:39 [core.py:96] Initializing a V1 LLM engine (v0.15.0) with config: model='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', speculative_config=None, tokenizer='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:40 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.20.0.2:58453 backend=nccl
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:40 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.20.0.2:54859 backend=nccl
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:40 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:40 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=4615) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=4681) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:43 [gpu_model_runner.py:4021] Starting to load model /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00...
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:43 [gpu_model_runner.py:4021] Starting to load model /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00...
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:43 [vllm.py:624] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:43 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:43 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:43 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:43 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:43 [utils.py:59] Trying to guess the arguments for old-style model class <class 'vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni_token2wav.Qwen2_5OmniToken2WavModel'>
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:43 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=4681) [Stage-1] WARNING 02-05 05:58:43 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:43 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=4681) [Stage-1] WARNING 02-05 05:58:43 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:58:43 [vllm.py:762] Cudagraph is disabled under eager mode
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:00<00:00, 98.47it/s]
(EngineCore_DP0 pid=4615) 
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:45 [qwen2_5_omni_token2wav.py:1759] [Model Loaded] name=Qwen2_5OmniToken2WavForConditionalGenerationVLLM, success=True, size=1492.80 MB, device=cuda:0
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:45 [default_loader.py:291] Loading weights took 1.34 seconds
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:46 [gpu_model_runner.py:4118] Model loading took 1.46 GiB memory and 2.011382 seconds
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:46 [qwen2_5_omni.py:943] Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:47 [gpu_generation_model_runner.py:418] Dummy sampler run is not implemented for generation model
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:47 [core.py:272] init engine (profile, create kv cache, warmup model) took 1.35 seconds
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:47 [scheduler.py:168] Using custom scheduler class vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:47 [core.py:129] Disabling chunked prefill for model without KVCache
(EngineCore_DP0 pid=4615) [Stage-2] WARNING 02-05 05:58:47 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4615) [Stage-2] INFO 02-05 05:58:47 [vllm.py:762] Cudagraph is disabled under eager mode
[Stage-2] INFO 02-05 05:58:48 [omni_llm.py:172] Supported_tasks: ['generate']
[Stage-2] INFO 02-05 05:58:48 [initialization.py:288] [Stage-2] Initializing OmniConnectors with config keys: ['from_stage_1']
[Stage-2] INFO 02-05 05:58:48 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-2] INFO 02-05 05:58:48 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
[Stage-2] INFO 02-05 05:58:48 [omni_stage.py:740] Max batch size: 1
INFO 02-05 05:58:48 [omni.py:349] [Orchestrator] Stage-2 reported ready
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-0] INFO 02-05 05:58:48 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
[Stage-0] INFO 02-05 05:58:48 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
[Stage-0] INFO 02-05 05:58:48 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
[Stage-0] INFO 02-05 05:58:48 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-0] INFO 02-05 05:58:48 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-0] INFO 02-05 05:58:48 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-0] INFO 02-05 05:58:48 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-0] INFO 02-05 05:58:48 [model.py:541] Resolved architecture: Qwen2_5OmniModel
[Stage-0] INFO 02-05 05:58:48 [model.py:1561] Using max model len 32768
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section'}
[Stage-0] INFO 02-05 05:58:58 [model.py:222] Resolved architecture: Qwen2_5OmniForConditionalGeneration
[Stage-0] INFO 02-05 05:58:58 [model.py:1561] Using max model len 32768
[Stage-0] INFO 02-05 05:58:58 [scheduler.py:226] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-0] INFO 02-05 05:58:58 [vllm.py:624] Asynchronous scheduling is enabled.
[Stage-0] WARNING 02-05 05:58:58 [vllm.py:662] Enforce eager set, overriding optimization level to -O0
[Stage-0] INFO 02-05 05:58:58 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:04 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:04 [mm_encoder_attention.py:77] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:00<00:00, 96.70it/s]
(EngineCore_DP0 pid=4681) 
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:08 [core.py:96] Initializing a V1 LLM engine (v0.15.0) with config: model='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', speculative_config=None, tokenizer='/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:09 [parallel_state.py:1212] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.20.0.2:33551 backend=nccl
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:09 [parallel_state.py:1423] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(EngineCore_DP0 pid=5809) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:12 [gpu_model_runner.py:4021] Starting to load model /workspace/.cache/huggingface/hub/models--Qwen--Qwen2.5-Omni-7B/snapshots/ae9e1690543ffd5c0221dc27f79834d0294cba00...
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:13 [vllm.py:624] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=5809) [Stage-0] WARNING 02-05 05:59:13 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:13 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=5809) [Stage-0] WARNING 02-05 05:59:13 [qwen2_5_omni_thinker.py:272] flash_attn is not available, the model may not yield the exactly same result as the transformers implementation in the audio tower part.
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:13 [mm_encoder_attention.py:77] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=5809) [Stage-0] WARNING 02-05 05:59:13 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:13 [vllm.py:762] Cudagraph is disabled under eager mode
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:13 [cuda.py:364] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:00<00:00, 98.96it/s]
(EngineCore_DP0 pid=5809) 
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:17 [default_loader.py:291] Loading weights took 3.89 seconds
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:18 [qwen2_5_omni_talker.py:196] [Model Loaded] name=Qwen2_5OmniTalkerForConditionalGeneration, success=True, size=5087.96 MB, device=cuda:0
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:18 [gpu_model_runner.py:4118] Model loading took 16.74 GiB memory and 4.539376 seconds
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:18 [default_loader.py:291] Loading weights took 14.11 seconds
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:18 [gpu_model_runner.py:4946] Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:19 [gpu_model_runner.py:4118] Model loading took 6.03 GiB memory and 34.718209 seconds
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:19 [gpu_model_runner.py:4946] Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:23 [gpu_worker.py:356] Available KV cache memory: 28.23 GiB
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:23 [kv_cache_utils.py:1307] GPU KV cache size: 616,784 tokens
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:23 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 18.82x
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:23 [core.py:272] init engine (profile, create kv cache, warmup model) took 4.18 seconds
(EngineCore_DP0 pid=4681) [Stage-1] WARNING 02-05 05:59:23 [scheduler.py:168] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=4681) [Stage-1] WARNING 02-05 05:59:23 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=4681) [Stage-1] INFO 02-05 05:59:23 [vllm.py:762] Cudagraph is disabled under eager mode
[Stage-1] INFO 02-05 05:59:24 [omni_llm.py:172] Supported_tasks: ['generate']
[Stage-1] INFO 02-05 05:59:24 [initialization.py:288] [Stage-1] Initializing OmniConnectors with config keys: ['from_stage_0']
[Stage-1] INFO 02-05 05:59:24 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 02-05 05:59:24 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-1] INFO 02-05 05:59:24 [omni_stage.py:740] Max batch size: 1
INFO 02-05 05:59:24 [omni.py:349] [Orchestrator] Stage-1 reported ready
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:25 [gpu_worker.py:356] Available KV cache memory: 17.09 GiB
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:25 [kv_cache_utils.py:1307] GPU KV cache size: 320,000 tokens
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:25 [kv_cache_utils.py:1312] Maximum concurrency for 32,768 tokens per request: 9.77x
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:25 [core.py:272] init engine (profile, create kv cache, warmup model) took 7.36 seconds
(EngineCore_DP0 pid=5809) [Stage-0] WARNING 02-05 05:59:25 [scheduler.py:168] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=5809) [Stage-0] WARNING 02-05 05:59:26 [vllm.py:669] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=5809) [Stage-0] INFO 02-05 05:59:26 [vllm.py:762] Cudagraph is disabled under eager mode
[Stage-0] INFO 02-05 05:59:26 [omni_llm.py:172] Supported_tasks: ['generate']
[Stage-0] INFO 02-05 05:59:26 [initialization.py:288] [Stage-0] Initializing OmniConnectors with config keys: ['to_stage_1']
[Stage-0] INFO 02-05 05:59:26 [omni_stage.py:740] Max batch size: 1
INFO 02-05 05:59:26 [omni.py:349] [Orchestrator] Stage-0 reported ready
INFO 02-05 05:59:26 [omni.py:375] [Orchestrator] All stages initialized successfully

zzhuoxin1508 and others added 2 commits February 5, 2026 14:26
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: zhou zhuoxin <zhouzhuoxin1508@outlook.com>
Copy link
Contributor

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with Qwen-image-edit. It will break.

(workspace) root@925981d52983:/workspace/vllm-omni/examples/offline_inference/image_to_image# python image_edit.py \
  --image qwen-bear.png \
  --prompt "Let this mascot dance under the moon, surrounded by floating stars and poetic bubbles such as 'Be Kind'" \
  --output output_image_edit.png \
  --num_inference_steps 50 \
  --cfg_scale 4.0
INFO 02-06 15:54:04 [weight_utils.py:50] Using model weights format ['**/*.json', '**/*.bin', '**/*.safetensors', '**/*.pt', '**/*.txt', '**/*.model', '**/*.yaml']
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 605/605 [00:00<00:00, 1.68MB/s]
preprocessor_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 788/788 [00:00<00:00, 2.61MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 2.54MB/s]
processor/tokenizer.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 11.0MB/s]
tokenizer_config.json: 4.73kB [00:00, 16.3MB/s]
video_preprocessor_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 904/904 [00:00<00:00, 7.05MB/s]
vocab.json: 2.78MB [00:00, 18.2MB/s]
scheduler_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 485/485 [00:00<00:00, 2.00MB/s]
config.json: 3.22kB [00:00, 7.97MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 244/244 [00:00<00:00, 1.02MB/s]
model.safetensors.index.json: 57.7kB [00:00, 102MB/s]
tokenizer_config.json: 4.69kB [00:00, 12.3MB/s]
vocab.json: 3.38MB [00:00, 53.7MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 339/339 [00:00<00:00, 1.37MB/s]
(…)ion_pytorch_model.safetensors.index.json: 199kB [00:00, 110MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 730/730 [00:00<00:00, 3.07MB/s]
text_encoder/model-00001-of-00004.safete(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.97G/4.97G [00:12<00:00, 398MB/s]
text_encoder/model-00002-of-00004.safete(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:12<00:00, 408MB/s]
text_encoder/model-00003-of-00004.safete(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.93G/4.93G [00:12<00:00, 399MB/s]
text_encoder/model-00004-of-00004.safete(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1.69G/1.69G [00:04<00:00, 353MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.99G/4.99G [00:11<00:00, 422MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.98G/4.98G [00:11<00:00, 421MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.95G/4.95G [00:33<00:00, 149MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.98G/4.98G [00:12<00:00, 411MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.95G/4.95G [00:11<00:00, 422MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.95G/4.95G [00:11<00:00, 427MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.91G/4.91G [00:12<00:00, 394MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 4.98G/4.98G [00:12<00:00, 398MB/s]
transformer/diffusion_pytorch_model-0000(…): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████| 1.17G/1.17G [00:03<00:00, 317MB/s]
vae/diffusion_pytorch_model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 254M/254M [00:01<00:00, 156MB/s]
merges.txt: 1.67MB [00:00, 41.0MB/s]
INFO 02-06 15:56:58 [weight_utils.py:71] Time spent downloading weights for Qwen/Qwen-Image-Edit: 174.753237 seconds
INFO 02-06 15:56:58 [omni.py:132] Initializing stages for model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen-Image-Edit/snapshots/ac7f9318f633fc4b5778c59367c8128225f1e3de
Traceback (most recent call last):
  File "/workspace/.venv/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 604, in get_config
    raise ValueError(
ValueError: Could not detect config format for no config file found. With config_format 'auto', ensure your model has either config.json (HF format) or params.json (Mistral format). Otherwise please specify your_custom_config_format in engine args for customized config parser.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/workspace/vllm-omni/vllm_omni/entrypoints/utils.py", line 139, in resolve_model_config_path
    hf_config = get_config(model, trust_remote_code=True)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.12/site-packages/vllm/transformers_utils/config.py", line 625, in get_config
    raise ValueError(error_message) from e
ValueError: Invalid repository ID or local directory specified: '/workspace/.cache/huggingface/hub/models--Qwen--Qwen-Image-Edit/snapshots/ac7f9318f633fc4b5778c59367c8128225f1e3de'.
Please verify the following requirements:
1. Provide a valid Hugging Face repository ID.
2. Specify a local directory that contains a recognized configuration file.
   - For Hugging Face models: ensure the presence of a 'config.json'.
   - For Mistral models: ensure the presence of a 'params.json'.


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/workspace/vllm-omni/examples/offline_inference/image_to_image/image_edit.py", line 492, in <module>
    main()
  File "/workspace/vllm-omni/examples/offline_inference/image_to_image/image_edit.py", line 362, in main
    omni = Omni(
           ^^^^^
  File "/workspace/vllm-omni/vllm_omni/entrypoints/omni.py", line 535, in __init__
    super().__init__(model, **kwargs)
  File "/workspace/vllm-omni/vllm_omni/entrypoints/omni.py", line 133, in __init__
    self._initialize_stages(model, kwargs)
  File "/workspace/vllm-omni/vllm_omni/entrypoints/omni.py", line 221, in _initialize_stages
    self.config_path = resolve_model_config_path(model)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspace/vllm-omni/vllm_omni/entrypoints/utils.py", line 162, in resolve_model_config_path
    raise ValueError(
ValueError: Could not determine model_type for model: /workspace/.cache/huggingface/hub/models--Qwen--Qwen-Image-Edit/snapshots/ac7f9318f633fc4b5778c59367c8128225f1e3de. Model is not in standard transformers format and does not have model_index.json. Please ensure the model has proper configuration files with 'model_type' field
(workspace) root@925981d52983:/workspace/vllm-omni/examples/offline_inference/image_to_image# 


@lishunyang12
Copy link
Contributor

lishunyang12 commented Feb 6, 2026

Can you take a look on how diffuser and vllm handle this situation? Track the respective code and try to run their examples.

@zzhuoxin1508
Copy link
Contributor Author

Can you take a look on how diffuser and vllm handle this situation? Track the respective code and try to run their examples.

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants