CUDA_VISIBLE_DEVICES=0 auto-round --scheme w4a16_mixed --iters 0 --disable_opt_rtn --model_name /data7/models/MiMo-V2.5-Pro --output_dir /data7/
2026-05-11 21:55:39 INFO entry.py L401: Auto-routing to model-free quantization (iters=0, disable_opt_rtn=True, supported scheme). Pass disable_model_free=True to use the regula
/home/uttest/miniforge3/envs/test/lib/python3.12/site-packages/transformers/modeling_rope_utils.py:1034: FutureWarning: `rope_config_validation` is deprecated and has been remov
warnings.warn(
2026-05-11 21:55:51 WARNING model_free.py L1144: Detected 2 layer(s) incompatible with model-free RTN: embed_tokens, rotary_emb, swa_rotary_emb.
These layers have been automatically added to ignore_layers and will be kept in full precision.
To override, pass --ignore_layers explicitly or disable model-free mode (remove --model_free).
2026-05-11 21:55:51 INFO model_free.py L1177: Detected FP8 source model (block_size=[128, 128], scale_fmt=N/A). FP8 weights will be dequantized before quantization.
2026-05-11 21:55:51 INFO model_free.py L1395: Model-free quantization: /data7/models/MiMo-V2.5-Pro
Scheme: QuantizationScheme(bits=4, group_size=128, sym=True, data_type='int', act_bits=16, act_group_size=None, act_sym=None, act_data_type=None, act_dynamic=None, super_bits=
Output: /data7/saved/MiMo-V2.5-Pro-int4-mixed
Shards: 34
Streaming download: False
Diffusion model: False
Quant lm_head: False
Quant nontext module: False
Device: cuda:0
Processing shards: 0%| | 0/34 [00:00<?, ?shard/s]
2026-05-11 21:56:37 INFO model_free.py L1271: Memory usage: 'peak_ram': 32.91GB, 'peak_vram': 4.78GB
Processing shards: 3%|███▌ | 1/34 [00:46<25:40, 46.67s/shard]
2026-05-11 21:58:08 INFO model_free.py L1271: Memory usage: 'peak_ram': 71.42GB, 'peak_vram': 4.78GB
Processing shards: 6%|███████ | 2/34 [02:17<38:49, 72.81s/shard]
2026-05-11 21:59:55 INFO model_free.py L1271: Memory usage: 'peak_ram': 71.42GB, 'peak_vram': 4.78GB
Processing shards: 9%|██████████▌ | 3/34 [04:04<45:38, 88.33s/shard]
2026-05-11 22:01:07 INFO model_free.py L1271: Memory usage: 'peak_ram': 96.1GB, 'peak_vram': 4.78GB
Processing shards: 12%|██████████████ | 4/34 [05:16<40:49, 81.66s/shard]
2026-05-11 22:02:36 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.04GB, 'peak_vram': 4.78GB
Processing shards: 15%|█████████████████▌ | 5/34 [06:44<40:43, 84.26s/shard]
2026-05-11 22:03:54 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.04GB, 'peak_vram': 4.78GB
Processing shards: 18%|█████████████████████ | 6/34 [08:03<38:21, 82.20s/shard]
2026-05-11 22:05:07 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 21%|████████████████████████▌ | 7/34 [09:16<35:42, 79.36s/shard]
2026-05-11 22:06:55 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 24%|████████████████████████████ | 8/34 [11:04<38:20, 88.46s/shard]
2026-05-11 22:08:25 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 26%|███████████████████████████████▌ | 9/34 [12:34<36:59, 88.78s/shard]
2026-05-11 22:09:45 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 29%|██████████████████████████████████▋ | 10/34 [13:54<34:31, 86.29s/shard]
2026-05-11 22:11:22 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 32%|██████████████████████████████████████▏ | 11/34 [15:31<34:20, 89.57s/shard]
2026-05-11 22:12:37 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 35%|█████████████████████████████████████████▋ | 12/34 [16:46<31:08, 84.94s/shard]
2026-05-11 22:13:59 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 38%|█████████████████████████████████████████████ | 13/34 [18:08<29:27, 84.18s/shard]
2026-05-11 22:15:14 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 41%|████████████████████████████████████████████████▌ | 14/34 [19:23<27:06, 81.34s/shard]
2026-05-11 22:16:37 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 44%|████████████████████████████████████████████████████ | 15/34 [20:46<25:56, 81.94s/shard]
2026-05-11 22:18:07 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 47%|███████████████████████████████████████████████████████▌ | 16/34 [22:16<25:19, 84.40s/shard]
2026-05-11 22:19:25 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 50%|███████████████████████████████████████████████████████████ | 17/34 [23:34<23:20, 82.37s/shard]
2026-05-11 22:20:39 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 53%|██████████████████████████████████████████████████████████████▍ | 18/34 [24:48<21:18, 79.90s/shard]
2026-05-11 22:21:51 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 56%|█████████████████████████████████████████████████████████████████▉ | 19/34 [26:00<19:20, 77.38s/shard]
2026-05-11 22:23:18 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 59%|█████████████████████████████████████████████████████████████████████▍ | 20/34 [27:27<18:44, 80.33s/shard]
2026-05-11 22:25:00 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 62%|████████████████████████████████████████████████████████████████████████▉ | 21/34 [29:09<18:49, 86.89s/shard]
2026-05-11 22:26:52 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 65%|████████████████████████████████████████████████████████████████████████████▎ | 22/34 [31:01<18:53, 94.48s/shard]
2026-05-11 22:28:17 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 68%|███████████████████████████████████████████████████████████████████████████████▊ | 23/34 [32:25<16:45, 91.41s/shard]
2026-05-11 22:29:46 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 71%|███████████████████████████████████████████████████████████████████████████████████▎ | 24/34 [33:54<15:07, 90.71s/shard]
2026-05-11 22:31:11 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 74%|██████████████████████████████████████████████████████████████████████████████████████▊ | 25/34 [35:20<13:22, 89.14s/shard]2026-05-11 22:31:30 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:32:39 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 76%|██████████████████████████████████████████████████████████████████████████████████████████▏ | 26/34 [36:47<11:48, 88.62s/shard]2026-05-11 22:32:57 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:33:59 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 79%|█████████████████████████████████████████████████████████████████████████████████████████████▋ | 27/34 [38:08<10:03, 86.25s/shard]2026-05-11 22:34:21 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:35:37 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 82%|█████████████████████████████████████████████████████████████████████████████████████████████████▏ | 28/34 [39:46<08:57, 89.61s/shard]2026-05-11 22:35:58 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:37:12 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 85%|████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 29/34 [41:21<07:37, 91.43s/shard]2026-05-11 22:37:35 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:38:57 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 88%|████████████████████████████████████████████████████████████████████████████████████████████████████████ | 30/34 [43:06<06:21, 95.46s/shard]2026-05-11 22:39:29 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:40:46 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 31/34 [44:55<04:58, 99.54s/shard]2026-05-11 22:41:09 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:42:27 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 32/34 [46:35<03:19, 99.76s/shard]2026-05-11 22:42:52 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:44:10 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards: 97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 33/34 [48:19<01:40, 101.00s/shard]2026-05-11 22:44:11 INFO model_free.py L639: Dequantizing 12 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:44:36 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 5.69GB
Processing shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [48:45<00:00, 86.04s/shard]
2026-05-11 22:44:40 INFO model_free.py L1360:
Model-free quantization complete.
Output directory: /data7/saved/MiMo-V2.5-Pro-int4-mixed
Total time: 2928.87 seconds
Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 5.69GB
Quantized layers (79649): model.layers.0.mlp.down_proj, model.layers.0.mlp.gate_proj, model.layers.0.mlp.up_proj, model.layers.[0-69].self_attn.o_proj, model.layers.[0-69].self_attn.qkv_proj, model.layers.[1-69].mlp.experts.[0-383].down_proj, model.layers.[1-69].mlp.experts.[0-383].gate_proj, model.layers.[1-69].mlp.experts.[0-383].up_proj, model.mtp.layers.[0-2].eh_proj, model.mtp.layers.[0-2].mlp.down_proj, model.mtp.layers.[0-2].mlp.gate_proj, model.mtp.layers.[0-2].mlp.up_proj, model.mtp.layers.[0-2].self_attn.o_proj, model.mtp.layers.[0-2].self_attn.qkv_proj
Ignored layers (71): lm_head, model.embed_tokens, model.layers.[1-69].mlp.gate
docker run --gpus '"device=0,1,2,3"' -ti --rm --name="test" \
--privileged --ipc=host -p 8000:8000 \
-v /data7/saved/MiMo-V2.5-Pro-int4-mixed:/MiMo-V2.5-Pro-int4-mixed \
-e no_proxy="localhost,127.0.0.1,::1,192.168.0.0/16,10.0.0.0/8,172.16.0.0/12" \
-e NCCL_P2P_DISABLE=1 \
vllm/vllm-openai:mimov25-cu129 \
/MiMo-V2.5-Pro-int4-mixed \
--trust-remote-code \
--generation-config vllm \
--tensor-parallel-size 4 \
--cpu-offload-gb 80 \
--gpu-memory-utilization 0.98 \
--max-model-len 512 \
--enforce-eager
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worke
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] worker = WorkerProc(*args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] return func(*args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 619, in __init__
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] self.worker.load_model()
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] return func(*args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4793, in load_model
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] self.model = model_loader.load_model(
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] return func(*args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] model = initialize_model(
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] return func(*args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 61, in initialize_model
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] model = model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2.py", line 685, in __init__
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] self.model = MiMoV2Model(
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] ^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 379, in __init__
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] old_init(self, *args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2.py", line 461, in __init__
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] self.embed_tokens = VocabParallelEmbedding(
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 284, in __init__
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] raise NotImplementedError(
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] NotImplementedError: The class UnquantizedLinearMethod must implement the 'embedding' method, see UnquantizedEmbeddingMethod.
Model: https://huggingface.co/INC4AI/MiMo-V2.5-Pro-int4-mixed
Quant log:
inference command:
inference log: