Skip to content

MiMo-V2.5-Pro int4 inference: NotImplementedError: The class UnquantizedLinearMethod must implement the 'embedding' method, see UnquantizedEmbeddingMethod. #1826

@XuehaoSun

Description

@XuehaoSun

Model: https://huggingface.co/INC4AI/MiMo-V2.5-Pro-int4-mixed

Quant log:

 CUDA_VISIBLE_DEVICES=0 auto-round --scheme w4a16_mixed --iters 0 --disable_opt_rtn --model_name /data7/models/MiMo-V2.5-Pro --output_dir /data7/
2026-05-11 21:55:39 INFO entry.py L401: Auto-routing to model-free quantization (iters=0, disable_opt_rtn=True, supported scheme). Pass disable_model_free=True to use the regula
/home/uttest/miniforge3/envs/test/lib/python3.12/site-packages/transformers/modeling_rope_utils.py:1034: FutureWarning: `rope_config_validation` is deprecated and has been remov
  warnings.warn(
2026-05-11 21:55:51 WARNING model_free.py L1144: Detected 2 layer(s) incompatible with model-free RTN: embed_tokens, rotary_emb, swa_rotary_emb.
These layers have been automatically added to ignore_layers and will be kept in full precision.
To override, pass --ignore_layers explicitly or disable model-free mode (remove --model_free).
2026-05-11 21:55:51 INFO model_free.py L1177: Detected FP8 source model (block_size=[128, 128], scale_fmt=N/A). FP8 weights will be dequantized before quantization.
2026-05-11 21:55:51 INFO model_free.py L1395: Model-free quantization: /data7/models/MiMo-V2.5-Pro
  Scheme: QuantizationScheme(bits=4, group_size=128, sym=True, data_type='int', act_bits=16, act_group_size=None, act_sym=None, act_data_type=None, act_dynamic=None, super_bits=
  Output: /data7/saved/MiMo-V2.5-Pro-int4-mixed
  Shards: 34
  Streaming download: False
  Diffusion model: False
  Quant lm_head: False
  Quant nontext module: False
  Device: cuda:0
Processing shards:   0%|                                                                                                                               | 0/34 [00:00<?, ?shard/s]
2026-05-11 21:56:37 INFO model_free.py L1271: Memory usage: 'peak_ram': 32.91GB, 'peak_vram': 4.78GB
Processing shards:   3%|███▌                                                                                                                   | 1/34 [00:46<25:40, 46.67s/shard]
2026-05-11 21:58:08 INFO model_free.py L1271: Memory usage: 'peak_ram': 71.42GB, 'peak_vram': 4.78GB
Processing shards:   6%|███████                                                                                                                | 2/34 [02:17<38:49, 72.81s/shard]
2026-05-11 21:59:55 INFO model_free.py L1271: Memory usage: 'peak_ram': 71.42GB, 'peak_vram': 4.78GB
Processing shards:   9%|██████████▌                                                                                                            | 3/34 [04:04<45:38, 88.33s/shard]
2026-05-11 22:01:07 INFO model_free.py L1271: Memory usage: 'peak_ram': 96.1GB, 'peak_vram': 4.78GB
Processing shards:  12%|██████████████                                                                                                         | 4/34 [05:16<40:49, 81.66s/shard]
2026-05-11 22:02:36 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.04GB, 'peak_vram': 4.78GB
Processing shards:  15%|█████████████████▌                                                                                                     | 5/34 [06:44<40:43, 84.26s/shard]
2026-05-11 22:03:54 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.04GB, 'peak_vram': 4.78GB
Processing shards:  18%|█████████████████████                                                                                                  | 6/34 [08:03<38:21, 82.20s/shard]
2026-05-11 22:05:07 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  21%|████████████████████████▌                                                                                              | 7/34 [09:16<35:42, 79.36s/shard]
2026-05-11 22:06:55 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  24%|████████████████████████████                                                                                           | 8/34 [11:04<38:20, 88.46s/shard]
2026-05-11 22:08:25 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  26%|███████████████████████████████▌                                                                                       | 9/34 [12:34<36:59, 88.78s/shard]
2026-05-11 22:09:45 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  29%|██████████████████████████████████▋                                                                                   | 10/34 [13:54<34:31, 86.29s/shard]
2026-05-11 22:11:22 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  32%|██████████████████████████████████████▏                                                                               | 11/34 [15:31<34:20, 89.57s/shard]
2026-05-11 22:12:37 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  35%|█████████████████████████████████████████▋                                                                            | 12/34 [16:46<31:08, 84.94s/shard]
2026-05-11 22:13:59 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  38%|█████████████████████████████████████████████                                                                         | 13/34 [18:08<29:27, 84.18s/shard]
2026-05-11 22:15:14 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  41%|████████████████████████████████████████████████▌                                                                     | 14/34 [19:23<27:06, 81.34s/shard]
2026-05-11 22:16:37 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  44%|████████████████████████████████████████████████████                                                                  | 15/34 [20:46<25:56, 81.94s/shard]
2026-05-11 22:18:07 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  47%|███████████████████████████████████████████████████████▌                                                              | 16/34 [22:16<25:19, 84.40s/shard]
2026-05-11 22:19:25 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  50%|███████████████████████████████████████████████████████████                                                           | 17/34 [23:34<23:20, 82.37s/shard]
2026-05-11 22:20:39 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  53%|██████████████████████████████████████████████████████████████▍                                                       | 18/34 [24:48<21:18, 79.90s/shard]
2026-05-11 22:21:51 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  56%|█████████████████████████████████████████████████████████████████▉                                                    | 19/34 [26:00<19:20, 77.38s/shard]
2026-05-11 22:23:18 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  59%|█████████████████████████████████████████████████████████████████████▍                                                | 20/34 [27:27<18:44, 80.33s/shard]
2026-05-11 22:25:00 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  62%|████████████████████████████████████████████████████████████████████████▉                                             | 21/34 [29:09<18:49, 86.89s/shard]
2026-05-11 22:26:52 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  65%|████████████████████████████████████████████████████████████████████████████▎                                         | 22/34 [31:01<18:53, 94.48s/shard]
2026-05-11 22:28:17 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  68%|███████████████████████████████████████████████████████████████████████████████▊                                      | 23/34 [32:25<16:45, 91.41s/shard]
2026-05-11 22:29:46 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  71%|███████████████████████████████████████████████████████████████████████████████████▎                                  | 24/34 [33:54<15:07, 90.71s/shard]
2026-05-11 22:31:11 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  74%|██████████████████████████████████████████████████████████████████████████████████████▊                               | 25/34 [35:20<13:22, 89.14s/shard]2026-05-11 22:31:30 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:32:39 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  76%|██████████████████████████████████████████████████████████████████████████████████████████▏                           | 26/34 [36:47<11:48, 88.62s/shard]2026-05-11 22:32:57 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:33:59 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  79%|█████████████████████████████████████████████████████████████████████████████████████████████▋                        | 27/34 [38:08<10:03, 86.25s/shard]2026-05-11 22:34:21 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:35:37 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  82%|█████████████████████████████████████████████████████████████████████████████████████████████████▏                    | 28/34 [39:46<08:57, 89.61s/shard]2026-05-11 22:35:58 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:37:12 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  85%|████████████████████████████████████████████████████████████████████████████████████████████████████▋                 | 29/34 [41:21<07:37, 91.43s/shard]2026-05-11 22:37:35 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:38:57 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  88%|████████████████████████████████████████████████████████████████████████████████████████████████████████              | 30/34 [43:06<06:21, 95.46s/shard]2026-05-11 22:39:29 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:40:46 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  91%|███████████████████████████████████████████████████████████████████████████████████████████████████████████▌          | 31/34 [44:55<04:58, 99.54s/shard]2026-05-11 22:41:09 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:42:27 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  94%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████       | 32/34 [46:35<03:19, 99.76s/shard]2026-05-11 22:42:52 INFO model_free.py L639: Dequantizing 2484 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:44:10 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 4.78GB
Processing shards:  97%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌   | 33/34 [48:19<01:40, 101.00s/shard]2026-05-11 22:44:11 INFO model_free.py L639: Dequantizing 12 FP8 weight tensor(s) to bfloat16.
2026-05-11 22:44:36 INFO model_free.py L1271: Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 5.69GB
Processing shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 34/34 [48:45<00:00, 86.04s/shard]
2026-05-11 22:44:40 INFO model_free.py L1360:
Model-free quantization complete.
  Output directory: /data7/saved/MiMo-V2.5-Pro-int4-mixed
  Total time: 2928.87 seconds
  Memory usage: 'peak_ram': 109.48GB, 'peak_vram': 5.69GB
  Quantized layers (79649): model.layers.0.mlp.down_proj, model.layers.0.mlp.gate_proj, model.layers.0.mlp.up_proj, model.layers.[0-69].self_attn.o_proj, model.layers.[0-69].self_attn.qkv_proj, model.layers.[1-69].mlp.experts.[0-383].down_proj, model.layers.[1-69].mlp.experts.[0-383].gate_proj, model.layers.[1-69].mlp.experts.[0-383].up_proj, model.mtp.layers.[0-2].eh_proj, model.mtp.layers.[0-2].mlp.down_proj, model.mtp.layers.[0-2].mlp.gate_proj, model.mtp.layers.[0-2].mlp.up_proj, model.mtp.layers.[0-2].self_attn.o_proj, model.mtp.layers.[0-2].self_attn.qkv_proj
  Ignored layers (71): lm_head, model.embed_tokens, model.layers.[1-69].mlp.gate

inference command:

docker run --gpus '"device=0,1,2,3"' -ti --rm --name="test" \
  --privileged --ipc=host -p 8000:8000 \
  -v /data7/saved/MiMo-V2.5-Pro-int4-mixed:/MiMo-V2.5-Pro-int4-mixed \
  -e no_proxy="localhost,127.0.0.1,::1,192.168.0.0/16,10.0.0.0/8,172.16.0.0/12" \
  -e NCCL_P2P_DISABLE=1 \
  vllm/vllm-openai:mimov25-cu129 \
  /MiMo-V2.5-Pro-int4-mixed \
  --trust-remote-code \
  --generation-config vllm \
  --tensor-parallel-size 4 \
  --cpu-offload-gb 80 \
  --gpu-memory-utilization 0.98 \
  --max-model-len 512 \
  --enforce-eager

inference log:

(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] WorkerProc failed to start.
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] Traceback (most recent call last):
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 837, in worke
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     worker = WorkerProc(*args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 619, in __init__
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     self.worker.load_model()
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4793, in load_model
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     self.model = model_loader.load_model(
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/base_loader.py", line 55, in load_model
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     model = initialize_model(
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]             ^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     return func(*args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/utils.py", line 61, in initialize_model
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     model = model_class(vllm_config=vllm_config, prefix=prefix)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2.py", line 685, in __init__
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     self.model = MiMoV2Model(
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]                  ^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 379, in __init__
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     old_init(self, *args, **kwargs)
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2.py", line 461, in __init__
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     self.embed_tokens = VocabParallelEmbedding(
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]                         ^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/vocab_parallel_embedding.py", line 284, in __init__
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870]     raise NotImplementedError(
(Worker_TP1 pid=601) ERROR 05-16 07:05:28 [multiproc_executor.py:870] NotImplementedError: The class UnquantizedLinearMethod must implement the 'embedding' method, see UnquantizedEmbeddingMethod.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions