Skip to content

Commit 2496f9c

Browse files
tc-mbCISC
andauthored
mtmd : support MiniCPM-V 4.6 (ggml-org#22529)
* Support MiniCPM-V 4.6 in new branch Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix code bug Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix pre-commit Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix convert Signed-off-by: tc-mb <tianchi_cai@icloud.com> * rename clip_graph_minicpmv4_6 Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use new TYPE_MINICPMV4_6 Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use build_attn to allow flash attention support Signed-off-by: tc-mb <tianchi_cai@icloud.com> * no use legacy code, restored here. Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use the existing tensors name Signed-off-by: tc-mb <tianchi_cai@icloud.com> * unused ctx->model.hparams.minicpmv_version Signed-off-by: tc-mb <tianchi_cai@icloud.com> * use n_merge for slice alignment Signed-off-by: tc-mb <tianchi_cai@icloud.com> * borrow wa_layer_indexes for vit_merger insertion point Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix code style Signed-off-by: tc-mb <tianchi_cai@icloud.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * use filter_tensors and add model.vision_tower Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix chkhsh Signed-off-by: tc-mb <tianchi_cai@icloud.com> * fix type check Signed-off-by: tc-mb <tianchi_cai@icloud.com> --------- Signed-off-by: tc-mb <tianchi_cai@icloud.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
1 parent 5207d12 commit 2496f9c

13 files changed

Lines changed: 701 additions & 3 deletions

File tree

convert_hf_to_gguf.py

Lines changed: 90 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1360,6 +1360,9 @@ def get_vocab_base_pre(self, tokenizer) -> str:
13601360
if chkhsh == "d4540891389ea895b53b399da6ac824becc30f2fba0e9ddbb98f92e55ca0e97c":
13611361
# ref: https://huggingface.co/Qwen/Qwen3-Embedding-0.6B
13621362
res = "qwen2"
1363+
if chkhsh == "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f":
1364+
# ref: https://huggingface.co/openbmb/MiniCPM-V-4_6
1365+
res = "qwen35"
13631366
if chkhsh == "66b8d4e19ab16c3bfd89bce5d785fb7e0155e8648708a1f42077cb9fe002c273":
13641367
# ref: https://huggingface.co/alvarobartt/grok-2-tokenizer
13651368
res = "grok-2"
@@ -5499,16 +5502,101 @@ def modify_tensors(self, data_torch: Tensor, name: str, bid: int | None) -> Iter
54995502
yield from super().modify_tensors(data_torch, name, bid)
55005503

55015504

5505+
class _Qwen35MRopeMixin:
5506+
# Qwen3.5 always applies interleaved MRoPE (see Qwen3_5RotaryEmbedding in transformers);
5507+
# the upstream default mrope_section is [11, 11, 10] and llama.cpp's QWEN35 / QWEN35MOE
5508+
# loaders treat qwen35.rope.dimension_sections as required, so make sure it is always
5509+
# written even when a particular checkpoint omits the field in `rope_parameters`.
5510+
_QWEN35_DEFAULT_MROPE_SECTION = [11, 11, 10, 0]
5511+
5512+
gguf_writer: gguf.GGUFWriter
5513+
rope_parameters: dict
5514+
5515+
def set_gguf_parameters(self):
5516+
super().set_gguf_parameters() # ty: ignore[unresolved-attribute]
5517+
if "mrope_section" not in self.rope_parameters:
5518+
self.gguf_writer.add_rope_dimension_sections(self._QWEN35_DEFAULT_MROPE_SECTION)
5519+
5520+
55025521
@ModelBase.register("Qwen3_5ForConditionalGeneration", "Qwen3_5ForCausalLM")
5503-
class Qwen3_5TextModel(_LinearAttentionVReorderBase):
5522+
class Qwen3_5TextModel(_Qwen35MRopeMixin, _LinearAttentionVReorderBase):
55045523
model_arch = gguf.MODEL_ARCH.QWEN35
55055524

55065525

55075526
@ModelBase.register("Qwen3_5MoeForConditionalGeneration", "Qwen3_5MoeForCausalLM")
5508-
class Qwen3_5MoeTextModel(_LinearAttentionVReorderBase):
5527+
class Qwen3_5MoeTextModel(_Qwen35MRopeMixin, _LinearAttentionVReorderBase):
55095528
model_arch = gguf.MODEL_ARCH.QWEN35MOE
55105529

55115530

5531+
# MiniCPM-V 4.6: text tower is Qwen3.5 (linear+full hybrid attention) wrapped under
5532+
# `model.language_model.*`; vision tower is SigLIP + a window-attention ViT merger
5533+
# + a final DownsampleMLP merger. The same HF arch is registered twice below: once as
5534+
# the LM (text mode) and once as the mmproj (vision mode), mirroring the Qwen3-VL setup.
5535+
5536+
@ModelBase.register("MiniCPMV4_6ForConditionalGeneration")
5537+
class MiniCPMV4_6TextModel(Qwen3_5TextModel):
5538+
model_arch = gguf.MODEL_ARCH.QWEN35
5539+
5540+
@classmethod
5541+
def filter_tensors(cls, item: tuple[str, Callable[[], Tensor]]) -> tuple[str, Callable[[], Tensor]] | None:
5542+
name, gen = item
5543+
5544+
if name.startswith("model.merger."):
5545+
return None
5546+
# MTP tensors are not used at inference yet; align with Qwen3Next behaviour
5547+
if name.startswith("mtp"):
5548+
return None
5549+
5550+
return super().filter_tensors(item)
5551+
5552+
5553+
@ModelBase.register("MiniCPMV4_6ForConditionalGeneration")
5554+
class MiniCPMV4_6VisionModel(MmprojModel):
5555+
def __init__(self, *args, **kwargs):
5556+
super().__init__(*args, **kwargs)
5557+
if self.hparams_vision is not None:
5558+
# In MiniCPM-V 4.6 `vision_config.image_size` (980) describes the SigLIP
5559+
# positional embedding bucket grid (70 x 70), while the per-slice processing
5560+
# resolution is the preprocessor's `scale_resolution` (typically 448).
5561+
# The CLIP loader in tools/mtmd/clip.cpp consumes `clip.vision.image_size`
5562+
# as the slice size and warmup resolution, so report `scale_resolution` there
5563+
# to match the upstream MiniCPMV4_6ImageProcessorPil slicing rules.
5564+
scale_resolution = self.preprocessor_config.get("scale_resolution")
5565+
if scale_resolution is not None:
5566+
self.hparams_vision["image_size"] = int(scale_resolution)
5567+
5568+
def set_gguf_parameters(self):
5569+
super().set_gguf_parameters()
5570+
assert self.hparams_vision is not None
5571+
5572+
# projector type string is consumed by clip_projector_type_from_string() in clip.cpp
5573+
# (mapped to PROJECTOR_TYPE_MINICPMV4_6).
5574+
self.gguf_writer.add_clip_projector_type(gguf.VisionProjectorType.MINICPMV4_6)
5575+
5576+
# ViT merger 2x2 + final merger 2x2 = 4x spatial merge per dimension; used for slice alignment
5577+
self.gguf_writer.add_vision_projector_scale_factor(4)
5578+
5579+
# borrow wa_layer_indexes for vit_merger insertion point
5580+
insert_layer_id = int(self.global_config.get(
5581+
"insert_layer_id", self.hparams_vision.get("insert_layer_id", 6)))
5582+
self.gguf_writer.add_vision_wa_layer_indexes([insert_layer_id])
5583+
5584+
# SigLIP vision body uses gelu_pytorch_tanh, which matches ggml_gelu (tanh approx).
5585+
self.gguf_writer.add_vision_use_gelu(True)
5586+
self.gguf_writer.add_vision_attention_layernorm_eps(
5587+
self.hparams_vision.get("layer_norm_eps", 1e-6))
5588+
5589+
@classmethod
5590+
def filter_tensors(cls, item: tuple[str, Callable[[], Tensor]]) -> tuple[str, Callable[[], Tensor]] | None:
5591+
name, gen = item
5592+
5593+
# lm_head / MTP -> belong to the LM file
5594+
if name.startswith(("lm_head.", "mtp")):
5595+
return None
5596+
5597+
return super().filter_tensors(item)
5598+
5599+
55125600
@ModelBase.register("GPT2LMHeadModel")
55135601
class GPT2Model(TextModel):
55145602
model_arch = gguf.MODEL_ARCH.GPT2

convert_hf_to_gguf_update.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -175,6 +175,7 @@ class TOKENIZER_TYPE(IntEnum):
175175
{"name": "falcon-h1", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/tiiuae/Falcon-H1-34B-Base", "chkhsh": "48f8e02c0359c0bbdd82f26909171fac1c18a457bb47573ed1fe3bbb2c1cfd4b"},
176176
{"name": "kimi-k2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/moonshotai/Kimi-K2-Base", "chkhsh": "81212dc7cdb7e0c1074ca62c5aeab0d43c9f52b8a737be7b12a777c953027890"},
177177
{"name": "qwen2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/Qwen/Qwen3-Embedding-0.6B", "chkhsh": "d4540891389ea895b53b399da6ac824becc30f2fba0e9ddbb98f92e55ca0e97c"},
178+
{"name": "qwen35", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/openbmb/MiniCPM-V-4_6", "chkhsh": "1444df51289cfa8063b96f0e62b1125440111bc79a52003ea14b6eac7016fd5f"},
178179
{"name": "grok-2", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/alvarobartt/grok-2-tokenizer", "chkhsh": "66b8d4e19ab16c3bfd89bce5d785fb7e0155e8648708a1f42077cb9fe002c273"},
179180
# jina-v2-de variants
180181
{"name": "jina-v2-de", "tokt": TOKENIZER_TYPE.BPE, "repo": "https://huggingface.co/aari1995/German_Semantic_V3", "chkhsh": "b3d1dd861f1d4c5c0d2569ce36baf3f90fe8a102db3de50dd71ff860d91be3df"},

docs/multimodal/minicpmv4.6.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
## MiniCPM-V 4.6
2+
3+
### Prepare models and code
4+
5+
Download [MiniCPM-V-4_6](https://huggingface.co/openbmb/MiniCPM-V-4_6) PyTorch model from huggingface to "MiniCPM-V-4_6" folder.
6+
7+
The model must be the standard `transformers` v5.7.0+ checkpoint (no `trust_remote_code`); the architecture in `config.json` is `MiniCPMV4_6ForConditionalGeneration` with a `qwen3_5_text` text model and a SigLIP-based vision tower plus a window-attention `vit_merger`.
8+
9+
### Build llama.cpp
10+
11+
If there are differences in usage, please refer to the official build [documentation](https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md)
12+
13+
Clone llama.cpp:
14+
```bash
15+
git clone https://github.com/ggml-org/llama.cpp
16+
cd llama.cpp
17+
```
18+
19+
Build llama.cpp using `CMake`:
20+
```bash
21+
cmake -B build
22+
cmake --build build --config Release
23+
```
24+
25+
26+
### Usage of MiniCPM-V 4.6
27+
28+
Unlike older MiniCPM-V variants, MiniCPM-V 4.6 is converted directly through `convert_hf_to_gguf.py`. The same script is invoked twice on the original Hugging Face directory: once to produce the language-model GGUF and once with `--mmproj` to produce the multimodal projector GGUF.
29+
30+
```bash
31+
# language model
32+
python ./convert_hf_to_gguf.py ../MiniCPM-V-4_6 --outfile ../MiniCPM-V-4_6/ggml-model-f16.gguf
33+
34+
# multimodal projector (vision tower + window-attention vit_merger + DownsampleMLP merger)
35+
python ./convert_hf_to_gguf.py ../MiniCPM-V-4_6 --mmproj --outfile ../MiniCPM-V-4_6/mmproj-model-f16.gguf
36+
37+
# optional: quantize to Q4_K_M
38+
./build/bin/llama-quantize ../MiniCPM-V-4_6/ggml-model-f16.gguf ../MiniCPM-V-4_6/ggml-model-Q4_K_M.gguf Q4_K_M
39+
```
40+
41+
42+
Inference on Linux or Mac
43+
```bash
44+
# run in single-turn mode
45+
./build/bin/llama-mtmd-cli -m ../MiniCPM-V-4_6/ggml-model-f16.gguf --mmproj ../MiniCPM-V-4_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"
46+
47+
# run in conversation mode
48+
./build/bin/llama-mtmd-cli -m ../MiniCPM-V-4_6/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-V-4_6/mmproj-model-f16.gguf
49+
```

gguf-py/gguf/constants.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -773,6 +773,14 @@ class MODEL_TENSOR(IntEnum):
773773
V_DS_NORM = auto() # qwen3vl
774774
V_DS_FC1 = auto() # qwen3vl
775775
V_DS_FC2 = auto() # qwen3vl
776+
V_MERGER_LN1 = auto() # minicpmv4_6
777+
V_MERGER_ATTN_Q = auto() # minicpmv4_6
778+
V_MERGER_ATTN_K = auto() # minicpmv4_6
779+
V_MERGER_ATTN_V = auto() # minicpmv4_6
780+
V_MERGER_ATTN_O = auto() # minicpmv4_6
781+
V_MERGER_DS_LN = auto() # minicpmv4_6
782+
V_MERGER_DS_UP = auto() # minicpmv4_6
783+
V_MERGER_DS_DOWN = auto() # minicpmv4_6
776784
V_MM_POST_FC_NORM = auto() # cogvlm
777785
V_MM_UP = auto() # cogvlm
778786
V_MM_DOWN = auto() # cogvlm
@@ -1277,6 +1285,14 @@ class MODEL_TENSOR(IntEnum):
12771285
MODEL_TENSOR.V_DS_NORM: "v.deepstack.{bid}.norm",
12781286
MODEL_TENSOR.V_DS_FC1: "v.deepstack.{bid}.fc1",
12791287
MODEL_TENSOR.V_DS_FC2: "v.deepstack.{bid}.fc2",
1288+
MODEL_TENSOR.V_MERGER_LN1: "v.vit_merger.ln1",
1289+
MODEL_TENSOR.V_MERGER_ATTN_Q: "v.vit_merger.attn_q",
1290+
MODEL_TENSOR.V_MERGER_ATTN_K: "v.vit_merger.attn_k",
1291+
MODEL_TENSOR.V_MERGER_ATTN_V: "v.vit_merger.attn_v",
1292+
MODEL_TENSOR.V_MERGER_ATTN_O: "v.vit_merger.attn_out",
1293+
MODEL_TENSOR.V_MERGER_DS_LN: "v.vit_merger.ds_ln",
1294+
MODEL_TENSOR.V_MERGER_DS_UP: "v.vit_merger.ds_ffn_up",
1295+
MODEL_TENSOR.V_MERGER_DS_DOWN: "v.vit_merger.ds_ffn_down",
12801296
MODEL_TENSOR.V_MM_POST_FC_NORM: "mm.post_fc_norm", # cogvlm
12811297
MODEL_TENSOR.V_MM_UP: "mm.up",
12821298
MODEL_TENSOR.V_MM_DOWN: "mm.down",
@@ -1449,6 +1465,14 @@ class MODEL_TENSOR(IntEnum):
14491465
MODEL_TENSOR.V_DS_NORM,
14501466
MODEL_TENSOR.V_DS_FC1,
14511467
MODEL_TENSOR.V_DS_FC2,
1468+
MODEL_TENSOR.V_MERGER_LN1,
1469+
MODEL_TENSOR.V_MERGER_ATTN_Q,
1470+
MODEL_TENSOR.V_MERGER_ATTN_K,
1471+
MODEL_TENSOR.V_MERGER_ATTN_V,
1472+
MODEL_TENSOR.V_MERGER_ATTN_O,
1473+
MODEL_TENSOR.V_MERGER_DS_LN,
1474+
MODEL_TENSOR.V_MERGER_DS_UP,
1475+
MODEL_TENSOR.V_MERGER_DS_DOWN,
14521476
MODEL_TENSOR.V_MM_POST_FC_NORM,
14531477
MODEL_TENSOR.V_MM_UP,
14541478
MODEL_TENSOR.V_MM_DOWN,
@@ -4224,6 +4248,7 @@ class VisionProjectorType:
42244248
NEMOTRON_V2_VL = "nemotron_v2_vl"
42254249
HUNYUANOCR = "hunyuanocr"
42264250
HUNYUANVL = "hunyuanvl"
4251+
MINICPMV4_6 = "minicpmv4_6"
42274252
GRANITE_SPEECH = "granite_speech" # audio
42284253

42294254

gguf-py/gguf/tensor_mapping.py

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1399,6 +1399,7 @@ class TensorNameMap:
13991399

14001400
MODEL_TENSOR.V_ENC_EMBD_PATCH: (
14011401
"vision_tower.vision_model.embeddings.patch_embedding",
1402+
"model.vision_tower.embeddings.patch_embedding", # minicpmv4_6
14021403
"model.vision_tower.embeddings.patch_embeddings.projection", # Intern-S1
14031404
"vpm.embeddings.patch_embedding",
14041405
"model.vision_model.embeddings.patch_embedding", # SmolVLM
@@ -1424,6 +1425,7 @@ class TensorNameMap:
14241425

14251426
MODEL_TENSOR.V_ENC_EMBD_POS: (
14261427
"vision_tower.vision_model.embeddings.position_embedding",
1428+
"model.vision_tower.embeddings.position_embedding", # minicpmv4_6
14271429
"model.vision_tower.embeddings.position_embeddings", # Intern-S1
14281430
"vpm.embeddings.position_embedding",
14291431
"model.vision_model.embeddings.position_embedding", # SmolVLM
@@ -1460,6 +1462,7 @@ class TensorNameMap:
14601462

14611463
MODEL_TENSOR.V_ENC_ATTN_Q: (
14621464
"vision_tower.vision_model.encoder.layers.{bid}.self_attn.q_proj",
1465+
"model.vision_tower.encoder.layers.{bid}.self_attn.q_proj", # minicpmv4_6
14631466
"model.vision_tower.encoder.layer.{bid}.attention.q_proj", # Intern-S1
14641467
"vpm.encoder.layers.{bid}.self_attn.q_proj",
14651468
"model.vision_model.encoder.layers.{bid}.self_attn.q_proj", # SmolVLM
@@ -1483,6 +1486,7 @@ class TensorNameMap:
14831486

14841487
MODEL_TENSOR.V_ENC_ATTN_K: (
14851488
"vision_tower.vision_model.encoder.layers.{bid}.self_attn.k_proj",
1489+
"model.vision_tower.encoder.layers.{bid}.self_attn.k_proj", # minicpmv4_6
14861490
"model.vision_tower.encoder.layer.{bid}.attention.k_proj", # Intern-S1
14871491
"vpm.encoder.layers.{bid}.self_attn.k_proj",
14881492
"model.vision_model.encoder.layers.{bid}.self_attn.k_proj", # SmolVLM
@@ -1506,6 +1510,7 @@ class TensorNameMap:
15061510

15071511
MODEL_TENSOR.V_ENC_ATTN_V: (
15081512
"vision_tower.vision_model.encoder.layers.{bid}.self_attn.v_proj",
1513+
"model.vision_tower.encoder.layers.{bid}.self_attn.v_proj", # minicpmv4_6
15091514
"model.vision_tower.encoder.layer.{bid}.attention.v_proj", # Intern-S1
15101515
"vpm.encoder.layers.{bid}.self_attn.v_proj",
15111516
"model.vision_model.encoder.layers.{bid}.self_attn.v_proj", # SmolVLM
@@ -1522,6 +1527,7 @@ class TensorNameMap:
15221527

15231528
MODEL_TENSOR.V_ENC_INPUT_NORM: (
15241529
"vision_tower.vision_model.encoder.layers.{bid}.layer_norm1",
1530+
"model.vision_tower.encoder.layers.{bid}.layer_norm1", # minicpmv4_6
15251531
"vision_tower.vision_model.encoder.layers.{bid}.norm1", # InternVL
15261532
"model.vision_tower.encoder.layer.{bid}.layernorm_before", # Intern-S1
15271533
"vpm.encoder.layers.{bid}.layer_norm1",
@@ -1542,6 +1548,7 @@ class TensorNameMap:
15421548

15431549
MODEL_TENSOR.V_ENC_ATTN_O: (
15441550
"vision_tower.vision_model.encoder.layers.{bid}.self_attn.out_proj",
1551+
"model.vision_tower.encoder.layers.{bid}.self_attn.out_proj", # minicpmv4_6
15451552
"vision_tower.vision_model.encoder.layers.{bid}.attn.proj", # InternVL
15461553
"model.vision_tower.encoder.layer.{bid}.attention.projection_layer", # Intern-S1
15471554
"vpm.encoder.layers.{bid}.self_attn.out_proj",
@@ -1564,6 +1571,7 @@ class TensorNameMap:
15641571

15651572
MODEL_TENSOR.V_ENC_POST_ATTN_NORM: (
15661573
"vision_tower.vision_model.encoder.layers.{bid}.layer_norm2",
1574+
"model.vision_tower.encoder.layers.{bid}.layer_norm2", # minicpmv4_6
15671575
"vision_tower.vision_model.encoder.layers.{bid}.norm2", # InternVL
15681576
"model.vision_tower.encoder.layer.{bid}.layernorm_after", # Intern-S1
15691577
"vpm.encoder.layers.{bid}.layer_norm2",
@@ -1585,6 +1593,7 @@ class TensorNameMap:
15851593

15861594
MODEL_TENSOR.V_ENC_FFN_UP: (
15871595
"vision_tower.vision_model.encoder.layers.{bid}.mlp.fc1",
1596+
"model.vision_tower.encoder.layers.{bid}.mlp.fc1", # minicpmv4_6
15881597
"model.vision_tower.encoder.layer.{bid}.mlp.fc1", # Intern-S1
15891598
"vpm.encoder.layers.{bid}.mlp.fc1",
15901599
"model.vision_model.encoder.layers.{bid}.mlp.fc1", # SmolVLM, gemma3
@@ -1613,6 +1622,7 @@ class TensorNameMap:
16131622

16141623
MODEL_TENSOR.V_ENC_FFN_DOWN: (
16151624
"vision_tower.vision_model.encoder.layers.{bid}.mlp.fc2",
1625+
"model.vision_tower.encoder.layers.{bid}.mlp.fc2", # minicpmv4_6
16161626
"model.vision_tower.encoder.layer.{bid}.mlp.fc2", # Intern-S1
16171627
"vpm.encoder.layers.{bid}.mlp.fc2",
16181628
"model.vision_model.encoder.layers.{bid}.mlp.fc2", # SmolVLM, gemma3
@@ -1668,6 +1678,7 @@ class TensorNameMap:
16681678

16691679
MODEL_TENSOR.V_POST_NORM: (
16701680
"vision_tower.vision_model.post_layernorm",
1681+
"model.vision_tower.post_layernorm", # minicpmv4_6
16711682
"model.vision_model.post_layernorm", # SmolVLM
16721683
"vision_model.layernorm_post", # llama4
16731684
"visual.merger.ln_q", # qwen2vl
@@ -1696,6 +1707,7 @@ class TensorNameMap:
16961707
"mlp_AR.pre_norm", # PaddleOCR-VL
16971708
"merger.ln_q",
16981709
"vision_tower.merger.ln_q", # dots.ocr
1710+
"model.merger.mlp.0.pre_norm", # minicpmv4_6
16991711
),
17001712

17011713
MODEL_TENSOR.V_MM_SOFT_EMB_NORM: (
@@ -1769,6 +1781,38 @@ class TensorNameMap:
17691781
"model.visual.deepstack_merger_list.{bid}.linear_fc2", # deepstack in qwen3vl
17701782
),
17711783

1784+
MODEL_TENSOR.V_MERGER_LN1: (
1785+
"model.vision_tower.vit_merger.layer_norm1", # minicpmv4_6
1786+
),
1787+
1788+
MODEL_TENSOR.V_MERGER_ATTN_Q: (
1789+
"model.vision_tower.vit_merger.self_attn.q_proj", # minicpmv4_6
1790+
),
1791+
1792+
MODEL_TENSOR.V_MERGER_ATTN_K: (
1793+
"model.vision_tower.vit_merger.self_attn.k_proj", # minicpmv4_6
1794+
),
1795+
1796+
MODEL_TENSOR.V_MERGER_ATTN_V: (
1797+
"model.vision_tower.vit_merger.self_attn.v_proj", # minicpmv4_6
1798+
),
1799+
1800+
MODEL_TENSOR.V_MERGER_ATTN_O: (
1801+
"model.vision_tower.vit_merger.self_attn.out_proj", # minicpmv4_6
1802+
),
1803+
1804+
MODEL_TENSOR.V_MERGER_DS_LN: (
1805+
"model.vision_tower.vit_merger.pre_norm", # minicpmv4_6
1806+
),
1807+
1808+
MODEL_TENSOR.V_MERGER_DS_UP: (
1809+
"model.vision_tower.vit_merger.linear_1", # minicpmv4_6
1810+
),
1811+
1812+
MODEL_TENSOR.V_MERGER_DS_DOWN: (
1813+
"model.vision_tower.vit_merger.linear_2", # minicpmv4_6
1814+
),
1815+
17721816
MODEL_TENSOR.V_SAM_POS_EMBD: (
17731817
"model.sam_model.pos_embed",
17741818
),
@@ -1828,11 +1872,13 @@ class TensorNameMap:
18281872
MODEL_TENSOR.V_MM_UP: (
18291873
"model.vision.linear_proj.dense_h_to_4h", # cogvlm
18301874
"visual.merger.up_proj", # glm4v
1875+
"model.merger.mlp.0.linear_1", # minicpmv4_6
18311876
),
18321877

18331878
MODEL_TENSOR.V_MM_DOWN: (
18341879
"model.vision.linear_proj.dense_4h_to_h", # cogvlm
18351880
"visual.merger.down_proj", # glm4v
1881+
"model.merger.mlp.0.linear_2", # minicpmv4_6
18361882
),
18371883

18381884
MODEL_TENSOR.V_MM_GATE: (

tools/mtmd/README.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ For the following models, you can use `convert_hf_to_gguf.py` with `--mmproj` fl
4949
- Qwen 2 VL and Qwen 2.5 VL (from [Qwen](https://huggingface.co/Qwen))
5050
- [Mistral Small 3.1 24B](https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503)
5151
- InternVL 2.5 and InternVL 3 from [OpenGVLab](https://huggingface.co/OpenGVLab) (note: we don't support conversion of `InternVL3-*-hf` model, only non-HF version is supported ; `InternLM2Model` **text** model is not supported)
52+
- [MiniCPM-V 4.6](https://huggingface.co/openbmb/MiniCPM-V-4_6) ; See the guide [here](../../docs/multimodal/minicpmv4.6.md) - requires the standard `transformers` v5.7.0+ checkpoint
5253

5354
For older models, please refer to the relevant guide for instructions on how to obtain or create them:
5455

@@ -60,4 +61,7 @@ NOTE: conversion scripts are located under `tools/mtmd/legacy-models`
6061
- [MiniCPM-V 2.5](../../docs/multimodal/minicpmv2.5.md)
6162
- [MiniCPM-V 2.6](../../docs/multimodal/minicpmv2.6.md)
6263
- [MiniCPM-o 2.6](../../docs/multimodal/minicpmo2.6.md)
64+
- [MiniCPM-V 4.0](../../docs/multimodal/minicpmv4.0.md)
65+
- [MiniCPM-o 4.0](../../docs/multimodal/minicpmo4.0.md)
66+
- [MiniCPM-V 4.5](../../docs/multimodal/minicpmv4.5.md)
6367
- [IBM Granite Vision](../../docs/multimodal/granitevision.md)

0 commit comments

Comments
 (0)