Add Qwen3VL support (dense and moe) #1174

yashaswikarnati · 2025-11-03T20:25:13Z

Qwen3VL Verification

Dense Model (8B)

Model: Qwen/Qwen3-VL-8B-Instruct

HF Logits Matching

Megatron Top 5:

[('\n\n', 25.625), ('?\n\n', 24.125), ('?', 22.0), ('?\n', 21.5), ('\n', 21.25)]

HF Top 5:

[('\n\n', 26.5), ('?\n\n', 25.25), ('?', 23.125), ('?\n', 22.75), ('\n', 22.25)]

Fine-tuning on cord-v2 Dataset

Train vs validation loss curves
WandB Link: Qwen3VL 8B Fine-tune Run

Inference on Sample of cord-v2 Dataset

Command:

uv run python -m torch.distributed.run --nproc_per_node=8 \
examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path="Qwen/Qwen3-VL-8B-Instruct" \
--image_path=./examples/recipes/qwen_vl/image.png \
--prompt="Describe this items and process in this image." \
--megatron_model_path ./logs/checkpoints/qwen3vl8b/ \
--max_new_tokens 150

Before Fine-tune:

<|im_start|>assistant
This is a receipt from a receipt. The receipt.
<|im_end|>

After Fine-tune:

<s_total><s_total_price>302,016</s_total_price></s_total>
<s_sub_total>
  <s_tax_price>52,416</s_tax_price>
  <s_discount_price>19,000</s_discount_price>
  <s_subtotal_price>259,000</s_subtotal_price>
</s_sub_total>
<s_menu>
  <s_price>59,000</s_price>
  <s_nm>Bintang Bremer</s_nm>
  <s_cnt>1</s_cnt>
  <sep/>
  <s_price>190,000</s_price>
  <s_nm>Chicken H-H</s_nm>
  <s_cnt>1</s_cnt>
</s_menu>

MoE Model (30B - A3B)

Model: Qwen/Qwen3-VL-30B-A3B-Instruct
WandB Link: Qwen3VL MoE Fine-tune Run

HF Logits Matching

Command:

uv run python -m torch.distributed.run --nproc_per_node=8 \
examples/conversion/compare_hf_and_megatron/compare.py \
--model_class Qwen3VLMoeForConditionalGeneration \
--hf_model_path="Qwen/Qwen3-VL-30B-A3B-Instruct" \
--prompt="What is the capital of California" \
--ep 8

Megatron Top 5:

[('\n\n', 23.375), ('?\n\n', 21.75), ('\n', 21.375), ('?', 20.75), (' The', 20.625)]

HF Top 5:

[('?\n\n', 22.875), ('\n\n', 22.5), ('?\n', 22.375), ('?', 21.875), ('\n', 21.75)]

Cosine Similarity: 0.973547

Fine-tuning on cord-v2 Dataset

Train vs validation loss curves

Inference on Sample of cord-v2 Dataset

Command:

uv run python -m torch.distributed.run --nproc_per_node=8 \
examples/conversion/hf_to_megatron_generate_vlm.py \
--hf_model_path="Qwen/Qwen3-VL-30B-A3B-Instruct" \
--image_path=/path/to/image \
--prompt="Describe this image." \
--ep 8 \
--megatron_model_path=/path/to/ckpt

Before Fine-tune:

<|im_start|>assistant
Of course
<|im_end|>

After Fine-tune:

<|im_start|>assistant
<s_total><s_total_price>302,016</s_total_price></s_total>
<s_sub_total>
  <s_tax_price>52,416</s_tax_price>
  <s_subtotal_price>259,000</s_subtotal_price>
  <s_service_price>9,600</s_service_price>
  <s_discount_price>19,000</s_discount_price>
</s_sub_total>
<s_menu>
  <s_price>59,000</s_price>
  <s_nm>Bintang Bremer</s_nm>
  <s_cnt>1</s_cnt>
  <sep/>
  <s_price>190,000</s_price>
  <s_nm>Chicken H-H</s_nm>
  <s_cnt>1</s_cnt>
  <sep/>
  <s_price>10,000</s_price>
  <s_nm>Ades</s_nm>
  <s_cnt>1</s_cnt>
</s_menu>

yashaswikarnati · 2025-11-04T18:39:31Z

/ok to test c061222

yaoyu-33 · 2025-11-04T19:40:18Z

examples/conversion/compare_hf_and_megatron/compare.py


-def process_inputs(tokenizer, processor, image_path: Optional[str], prompt: str, is_vl_model: bool):
+def pad_input_ids_to_tp_multiple(input_ids, tp_size: int, pad_token_id: int = 0):
+    """Pad input_ids so sequence length is divisible by tp_size.


add this is required when sequence parallel is on.

yaoyu-33 · 2025-11-04T19:40:58Z

examples/recipes/qwen_vl/finetune_qwen25_vl.py

    Loading pretrained weights (recommended for finetune):
        1) Import HF checkpoint to Megatron format:
-           $ python examples/conversion/convert_checkpoints.py import \
+           $ torchrun --nproc_per_node=1 examples/conversion/convert_checkpoints.py import \


no need to change this

the process hangs if I dont use torchrun, so updated

yaoyu-33 · 2025-11-04T19:42:11Z

examples/recipes/qwen_vl/finetune_qwen3_vl.py

@@ -0,0 +1,213 @@
+#!/usr/bin/env python3


can we merge the 2 finetune vl scripts? you can rename one of them and remove the other one.

done, just have one common script now

yaoyu-33 · 2025-11-04T19:44:03Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/bridge.py

@@ -0,0 +1,181 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.


duplicated w/ qwen3_vl bridge?

removed duplicate

yaoyu-33 · 2025-11-04T19:45:30Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/gpt_model.py

+        )
+
+        # rebuild the transformer block
+        self.decoder = Qwen3VLTransformerBlock(


there shouldn't be a rebuit, should just update layerspec

https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/models/gpt/gpt_model.py#L202

I think the layer spec override is for TransformerLayer, not for TransformerBlock. We might still need to override?

yaoyu-33 · 2025-11-04T19:50:18Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/gpt_model.py

+        )
+
+        # rebuild the transformer block
+        self.decoder = Qwen3VLTransformerBlock(


shouldn't need rebuild blocks. It's extra overhead, just update the layer spec to use Qwen3VLTransformerBlock

same comment as above

yaoyu-33 · 2025-11-04T19:51:19Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/gpt_model.py

+        visual_pos_masks: Optional[torch.Tensor] = None,
+        deepstack_visual_embeds: Optional[list[torch.Tensor]] = None,
+    ) -> Tensor:
+        """Forward function of the GPT Model This function passes the input tensors


add in comment why this need to be overriden - deepstack_visual_embeds

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/bridge.py

yaoyu-33 · 2025-11-04T20:14:10Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/gpt_model.py

@@ -0,0 +1,179 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.


can you move this file to model.py as well? seems single file is enough easier to understand.

prefer to have shorter self contained files for easier maintenance than one long file, can change it if have strong preference

yaoyu-33 · 2025-11-04T20:14:50Z

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/provider.py

@@ -0,0 +1,154 @@
+# Copyright (c) 2025, NVIDIA CORPORATION.  All rights reserved.


duplicated with qwen3vl_provider?

removed the duplicate

yaoyu-33 · 2025-11-04T20:15:44Z

src/megatron/bridge/models/qwen_vl/qwen3vl_bridge.py

+        pass
+
+
+def extract_expert_number_from_param(param_name: str) -> int:


import this from src/megatron/bridge/utils/common_utils.py

yashaswikarnati · 2025-11-05T22:56:20Z

/ok to test 3e7f918

yashaswikarnati · 2025-11-05T22:57:08Z

@yaoyu-33 addressed all the comments, ptal when you get chance

Signed-off-by: ykarnati <[email protected]>

yashaswikarnati · 2025-11-06T00:14:28Z

/ok to test 69233a5

yashaswikarnati and others added 30 commits October 13, 2025 21:56

update uv lock to use latest transformers

f2e93d5

add vision model

8f283e3

add rope for language model

d2f3d58

add custom transformer block to use deepstack features

fbe61d9

add qwen vl language model

dd243d6

add initial full qwen vl model

e47a84a

update init and remove split in vision encoder

8cf19fd

remove unused imports

69a723d

fix init

0b02869

add test for rope

75086b9

fix model interface

36fc520

test model init

1dbec46

add forward pass

4593a85

fix rope and test forward pass

2206ea9

fix vision model fwd

335b9ff

Merge main into yash/qwen3.5vl - use main's pyproject.toml and uv.lock

93885d8

add doc string to test

b12a8d9

Merge remote-tracking branch 'origin' into yash/qwen3.5vl

7878113

change init file and add new directory

82686c6

fix from yu

04b767d

rm visionmodel

e90861f

clean up all the duplicate mess and get tests to work again

946e16e

fix forward pass

1807633

some clean up

3387d39

make the tests pytest compatible and add parallelisms

5aee16a

Merge remote-tracking branch 'origin' into yash/qwen3.5vl

f17f742

use hf vision model directly

89a2b44

mistake vision as visual

33d2222

use hf vision config and rm code not needed

b2929c5

permanently rm vison_model.py

8169f69

copy-pr-bot bot temporarily deployed to nemo-ci November 4, 2025 18:39 Inactive

copy-pr-bot bot temporarily deployed to test November 4, 2025 18:40 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 4, 2025 18:53 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 4, 2025 19:08 Failure

yaoyu-33 reviewed Nov 4, 2025

View reviewed changes

src/megatron/bridge/models/qwen_vl/modelling_qwen3_vl/bridge.py Outdated Show resolved Hide resolved

yaoyu-33 reviewed Nov 4, 2025

View reviewed changes

yashaswikarnati added 4 commits November 4, 2025 16:58

remove duplicate provider/bridge

56d83ee

have unified finetune script

3449c8b

edit doc strings

7083635

edit doc strings

3e7f918

copy-pr-bot bot temporarily deployed to nemo-ci November 5, 2025 22:56 Inactive

copy-pr-bot bot had a problem deploying to test November 5, 2025 22:57 Error

add doc

69233a5

Signed-off-by: ykarnati <[email protected]>

copy-pr-bot bot temporarily deployed to nemo-ci November 6, 2025 00:14 Inactive

copy-pr-bot bot temporarily deployed to test November 6, 2025 00:15 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci November 6, 2025 01:39 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci November 6, 2025 01:51 Failure

		@@ -0,0 +1,181 @@
		# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,179 @@
		# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

		@@ -0,0 +1,154 @@
		# Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved.

		pass


		def extract_expert_number_from_param(param_name: str) -> int:

Add Qwen3VL support (dense and moe) #1174

Are you sure you want to change the base?

Add Qwen3VL support (dense and moe) #1174

Uh oh!

Conversation

yashaswikarnati commented Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Qwen3VL Verification

Dense Model (8B)

HF Logits Matching

Fine-tuning on cord-v2 Dataset

Inference on Sample of cord-v2 Dataset

MoE Model (30B - A3B)

HF Logits Matching

Fine-tuning on cord-v2 Dataset

Inference on Sample of cord-v2 Dataset

Uh oh!

yashaswikarnati commented Nov 4, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yashaswikarnati commented Nov 5, 2025

Uh oh!

yashaswikarnati commented Nov 5, 2025

Uh oh!

yashaswikarnati commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yashaswikarnati commented Nov 3, 2025 •

edited

Loading

yashaswikarnati Nov 5, 2025 •

edited

Loading