Skip to content

[bug] nemotron_omni SFT loss mask never supervises assistant <|im_end|> terminator, so free-form fine-tunes fail to learn when to stop #4265

@ShiChenAI

Description

@ShiChenAI

Problem

In the nemotron_omni SFT collate, the assistant turn terminator <|im_end|> — the eos token, id 11 — is never a training target.

This means fine-tuned models are not taught to emit the assistant turn terminator. In my runs, free-form SFT models often produce a clean <think>...</think> block, then continue generating/repeating the final answer until max_new_tokens instead of emitting <|im_end|>.

This is reproducible at the label level: across inspected samples, there are 0 supervised positions whose target label is <|im_end|>.

The asymmetry is especially clear:

  • </think> is learned reliably, because it is part of the assistant content text and therefore included in the loss mask.
  • <|im_end|> is not learned, because it is added by the chat template and then dropped by the collate/loss-mask logic.

This appears to happen for two reasons.

First, create_multiturn_loss_mask_by_search unmasks only the tokenized assistant content text. However, <|im_end|> is not part of the assistant content. It is added by the chat template:

{content}<|im_end|>\n

After the next-token shift in nemotron_omni_collate_fn:

labels[i] = input_ids[i + 1]
loss_mask[i] = loss_mask[i + 1]

the position that should predict <|im_end|> — the last assistant content token — remains masked.

Second, <|im_end|> is also included in skipped_tokens. extract_skipped_token_ids(processor) returns:

[18, 10, 11]

where 11 is <|im_end|>. It also doubles as the pad token. Then the collate applies:

labels[torch.isin(labels, skipped_tokens)] = IGNORE_INDEX

So even if the loss mask were set, the <|im_end|> label would still be forced to IGNORE_INDEX.

Minimal repro

Using `nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16` with `nemotron_omni_collate_fn`:

1. Prepare a normal 2-turn image-to-text SFT sample.

   Assistant content format:

   
   <think>{reasoning}</think>{answer}
   

   The full chat template adds the assistant turn terminator:

   
   {content}<|im_end|>\n
   

2. Tokenize with the processor and run the sample through:

   
   nemotron_omni_collate_fn
   

3. Inspect labels after the next-token shift and skipped-token masking.

   Example diagnostic result from a real sample:

   
   eos_token = '<|im_end|>'  id = 11
   trained positions whose TARGET (label) == <|im_end|>(11):  0

   last trained pos:  input ' ]\n' → target '}'
     next pos: input '}'           → target IGNORE  loss_mask 0
     next pos: input '<|im_end|>'  → target IGNORE  loss_mask 0
   

4. Fine-tune with base correctly loaded, `micro_batch_size=1`, and packing off.

5. Run inference on a free-form VQA sample.
   The model repeats until `max_new_tokens` and does not emit `<|im_end|>`.

Expected behavior

For each assistant turn whose terminator is present and not truncated away by sequence truncation, there should be exactly one supervised target equal to eos_token_id.

That target should correspond to the transition:

last assistant content token → <|im_end|>

A possible invariant / unit test:

# excluding assistant turns whose <|im_end|> was truncated off the sequence
assert (labels == eos_id).sum().item() == num_non_truncated_assistant_turns

Currently, this count is 0, not num_non_truncated_assistant_turns.

The model should be able to learn both:

<think>...</think>

and the assistant turn boundary:

... answer <|im_end|>

Affected area

area:training

Regression?

Not sure

Environment

Repo: NVIDIA-NeMo/Megatron-Bridge
Branch: nemotron_3_omni
Commit: 648756c

Model:
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Collate:
src/megatron/bridge/data/vlm_datasets/collate.py
nemotron_omni_collate_fn
create_multiturn_loss_mask_by_search

Training setup:
single node, 8× B300
base checkpoint correctly loaded
micro_batch_size = 1
packing off
2-turn image→text SFT data
assistant content = <think>{reasoning}</think>{answer}
answer = short free-form text

Logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions