Problem
In the nemotron_omni SFT collate, the assistant turn terminator <|im_end|> — the eos token, id 11 — is never a training target.
This means fine-tuned models are not taught to emit the assistant turn terminator. In my runs, free-form SFT models often produce a clean <think>...</think> block, then continue generating/repeating the final answer until max_new_tokens instead of emitting <|im_end|>.
This is reproducible at the label level: across inspected samples, there are 0 supervised positions whose target label is <|im_end|>.
The asymmetry is especially clear:
</think> is learned reliably, because it is part of the assistant content text and therefore included in the loss mask.
<|im_end|> is not learned, because it is added by the chat template and then dropped by the collate/loss-mask logic.
This appears to happen for two reasons.
First, create_multiturn_loss_mask_by_search unmasks only the tokenized assistant content text. However, <|im_end|> is not part of the assistant content. It is added by the chat template:
After the next-token shift in nemotron_omni_collate_fn:
labels[i] = input_ids[i + 1]
loss_mask[i] = loss_mask[i + 1]
the position that should predict <|im_end|> — the last assistant content token — remains masked.
Second, <|im_end|> is also included in skipped_tokens. extract_skipped_token_ids(processor) returns:
where 11 is <|im_end|>. It also doubles as the pad token. Then the collate applies:
labels[torch.isin(labels, skipped_tokens)] = IGNORE_INDEX
So even if the loss mask were set, the <|im_end|> label would still be forced to IGNORE_INDEX.
Minimal repro
Using `nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16` with `nemotron_omni_collate_fn`:
1. Prepare a normal 2-turn image-to-text SFT sample.
Assistant content format:
<think>{reasoning}</think>{answer}
The full chat template adds the assistant turn terminator:
{content}<|im_end|>\n
2. Tokenize with the processor and run the sample through:
nemotron_omni_collate_fn
3. Inspect labels after the next-token shift and skipped-token masking.
Example diagnostic result from a real sample:
eos_token = '<|im_end|>' id = 11
trained positions whose TARGET (label) == <|im_end|>(11): 0
last trained pos: input ' ]\n' → target '}'
next pos: input '}' → target IGNORE loss_mask 0
next pos: input '<|im_end|>' → target IGNORE loss_mask 0
4. Fine-tune with base correctly loaded, `micro_batch_size=1`, and packing off.
5. Run inference on a free-form VQA sample.
The model repeats until `max_new_tokens` and does not emit `<|im_end|>`.
Expected behavior
For each assistant turn whose terminator is present and not truncated away by sequence truncation, there should be exactly one supervised target equal to eos_token_id.
That target should correspond to the transition:
last assistant content token → <|im_end|>
A possible invariant / unit test:
# excluding assistant turns whose <|im_end|> was truncated off the sequence
assert (labels == eos_id).sum().item() == num_non_truncated_assistant_turns
Currently, this count is 0, not num_non_truncated_assistant_turns.
The model should be able to learn both:
and the assistant turn boundary:
Affected area
area:training
Regression?
Not sure
Environment
Repo: NVIDIA-NeMo/Megatron-Bridge
Branch: nemotron_3_omni
Commit: 648756c
Model:
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
Collate:
src/megatron/bridge/data/vlm_datasets/collate.py
nemotron_omni_collate_fn
create_multiturn_loss_mask_by_search
Training setup:
single node, 8× B300
base checkpoint correctly loaded
micro_batch_size = 1
packing off
2-turn image→text SFT data
assistant content = <think>{reasoning}</think>{answer}
answer = short free-form text
Logs
Problem
In the
nemotron_omniSFT collate, the assistant turn terminator<|im_end|>— the eos token, id11— is never a training target.This means fine-tuned models are not taught to emit the assistant turn terminator. In my runs, free-form SFT models often produce a clean
<think>...</think>block, then continue generating/repeating the final answer untilmax_new_tokensinstead of emitting<|im_end|>.This is reproducible at the label level: across inspected samples, there are 0 supervised positions whose target label is
<|im_end|>.The asymmetry is especially clear:
</think>is learned reliably, because it is part of the assistant content text and therefore included in the loss mask.<|im_end|>is not learned, because it is added by the chat template and then dropped by the collate/loss-mask logic.This appears to happen for two reasons.
First,
create_multiturn_loss_mask_by_searchunmasks only the tokenized assistant content text. However,<|im_end|>is not part of the assistant content. It is added by the chat template:After the next-token shift in
nemotron_omni_collate_fn:the position that should predict
<|im_end|>— the last assistant content token — remains masked.Second,
<|im_end|>is also included inskipped_tokens.extract_skipped_token_ids(processor)returns:where
11is<|im_end|>. It also doubles as the pad token. Then the collate applies:So even if the loss mask were set, the
<|im_end|>label would still be forced toIGNORE_INDEX.Minimal repro
Expected behavior
For each assistant turn whose terminator is present and not truncated away by sequence truncation, there should be exactly one supervised target equal to
eos_token_id.That target should correspond to the transition:
A possible invariant / unit test:
Currently, this count is
0, notnum_non_truncated_assistant_turns.The model should be able to learn both:
and the assistant turn boundary:
Affected area
area:training
Regression?
Not sure
Environment
Logs