[bug] `nemotron_omni` SFT loss mask never supervises assistant `<|im_end|>` terminator, so free-form fine-tunes fail to learn when to stop

### Problem

In the `nemotron_omni` SFT collate, the assistant turn terminator `<|im_end|>` — the eos token, id `11` — is never a training target.

This means fine-tuned models are not taught to emit the assistant turn terminator. In my runs, free-form SFT models often produce a clean `<think>...</think>` block, then continue generating/repeating the final answer until `max_new_tokens` instead of emitting `<|im_end|>`.

This is reproducible at the label level: across inspected samples, there are **0 supervised positions whose target label is `<|im_end|>`**.

The asymmetry is especially clear:

- `</think>` is learned reliably, because it is part of the assistant content text and therefore included in the loss mask.
- `<|im_end|>` is not learned, because it is added by the chat template and then dropped by the collate/loss-mask logic.

This appears to happen for two reasons.

First, `create_multiturn_loss_mask_by_search` unmasks only the tokenized assistant content text. However, `<|im_end|>` is not part of the assistant content. It is added by the chat template:

```text
{content}<|im_end|>\n
```

After the next-token shift in `nemotron_omni_collate_fn`:

```python
labels[i] = input_ids[i + 1]
loss_mask[i] = loss_mask[i + 1]
```

the position that should predict `<|im_end|>` — the last assistant content token — remains masked.

Second, `<|im_end|>` is also included in `skipped_tokens`. `extract_skipped_token_ids(processor)` returns:

```python
[18, 10, 11]
```

where `11` is `<|im_end|>`. It also doubles as the pad token. Then the collate applies:

```python
labels[torch.isin(labels, skipped_tokens)] = IGNORE_INDEX
```

So even if the loss mask were set, the `<|im_end|>` label would still be forced to `IGNORE_INDEX`.

### Minimal repro

```shell
Using `nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16` with `nemotron_omni_collate_fn`:

1. Prepare a normal 2-turn image-to-text SFT sample.

   Assistant content format:

   
   <think>{reasoning}</think>{answer}
   

   The full chat template adds the assistant turn terminator:

   
   {content}<|im_end|>\n
   

2. Tokenize with the processor and run the sample through:

   
   nemotron_omni_collate_fn
   

3. Inspect labels after the next-token shift and skipped-token masking.

   Example diagnostic result from a real sample:

   
   eos_token = '<|im_end|>'  id = 11
   trained positions whose TARGET (label) == <|im_end|>(11):  0

   last trained pos:  input ' ]\n' → target '}'
     next pos: input '}'           → target IGNORE  loss_mask 0
     next pos: input '<|im_end|>'  → target IGNORE  loss_mask 0
   

4. Fine-tune with base correctly loaded, `micro_batch_size=1`, and packing off.

5. Run inference on a free-form VQA sample.
   The model repeats until `max_new_tokens` and does not emit `<|im_end|>`.
```

### Expected behavior

For each assistant turn whose terminator is present and not truncated away by sequence truncation, there should be exactly one supervised target equal to `eos_token_id`.

That target should correspond to the transition:

```text
last assistant content token → <|im_end|>
```

A possible invariant / unit test:

```python
# excluding assistant turns whose <|im_end|> was truncated off the sequence
assert (labels == eos_id).sum().item() == num_non_truncated_assistant_turns
```

Currently, this count is `0`, not `num_non_truncated_assistant_turns`.

The model should be able to learn both:

```text
<think>...</think>
```

and the assistant turn boundary:

```text
... answer <|im_end|>
```

### Affected area

area:training

### Regression?

Not sure

### Environment

```text
Repo: NVIDIA-NeMo/Megatron-Bridge
Branch: nemotron_3_omni
Commit: 648756c

Model:
nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16

Collate:
src/megatron/bridge/data/vlm_datasets/collate.py
nemotron_omni_collate_fn
create_multiturn_loss_mask_by_search

Training setup:
single node, 8× B300
base checkpoint correctly loaded
micro_batch_size = 1
packing off
2-turn image→text SFT data
assistant content = <think>{reasoning}</think>{answer}
answer = short free-form text
```

### Logs

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] `nemotron_omni` SFT loss mask never supervises assistant `<|im_end|>` terminator, so free-form fine-tunes fail to learn when to stop #4265

Problem

Minimal repro

Expected behavior

Affected area

Regression?

Environment

Logs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[bug] nemotron_omni SFT loss mask never supervises assistant <|im_end|> terminator, so free-form fine-tunes fail to learn when to stop #4265

Description

Problem

Minimal repro

Expected behavior

Affected area

Regression?

Environment

Logs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

[bug] `nemotron_omni` SFT loss mask never supervises assistant `<|im_end|>` terminator, so free-form fine-tunes fail to learn when to stop #4265