Skip to content

Commit 9d4262c

Browse files
ZhentingWangZhenting Wang
andauthored
Swap K2V3 TITO tokenizer to IFM template; rename legacy to k2v3_oldbackup (#43)
* Swap TITO tokenizer's K2V3 to IFM template; rename legacy to k2v3_oldbackup The K2V3 family is migrating to the IFM-style chat template introduced in bbq-0601 (used by bbq-8b-mid3_v3 and later checkpoints). The new template namespaces ChatML tokens as <|ifm|im_start|> / <|ifm|im_end|>, emits no whitespace between messages, and requires assistant messages to carry a thinking field. The legacy <|im_end|>\n template stays supported for older K2V3 checkpoints (bbq-8b-mid3-final and earlier) that haven't migrated yet. Changes: - K2V3TITOTokenizer now targets the IFM template. merge_tokens is pure concat — the buffer already matches the canonical render (model stops at <|ifm|im_end|> and no trailing whitespace follows in the template). - Renamed the legacy K2V3TITOTokenizer to K2V3OldBackupTITOTokenizer. Its <|im_end|> + \n boundary-fix logic is preserved bit-for-bit. - Added TITOTokenizerType.K2V3_OLDBACKUP enum value and registry entry. TITOTokenizerType.K2V3 now points at the new IFM class. - Both classes hard-assert at __init__ that the loaded tokenizer's vocab matches their target template (refuses to load on a misconfigured checkpoint, with an error pointing at the right --tito-model value). - test_tito_k2v3.py rewritten for IFM invariants (no boundary fix, BOS prepend, thinking required, hard-assert sanity). - Renamed previous test file to test_tito_k2v3_oldbackup.py with K2V3OldBackup references. Breaking change for downstream sbatch: --tito-model k2v3 now refers to the IFM template. Legacy checkpoint users must update to --tito-model k2v3_oldbackup. Misconfiguration raises at init rather than silently producing wrong TITO buffers. Out of scope (required separately for IFM training): - IFM-compatible SGLang reasoning_parser + tool_parser (see LLM360/sglang#33). Verification: - tests/fast/.../test_tito_k2v3.py: 43 passed, 12 skipped (skipped = SGLang IFM parsers not yet in this container build). - tests/fast/.../test_tito_k2v3_oldbackup.py: 54 passed (legacy behavior unchanged). * Use raw-string docstrings to display \n literally in tito K2V3 classes/tests Docstrings on K2V3TITOTokenizer / K2V3OldBackupTITOTokenizer and the two K2V3 test files contain visual references to the literal `\n` escape sequence (the chat-template trailing newline). The previous \\n escaping renders correctly but reads awkwardly in source. Convert the affected docstrings to raw strings (r"""...""") so the source literally contains \n, which is easier to read and write. No code or test behavior changes. Tested: 109 passed (55 IFM + 54 oldbackup) inside the agentic-rl container with sglang PR #33 shadowed for the parser tests. --------- Co-authored-by: Zhenting Wang <zhenting.wang@mbzuai.ac.ae>
1 parent 2b9b705 commit 9d4262c

3 files changed

Lines changed: 1440 additions & 105 deletions

File tree

miles/utils/chat_template_utils/tito_tokenizer.py

Lines changed: 89 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -340,25 +340,90 @@ def merge_tokens(
340340

341341

342342
# ---------------------------------------------------------------------------
343-
# K2V3 family implementation
343+
# K2V3 family — current (IFM) chat template
344344
# ---------------------------------------------------------------------------
345345

346346

347347
class K2V3TITOTokenizer(TITOTokenizer):
348-
"""K2V3 family.
348+
r"""K2V3 family with the IFM-style chat template (introduced 2026-06-01).
349349
350-
The chat template emits ``<|im_end|>\\n`` after every message (jinja
351-
block whitespace between ``{{- '<|im_end|>' }}`` and the next block
352-
is preserved by default ``trim_blocks``), but the model
353-
autoregressively stops at ``<|im_end|>`` without generating the
354-
trailing ``\\n``. ``merge_tokens`` inserts the missing newline so the
355-
pretokenized buffer matches the canonical template output.
350+
The current K2V3 chat template (``bbq-0601`` / ``bbq-8b-mid3_v3`` and
351+
later) namespaces ChatML tokens as ``<|ifm|im_start|>`` /
352+
``<|ifm|im_end|>`` and emits NO whitespace between
353+
``<|ifm|im_end|>`` and the next ``<|ifm|im_start|>``. The model
354+
autoregressively stops at ``<|ifm|im_end|>`` with no trailing byte;
355+
the rollout buffer already matches the canonical template render
356+
exactly. ``merge_tokens`` therefore needs no boundary fix — it
357+
inherits the base ``TITOTokenizer`` concat behavior.
358+
359+
Empirical sanity check::
360+
361+
apply_chat_template([user, asst, user], tokenize=False)
362+
→ '...A1<|ifm|im_end|><|ifm|im_start|>user\n...'
363+
^^ no \n between messages
364+
365+
For legacy K2V3 checkpoints (``bbq-8b-mid3-final`` and earlier) whose
366+
chat template uses ``<|im_end|>\n`` between messages, use
367+
:class:`K2V3OldBackupTITOTokenizer` (``--tito-model k2v3_oldbackup``)
368+
instead.
369+
"""
370+
371+
_default_assistant_start_str: str = "<|ifm|im_start|>assistant"
372+
373+
def __init__(
374+
self,
375+
tokenizer: Any,
376+
chat_template_kwargs: dict[str, Any] | None = None,
377+
assistant_start_str: str | None = None,
378+
allowed_append_roles: list[str] | None = None,
379+
):
380+
super().__init__(
381+
tokenizer,
382+
chat_template_kwargs,
383+
assistant_start_str or self._default_assistant_start_str,
384+
allowed_append_roles=allowed_append_roles,
385+
)
386+
# Hard assert against misconfiguration: refuse to load on a legacy
387+
# K2V3 checkpoint whose vocab does not have <|ifm|im_end|>.
388+
ifm_end_id = tokenizer.convert_tokens_to_ids("<|ifm|im_end|>")
389+
unk_id = getattr(tokenizer, "unk_token_id", None)
390+
if ifm_end_id is None or ifm_end_id == unk_id:
391+
raise ValueError(
392+
"K2V3TITOTokenizer (current/IFM chat template) requires "
393+
"<|ifm|im_end|> in the tokenizer vocab. The loaded "
394+
"tokenizer does not have this token, suggesting you are "
395+
"on a legacy K2V3 checkpoint. Use --tito-model "
396+
"k2v3_oldbackup for those."
397+
)
398+
self._im_end_id: int = ifm_end_id
399+
self.trailing_token_ids = frozenset({ifm_end_id})
400+
401+
402+
# ---------------------------------------------------------------------------
403+
# K2V3 family — legacy (<|im_end|>\n) chat template
404+
# ---------------------------------------------------------------------------
405+
406+
407+
class K2V3OldBackupTITOTokenizer(TITOTokenizer):
408+
r"""K2V3 family with the LEGACY chat template (``<|im_end|>\n``).
409+
410+
Use this with legacy K2V3 checkpoints (``bbq-8b-mid3-final`` and
411+
earlier) whose chat template emits ``<|im_end|>\n`` after every
412+
message (jinja block whitespace between ``{{- '<|im_end|>' }}`` and
413+
the next block is preserved by default ``trim_blocks``), but where
414+
the model autoregressively stops at ``<|im_end|>`` without producing
415+
the trailing ``\n``. ``merge_tokens`` inserts the missing newline so
416+
the pretokenized buffer matches the canonical template output.
356417
357418
Empirical sanity check::
358419
359420
apply_chat_template([user, assistant, user], tokenize=False)
360-
→ '...hello<|im_end|>\\n<|im_start|>user\\n...'
421+
→ '...hello<|im_end|>\n<|im_start|>user\n...'
361422
^^
423+
424+
For current K2V3 checkpoints (``bbq-8b-mid3_v3`` and later) whose
425+
template uses ``<|ifm|im_end|>`` with no trailing ``\n``, use
426+
:class:`K2V3TITOTokenizer` (``--tito-model k2v3``) instead.
362427
"""
363428

364429
_default_assistant_start_str: str = "<|im_start|>assistant"
@@ -376,10 +441,22 @@ def __init__(
376441
assistant_start_str or self._default_assistant_start_str,
377442
allowed_append_roles=allowed_append_roles,
378443
)
444+
# Hard assert against misconfiguration: refuse to load on a current
445+
# K2V3 checkpoint whose vocab does not have <|im_end|>.
446+
im_end_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
447+
unk_id = getattr(tokenizer, "unk_token_id", None)
448+
if im_end_id is None or im_end_id == unk_id:
449+
raise ValueError(
450+
"K2V3OldBackupTITOTokenizer (legacy chat template) "
451+
"requires <|im_end|> in the tokenizer vocab. The loaded "
452+
"tokenizer does not have this token, suggesting you are "
453+
"on a current K2V3 checkpoint that uses the IFM template. "
454+
"Use --tito-model k2v3 for those."
455+
)
379456
nl_ids = tokenizer.encode("\n", add_special_tokens=False)
380457
assert len(nl_ids) == 1, f"Expected single newline token, got {nl_ids}"
381458
self._newline_id: int = nl_ids[0]
382-
self._im_end_id: int = tokenizer.convert_tokens_to_ids("<|im_end|>")
459+
self._im_end_id: int = im_end_id
383460
self.trailing_token_ids = frozenset({self._newline_id})
384461

385462
def merge_tokens(
@@ -406,13 +483,15 @@ class TITOTokenizerType(str, Enum):
406483
QWEN3 = "qwen3"
407484
GLM47 = "glm47"
408485
K2V3 = "k2v3"
486+
K2V3_OLDBACKUP = "k2v3_oldbackup"
409487

410488

411489
_TOKENIZER_REGISTRY: dict[TITOTokenizerType, type[TITOTokenizer]] = {
412490
TITOTokenizerType.DEFAULT: TITOTokenizer,
413491
TITOTokenizerType.QWEN3: Qwen3TITOTokenizer,
414492
TITOTokenizerType.GLM47: GLM47TITOTokenizer,
415493
TITOTokenizerType.K2V3: K2V3TITOTokenizer,
494+
TITOTokenizerType.K2V3_OLDBACKUP: K2V3OldBackupTITOTokenizer,
416495
}
417496

418497

0 commit comments

Comments
 (0)