Commit 9d4262c
Swap K2V3 TITO tokenizer to IFM template; rename legacy to k2v3_oldbackup (#43)
* Swap TITO tokenizer's K2V3 to IFM template; rename legacy to k2v3_oldbackup
The K2V3 family is migrating to the IFM-style chat template introduced
in bbq-0601 (used by bbq-8b-mid3_v3 and later checkpoints). The new
template namespaces ChatML tokens as <|ifm|im_start|> / <|ifm|im_end|>,
emits no whitespace between messages, and requires assistant messages
to carry a thinking field. The legacy <|im_end|>\n template stays
supported for older K2V3 checkpoints (bbq-8b-mid3-final and earlier)
that haven't migrated yet.
Changes:
- K2V3TITOTokenizer now targets the IFM template. merge_tokens is
pure concat — the buffer already matches the canonical render
(model stops at <|ifm|im_end|> and no trailing whitespace
follows in the template).
- Renamed the legacy K2V3TITOTokenizer to K2V3OldBackupTITOTokenizer.
Its <|im_end|> + \n boundary-fix logic is preserved bit-for-bit.
- Added TITOTokenizerType.K2V3_OLDBACKUP enum value and registry
entry. TITOTokenizerType.K2V3 now points at the new IFM class.
- Both classes hard-assert at __init__ that the loaded tokenizer's
vocab matches their target template (refuses to load on a
misconfigured checkpoint, with an error pointing at the right
--tito-model value).
- test_tito_k2v3.py rewritten for IFM invariants (no boundary fix,
BOS prepend, thinking required, hard-assert sanity).
- Renamed previous test file to test_tito_k2v3_oldbackup.py with
K2V3OldBackup references.
Breaking change for downstream sbatch:
--tito-model k2v3 now refers to the IFM template. Legacy checkpoint
users must update to --tito-model k2v3_oldbackup. Misconfiguration
raises at init rather than silently producing wrong TITO buffers.
Out of scope (required separately for IFM training):
- IFM-compatible SGLang reasoning_parser + tool_parser (see
LLM360/sglang#33).
Verification:
- tests/fast/.../test_tito_k2v3.py: 43 passed, 12 skipped (skipped =
SGLang IFM parsers not yet in this container build).
- tests/fast/.../test_tito_k2v3_oldbackup.py: 54 passed (legacy
behavior unchanged).
* Use raw-string docstrings to display \n literally in tito K2V3 classes/tests
Docstrings on K2V3TITOTokenizer / K2V3OldBackupTITOTokenizer and the
two K2V3 test files contain visual references to the literal `\n`
escape sequence (the chat-template trailing newline). The previous
\\n escaping renders correctly but reads awkwardly in source. Convert
the affected docstrings to raw strings (r"""...""") so the source
literally contains \n, which is easier to read and write.
No code or test behavior changes.
Tested: 109 passed (55 IFM + 54 oldbackup) inside the agentic-rl
container with sglang PR #33 shadowed for the parser tests.
---------
Co-authored-by: Zhenting Wang <zhenting.wang@mbzuai.ac.ae>1 parent 2b9b705 commit 9d4262c
3 files changed
Lines changed: 1440 additions & 105 deletions
File tree
- miles/utils/chat_template_utils
- tests/fast/utils/chat_template_utils
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
340 | 340 | | |
341 | 341 | | |
342 | 342 | | |
343 | | - | |
| 343 | + | |
344 | 344 | | |
345 | 345 | | |
346 | 346 | | |
347 | 347 | | |
348 | | - | |
| 348 | + | |
349 | 349 | | |
350 | | - | |
351 | | - | |
352 | | - | |
353 | | - | |
354 | | - | |
355 | | - | |
| 350 | + | |
| 351 | + | |
| 352 | + | |
| 353 | + | |
| 354 | + | |
| 355 | + | |
| 356 | + | |
| 357 | + | |
| 358 | + | |
| 359 | + | |
| 360 | + | |
| 361 | + | |
| 362 | + | |
| 363 | + | |
| 364 | + | |
| 365 | + | |
| 366 | + | |
| 367 | + | |
| 368 | + | |
| 369 | + | |
| 370 | + | |
| 371 | + | |
| 372 | + | |
| 373 | + | |
| 374 | + | |
| 375 | + | |
| 376 | + | |
| 377 | + | |
| 378 | + | |
| 379 | + | |
| 380 | + | |
| 381 | + | |
| 382 | + | |
| 383 | + | |
| 384 | + | |
| 385 | + | |
| 386 | + | |
| 387 | + | |
| 388 | + | |
| 389 | + | |
| 390 | + | |
| 391 | + | |
| 392 | + | |
| 393 | + | |
| 394 | + | |
| 395 | + | |
| 396 | + | |
| 397 | + | |
| 398 | + | |
| 399 | + | |
| 400 | + | |
| 401 | + | |
| 402 | + | |
| 403 | + | |
| 404 | + | |
| 405 | + | |
| 406 | + | |
| 407 | + | |
| 408 | + | |
| 409 | + | |
| 410 | + | |
| 411 | + | |
| 412 | + | |
| 413 | + | |
| 414 | + | |
| 415 | + | |
| 416 | + | |
356 | 417 | | |
357 | 418 | | |
358 | 419 | | |
359 | 420 | | |
360 | | - | |
| 421 | + | |
361 | 422 | | |
| 423 | + | |
| 424 | + | |
| 425 | + | |
| 426 | + | |
362 | 427 | | |
363 | 428 | | |
364 | 429 | | |
| |||
376 | 441 | | |
377 | 442 | | |
378 | 443 | | |
| 444 | + | |
| 445 | + | |
| 446 | + | |
| 447 | + | |
| 448 | + | |
| 449 | + | |
| 450 | + | |
| 451 | + | |
| 452 | + | |
| 453 | + | |
| 454 | + | |
| 455 | + | |
379 | 456 | | |
380 | 457 | | |
381 | 458 | | |
382 | | - | |
| 459 | + | |
383 | 460 | | |
384 | 461 | | |
385 | 462 | | |
| |||
406 | 483 | | |
407 | 484 | | |
408 | 485 | | |
| 486 | + | |
409 | 487 | | |
410 | 488 | | |
411 | 489 | | |
412 | 490 | | |
413 | 491 | | |
414 | 492 | | |
415 | 493 | | |
| 494 | + | |
416 | 495 | | |
417 | 496 | | |
418 | 497 | | |
| |||
0 commit comments