Skip to content

fix(tokenizer): match HuggingFace tojson formatting#1478

Merged
CatherineSue merged 4 commits into
mainfrom
codex/fix-chat-template-tojson-formatting
May 13, 2026
Merged

fix(tokenizer): match HuggingFace tojson formatting#1478
CatherineSue merged 4 commits into
mainfrom
codex/fix-chat-template-tojson-formatting

Conversation

@CatherineSue

@CatherineSue CatherineSue commented May 12, 2026

Copy link
Copy Markdown
Member

Description

Fixes: https://github.com/ai-jz/serve-qa/issues/137

Problem

HuggingFace chat templates call the custom tojson filter as Python json.dumps(..., ensure_ascii=..., indent=..., separators=..., sort_keys=...). Our implementation accepted those kwargs but still serialized through serde_json::to_string / PrettyFormatter, so default output was compact ({"name":"get_weather"}), explicit separators was ignored, ensure_ascii=True was ignored, and object order could be sorted before the template rendered. That created tokenization drift versus Transformers/vLLM, including the serve-qa#137 tool-call spacing mismatch.

There is also a preserve-order issue: matching HuggingFace/Python default behavior requires preserving object insertion order when sort_keys is not requested. Without order preservation, request/tool JSON can lose source order before the tojson filter ever serializes it.

Solution

Implement a Python-compatible JSON formatter for the chat-template tojson filter. It preserves insertion order by enabling preserve_order in both serde_json and MiniJinja, uses Python default separators (", ", ": " without indent and ",", ": " with indent), honors explicit separators, supports ensure_ascii, and only sorts recursively when templates explicitly pass sort_keys=True.

The reason both order features are needed is that preserving order only at serialization is too late: request/tool JSON can first enter as serde_json::Value, then move through MiniJinja values before tojson runs. Both layers need order-preserving maps for default HF parity.

Preserve-order impact

This PR enables serde_json/preserve_order at the workspace level. That means serde_json::Value::Object now serializes in insertion/source order instead of sorted-key order, which can affect byte-level JSON output in model_gateway paths that serialize serde_json::Value objects with serde_json::to_string or serde_json::to_vec.

This does not change JSON whitespace, typed Rust struct field order, BTreeMap ordering, or HashMap iteration behavior. The spacing fix is scoped to the chat-template tojson formatter; the broader workspace impact is object key ordering for serde_json::Value.

Changes

  • Add a PythonJsonFormatter for HuggingFace/Python json.dumps separator and ASCII escaping behavior.
  • Parse and validate tojson(separators=...), including indented output.
  • Enable serde_json and minijinja order preservation for chat-template input objects.
  • Expand tokenizer integration coverage for default order, compact separators, indentation, ensure_ascii=True, and sort_keys=True.

Test Plan

  • cargo +nightly fmt --check
  • cargo test -p llm-tokenizer --test chat_template_integration --profile dev-opt
  • Commit hook: cargo +nightly fmt --all --
  • Commit hook: cargo clippy --workspace --all-targets --all-features -- -D warnings
Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

Summary by CodeRabbit

  • New Features

    • Added Python-compatible JSON formatting options for template filters: ensure_ascii, indent, separators, sort_keys, with a unified serializer that preserves insertion order by default.
  • Tests

    • Expanded integration tests for formatting variants and invalid-argument validation; added end-to-end assertions to accept any declared tool and validate parsed function-call arguments and schemas.
  • Chores

    • Updated dependency metadata to enable order-preserving JSON behavior.

Review Change Stack

@CatherineSue CatherineSue requested a review from slin1237 as a code owner May 12, 2026 22:37
@github-actions github-actions Bot added tokenizer Tokenizer related changes dependencies Dependency updates tests Test changes labels May 12, 2026
@coderabbitai

coderabbitai Bot commented May 12, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 67747835-c1d9-4803-90b9-8f0722f79928

📥 Commits

Reviewing files that changed from the base of the PR and between 002ab67 and f1255dc.

📒 Files selected for processing (1)
  • e2e_test/chat_completions/test_function_calling.py

📝 Walkthrough

Walkthrough

Replace the template tojson path with a Python-compatible serializer (supporting ensure_ascii, indent, separators, and optional sort_keys), enable ordered-key preservation in dependencies, and expand integration and e2e tests to validate formatting and kwarg validation.

Changes

Python-Compatible JSON Serialization for tojson Filter

Layer / File(s) Summary
Dependencies and import setup
Cargo.toml, crates/tokenizer/Cargo.toml, crates/tokenizer/src/chat_template.rs
Enable preserve_order for serde_json and minijinja; add std::io import and switch to serde_json::Formatter usage to support the custom Python-compatible serializer.
Python-compatible JSON formatter and utilities
crates/tokenizer/src/chat_template.rs
Add JsonSeparators, PythonJsonFormatter, separators parsing/validation, Python-style ASCII escaping, and serialize_with_python_json to implement json.dumps-compatible formatting for the tojson filter.
tojson filter refactoring
crates/tokenizer/src/chat_template.rs
tojson_filter now consumes and validates separators, indent, and ensure_ascii kwargs (and supports sort_keys), delegating serialization to serialize_with_python_json; previous indent-branch and helper removed.
Integration tests for tojson behavior
crates/tokenizer/tests/chat_template_integration.rs
test_tojson_with_all_huggingface_kwargs updated to provide data via template kwargs and expanded to assert outputs for sort_keys, separators (compact/default), indent variants, and ensure_ascii; new test_tojson_invalid_kwargs_rejected asserts specific validation errors for bad kwargs.
E2E function-calling test update
e2e_test/chat_completions/test_function_calling.py
test_function_call_required generalized to assert the returned function name is one of the declared tools, that arguments parse to a JSON object, and perform function-specific argument schema checks.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • lightseekorg/smg#439: Both PRs modify tokenizer's chat_template implementation — this PR changes tojson filter/serialization behavior while the related PR restructures chat template APIs/state.

Suggested labels

tokenizer, dependencies, tests

Suggested reviewers

  • slin1237
  • key4ng
  • whybeyoung

Poem

🐰 A rabbit formats JSON with care,
Python-style commas hop in line,
ensure_ascii twinkles like a stare,
indent and sort make keys align,
tests applaud — the serializer's fine.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main change: implementing Python-compatible JSON formatting for the tojson filter to match HuggingFace behavior, which is the primary objective of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/fix-chat-template-tojson-formatting

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@Cargo.toml`:
- Line 45: The ETag computation in build_resource_etag and build_list_etag is
fragile because enabling serde_json's preserve_order makes serde_json::to_vec(…)
produce different key orders; fix by canonicalizing the JSON before hashing:
convert the Value produced for response bodies into a deterministically ordered
form (e.g., sort object keys recursively or serialize via a stable-sorted map)
prior to calling Sha256::digest(serde_json::to_vec(…)); alternatively ensure all
response structs always create fields in the exact same insertion order across
code paths, but preferred approach is to perform an explicit recursive key-sort
on the serde_json::Value used by build_resource_etag and build_list_etag so ETag
is stable regardless of insertion order or the preserve_order setting.

In `@crates/tokenizer/tests/chat_template_integration.rs`:
- Around line 303-366: The test test_tojson_with_all_huggingface_kwargs
currently only asserts success paths; add explicit failure-path assertions that
invalid tojson kwargs are rejected by
ChatTemplateProcessor::apply_chat_template. Create small templates invoking
tojson with bad values (e.g., separators with wrong shape/type, negative indent,
non-boolean ensure_ascii) and call apply_chat_template with those templates and
the same template_kwargs, asserting the call returns an Err (or panics
accordingly) and that the returned error/message mentions the invalid kwarg;
reference ChatTemplateProcessor::new and
ChatTemplateProcessor::apply_chat_template when adding these negative-test cases
so the validation logic stays covered.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 77961018-b929-4874-862e-05b55e094822

📥 Commits

Reviewing files that changed from the base of the PR and between 76d9c18 and 5e41409.

📒 Files selected for processing (4)
  • Cargo.toml
  • crates/tokenizer/Cargo.toml
  • crates/tokenizer/src/chat_template.rs
  • crates/tokenizer/tests/chat_template_integration.rs

Comment thread Cargo.toml
Comment thread crates/tokenizer/tests/chat_template_integration.rs
Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>
@CatherineSue CatherineSue force-pushed the codex/fix-chat-template-tojson-formatting branch from 5e41409 to 5691c33 Compare May 12, 2026 22:42

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thorough review of the PythonJsonFormatter implementation and related changes. No issues found.

What was reviewed:

  • PythonJsonFormatter — correctly matches Python's json.dumps separator/indent behavior. The has_value single-boolean tracking works correctly for nested structures due to serde_json's guaranteed call ordering (end_*_value always follows nested structure completion).
  • ensure_ascii — correct UTF-16 surrogate pair encoding for non-BMP characters, default false matches HuggingFace's policy ("ensure_ascii": False).
  • parse_separators — proper handling of minijinja Valueserde_json::Value conversion with good error messages.
  • sort_json_keys — works correctly with preserve_order since keys are inserted in sorted order into a fresh IndexMap.
  • Workspace-wide preserve_order on serde_json — necessary due to Cargo feature unification; changes Map from BTreeMap to IndexMap everywhere, but this is the correct behavior for HuggingFace compatibility (Python dicts preserve insertion order).
  • Tests cover default separators, custom separators, indented + compact combos, ASCII escaping, and sort_keys.

0 🔴 Important · 0 🟡 Nit · 0 🟣 Pre-existing

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the tojson filter within the tokenizer crate to align with HuggingFace's implementation by mimicking Python's json.dumps behavior. It introduces a custom PythonJsonFormatter to support ensure_ascii, indent, and custom separators, while also enabling preserve_order for JSON serialization. Feedback indicates that ensure_ascii should default to true to achieve full parity with Python's standard behavior.

Comment thread crates/tokenizer/src/chat_template.rs

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
crates/tokenizer/tests/chat_template_integration.rs (1)

303-366: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick win

Add explicit invalid-kwarg tests for tojson validation paths.

This block verifies success cases well, but it still doesn’t lock in rejection behavior for malformed kwargs (e.g., bad separators, negative indent).

Proposed test addition
 #[test]
 fn test_tojson_with_all_huggingface_kwargs() {
@@
     assert!(
         result.contains(r#"Ascii: "\u65e5\u672c\u8a9e""#),
         "ensure_ascii=True should escape non-ASCII text: {result}"
     );
 }
+
+#[test]
+fn test_tojson_invalid_kwargs_rejected() {
+    let messages: Vec<serde_json::Value> = vec![];
+    let mut template_kwargs = std::collections::HashMap::new();
+    template_kwargs.insert("data".to_string(), serde_json::json!({"k": 1}));
+
+    let bad_separators = ChatTemplateProcessor::new(
+        r#"{{ data|tojson(separators=',') }}"#.to_string(),
+    )
+    .unwrap();
+    let err = bad_separators
+        .apply_chat_template(
+            &messages,
+            ChatTemplateParams {
+                template_kwargs: Some(&template_kwargs),
+                ..Default::default()
+            },
+        )
+        .unwrap_err()
+        .to_string();
+    assert!(err.contains("separators must be a two-item sequence"));
+
+    let negative_indent = ChatTemplateProcessor::new(
+        r#"{{ data|tojson(indent=-1) }}"#.to_string(),
+    )
+    .unwrap();
+    let err = negative_indent
+        .apply_chat_template(
+            &messages,
+            ChatTemplateParams {
+                template_kwargs: Some(&template_kwargs),
+                ..Default::default()
+            },
+        )
+        .unwrap_err()
+        .to_string();
+    assert!(err.contains("indent cannot be negative"));
+}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/tokenizer/tests/chat_template_integration.rs` around lines 303 - 366,
Add negative tests alongside test_tojson_with_all_huggingface_kwargs to assert
that malformed kwargs are rejected: call
ChatTemplateProcessor::apply_chat_template (same call site used in
test_tojson_with_all_huggingface_kwargs) with template_kwargs containing invalid
values like separators set to a non-tuple/string, indent set to a negative
integer, and ensure_ascii set to a non-boolean; for each case assert that
apply_chat_template returns an Err (or unwrap_err) and the error message
mentions the specific invalid kwarg (e.g., "separators", "indent",
"ensure_ascii"); reuse the existing template string or create small templates
that invoke tojson with the bad kwargs and pass the params via
ChatTemplateParams::template_kwargs to exercise the same validation paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@crates/tokenizer/tests/chat_template_integration.rs`:
- Around line 303-366: Add negative tests alongside
test_tojson_with_all_huggingface_kwargs to assert that malformed kwargs are
rejected: call ChatTemplateProcessor::apply_chat_template (same call site used
in test_tojson_with_all_huggingface_kwargs) with template_kwargs containing
invalid values like separators set to a non-tuple/string, indent set to a
negative integer, and ensure_ascii set to a non-boolean; for each case assert
that apply_chat_template returns an Err (or unwrap_err) and the error message
mentions the specific invalid kwarg (e.g., "separators", "indent",
"ensure_ascii"); reuse the existing template string or create small templates
that invoke tojson with the bad kwargs and pass the params via
ChatTemplateParams::template_kwargs to exercise the same validation paths.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2b117356-6731-4e7b-9db6-51df5ce5d4c2

📥 Commits

Reviewing files that changed from the base of the PR and between 5e41409 and 5691c33.

📒 Files selected for processing (4)
  • Cargo.toml
  • crates/tokenizer/Cargo.toml
  • crates/tokenizer/src/chat_template.rs
  • crates/tokenizer/tests/chat_template_integration.rs

Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>
Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@e2e_test/chat_completions/test_function_calling.py`:
- Around line 449-455: The test currently only validates arguments for the
get_weather branch, so add equivalent required-arg checks for the "sub" branch:
when function_name == "sub" assert that args_obj contains "int_a" and "int_b"
and that both values are integers (use isinstance checks and clear assertion
messages); update the same function_name/args_obj logic in
test_function_calling.py to validate missing or non-int payloads for "sub" to
ensure tool_choice="required" fails on malformed inputs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a97a4e4e-0a4a-402e-8014-f6bde876a2da

📥 Commits

Reviewing files that changed from the base of the PR and between 5542bb6 and 002ab67.

📒 Files selected for processing (1)
  • e2e_test/chat_completions/test_function_calling.py

Comment thread e2e_test/chat_completions/test_function_calling.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 002ab67868

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread e2e_test/chat_completions/test_function_calling.py
Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f1255dc81e

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread e2e_test/chat_completions/test_function_calling.py
@CatherineSue CatherineSue merged commit 401e666 into main May 13, 2026
55 checks passed
@CatherineSue CatherineSue deleted the codex/fix-chat-template-tojson-formatting branch May 13, 2026 02:09
zach-li-sudo pushed a commit to zach-li-sudo/smg that referenced this pull request May 13, 2026
Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates tests Test changes tokenizer Tokenizer related changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant