fix(tokenizer): match HuggingFace tojson formatting#1478
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughReplace the template tojson path with a Python-compatible serializer (supporting ensure_ascii, indent, separators, and optional sort_keys), enable ordered-key preservation in dependencies, and expand integration and e2e tests to validate formatting and kwarg validation. ChangesPython-Compatible JSON Serialization for tojson Filter
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@Cargo.toml`:
- Line 45: The ETag computation in build_resource_etag and build_list_etag is
fragile because enabling serde_json's preserve_order makes serde_json::to_vec(…)
produce different key orders; fix by canonicalizing the JSON before hashing:
convert the Value produced for response bodies into a deterministically ordered
form (e.g., sort object keys recursively or serialize via a stable-sorted map)
prior to calling Sha256::digest(serde_json::to_vec(…)); alternatively ensure all
response structs always create fields in the exact same insertion order across
code paths, but preferred approach is to perform an explicit recursive key-sort
on the serde_json::Value used by build_resource_etag and build_list_etag so ETag
is stable regardless of insertion order or the preserve_order setting.
In `@crates/tokenizer/tests/chat_template_integration.rs`:
- Around line 303-366: The test test_tojson_with_all_huggingface_kwargs
currently only asserts success paths; add explicit failure-path assertions that
invalid tojson kwargs are rejected by
ChatTemplateProcessor::apply_chat_template. Create small templates invoking
tojson with bad values (e.g., separators with wrong shape/type, negative indent,
non-boolean ensure_ascii) and call apply_chat_template with those templates and
the same template_kwargs, asserting the call returns an Err (or panics
accordingly) and that the returned error/message mentions the invalid kwarg;
reference ChatTemplateProcessor::new and
ChatTemplateProcessor::apply_chat_template when adding these negative-test cases
so the validation logic stays covered.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 77961018-b929-4874-862e-05b55e094822
📒 Files selected for processing (4)
Cargo.tomlcrates/tokenizer/Cargo.tomlcrates/tokenizer/src/chat_template.rscrates/tokenizer/tests/chat_template_integration.rs
Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>
5e41409 to
5691c33
Compare
There was a problem hiding this comment.
Thorough review of the PythonJsonFormatter implementation and related changes. No issues found.
What was reviewed:
PythonJsonFormatter— correctly matches Python'sjson.dumpsseparator/indent behavior. Thehas_valuesingle-boolean tracking works correctly for nested structures due to serde_json's guaranteed call ordering (end_*_valuealways follows nested structure completion).ensure_ascii— correct UTF-16 surrogate pair encoding for non-BMP characters, defaultfalsematches HuggingFace's policy ("ensure_ascii": False).parse_separators— proper handling of minijinjaValue→serde_json::Valueconversion with good error messages.sort_json_keys— works correctly withpreserve_ordersince keys are inserted in sorted order into a freshIndexMap.- Workspace-wide
preserve_orderonserde_json— necessary due to Cargo feature unification; changesMapfromBTreeMaptoIndexMapeverywhere, but this is the correct behavior for HuggingFace compatibility (Python dicts preserve insertion order). - Tests cover default separators, custom separators, indented + compact combos, ASCII escaping, and sort_keys.
0 🔴 Important · 0 🟡 Nit · 0 🟣 Pre-existing
There was a problem hiding this comment.
Code Review
This pull request enhances the tojson filter within the tokenizer crate to align with HuggingFace's implementation by mimicking Python's json.dumps behavior. It introduces a custom PythonJsonFormatter to support ensure_ascii, indent, and custom separators, while also enabling preserve_order for JSON serialization. Feedback indicates that ensure_ascii should default to true to achieve full parity with Python's standard behavior.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
crates/tokenizer/tests/chat_template_integration.rs (1)
303-366: 🧹 Nitpick | 🔵 Trivial | ⚡ Quick winAdd explicit invalid-kwarg tests for
tojsonvalidation paths.This block verifies success cases well, but it still doesn’t lock in rejection behavior for malformed kwargs (e.g., bad
separators, negativeindent).Proposed test addition
#[test] fn test_tojson_with_all_huggingface_kwargs() { @@ assert!( result.contains(r#"Ascii: "\u65e5\u672c\u8a9e""#), "ensure_ascii=True should escape non-ASCII text: {result}" ); } + +#[test] +fn test_tojson_invalid_kwargs_rejected() { + let messages: Vec<serde_json::Value> = vec![]; + let mut template_kwargs = std::collections::HashMap::new(); + template_kwargs.insert("data".to_string(), serde_json::json!({"k": 1})); + + let bad_separators = ChatTemplateProcessor::new( + r#"{{ data|tojson(separators=',') }}"#.to_string(), + ) + .unwrap(); + let err = bad_separators + .apply_chat_template( + &messages, + ChatTemplateParams { + template_kwargs: Some(&template_kwargs), + ..Default::default() + }, + ) + .unwrap_err() + .to_string(); + assert!(err.contains("separators must be a two-item sequence")); + + let negative_indent = ChatTemplateProcessor::new( + r#"{{ data|tojson(indent=-1) }}"#.to_string(), + ) + .unwrap(); + let err = negative_indent + .apply_chat_template( + &messages, + ChatTemplateParams { + template_kwargs: Some(&template_kwargs), + ..Default::default() + }, + ) + .unwrap_err() + .to_string(); + assert!(err.contains("indent cannot be negative")); +}🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/tokenizer/tests/chat_template_integration.rs` around lines 303 - 366, Add negative tests alongside test_tojson_with_all_huggingface_kwargs to assert that malformed kwargs are rejected: call ChatTemplateProcessor::apply_chat_template (same call site used in test_tojson_with_all_huggingface_kwargs) with template_kwargs containing invalid values like separators set to a non-tuple/string, indent set to a negative integer, and ensure_ascii set to a non-boolean; for each case assert that apply_chat_template returns an Err (or unwrap_err) and the error message mentions the specific invalid kwarg (e.g., "separators", "indent", "ensure_ascii"); reuse the existing template string or create small templates that invoke tojson with the bad kwargs and pass the params via ChatTemplateParams::template_kwargs to exercise the same validation paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In `@crates/tokenizer/tests/chat_template_integration.rs`:
- Around line 303-366: Add negative tests alongside
test_tojson_with_all_huggingface_kwargs to assert that malformed kwargs are
rejected: call ChatTemplateProcessor::apply_chat_template (same call site used
in test_tojson_with_all_huggingface_kwargs) with template_kwargs containing
invalid values like separators set to a non-tuple/string, indent set to a
negative integer, and ensure_ascii set to a non-boolean; for each case assert
that apply_chat_template returns an Err (or unwrap_err) and the error message
mentions the specific invalid kwarg (e.g., "separators", "indent",
"ensure_ascii"); reuse the existing template string or create small templates
that invoke tojson with the bad kwargs and pass the params via
ChatTemplateParams::template_kwargs to exercise the same validation paths.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2b117356-6731-4e7b-9db6-51df5ce5d4c2
📒 Files selected for processing (4)
Cargo.tomlcrates/tokenizer/Cargo.tomlcrates/tokenizer/src/chat_template.rscrates/tokenizer/tests/chat_template_integration.rs
Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>
Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@e2e_test/chat_completions/test_function_calling.py`:
- Around line 449-455: The test currently only validates arguments for the
get_weather branch, so add equivalent required-arg checks for the "sub" branch:
when function_name == "sub" assert that args_obj contains "int_a" and "int_b"
and that both values are integers (use isinstance checks and clear assertion
messages); update the same function_name/args_obj logic in
test_function_calling.py to validate missing or non-int payloads for "sub" to
ensure tool_choice="required" fails on malformed inputs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: a97a4e4e-0a4a-402e-8014-f6bde876a2da
📒 Files selected for processing (1)
e2e_test/chat_completions/test_function_calling.py
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 002ab67868
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f1255dc81e
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Chang Su <8605658+CatherineSue@users.noreply.github.com>
Description
Fixes: https://github.com/ai-jz/serve-qa/issues/137
Problem
HuggingFace chat templates call the custom
tojsonfilter as Pythonjson.dumps(..., ensure_ascii=..., indent=..., separators=..., sort_keys=...). Our implementation accepted those kwargs but still serialized throughserde_json::to_string/PrettyFormatter, so default output was compact ({"name":"get_weather"}), explicitseparatorswas ignored,ensure_ascii=Truewas ignored, and object order could be sorted before the template rendered. That created tokenization drift versus Transformers/vLLM, including the serve-qa#137 tool-call spacing mismatch.There is also a preserve-order issue: matching HuggingFace/Python default behavior requires preserving object insertion order when
sort_keysis not requested. Without order preservation, request/tool JSON can lose source order before thetojsonfilter ever serializes it.Solution
Implement a Python-compatible JSON formatter for the chat-template
tojsonfilter. It preserves insertion order by enablingpreserve_orderin bothserde_jsonand MiniJinja, uses Python default separators (", ", ": "without indent and",", ": "with indent), honors explicitseparators, supportsensure_ascii, and only sorts recursively when templates explicitly passsort_keys=True.The reason both order features are needed is that preserving order only at serialization is too late: request/tool JSON can first enter as
serde_json::Value, then move through MiniJinja values beforetojsonruns. Both layers need order-preserving maps for default HF parity.Preserve-order impact
This PR enables
serde_json/preserve_orderat the workspace level. That meansserde_json::Value::Objectnow serializes in insertion/source order instead of sorted-key order, which can affect byte-level JSON output inmodel_gatewaypaths that serializeserde_json::Valueobjects withserde_json::to_stringorserde_json::to_vec.This does not change JSON whitespace, typed Rust struct field order,
BTreeMapordering, orHashMapiteration behavior. The spacing fix is scoped to the chat-templatetojsonformatter; the broader workspace impact is object key ordering forserde_json::Value.Changes
PythonJsonFormatterfor HuggingFace/Pythonjson.dumpsseparator and ASCII escaping behavior.tojson(separators=...), including indented output.serde_jsonandminijinjaorder preservation for chat-template input objects.ensure_ascii=True, andsort_keys=True.Test Plan
cargo +nightly fmt --checkcargo test -p llm-tokenizer --test chat_template_integration --profile dev-optcargo +nightly fmt --all --cargo clippy --workspace --all-targets --all-features -- -D warningsChecklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspassesSummary by CodeRabbit
New Features
Tests
Chores