fix(tokenizer): serialize DeepSeek V3.2/V4 prompt JSON with json.dumps spacing#1839
Conversation
The DeepSeek V3.2/V4 prompt encoders embed tool schemas and non-string argument values as JSON. They serialized with compact `serde_json::to_string`, but the reference encoding (encoding_dsv4.py, which vLLM serves from) uses `json.dumps(ensure_ascii=False)` with spaced `", "` / `": "` separators. Every tool-enabled prompt therefore diverged byte-for-byte from vLLM in the tool-schema block (every function-calling turn) and in fed-back tool arguments (every multi-turn step), shifting the model off its training distribution. This compounds across turns and explains the DeepSeek-V4 multi_turn collapse on the SMG arm of the BFCL A/B. Add a shared `python_json::to_python_json_string` matching json.dumps default separators and point both encoders' `to_json` at it. Vendor DeepSeek's 4 reference encoding fixtures and assert byte-for-byte parity (cases 1 and 3, the tool cases, failed before this fix). Signed-off-by: key4ng <rukeyang@gmail.com>
📝 WalkthroughWalkthroughAdds a new crate-internal ChangesPython JSON Formatter and Encoder Wiring
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a custom JSON serializer (python_json) to match Python's json.dumps(value, ensure_ascii=False) formatting (specifically, using spaced separators like ", " and ": "). This serializer is integrated into the DeepSeek V3.2 and V4 prompt encoders to prevent prompt drift and maintain parity with the reference Python implementation. Additionally, parity tests and golden fixtures for DeepSeek-V4 have been added, and pre-commit hooks have been updated to exclude these test fixtures. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Clean, focused fix with strong test coverage. The PythonDefaultFormatter correctly matches Python's json.dumps default separators, the shared helper eliminates duplication across V3.2/V4 encoders, and the 4 byte-for-byte parity tests against DeepSeek's reference vectors give high confidence in correctness.
Keep the PR to the encoder fix only; remove the vendored DeepSeek reference fixtures, the byte-parity integration test, and the pre-commit excludes that only existed to protect those fixtures. The python_json helper retains its self-contained unit tests. Signed-off-by: key4ng <rukeyang@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7e25ce4a99
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pub(crate) fn to_python_json_string(value: &Value) -> String { | ||
| let mut buf = Vec::new(); | ||
| let mut ser = Serializer::with_formatter(&mut buf, PythonDefaultFormatter); | ||
| if value.serialize(&mut ser).is_err() { |
There was a problem hiding this comment.
Match Python formatting for JSON numbers
When a tool schema or non-string tool argument contains floats rendered in exponent form (for example 1e20 or 1e-6), this still serializes through serde_json's number formatter rather than Python's json.dumps formatter; Python emits forms like 1e+20 / 1e-06, so those prompts remain byte-different from the DeepSeek V3.2/V4 reference and vLLM despite the new helper. This path is used for embedded tool schemas, response formats, and numeric tool arguments, so numeric bounds/defaults or float arguments can still shift the prompt off the intended distribution.
Useful? React with 👍 / 👎.
…n_dumps serde_json ships only compact and pretty formatters, neither matching Python's single-line `json.dumps` spacing, so a small custom formatter is needed. Keep it as a focused 3-override formatter in a dedicated `json_dumps` module (`json_dumps::to_string`) rather than relocating chat_template's configurable HF-tojson formatter — the two have different needs (fixed default vs configurable), and a move would be pure churn. chat_template is left untouched. Signed-off-by: key4ng <rukeyang@gmail.com>
…nt untouched Guards against spacing `,` / `:` that appear inside string keys/values (the failure mode of a naive compact-then-replace approach); each case matches json.dumps(ensure_ascii=False) byte-for-byte. Signed-off-by: key4ng <rukeyang@gmail.com>
Description
Problem
On the BFCL nightly A/B (pure vLLM vs SMG→vLLM, same model/engine/checkpoint/sampling — only the frontend differs), DeepSeek-V4
multi_turnscores collapsed on the SMG arm (empty-{}-argsforce_terminatedloops), while non-tool categories were far less affected.SMG does have and use a DeepSeek-V4 prompt encoder (
crates/tokenizer/src/encoders/deepseek_v4.rs, a port of the model'sencoding_dsv4.py, selected viaconfig.jsonarchDeepseekV4ForCausalLM). The bug was in how it serialized embedded JSON. Itsto_jsonused compactserde_json::to_string, but the reference encoder (which vLLM serves from) uses Pythonjson.dumps(ensure_ascii=False):So every tool-enabled prompt differed byte-for-byte between the two arms in:
This shifts the model off its training distribution; the error compounds across turns and is amplified by
multi_turn's binary pass/fail cascade — hence the disproportionatemulti_turndrop. The V3.2 encoder had the identical bug.Solution
Add a focused
json_dumps::to_stringhelper — aserde_jsonFormatterthat overrides just the three structural separators serde_json lacks (": ",", ") — and point both DeepSeek encoders'to_jsonat it. Because it overrides only structural callbacks (not string content),,/:inside keys/values are left untouched.chat_template.rsalready has a separate, configurable Python-JSON formatter for the HuggingFacetojsonfilter (indent / separators /ensure_ascii); it is intentionally left untouched — the two have different needs (fixed default vs configurable), so consolidating would be churn, not simplification.Changes
crates/tokenizer/src/json_dumps.rs— newto_stringhelper (3-override formatter) + unit tests.crates/tokenizer/src/encoders/deepseek_v4.rs,deepseek_v32.rs—to_jsonnow uses it.crates/tokenizer/src/lib.rs— module declaration.(4 files, ~+113/−5.)
Test Plan
Verified against DeepSeek's 4 official golden encoding vectors (
encoding/tests/in the model distribution — what vLLM renders from), run end-to-end through the encoder:The two failing cases were exactly the tool cases; after the fix all 4 are byte-identical (incl. case 1's full multi-turn body). The committed unit tests also assert parity with
json.dumps(ensure_ascii=False)and that separators inside string content are not spaced.cargo +nightly fmt,cargo clippy -p llm-tokenizer --all-targets -- -D warnings, and 140 tokenizer lib tests all pass.Checklist
cargo +nightly fmtpassescargo clippy --all-targets --all-features -- -D warningspasses (tokenizer crate)