fix(tokenizer): serialize DeepSeek V3.2/V4 prompt JSON with json.dumps spacing by key4ng · Pull Request #1839 · lightseekorg/smg

key4ng · 2026-06-24T02:24:46Z

Description

Problem

On the BFCL nightly A/B (pure vLLM vs SMG→vLLM, same model/engine/checkpoint/sampling — only the frontend differs), DeepSeek-V4 multi_turn scores collapsed on the SMG arm (empty-{}-args force_terminated loops), while non-tool categories were far less affected.

SMG does have and use a DeepSeek-V4 prompt encoder (crates/tokenizer/src/encoders/deepseek_v4.rs, a port of the model's encoding_dsv4.py, selected via config.json arch DeepseekV4ForCausalLM). The bug was in how it serialized embedded JSON. Its to_json used compact serde_json::to_string, but the reference encoder (which vLLM serves from) uses Python json.dumps(ensure_ascii=False):

SMG:  {"name":"get_weather","description":"Get the wea...    ← compact
vLLM: {"name": "get_weather", "description": "Get the ...    ← json.dumps (spaced)

So every tool-enabled prompt differed byte-for-byte between the two arms in:

the tool-schema block (present on every function-calling turn), and
non-string argument values of fed-back tool calls (present on every multi-turn step).

This shifts the model off its training distribution; the error compounds across turns and is amplified by multi_turn's binary pass/fail cascade — hence the disproportionate multi_turn drop. The V3.2 encoder had the identical bug.

serde_json ships only CompactFormatter (no spaces) and PrettyFormatter (multi-line); neither matches Python's single-line-with-spaces json.dumps, so a small custom Formatter is required.

Distinct from the closed #1836, which fixed JSON spacing in the tool parser (the arguments string in the API response). This PR fixes the prompt encoder (the tool-schema block + fed-back args rendered into the prompt) — the opposite end of the model.

Solution

Add a focused json_dumps::to_string helper — a serde_json Formatter that overrides just the three structural separators serde_json lacks (": ", ", ") — and point both DeepSeek encoders' to_json at it. Because it overrides only structural callbacks (not string content), , / : inside keys/values are left untouched.

chat_template.rs already has a separate, configurable Python-JSON formatter for the HuggingFace tojson filter (indent / separators / ensure_ascii); it is intentionally left untouched — the two have different needs (fixed default vs configurable), so consolidating would be churn, not simplification.

Changes

crates/tokenizer/src/json_dumps.rs — new to_string helper (3-override formatter) + unit tests.
crates/tokenizer/src/encoders/deepseek_v4.rs, deepseek_v32.rs — to_json now uses it.
crates/tokenizer/src/lib.rs — module declaration.

(4 files, ~+113/−5.)

Test Plan

Verified against DeepSeek's 4 official golden encoding vectors (encoding/tests/ in the model distribution — what vLLM renders from), run end-to-end through the encoder:

case	covers	before	after
1	thinking + multi-turn tool calls	❌	✅
2	thinking, no tools	✅	✅
3	developer + search + tools	❌	✅
4	quick-instruction (chat)	✅	✅

The two failing cases were exactly the tool cases; after the fix all 4 are byte-identical (incl. case 1's full multi-turn body). The committed unit tests also assert parity with json.dumps(ensure_ascii=False) and that separators inside string content are not spaced. cargo +nightly fmt, cargo clippy -p llm-tokenizer --all-targets -- -D warnings, and 140 tokenizer lib tests all pass.

The golden vectors are DeepSeek model-distribution files, so they are not vendored into the repo; the parity above was re-verified locally before finalizing.

Final score confirmation needs a nightly re-run. Separately worth verifying there (not an encoder bug, so not fixed here): SMG defaults V4 to chat mode (ThinkingToggle::DefaultOff) — confirm that matches vLLM's --tokenizer-mode deepseek_v4 default.

Checklist

cargo +nightly fmt passes
cargo clippy --all-targets --all-features -- -D warnings passes (tokenizer crate)
(Optional) Documentation updated
(Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

The DeepSeek V3.2/V4 prompt encoders embed tool schemas and non-string argument values as JSON. They serialized with compact `serde_json::to_string`, but the reference encoding (encoding_dsv4.py, which vLLM serves from) uses `json.dumps(ensure_ascii=False)` with spaced `", "` / `": "` separators. Every tool-enabled prompt therefore diverged byte-for-byte from vLLM in the tool-schema block (every function-calling turn) and in fed-back tool arguments (every multi-turn step), shifting the model off its training distribution. This compounds across turns and explains the DeepSeek-V4 multi_turn collapse on the SMG arm of the BFCL A/B. Add a shared `python_json::to_python_json_string` matching json.dumps default separators and point both encoders' `to_json` at it. Vendor DeepSeek's 4 reference encoding fixtures and assert byte-for-byte parity (cases 1 and 3, the tool cases, failed before this fix). Signed-off-by: key4ng <rukeyang@gmail.com>

coderabbitai · 2026-06-24T02:25:01Z

📝 Walkthrough

Walkthrough

Adds a new crate-internal json_dumps module implementing a custom serde_json::ser::Formatter that emits Python json.dumps-style separators (": ", ", ") with raw UTF-8 output. The module is declared in lib.rs and the DeepSeek V3.2 and V4 encoders' to_json helpers are updated to delegate to it.

Changes

Python JSON Formatter and Encoder Wiring

Layer / File(s)	Summary
`json_dumps` module: formatter, function, and tests `crates/tokenizer/src/lib.rs`, `crates/tokenizer/src/json_dumps.rs`	Adds `PythonDefaultFormatter` implementing `serde_json::ser::Formatter` with `": "`/`", "` separators, exposes `to_string` (returns `"null"` on any failure), includes unit tests for spacing, Unicode, and string delimiter preservation, and declares the module in `lib.rs`.
DeepSeek encoders wired to `json_dumps` `crates/tokenizer/src/encoders/deepseek_v32.rs`, `crates/tokenizer/src/encoders/deepseek_v4.rs`	Replaces direct `serde_json::to_string` calls in the `to_json` helpers of both encoders with delegation to `crate::json_dumps::to_python_json_string` / `crate::json_dumps::to_string`.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

lightseekorg/smg#1373: Modifies the same to_json helper in deepseek_v32.rs and deepseek_v4.rs to control prompt byte formatting.

Suggested labels

tests

Suggested reviewers

slin1237
claude

Poem

🐇 A rabbit hops through JSON land,
Where colons and commas space as planned,
": " here, ", " there,
Python formatting fills the air—
No compact bytes shall fool the model's hand! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 76.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Linked Issues check	✅ Passed	PR correctly implements json.dumps-style JSON formatting for DeepSeek encoders to fix prompt divergence. Changes align with `#1836` (tool_parser fix) and address the complementary encoder-side issue.
Out of Scope Changes check	✅ Passed	All changes are scoped to fixing JSON formatting in DeepSeek encoders and supporting helper module. No unrelated modifications detected.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: fixing JSON serialization in DeepSeek V3.2/V4 encoders to use json.dumps-style spacing instead of compact format.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/dsv4-encoder-json-spacing

_{Comment @coderabbitai help to get the list of available commands.}

gemini-code-assist

Code Review

This pull request introduces a custom JSON serializer (python_json) to match Python's json.dumps(value, ensure_ascii=False) formatting (specifically, using spaced separators like ", " and ": "). This serializer is integrated into the DeepSeek V3.2 and V4 prompt encoders to prevent prompt drift and maintain parity with the reference Python implementation. Additionally, parity tests and golden fixtures for DeepSeek-V4 have been added, and pre-commit hooks have been updated to exclude these test fixtures. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

claude

Clean, focused fix with strong test coverage. The PythonDefaultFormatter correctly matches Python's json.dumps default separators, the shared helper eliminates duplication across V3.2/V4 encoders, and the 4 byte-for-byte parity tests against DeepSeek's reference vectors give high confidence in correctness.

Keep the PR to the encoder fix only; remove the vendored DeepSeek reference fixtures, the byte-parity integration test, and the pre-commit excludes that only existed to protect those fixtures. The python_json helper retains its self-contained unit tests. Signed-off-by: key4ng <rukeyang@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e25ce4a99

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-24T02:31:13Z

+pub(crate) fn to_python_json_string(value: &Value) -> String {
+    let mut buf = Vec::new();
+    let mut ser = Serializer::with_formatter(&mut buf, PythonDefaultFormatter);
+    if value.serialize(&mut ser).is_err() {


Match Python formatting for JSON numbers

When a tool schema or non-string tool argument contains floats rendered in exponent form (for example 1e20 or 1e-6), this still serializes through serde_json's number formatter rather than Python's json.dumps formatter; Python emits forms like 1e+20 / 1e-06, so those prompts remain byte-different from the DeepSeek V3.2/V4 reference and vLLM despite the new helper. This path is used for embedded tool schemas, response formats, and numeric tool arguments, so numeric bounds/defaults or float arguments can still shift the prompt off the intended distribution.

Useful? React with 👍 / 👎.

…n_dumps serde_json ships only compact and pretty formatters, neither matching Python's single-line `json.dumps` spacing, so a small custom formatter is needed. Keep it as a focused 3-override formatter in a dedicated `json_dumps` module (`json_dumps::to_string`) rather than relocating chat_template's configurable HF-tojson formatter — the two have different needs (fixed default vs configurable), and a move would be pure churn. chat_template is left untouched. Signed-off-by: key4ng <rukeyang@gmail.com>

…nt untouched Guards against spacing `,` / `:` that appear inside string keys/values (the failure mode of a naive compact-then-replace approach); each case matches json.dumps(ensure_ascii=False) byte-for-byte. Signed-off-by: key4ng <rukeyang@gmail.com>

key4ng requested review from CatherineSue and slin1237 as code owners June 24, 2026 02:24

github-actions Bot added documentation Improvements or additions to documentation tokenizer Tokenizer related changes tests Test changes labels Jun 24, 2026

gemini-code-assist Bot reviewed Jun 24, 2026

View reviewed changes

claude Bot approved these changes Jun 24, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Jun 24, 2026

View reviewed changes

coderabbitai Bot approved these changes Jun 24, 2026

View reviewed changes

github-actions Bot removed documentation Improvements or additions to documentation tests Test changes labels Jun 24, 2026

key4ng added 2 commits June 23, 2026 20:34

key4ng changed the title ~~fix(tokenizer): render DeepSeek V3.2/V4 prompt JSON like json.dumps~~ fix(tokenizer): serialize DeepSeek V3.2/V4 prompt JSON with json.dumps spacing Jun 24, 2026

slin1237 approved these changes Jun 24, 2026

View reviewed changes

key4ng merged commit 78389d5 into main Jun 25, 2026
51 checks passed

key4ng deleted the fix/dsv4-encoder-json-spacing branch June 25, 2026 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(tokenizer): serialize DeepSeek V3.2/V4 prompt JSON with json.dumps spacing#1839

fix(tokenizer): serialize DeepSeek V3.2/V4 prompt JSON with json.dumps spacing#1839
key4ng merged 4 commits into
mainfrom
fix/dsv4-encoder-json-spacing

key4ng commented Jun 24, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

claude Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

key4ng commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Solution

Changes

Test Plan

Uh oh!

coderabbitai Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

key4ng commented Jun 24, 2026 •

edited

Loading

coderabbitai Bot commented Jun 24, 2026 •

edited

Loading