Skip to content

fix(tokenizer): serialize DeepSeek V3.2/V4 prompt JSON with json.dumps spacing#1839

Merged
key4ng merged 4 commits into
mainfrom
fix/dsv4-encoder-json-spacing
Jun 25, 2026
Merged

fix(tokenizer): serialize DeepSeek V3.2/V4 prompt JSON with json.dumps spacing#1839
key4ng merged 4 commits into
mainfrom
fix/dsv4-encoder-json-spacing

Conversation

@key4ng

@key4ng key4ng commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Description

Problem

On the BFCL nightly A/B (pure vLLM vs SMG→vLLM, same model/engine/checkpoint/sampling — only the frontend differs), DeepSeek-V4 multi_turn scores collapsed on the SMG arm (empty-{}-args force_terminated loops), while non-tool categories were far less affected.

SMG does have and use a DeepSeek-V4 prompt encoder (crates/tokenizer/src/encoders/deepseek_v4.rs, a port of the model's encoding_dsv4.py, selected via config.json arch DeepseekV4ForCausalLM). The bug was in how it serialized embedded JSON. Its to_json used compact serde_json::to_string, but the reference encoder (which vLLM serves from) uses Python json.dumps(ensure_ascii=False):

SMG:  {"name":"get_weather","description":"Get the wea...    ← compact
vLLM: {"name": "get_weather", "description": "Get the ...    ← json.dumps (spaced)

So every tool-enabled prompt differed byte-for-byte between the two arms in:

  • the tool-schema block (present on every function-calling turn), and
  • non-string argument values of fed-back tool calls (present on every multi-turn step).

This shifts the model off its training distribution; the error compounds across turns and is amplified by multi_turn's binary pass/fail cascade — hence the disproportionate multi_turn drop. The V3.2 encoder had the identical bug.

serde_json ships only CompactFormatter (no spaces) and PrettyFormatter (multi-line); neither matches Python's single-line-with-spaces json.dumps, so a small custom Formatter is required.

Distinct from the closed #1836, which fixed JSON spacing in the tool parser (the arguments string in the API response). This PR fixes the prompt encoder (the tool-schema block + fed-back args rendered into the prompt) — the opposite end of the model.

Solution

Add a focused json_dumps::to_string helper — a serde_json Formatter that overrides just the three structural separators serde_json lacks (": ", ", ") — and point both DeepSeek encoders' to_json at it. Because it overrides only structural callbacks (not string content), , / : inside keys/values are left untouched.

chat_template.rs already has a separate, configurable Python-JSON formatter for the HuggingFace tojson filter (indent / separators / ensure_ascii); it is intentionally left untouched — the two have different needs (fixed default vs configurable), so consolidating would be churn, not simplification.

Changes

  • crates/tokenizer/src/json_dumps.rs — new to_string helper (3-override formatter) + unit tests.
  • crates/tokenizer/src/encoders/deepseek_v4.rs, deepseek_v32.rsto_json now uses it.
  • crates/tokenizer/src/lib.rs — module declaration.

(4 files, ~+113/−5.)

Test Plan

Verified against DeepSeek's 4 official golden encoding vectors (encoding/tests/ in the model distribution — what vLLM renders from), run end-to-end through the encoder:

case covers before after
1 thinking + multi-turn tool calls
2 thinking, no tools
3 developer + search + tools
4 quick-instruction (chat)

The two failing cases were exactly the tool cases; after the fix all 4 are byte-identical (incl. case 1's full multi-turn body). The committed unit tests also assert parity with json.dumps(ensure_ascii=False) and that separators inside string content are not spaced. cargo +nightly fmt, cargo clippy -p llm-tokenizer --all-targets -- -D warnings, and 140 tokenizer lib tests all pass.

The golden vectors are DeepSeek model-distribution files, so they are not vendored into the repo; the parity above was re-verified locally before finalizing.

Final score confirmation needs a nightly re-run. Separately worth verifying there (not an encoder bug, so not fixed here): SMG defaults V4 to chat mode (ThinkingToggle::DefaultOff) — confirm that matches vLLM's --tokenizer-mode deepseek_v4 default.

Checklist
  • cargo +nightly fmt passes
  • cargo clippy --all-targets --all-features -- -D warnings passes (tokenizer crate)
  • (Optional) Documentation updated
  • (Optional) Please join us on Slack #sig-smg to discuss, review, and merge PRs

The DeepSeek V3.2/V4 prompt encoders embed tool schemas and non-string
argument values as JSON. They serialized with compact
`serde_json::to_string`, but the reference encoding (encoding_dsv4.py,
which vLLM serves from) uses `json.dumps(ensure_ascii=False)` with spaced
`", "` / `": "` separators.

Every tool-enabled prompt therefore diverged byte-for-byte from vLLM in
the tool-schema block (every function-calling turn) and in fed-back tool
arguments (every multi-turn step), shifting the model off its training
distribution. This compounds across turns and explains the DeepSeek-V4
multi_turn collapse on the SMG arm of the BFCL A/B.

Add a shared `python_json::to_python_json_string` matching json.dumps
default separators and point both encoders' `to_json` at it. Vendor
DeepSeek's 4 reference encoding fixtures and assert byte-for-byte parity
(cases 1 and 3, the tool cases, failed before this fix).

Signed-off-by: key4ng <rukeyang@gmail.com>
@github-actions github-actions Bot added documentation Improvements or additions to documentation tokenizer Tokenizer related changes tests Test changes labels Jun 24, 2026
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds a new crate-internal json_dumps module implementing a custom serde_json::ser::Formatter that emits Python json.dumps-style separators (": ", ", ") with raw UTF-8 output. The module is declared in lib.rs and the DeepSeek V3.2 and V4 encoders' to_json helpers are updated to delegate to it.

Changes

Python JSON Formatter and Encoder Wiring

Layer / File(s) Summary
json_dumps module: formatter, function, and tests
crates/tokenizer/src/lib.rs, crates/tokenizer/src/json_dumps.rs
Adds PythonDefaultFormatter implementing serde_json::ser::Formatter with ": "/", " separators, exposes to_string (returns "null" on any failure), includes unit tests for spacing, Unicode, and string delimiter preservation, and declares the module in lib.rs.
DeepSeek encoders wired to json_dumps
crates/tokenizer/src/encoders/deepseek_v32.rs, crates/tokenizer/src/encoders/deepseek_v4.rs
Replaces direct serde_json::to_string calls in the to_json helpers of both encoders with delegation to crate::json_dumps::to_python_json_string / crate::json_dumps::to_string.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

  • lightseekorg/smg#1373: Modifies the same to_json helper in deepseek_v32.rs and deepseek_v4.rs to control prompt byte formatting.

Suggested labels

tests

Suggested reviewers

  • slin1237
  • claude

Poem

🐇 A rabbit hops through JSON land,
Where colons and commas space as planned,
": " here, ", " there,
Python formatting fills the air—
No compact bytes shall fool the model's hand! ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 76.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Linked Issues check ✅ Passed PR correctly implements json.dumps-style JSON formatting for DeepSeek encoders to fix prompt divergence. Changes align with #1836 (tool_parser fix) and address the complementary encoder-side issue.
Out of Scope Changes check ✅ Passed All changes are scoped to fixing JSON formatting in DeepSeek encoders and supporting helper module. No unrelated modifications detected.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: fixing JSON serialization in DeepSeek V3.2/V4 encoders to use json.dumps-style spacing instead of compact format.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/dsv4-encoder-json-spacing

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a custom JSON serializer (python_json) to match Python's json.dumps(value, ensure_ascii=False) formatting (specifically, using spaced separators like ", " and ": "). This serializer is integrated into the DeepSeek V3.2 and V4 prompt encoders to prevent prompt drift and maintain parity with the reference Python implementation. Additionally, parity tests and golden fixtures for DeepSeek-V4 have been added, and pre-commit hooks have been updated to exclude these test fixtures. There are no review comments, so I have no feedback to provide.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean, focused fix with strong test coverage. The PythonDefaultFormatter correctly matches Python's json.dumps default separators, the shared helper eliminates duplication across V3.2/V4 encoders, and the 4 byte-for-byte parity tests against DeepSeek's reference vectors give high confidence in correctness.

Keep the PR to the encoder fix only; remove the vendored DeepSeek
reference fixtures, the byte-parity integration test, and the
pre-commit excludes that only existed to protect those fixtures.
The python_json helper retains its self-contained unit tests.

Signed-off-by: key4ng <rukeyang@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7e25ce4a99

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/tokenizer/src/python_json.rs Outdated
pub(crate) fn to_python_json_string(value: &Value) -> String {
let mut buf = Vec::new();
let mut ser = Serializer::with_formatter(&mut buf, PythonDefaultFormatter);
if value.serialize(&mut ser).is_err() {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Match Python formatting for JSON numbers

When a tool schema or non-string tool argument contains floats rendered in exponent form (for example 1e20 or 1e-6), this still serializes through serde_json's number formatter rather than Python's json.dumps formatter; Python emits forms like 1e+20 / 1e-06, so those prompts remain byte-different from the DeepSeek V3.2/V4 reference and vLLM despite the new helper. This path is used for embedded tool schemas, response formats, and numeric tool arguments, so numeric bounds/defaults or float arguments can still shift the prompt off the intended distribution.

Useful? React with 👍 / 👎.

@github-actions github-actions Bot removed documentation Improvements or additions to documentation tests Test changes labels Jun 24, 2026
key4ng added 2 commits June 23, 2026 20:34
…n_dumps

serde_json ships only compact and pretty formatters, neither matching
Python's single-line `json.dumps` spacing, so a small custom formatter is
needed. Keep it as a focused 3-override formatter in a dedicated
`json_dumps` module (`json_dumps::to_string`) rather than relocating
chat_template's configurable HF-tojson formatter — the two have different
needs (fixed default vs configurable), and a move would be pure churn.
chat_template is left untouched.

Signed-off-by: key4ng <rukeyang@gmail.com>
…nt untouched

Guards against spacing `,` / `:` that appear inside string keys/values
(the failure mode of a naive compact-then-replace approach); each case
matches json.dumps(ensure_ascii=False) byte-for-byte.

Signed-off-by: key4ng <rukeyang@gmail.com>
@key4ng key4ng changed the title fix(tokenizer): render DeepSeek V3.2/V4 prompt JSON like json.dumps fix(tokenizer): serialize DeepSeek V3.2/V4 prompt JSON with json.dumps spacing Jun 24, 2026
@key4ng key4ng merged commit 78389d5 into main Jun 25, 2026
51 checks passed
@key4ng key4ng deleted the fix/dsv4-encoder-json-spacing branch June 25, 2026 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

tokenizer Tokenizer related changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants