fix: merge skip_one API and fix utf8_lossy surrogate backtrack by liuq19 · Pull Request #220 · cloudwego/sonic-rs

liuq19 · 2026-03-25T01:27:33Z

Summary

Merge skip_one() and skip_one_checked(checked) into a single skip_one(checked: bool) to reduce duplication
Add Parser::utf8_lossy() builder method mirroring Deserializer::utf8_lossy()
Fix surrogate backtrack in parse_escaped_utf8: when a high surrogate is followed by a non-\uXXXX sequence or an invalid low surrogate, backtrack the reader 6 bytes (only in utf8_lossy mode) so consumed bytes can be re-parsed correctly
- e.g. \uDA51\uD83D\uDE04 now correctly produces �😄 instead of ��
Non-lossy error path no longer changes reader index, preserving correct error positions

Test plan

New test test_utf8_lossy_surrogate_backtrack covers:
- High surrogate + non-\u content (\uD800abc → �abc)
- High + high + low (\uDA51\uD83D\uDE04 → �😄)
- High + non-surrogate \uXXXX (\uD800\u0041 → �A)
- Multiple consecutive lone surrogates (\uD800\uD801\uD802abc → ��abc)
- Non-lossy mode still errors on invalid surrogates
All 146 existing tests pass

- Merge `skip_one()` and `skip_one_checked(checked)` into a single `skip_one(checked: bool)` to reduce code duplication. - Add `Parser::utf8_lossy()` builder method mirroring `Deserializer::utf8_lossy()`. - Fix surrogate backtrack in `parse_escaped_utf8`: when a high surrogate is followed by a non-\uXXXX sequence or an invalid low surrogate, backtrack the reader 6 bytes so the consumed bytes can be re-parsed. This only applies in utf8_lossy mode; the non-lossy error path no longer changes the reader index, preserving correct error positions. - Add `test_utf8_lossy_surrogate_backtrack` covering: high+non-\u, high+high+low, high+non-surrogate \u, consecutive lone surrogates, and non-lossy error.

codecov · 2026-03-25T01:28:29Z

Codecov Report

❌ Patch coverage is 81.25000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.60%. Comparing base (f5aefd7) to head (1439d4c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/parser.rs	64.00%	9 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #220      +/-   ##
==========================================
+ Coverage   70.42%   70.60%   +0.18%     
==========================================
  Files          42       42              
  Lines        9524     9550      +26     
==========================================
+ Hits         6707     6743      +36     
+ Misses       2817     2807      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

liuq19 · 2026-03-25T01:32:46Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Notes (below threshold, informational only):

de.rs:867 comment "we use faster skip, and will not validate the skipped parts" is inconsistent with skip_one(true) which IS the validated path. This is a pre-existing mismatch (the old skip_one() also called skip_one_checked(true)), not introduced by this PR.

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

Copilot

Pull request overview

This PR refactors the parser “skip one value” API, adds a Parser::utf8_lossy() builder for parity with Deserializer::utf8_lossy(), and adjusts UTF-16 surrogate handling in parse_escaped_utf8 to backtrack correctly in lossy mode so bytes can be re-parsed into the right output.

Changes:

Merge skip_one() / skip_one_checked(...) into skip_one(checked: bool) and update call sites.
Add Parser::utf8_lossy() builder method.
Fix lossy-mode surrogate handling by backtracking 6 bytes on invalid/non-\\uXXXX low-surrogate sequences; add regression test coverage.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
src/serde/mod.rs	Adds a regression test covering lossy surrogate backtracking and non-lossy error behavior.
src/serde/de.rs	Updates deserializer call sites to the new `skip_one(checked)` API.
src/parser.rs	Introduces `Parser::utf8_lossy()`, merges skip APIs, and implements lossy surrogate backtracking in `parse_escaped_utf8`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-25T01:32:56Z

@@ -727,9 +736,12 @@ where
            let low_bit = point2.wrapping_sub(0xdc00);
            if (low_bit >> 10) != 0 {
                if self.cfg.utf8_lossy {
+                    // point2 is not a valid low surrogate. Backtrack 6 bytes
+                    // so it can be re-parsed (e.g. \uDA51\uD83D\uDE04 → FFFD + 😄).
+                    let idx = self.read.index();
+                    self.read.set_index(idx - 6);
                    return Ok(0xFFFD);
                } else {
-                    // invalid surrogate
                    return perr!(self, InvalidSurrogateUnicodeCodePoint);
                }


In parse_escaped_utf8, the non-lossy invalid-surrogate branches return an error after next_n(6) has already advanced the reader by 6 bytes. This shifts the error position (and for PaddedSliceRead can push index past origin_input().len(), causing Parser::error() to rewrite the error into EofWhileParsing). Consider avoiding consumption on the probe: use peek_n(6) + eat(6) only when the \\uXXXX low-surrogate is confirmed valid, or explicitly backtrack before perr! so the reader index stays at the correct position on error.

Copilot · 2026-03-25T01:32:57Z

+    pub fn skip_one(&mut self, checked: bool) -> Result<(&'de [u8], ParseStatus)> {
        let ch = match self.skip_space() {
            Some(ch) => ch,
            None => return perr!(self, EofWhileParsing),


Parser::skip_one is a public method and this change removes the previous skip_one() / skip_one_checked(...) APIs in favor of skip_one(checked: bool), which is a breaking change for downstream users of sonic_rs::parser::Parser. If this isn’t intended to be a breaking public API change, consider keeping the old methods as deprecated wrappers (or making skip_one crate-private) and documenting the new entrypoint in the public API docs / release notes.

Copilot · 2026-03-25T01:32:57Z

@@ -865,7 +865,7 @@ impl<'de, 'a, R: Reader<'de>> de::Deserializer<'de> for &'a mut Deserializer<R>
        V: de::Visitor<'de>,
    {
        // NOTE: we use faster skip, and will not validate the skipped parts.


The comment above deserialize_ignored_any says the skipped parts are not validated, but the implementation calls self.parser.skip_one(true) which takes the checked/validating path. Either update the comment to match the actual behavior, or (if intentionally skipping validation here) pass checked = false and ensure that path is still safe/correct for Serde deserialization of untrusted input.

Suggested change

// NOTE: we use faster skip, and will not validate the skipped parts.

// NOTE: we use a fast skipping routine that still validates the JSON structure,

// but does not allocate or materialize the skipped value.

The comment said "will not validate the skipped parts" but skip_one(true) is the validated path. This mismatch predates this PR (the old skip_one() also called skip_one_checked(true)). Update comment to match behavior.

Copilot AI review requested due to automatic review settings March 25, 2026 01:27

Copilot started reviewing on behalf of liuq19 March 25, 2026 01:28 View session

Copilot AI reviewed Mar 25, 2026

View reviewed changes

fix: correct misleading comment in deserialize_ignored_any

1439d4c

The comment said "will not validate the skipped parts" but skip_one(true) is the validated path. This mismatch predates this PR (the old skip_one() also called skip_one_checked(true)). Update comment to match behavior.

liuq19 merged commit 7424308 into main Mar 25, 2026
24 checks passed

liuq19 deleted the fix/utf8-lossy-surrogate-backtrack branch March 25, 2026 01:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: merge skip_one API and fix utf8_lossy surrogate backtrack#220

fix: merge skip_one API and fix utf8_lossy surrogate backtrack#220
liuq19 merged 2 commits intomainfrom
fix/utf8-lossy-surrogate-backtrack

liuq19 commented Mar 25, 2026

Uh oh!

codecov Bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

liuq19 commented Mar 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Copilot AI Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

	// NOTE: we use faster skip, and will not validate the skipped parts.
	// NOTE: we use a fast skipping routine that still validates the JSON structure,
	// but does not allocate or materialize the skipped value.

Conversation

liuq19 commented Mar 25, 2026

Summary

Test plan

Uh oh!

codecov Bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

liuq19 commented Mar 25, 2026

Code review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

codecov Bot commented Mar 25, 2026 •

edited

Loading