Skip to content

fix: merge skip_one API and fix utf8_lossy surrogate backtrack#220

Merged
liuq19 merged 2 commits intomainfrom
fix/utf8-lossy-surrogate-backtrack
Mar 25, 2026
Merged

fix: merge skip_one API and fix utf8_lossy surrogate backtrack#220
liuq19 merged 2 commits intomainfrom
fix/utf8-lossy-surrogate-backtrack

Conversation

@liuq19
Copy link
Copy Markdown
Collaborator

@liuq19 liuq19 commented Mar 25, 2026

Summary

  • Merge skip_one() and skip_one_checked(checked) into a single skip_one(checked: bool) to reduce duplication
  • Add Parser::utf8_lossy() builder method mirroring Deserializer::utf8_lossy()
  • Fix surrogate backtrack in parse_escaped_utf8: when a high surrogate is followed by a non-\uXXXX sequence or an invalid low surrogate, backtrack the reader 6 bytes (only in utf8_lossy mode) so consumed bytes can be re-parsed correctly
    • e.g. \uDA51\uD83D\uDE04 now correctly produces �😄 instead of ���
  • Non-lossy error path no longer changes reader index, preserving correct error positions

Test plan

  • New test test_utf8_lossy_surrogate_backtrack covers:
    • High surrogate + non-\u content (\uD800abc�abc)
    • High + high + low (\uDA51\uD83D\uDE04�😄)
    • High + non-surrogate \uXXXX (\uD800\u0041�A)
    • Multiple consecutive lone surrogates (\uD800\uD801\uD802abc���abc)
    • Non-lossy mode still errors on invalid surrogates
  • All 146 existing tests pass

- Merge `skip_one()` and `skip_one_checked(checked)` into a single
  `skip_one(checked: bool)` to reduce code duplication.
- Add `Parser::utf8_lossy()` builder method mirroring
  `Deserializer::utf8_lossy()`.
- Fix surrogate backtrack in `parse_escaped_utf8`: when a high surrogate
  is followed by a non-\uXXXX sequence or an invalid low surrogate,
  backtrack the reader 6 bytes so the consumed bytes can be re-parsed.
  This only applies in utf8_lossy mode; the non-lossy error path no
  longer changes the reader index, preserving correct error positions.
- Add `test_utf8_lossy_surrogate_backtrack` covering: high+non-\u,
  high+high+low, high+non-surrogate \u, consecutive lone surrogates,
  and non-lossy error.
Copilot AI review requested due to automatic review settings March 25, 2026 01:27
@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 81.25000% with 9 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.60%. Comparing base (f5aefd7) to head (1439d4c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/parser.rs 64.00% 9 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #220      +/-   ##
==========================================
+ Coverage   70.42%   70.60%   +0.18%     
==========================================
  Files          42       42              
  Lines        9524     9550      +26     
==========================================
+ Hits         6707     6743      +36     
+ Misses       2817     2807      -10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@liuq19
Copy link
Copy Markdown
Collaborator Author

liuq19 commented Mar 25, 2026

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

Notes (below threshold, informational only):

  • de.rs:867 comment "we use faster skip, and will not validate the skipped parts" is inconsistent with skip_one(true) which IS the validated path. This is a pre-existing mismatch (the old skip_one() also called skip_one_checked(true)), not introduced by this PR.

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the parser “skip one value” API, adds a Parser::utf8_lossy() builder for parity with Deserializer::utf8_lossy(), and adjusts UTF-16 surrogate handling in parse_escaped_utf8 to backtrack correctly in lossy mode so bytes can be re-parsed into the right output.

Changes:

  • Merge skip_one() / skip_one_checked(...) into skip_one(checked: bool) and update call sites.
  • Add Parser::utf8_lossy() builder method.
  • Fix lossy-mode surrogate handling by backtracking 6 bytes on invalid/non-\\uXXXX low-surrogate sequences; add regression test coverage.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/serde/mod.rs Adds a regression test covering lossy surrogate backtracking and non-lossy error behavior.
src/serde/de.rs Updates deserializer call sites to the new skip_one(checked) API.
src/parser.rs Introduces Parser::utf8_lossy(), merges skip APIs, and implements lossy surrogate backtracking in parse_escaped_utf8.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/parser.rs
Comment on lines 716 to 746
@@ -727,9 +736,12 @@ where
let low_bit = point2.wrapping_sub(0xdc00);
if (low_bit >> 10) != 0 {
if self.cfg.utf8_lossy {
// point2 is not a valid low surrogate. Backtrack 6 bytes
// so it can be re-parsed (e.g. \uDA51\uD83D\uDE04 → FFFD + 😄).
let idx = self.read.index();
self.read.set_index(idx - 6);
return Ok(0xFFFD);
} else {
// invalid surrogate
return perr!(self, InvalidSurrogateUnicodeCodePoint);
}
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In parse_escaped_utf8, the non-lossy invalid-surrogate branches return an error after next_n(6) has already advanced the reader by 6 bytes. This shifts the error position (and for PaddedSliceRead can push index past origin_input().len(), causing Parser::error() to rewrite the error into EofWhileParsing). Consider avoiding consumption on the probe: use peek_n(6) + eat(6) only when the \\uXXXX low-surrogate is confirmed valid, or explicitly backtrack before perr! so the reader index stays at the correct position on error.

Copilot uses AI. Check for mistakes.
Comment thread src/parser.rs
Comment on lines +1439 to 1442
pub fn skip_one(&mut self, checked: bool) -> Result<(&'de [u8], ParseStatus)> {
let ch = match self.skip_space() {
Some(ch) => ch,
None => return perr!(self, EofWhileParsing),
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parser::skip_one is a public method and this change removes the previous skip_one() / skip_one_checked(...) APIs in favor of skip_one(checked: bool), which is a breaking change for downstream users of sonic_rs::parser::Parser. If this isn’t intended to be a breaking public API change, consider keeping the old methods as deprecated wrappers (or making skip_one crate-private) and documenting the new entrypoint in the public API docs / release notes.

Copilot uses AI. Check for mistakes.
Comment thread src/serde/de.rs Outdated
@@ -865,7 +865,7 @@ impl<'de, 'a, R: Reader<'de>> de::Deserializer<'de> for &'a mut Deserializer<R>
V: de::Visitor<'de>,
{
// NOTE: we use faster skip, and will not validate the skipped parts.
Copy link

Copilot AI Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment above deserialize_ignored_any says the skipped parts are not validated, but the implementation calls self.parser.skip_one(true) which takes the checked/validating path. Either update the comment to match the actual behavior, or (if intentionally skipping validation here) pass checked = false and ensure that path is still safe/correct for Serde deserialization of untrusted input.

Suggested change
// NOTE: we use faster skip, and will not validate the skipped parts.
// NOTE: we use a fast skipping routine that still validates the JSON structure,
// but does not allocate or materialize the skipped value.

Copilot uses AI. Check for mistakes.
The comment said "will not validate the skipped parts" but skip_one(true)
is the validated path. This mismatch predates this PR (the old skip_one()
also called skip_one_checked(true)). Update comment to match behavior.
@liuq19 liuq19 merged commit 7424308 into main Mar 25, 2026
24 checks passed
@liuq19 liuq19 deleted the fix/utf8-lossy-surrogate-backtrack branch March 25, 2026 01:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants