fix: merge skip_one API and fix utf8_lossy surrogate backtrack#220
fix: merge skip_one API and fix utf8_lossy surrogate backtrack#220
Conversation
- Merge `skip_one()` and `skip_one_checked(checked)` into a single `skip_one(checked: bool)` to reduce code duplication. - Add `Parser::utf8_lossy()` builder method mirroring `Deserializer::utf8_lossy()`. - Fix surrogate backtrack in `parse_escaped_utf8`: when a high surrogate is followed by a non-\uXXXX sequence or an invalid low surrogate, backtrack the reader 6 bytes so the consumed bytes can be re-parsed. This only applies in utf8_lossy mode; the non-lossy error path no longer changes the reader index, preserving correct error positions. - Add `test_utf8_lossy_surrogate_backtrack` covering: high+non-\u, high+high+low, high+non-surrogate \u, consecutive lone surrogates, and non-lossy error.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #220 +/- ##
==========================================
+ Coverage 70.42% 70.60% +0.18%
==========================================
Files 42 42
Lines 9524 9550 +26
==========================================
+ Hits 6707 6743 +36
+ Misses 2817 2807 -10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. Notes (below threshold, informational only):
🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
There was a problem hiding this comment.
Pull request overview
This PR refactors the parser “skip one value” API, adds a Parser::utf8_lossy() builder for parity with Deserializer::utf8_lossy(), and adjusts UTF-16 surrogate handling in parse_escaped_utf8 to backtrack correctly in lossy mode so bytes can be re-parsed into the right output.
Changes:
- Merge
skip_one()/skip_one_checked(...)intoskip_one(checked: bool)and update call sites. - Add
Parser::utf8_lossy()builder method. - Fix lossy-mode surrogate handling by backtracking 6 bytes on invalid/non-
\\uXXXXlow-surrogate sequences; add regression test coverage.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/serde/mod.rs | Adds a regression test covering lossy surrogate backtracking and non-lossy error behavior. |
| src/serde/de.rs | Updates deserializer call sites to the new skip_one(checked) API. |
| src/parser.rs | Introduces Parser::utf8_lossy(), merges skip APIs, and implements lossy surrogate backtracking in parse_escaped_utf8. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -727,9 +736,12 @@ where | |||
| let low_bit = point2.wrapping_sub(0xdc00); | |||
| if (low_bit >> 10) != 0 { | |||
| if self.cfg.utf8_lossy { | |||
| // point2 is not a valid low surrogate. Backtrack 6 bytes | |||
| // so it can be re-parsed (e.g. \uDA51\uD83D\uDE04 → FFFD + 😄). | |||
| let idx = self.read.index(); | |||
| self.read.set_index(idx - 6); | |||
| return Ok(0xFFFD); | |||
| } else { | |||
| // invalid surrogate | |||
| return perr!(self, InvalidSurrogateUnicodeCodePoint); | |||
| } | |||
There was a problem hiding this comment.
In parse_escaped_utf8, the non-lossy invalid-surrogate branches return an error after next_n(6) has already advanced the reader by 6 bytes. This shifts the error position (and for PaddedSliceRead can push index past origin_input().len(), causing Parser::error() to rewrite the error into EofWhileParsing). Consider avoiding consumption on the probe: use peek_n(6) + eat(6) only when the \\uXXXX low-surrogate is confirmed valid, or explicitly backtrack before perr! so the reader index stays at the correct position on error.
| pub fn skip_one(&mut self, checked: bool) -> Result<(&'de [u8], ParseStatus)> { | ||
| let ch = match self.skip_space() { | ||
| Some(ch) => ch, | ||
| None => return perr!(self, EofWhileParsing), |
There was a problem hiding this comment.
Parser::skip_one is a public method and this change removes the previous skip_one() / skip_one_checked(...) APIs in favor of skip_one(checked: bool), which is a breaking change for downstream users of sonic_rs::parser::Parser. If this isn’t intended to be a breaking public API change, consider keeping the old methods as deprecated wrappers (or making skip_one crate-private) and documenting the new entrypoint in the public API docs / release notes.
| @@ -865,7 +865,7 @@ impl<'de, 'a, R: Reader<'de>> de::Deserializer<'de> for &'a mut Deserializer<R> | |||
| V: de::Visitor<'de>, | |||
| { | |||
| // NOTE: we use faster skip, and will not validate the skipped parts. | |||
There was a problem hiding this comment.
The comment above deserialize_ignored_any says the skipped parts are not validated, but the implementation calls self.parser.skip_one(true) which takes the checked/validating path. Either update the comment to match the actual behavior, or (if intentionally skipping validation here) pass checked = false and ensure that path is still safe/correct for Serde deserialization of untrusted input.
| // NOTE: we use faster skip, and will not validate the skipped parts. | |
| // NOTE: we use a fast skipping routine that still validates the JSON structure, | |
| // but does not allocate or materialize the skipped value. |
The comment said "will not validate the skipped parts" but skip_one(true) is the validated path. This mismatch predates this PR (the old skip_one() also called skip_one_checked(true)). Update comment to match behavior.
Summary
skip_one()andskip_one_checked(checked)into a singleskip_one(checked: bool)to reduce duplicationParser::utf8_lossy()builder method mirroringDeserializer::utf8_lossy()parse_escaped_utf8: when a high surrogate is followed by a non-\uXXXXsequence or an invalid low surrogate, backtrack the reader 6 bytes (only inutf8_lossymode) so consumed bytes can be re-parsed correctly\uDA51\uD83D\uDE04now correctly produces�😄instead of���Test plan
test_utf8_lossy_surrogate_backtrackcovers:\ucontent (\uD800abc→�abc)\uDA51\uD83D\uDE04→�😄)\uXXXX(\uD800\u0041→�A)\uD800\uD801\uD802abc→���abc)