fix(regexp): align lastIndex with code point boundary under /u and /v#5303
fix(regexp): align lastIndex with code point boundary under /u and /v#5303mrhapile wants to merge 3 commits intoboa-dev:mainfrom
Conversation
Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes RegExpBuiltinExec start-position handling for Unicode (/u) and Unicode sets (/v) so matching does not begin from the trailing half of a UTF-16 surrogate pair (i.e., not a valid Unicode code point boundary), and adds regression tests around exec coercion order and surrogate-boundary behavior.
Changes:
- Adjust matcher start index under
/uor/vwhenlastIndexpoints at a trailing surrogate within a valid surrogate pair. - Update matcher invocations to use the adjusted start index (without mutating the observable
lastIndexvalue up-front). - Add tests for
execcoercion order and surrogate-boundary matching behavior under/u.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| core/engine/src/builtins/regexp/mod.rs | Adjusts the internal matcher start position under /u and /v to avoid starting inside surrogate pairs. |
| core/engine/src/builtins/regexp/tests.rs | Adds regression tests for exec coercion order and Unicode surrogate-boundary behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| && let Some(cu) = input.code_unit_at(start_index as usize) | ||
| && (0xDC00..=0xDFFF).contains(&cu) | ||
| && let Some(prev_cu) = input.code_unit_at(start_index as usize - 1) |
There was a problem hiding this comment.
start_index is derived from last_index (u64) and cast to usize in code_unit_at before validating last_index <= length. On 32-bit targets (or with very large lastIndex values), the as usize cast can truncate and accidentally make the index appear in-bounds, leading to incorrect surrogate-boundary adjustment (and potentially incorrect match results). Consider moving the surrogate-boundary adjustment block to after the if last_index > length { ... return } early-exit, and/or using a checked conversion (usize::try_from) before calling code_unit_at.
| && let Some(cu) = input.code_unit_at(start_index as usize) | |
| && (0xDC00..=0xDFFF).contains(&cu) | |
| && let Some(prev_cu) = input.code_unit_at(start_index as usize - 1) | |
| && start_index <= length | |
| && let Ok(start_index_usize) = usize::try_from(start_index) | |
| && let Some(cu) = input.code_unit_at(start_index_usize) | |
| && (0xDC00..=0xDFFF).contains(&cu) | |
| && let Some(prev_cu) = input.code_unit_at(start_index_usize - 1) |
| // NOTE: We can use the faster ucs2 variant since there will never be two byte unicode. | ||
| matcher.find_from_ucs2(&input, last_index as usize).next() | ||
| matcher.find_from_ucs2(&input, start_index as usize).next() | ||
| } | ||
| (true, JsStrVariant::Utf16(input)) => { | ||
| matcher.find_from_utf16(input, last_index as usize).next() | ||
| matcher.find_from_utf16(input, start_index as usize).next() | ||
| } | ||
| (false, JsStrVariant::Utf16(input)) => { | ||
| matcher.find_from_ucs2(input, last_index as usize).next() | ||
| matcher.find_from_ucs2(input, start_index as usize).next() |
There was a problem hiding this comment.
The matcher is now invoked with start_index, but sticky emulation still checks match_value.start() != last_index. When /u or /v adjusts start_index (e.g. lastIndex points into the trailing surrogate), this will incorrectly reject valid sticky matches because match_value.start() will equal start_index (adjusted) rather than last_index (original). The sticky check should compare against the actual matcher start position (or a computed “effective lastIndex” used to start matching) to avoid a behavior regression for /yu and /yv.
Test262 conformance changes
Tested main commit: |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #5303 +/- ##
===========================================
+ Coverage 47.24% 59.76% +12.52%
===========================================
Files 476 589 +113
Lines 46892 63495 +16603
===========================================
+ Hits 22154 37950 +15796
- Misses 24738 25545 +807 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Summary
Fix incorrect handling of
lastIndexwhen the/uor/vflag is enabled and the index points to the trailing half of a surrogate pair.Previously, Boa would begin matching from a UTF-16 code unit index that does not correspond to a valid Unicode code point boundary, leading to incorrect partial matches.
Details
When
/uor/vis enabled, RegExp matching operates on Unicode code points (§22.2.2 Pattern Semantics), whilelastIndexis defined in terms of UTF-16 code units.If
lastIndexpoints to the trailing half of a surrogate pair, it is not a valid code point boundary. This patch adjusts the matcher start position to the preceding lead surrogate so that matching begins at a valid boundary.Importantly, this adjustment is applied only to the matcher input (
start_index), leaving the originallastIndexunchanged. This ensures that:g) and sticky (y) semantics remain correctAdvanceStringIndexbehavior is unaffectedExample
Before
After
Spec Reference
https://tc39.es/ecma262/#sec-pattern-semantics