Skip to content

fix(regexp): align lastIndex with code point boundary under /u and /v#5303

Open
mrhapile wants to merge 3 commits intoboa-dev:mainfrom
mrhapile:fix/regexp-unicode-lastindex-boundary
Open

fix(regexp): align lastIndex with code point boundary under /u and /v#5303
mrhapile wants to merge 3 commits intoboa-dev:mainfrom
mrhapile:fix/regexp-unicode-lastindex-boundary

Conversation

@mrhapile
Copy link
Copy Markdown
Contributor

@mrhapile mrhapile commented Apr 5, 2026

Summary

Fix incorrect handling of lastIndex when the /u or /v flag is enabled and the index points to the trailing half of a surrogate pair.

Previously, Boa would begin matching from a UTF-16 code unit index that does not correspond to a valid Unicode code point boundary, leading to incorrect partial matches.


Details

When /u or /v is enabled, RegExp matching operates on Unicode code points (§22.2.2 Pattern Semantics), while lastIndex is defined in terms of UTF-16 code units.

If lastIndex points to the trailing half of a surrogate pair, it is not a valid code point boundary. This patch adjusts the matcher start position to the preceding lead surrogate so that matching begins at a valid boundary.

Importantly, this adjustment is applied only to the matcher input (start_index), leaving the original lastIndex unchanged. This ensures that:

  • global (g) and sticky (y) semantics remain correct
  • AdvanceStringIndex behavior is unaffected
  • no regressions are introduced in iteration or index updates

Example

Before

let re = /./gu;
re.lastIndex = 1;
re.exec("💩")[0];
// → "\uDCA9"

After

"💩"

Spec Reference
https://tc39.es/ecma262/#sec-pattern-semantics


mrhapile added 3 commits April 5, 2026 17:19
Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
Signed-off-by: mrhapile <allinonegaming3456@gmail.com>
@mrhapile mrhapile requested a review from a team as a code owner April 5, 2026 11:58
Copilot AI review requested due to automatic review settings April 5, 2026 11:58
@github-actions github-actions bot added Waiting On Review Waiting on reviews from the maintainers C-Tests Issues and PRs related to the tests. C-Builtins PRs and Issues related to builtins/intrinsics and removed Waiting On Review Waiting on reviews from the maintainers labels Apr 5, 2026
@github-actions github-actions bot added this to the v1.0.0 milestone Apr 5, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes RegExpBuiltinExec start-position handling for Unicode (/u) and Unicode sets (/v) so matching does not begin from the trailing half of a UTF-16 surrogate pair (i.e., not a valid Unicode code point boundary), and adds regression tests around exec coercion order and surrogate-boundary behavior.

Changes:

  • Adjust matcher start index under /u or /v when lastIndex points at a trailing surrogate within a valid surrogate pair.
  • Update matcher invocations to use the adjusted start index (without mutating the observable lastIndex value up-front).
  • Add tests for exec coercion order and surrogate-boundary matching behavior under /u.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
core/engine/src/builtins/regexp/mod.rs Adjusts the internal matcher start position under /u and /v to avoid starting inside surrogate pairs.
core/engine/src/builtins/regexp/tests.rs Adds regression tests for exec coercion order and Unicode surrogate-boundary behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +1149 to +1151
&& let Some(cu) = input.code_unit_at(start_index as usize)
&& (0xDC00..=0xDFFF).contains(&cu)
&& let Some(prev_cu) = input.code_unit_at(start_index as usize - 1)
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

start_index is derived from last_index (u64) and cast to usize in code_unit_at before validating last_index <= length. On 32-bit targets (or with very large lastIndex values), the as usize cast can truncate and accidentally make the index appear in-bounds, leading to incorrect surrogate-boundary adjustment (and potentially incorrect match results). Consider moving the surrogate-boundary adjustment block to after the if last_index > length { ... return } early-exit, and/or using a checked conversion (usize::try_from) before calling code_unit_at.

Suggested change
&& let Some(cu) = input.code_unit_at(start_index as usize)
&& (0xDC00..=0xDFFF).contains(&cu)
&& let Some(prev_cu) = input.code_unit_at(start_index as usize - 1)
&& start_index <= length
&& let Ok(start_index_usize) = usize::try_from(start_index)
&& let Some(cu) = input.code_unit_at(start_index_usize)
&& (0xDC00..=0xDFFF).contains(&cu)
&& let Some(prev_cu) = input.code_unit_at(start_index_usize - 1)

Copilot uses AI. Check for mistakes.
Comment on lines 1183 to +1190
// NOTE: We can use the faster ucs2 variant since there will never be two byte unicode.
matcher.find_from_ucs2(&input, last_index as usize).next()
matcher.find_from_ucs2(&input, start_index as usize).next()
}
(true, JsStrVariant::Utf16(input)) => {
matcher.find_from_utf16(input, last_index as usize).next()
matcher.find_from_utf16(input, start_index as usize).next()
}
(false, JsStrVariant::Utf16(input)) => {
matcher.find_from_ucs2(input, last_index as usize).next()
matcher.find_from_ucs2(input, start_index as usize).next()
Copy link

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The matcher is now invoked with start_index, but sticky emulation still checks match_value.start() != last_index. When /u or /v adjusts start_index (e.g. lastIndex points into the trailing surrogate), this will incorrectly reject valid sticky matches because match_value.start() will equal start_index (adjusted) rather than last_index (original). The sticky check should compare against the actual matcher start position (or a computed “effective lastIndex” used to start matching) to avoid a behavior regression for /yu and /yv.

Copilot uses AI. Check for mistakes.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 5, 2026

Test262 conformance changes

Test result main count PR count difference
Total 53,125 53,125 0
Passed 50,827 50,827 0
Ignored 1,482 1,482 0
Failed 816 816 0
Panics 0 0 0
Conformance 95.67% 95.67% 0.00%

Tested main commit: f6021b3d3dd1249c9569be77f0ca9ccf00ea318a
Tested PR commit: da29940ba354f9f771305484ac14a14f8e111a42
Compare commits: f6021b3...da29940

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 5, 2026

Codecov Report

❌ Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 59.76%. Comparing base (6ddc2b4) to head (da29940).
⚠️ Report is 929 commits behind head on main.

Files with missing lines Patch % Lines
core/engine/src/builtins/regexp/mod.rs 90.90% 1 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #5303       +/-   ##
===========================================
+ Coverage   47.24%   59.76%   +12.52%     
===========================================
  Files         476      589      +113     
  Lines       46892    63495    +16603     
===========================================
+ Hits        22154    37950    +15796     
- Misses      24738    25545      +807     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

C-Builtins PRs and Issues related to builtins/intrinsics C-Tests Issues and PRs related to the tests.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants