fix(regexp): align lastIndex with code point boundary under /u and /v by mrhapile · Pull Request #5303 · boa-dev/boa

mrhapile · 2026-04-05T11:58:38Z

Summary

Fix incorrect handling of lastIndex when the /u or /v flag is enabled and the index points to the trailing half of a surrogate pair.

Previously, Boa would begin matching from a UTF-16 code unit index that does not correspond to a valid Unicode code point boundary, leading to incorrect partial matches.

Details

When /u or /v is enabled, RegExp matching operates on Unicode code points (§22.2.2 Pattern Semantics), while lastIndex is defined in terms of UTF-16 code units.

If lastIndex points to the trailing half of a surrogate pair, it is not a valid code point boundary. This patch adjusts the matcher start position to the preceding lead surrogate so that matching begins at a valid boundary.

Importantly, this adjustment is applied only to the matcher input (start_index), leaving the original lastIndex unchanged. This ensures that:

global (g) and sticky (y) semantics remain correct
AdvanceStringIndex behavior is unaffected
no regressions are introduced in iteration or index updates

Example

Before

let re = /./gu;
re.lastIndex = 1;
re.exec("💩")[0];
// → "\uDCA9"

After

"💩"

Spec Reference
https://tc39.es/ecma262/#sec-pattern-semantics

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

Copilot

Pull request overview

This PR fixes RegExpBuiltinExec start-position handling for Unicode (/u) and Unicode sets (/v) so matching does not begin from the trailing half of a UTF-16 surrogate pair (i.e., not a valid Unicode code point boundary), and adds regression tests around exec coercion order and surrogate-boundary behavior.

Changes:

Adjust matcher start index under /u or /v when lastIndex points at a trailing surrogate within a valid surrogate pair.
Update matcher invocations to use the adjusted start index (without mutating the observable lastIndex value up-front).
Add tests for exec coercion order and surrogate-boundary matching behavior under /u.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
core/engine/src/builtins/regexp/mod.rs	Adjusts the internal matcher start position under `/u` and `/v` to avoid starting inside surrogate pairs.
core/engine/src/builtins/regexp/tests.rs	Adds regression tests for `exec` coercion order and Unicode surrogate-boundary behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-05T12:02:59Z

core/engine/src/builtins/regexp/mod.rs

+            && let Some(cu) = input.code_unit_at(start_index as usize)
+            && (0xDC00..=0xDFFF).contains(&cu)
+            && let Some(prev_cu) = input.code_unit_at(start_index as usize - 1)


start_index is derived from last_index (u64) and cast to usize in code_unit_at before validating last_index <= length. On 32-bit targets (or with very large lastIndex values), the as usize cast can truncate and accidentally make the index appear in-bounds, leading to incorrect surrogate-boundary adjustment (and potentially incorrect match results). Consider moving the surrogate-boundary adjustment block to after the if last_index > length { ... return } early-exit, and/or using a checked conversion (usize::try_from) before calling code_unit_at.

Suggested change

&& let Some(cu) = input.code_unit_at(start_index as usize)

&& (0xDC00..=0xDFFF).contains(&cu)

&& let Some(prev_cu) = input.code_unit_at(start_index as usize - 1)

&& start_index <= length

&& let Ok(start_index_usize) = usize::try_from(start_index)

&& let Some(cu) = input.code_unit_at(start_index_usize)

&& (0xDC00..=0xDFFF).contains(&cu)

&& let Some(prev_cu) = input.code_unit_at(start_index_usize - 1)

Copilot · 2026-04-05T12:03:00Z

core/engine/src/builtins/regexp/mod.rs

                // NOTE: We can use the faster ucs2 variant since there will never be two byte unicode.
-                matcher.find_from_ucs2(&input, last_index as usize).next()
+                matcher.find_from_ucs2(&input, start_index as usize).next()
            }
            (true, JsStrVariant::Utf16(input)) => {
-                matcher.find_from_utf16(input, last_index as usize).next()
+                matcher.find_from_utf16(input, start_index as usize).next()
            }
            (false, JsStrVariant::Utf16(input)) => {
-                matcher.find_from_ucs2(input, last_index as usize).next()
+                matcher.find_from_ucs2(input, start_index as usize).next()


The matcher is now invoked with start_index, but sticky emulation still checks match_value.start() != last_index. When /u or /v adjusts start_index (e.g. lastIndex points into the trailing surrogate), this will incorrectly reject valid sticky matches because match_value.start() will equal start_index (adjusted) rather than last_index (original). The sticky check should compare against the actual matcher start position (or a computed “effective lastIndex” used to start matching) to avoid a behavior regression for /yu and /yv.

github-actions · 2026-04-05T12:07:48Z

Test262 conformance changes

Test result	main count	PR count	difference
Total	53,125	53,125	0
Passed	50,827	50,827	0
Ignored	1,482	1,482	0
Failed	816	816	0
Panics	0	0	0
Conformance	95.67%	95.67%	0.00%

Tested main commit: f6021b3d3dd1249c9569be77f0ca9ccf00ea318a
Tested PR commit: da29940ba354f9f771305484ac14a14f8e111a42
Compare commits: f6021b3...da29940

codecov · 2026-04-05T12:19:28Z

Codecov Report

❌ Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 59.76%. Comparing base (6ddc2b4) to head (da29940).
⚠️ Report is 929 commits behind head on main.

Files with missing lines	Patch %	Lines
core/engine/src/builtins/regexp/mod.rs	90.90%	1 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #5303       +/-   ##
===========================================
+ Coverage   47.24%   59.76%   +12.52%     
===========================================
  Files         476      589      +113     
  Lines       46892    63495    +16603     
===========================================
+ Hits        22154    37950    +15796     
- Misses      24738    25545      +807

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mrhapile added 3 commits April 5, 2026 17:19

fix(regexp): align lastIndex with code point boundary under /u

e7f5610

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

style: apply rustfmt

bc1bb77

Signed-off-by: mrhapile <allinonegaming3456@gmail.com>

fix: resolve clippy collapsible-if lint

da29940

mrhapile requested a review from a team as a code owner April 5, 2026 11:58

Copilot AI review requested due to automatic review settings April 5, 2026 11:58

github-actions bot added Waiting On Review Waiting on reviews from the maintainers C-Tests Issues and PRs related to the tests. C-Builtins PRs and Issues related to builtins/intrinsics and removed Waiting On Review Waiting on reviews from the maintainers labels Apr 5, 2026

github-actions bot added this to the v1.0.0 milestone Apr 5, 2026

Copilot started reviewing on behalf of mrhapile April 5, 2026 11:59 View session

Copilot AI reviewed Apr 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(regexp): align lastIndex with code point boundary under /u and /v#5303

fix(regexp): align lastIndex with code point boundary under /u and /v#5303
mrhapile wants to merge 3 commits intoboa-dev:mainfrom
mrhapile:fix/regexp-unicode-lastindex-boundary

mrhapile commented Apr 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 5, 2026

Uh oh!

Copilot AI Apr 5, 2026

Uh oh!

github-actions bot commented Apr 5, 2026

Uh oh!

codecov bot commented Apr 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mrhapile commented Apr 5, 2026

Summary

Details

Example

Before

After

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 5, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 5, 2026

Test262 conformance changes

Uh oh!

codecov bot commented Apr 5, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants