Skip to content

markers: replace regexp with scanning in StripMarkers and EscapeMarkers #37

Merged
dhartunian merged 2 commits intomasterfrom
davidh/strip-markers-perf
Mar 13, 2026
Merged

markers: replace regexp with scanning in StripMarkers and EscapeMarkers #37
dhartunian merged 2 commits intomasterfrom
davidh/strip-markers-perf

Conversation

@dhartunian
Copy link
Contributor

@dhartunian dhartunian commented Mar 13, 2026

Replace the regexp-based implementation of StripMarkers() and
EscapeMarkers() with manual byte scanning using a shared
stripMarkersBytes() function.

The new implementation exploits the fact that all three marker
characters (‹ U+2039, › U+203A, † U+2020) are 3-byte UTF-8 sequences
sharing the prefix 0xE2 0x80. It scans for 0xE2 bytes and checks the
subsequent two bytes to identify markers, then either removes them or
replaces them with the escape character.

Also removes the now-unused ReStripMarkers regexp and its regexp import.

Benchmark results (Apple M1 Pro):

name                           old time/op    new time/op    delta
StripMarkers_String-10           738ns ± 2%     105ns ± 1%  -85.83%  (p=0.000 n=8)
StripMarkers_Bytes-10            733ns ± 4%      58ns ± 1%  -92.10%  (p=0.000 n=8)
StripMarkers_NoMarkers-10        660ns ± 1%      23ns ± 0%  -96.53%  (p=0.000 n=8)
EscapeMarkers-10                 377ns ± 2%      45ns ±18%  -87.99%  (p=0.000 n=8)

name                           old alloc/op   new alloc/op   delta
StripMarkers_String-10           185B ± 1%     144B ± 0%  -22.16%  (p=0.000 n=8)
StripMarkers_Bytes-10            136B ± 0%      48B ± 0%  -64.71%  (p=0.000 n=8)
StripMarkers_NoMarkers-10        144B ± 0%       0B ± 0%  -100.00% (p=0.000 n=8)
EscapeMarkers-10                40.0B ± 0%    24.0B ± 0%  -40.00%  (p=0.000 n=8)

name                           old allocs/op  new allocs/op  delta
StripMarkers_String-10           6.00 ± 0%     3.00 ± 0%  -50.00%  (p=0.000 n=8)
StripMarkers_Bytes-10            5.00 ± 0%     1.00 ± 0%  -80.00%  (p=0.000 n=8)
StripMarkers_NoMarkers-10        3.00 ± 0%     0.00 ± 0%  -100.00% (p=0.000 n=8)
EscapeMarkers-10                 3.00 ± 0%     1.00 ± 0%  -66.67%  (p=0.000 n=8)

Co-Authored-By: roachdev-claude roachdev-claude-bot@cockroachlabs.com


This change is Reviewable

Add exhaustive table-driven tests for StripMarkers (both string and byte
variants) and EscapeMarkers covering edge cases including empty inputs,
unicode content, consecutive markers, hash prefix markers, and boundary
conditions. Add benchmarks to establish a performance baseline before
optimizing these functions.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@visheshbardia
Copy link
Contributor

The performance difference is great.
LGTM! just a small nit.
Thanks.

Replace the regexp-based implementation of StripMarkers() and
EscapeMarkers() with manual byte scanning using a shared
stripMarkersBytes() function.

The new implementation exploits the fact that all three marker
characters (‹ U+2039, › U+203A, † U+2020) are 3-byte UTF-8 sequences
sharing the prefix 0xE2 0x80. It scans for 0xE2 bytes and checks the
subsequent two bytes to identify markers, then either removes them or
replaces them with the escape character.

Also removes the now-unused ReStripMarkers regexp and its regexp import.

Benchmark results (Apple M1 Pro):

name                           old time/op    new time/op    delta
StripMarkers_String-10           738ns ± 2%     105ns ± 1%  -85.83%  (p=0.000 n=8)
StripMarkers_Bytes-10            733ns ± 4%      58ns ± 1%  -92.10%  (p=0.000 n=8)
StripMarkers_NoMarkers-10        660ns ± 1%      23ns ± 0%  -96.53%  (p=0.000 n=8)
EscapeMarkers-10                 377ns ± 2%      45ns ±18%  -87.99%  (p=0.000 n=8)

name                           old alloc/op   new alloc/op   delta
StripMarkers_String-10           185B ± 1%     144B ± 0%  -22.16%  (p=0.000 n=8)
StripMarkers_Bytes-10            136B ± 0%      48B ± 0%  -64.71%  (p=0.000 n=8)
StripMarkers_NoMarkers-10        144B ± 0%       0B ± 0%  -100.00% (p=0.000 n=8)
EscapeMarkers-10                40.0B ± 0%    24.0B ± 0%  -40.00%  (p=0.000 n=8)

name                           old allocs/op  new allocs/op  delta
StripMarkers_String-10           6.00 ± 0%     3.00 ± 0%  -50.00%  (p=0.000 n=8)
StripMarkers_Bytes-10            5.00 ± 0%     1.00 ± 0%  -80.00%  (p=0.000 n=8)
StripMarkers_NoMarkers-10        3.00 ± 0%     0.00 ± 0%  -100.00% (p=0.000 n=8)
EscapeMarkers-10                 3.00 ± 0%     1.00 ± 0%  -66.67%  (p=0.000 n=8)

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@dhartunian dhartunian force-pushed the davidh/strip-markers-perf branch from cfc9aed to ccab926 Compare March 13, 2026 16:45
Copy link
Contributor

@visheshbardia visheshbardia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

@visheshbardia made 1 comment.
Reviewable status: 0 of 3 files reviewed, 1 unresolved discussion.

@dhartunian dhartunian merged commit d01e850 into master Mar 13, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants