Change .Multiline regex anchors to match common regex implementations #5027

jist99 · 2025-04-10T20:12:40Z

Initially, I raised the issue #5017 thinking this was just a problem with the iterator, however after some initial investigation I found that the ^ and $ anchors when in .Multiline mode were including the newline characters in the capture group, which is not the behaviour seen on any of regex101's engines (see images).

So this PR changes the behaviour of ^ and $ in multiline mode to ensure that newlines are not included in the capture.

To do this I've changed the regex compiler so that Save operations always occur after an opening Multiline_Start_Open, and before a Assert_Multiline_End. I've added Assert_Multiline_End as this requires special behaviour to not progress the string pointer so that one newline can be used by two matches (e.g. ^foo$ on the text "foo\nfoo").

@Feoramund your review would be appreciated since you wrote this :)
Closes #5017

Feoramund · 2025-04-10T22:48:40Z

I should have some time to review and test this tomorrow.

Feoramund

I've spent a few hours trying it, Multiline flag on with regex.match_iterator. I've seen some issues.

Lines 72 and 441 of regex/regex.odin: strike comments about not handling .Multiline.
^$ matches a\nb\n\r with two captures groups of 4,4 and 5,5.
^$ matches a\nb\n with one captures group of 4,4.
^$ matches a\nb with no capture group. I should expect these three to return no capture, same behavior as if it was not Multiline.
a$ sends the iterator into an infinite loop on a\nb\n. So does a blank pattern string, but that looks like a problem introduced by Kelimion's original implementation of the regex iterator.
$ itself alone causes an infinite loop.
a matches three times on aaa but as a then two blank captures. Issue exists in original iterator implementation.
(^a$|^b$) matches a\nb\na\nb as a blank string then three newlines. Very strange.
^a(b|$) matches a\nb\na\nb as two blank strings.

The rewinding of string_pointer is the most questionable thing here. All threads are supposed to execute in lock step per the original design by Ken Thompson. There may be other issues stemming from this, but I will need more time to review.

Feoramund · 2025-04-11T18:18:16Z

core/text/regex/virtual_machine/doc.odin

@@ -171,5 +171,17 @@ For more information, see: https://swtch.com/~rsc/regexp/regexp2.html
 	Be aware, this opcode is not compiled in if the `Multiline` flag is on, as
 	the meaning of `$` changes with that flag.

+	(0x15) Assert_Multiline_end


Suggested change

(0x15) Assert_Multiline_end

(0x15) Assert_Multiline_End

Nice catch, fixed

Feoramund · 2025-04-11T18:21:25Z

core/text/regex/virtual_machine/util.odin

+	case .Wait_For_Rune_Class:               iter.pc += size_of(Opcode) + size_of(u8)
+	case .Wait_For_Rune_Class_Negated:       iter.pc += size_of(Opcode) + size_of(u8)
+	case .Match_All_And_Escape:              iter.pc += size_of(Opcode)
+	case .Assert_Multiline_End:              iter.pc += size_of(Opcode)


It's not needed to align all of these out. It creates more noise in the diffs. See other areas involving opcode enums where this was the case. This can be rebased out of the initial commit.

I've rebased it out. My bad.

The github diffs still seem to be showing this, but I did rebase it out. Maybe it's because I had to force push my rebase? Not sure.

Feoramund · 2025-04-11T19:56:45Z

core/text/regex/virtual_machine/virtual_machine.odin

+				// Special case where we don't want to progress the string pointer
+				// Because we want to leave a potential `\r` or `\n` to be consumed
+				// by a potential `^` in potential future iterations.
+				vm.string_pointer -= vm.current_rune_size
+				continue


This is suspicious. The VM is not designed to be able to rewind the string_pointer. All threads must operate on the same rune simultaneously and independently of any decisions made by any other thread.

Yes, this seems to cause issues.

Do you have any alternate ideas? We need to ensure that the string pointer is not progressed past the newline character to be able to match things like ^foo$ foo\nfoo, where the newline is used by both the $ in the first iteration and ^ in the second.

Forcibly rewinding the string_pointer after the match has concluded might be possible, but it feels very fragile as there are cases such as foo$|bar\n where it's not exactly clear if we should rewind or not.

I've entirely changed the behaviour here, we no longer do any rewinding.

Feoramund · 2025-04-11T20:03:08Z

core/text/regex/virtual_machine/virtual_machine.odin

+	when common.ODIN_DEBUG_REGEX {
+		io.write_string(common.debug_stream, "Whole program::\n")
+		for op in vm.code {
+			io.write_string(common.debug_stream, opcode_to_name(op))
+			io.write_byte(common.debug_stream, '\n')
+		}
+	}
+


This is unneeded. Just call trace found in core:text/regex/compiler on your code after compiling. It shows the full opcode name with arguments along with jump targets and arrows like you'd see in a disassembler output.

Thanks! trace is much easier to work with 😄

jist99 · 2025-04-12T12:11:02Z

@Feoramund I've made some changes. We no longer rewind the string pointer, instead ^ can look back to the previous character to check if it was a newline, meaning the cases like foo1\nfoo2 still work.

I've had to remove the slight change I made to the compiler where it was moving Save instructions to avoid newlines as cases such as "bar\nfoo" ^foo$|^bar$ were causing issues as the split introduced more complexity. Most of the time \n is still not included, though in some cases it does tag along.. I'm not sure how to fix this while ensuring that splits still work correctly.

As you pointed out the infinite loops were introduced in the original iterator patch, including the one with just $ alone. I haven't fixed this.

With regards to the ^$ regex

I should expect these three to return no capture, same behavior as if it was not Multiline.
On main now with .Multiline on a regular match (non-iterato) does return things:

    exp, _ := regex.create(`^$`, {.Multiline})
    capture := regex.match_and_allocate_capture(exp, "a\nb\n\r") or_else panic("No match")
    fmt.println(capture)

With my changes this still happens (though only when .Global is enabled). Although it's not ideal the regex engines of js, go, C#, and rust all show similar behaviour to what is shown with my patch — so I don't believe that this is a major issue.

Feoramund · 2025-04-14T19:19:13Z

I have a partial fix in the works for some of the core iterator problems. I'll need to see how it interacts with these changes. I'd prefer if the output's deterministic with regard to splitting newlines. If there's going to be exceptions, then we need to be able to accurately state what they are, because people will be depending on this API.

jist99 · 2025-04-14T20:00:20Z

Sorry, I should have been clearer in my comment. The behaviour for including \n is fully deterministic, and I know exactly when it happens, I just can't think of a good way to prevent it.

^ will include the newline in the match group, and $ will not. So with the regex ^a$ "a\n" will give Capture{pos = [[0, 1]], groups = ["a"]}. And "\na" will give Capture{pos = [[0, 2]], groups = ["\na"]}

Kelimion · 2025-05-26T19:24:44Z

Superseded by @Feoramund's #5220 PR.

Feoramund reviewed Apr 11, 2025

View reviewed changes

jist99 added 5 commits April 11, 2025 23:08

Change .Multiline regex anchors to match common regex implementations

a86c5f1

Appease the style checks

640c5de

Allow the iterator to take the .Multiline flag

c64a019

Implement review suggestions

4dff946

Alter multiline approach

80654ef

jist99 force-pushed the regex-anchor-fix branch from 2e0ef9d to 80654ef Compare April 12, 2025 11:59

This was referenced May 24, 2025

Fix RegEx iterator, remove .Global, make patterns unanchored by default (breaking change) #5209

Merged

Fix multiline RegEx iteration (breaking change for .Multiline usage) #5220

Merged

Kelimion closed this May 26, 2025

Uh oh!

Change .Multiline regex anchors to match common regex implementations #5027

Change .Multiline regex anchors to match common regex implementations #5027

Uh oh!

Conversation

jist99 commented Apr 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Feoramund commented Apr 10, 2025

Uh oh!

Feoramund left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jist99 commented Apr 12, 2025

Uh oh!

Feoramund commented Apr 14, 2025

Uh oh!

jist99 commented Apr 14, 2025

Uh oh!

Kelimion commented May 26, 2025

Uh oh!

Uh oh!

jist99 commented Apr 10, 2025 •

edited

Loading