Skip to content

Change .Multiline regex anchors to match common regex implementations #5027

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

jist99
Copy link

@jist99 jist99 commented Apr 10, 2025

Initially, I raised the issue #5017 thinking this was just a problem with the iterator, however after some initial investigation I found that the ^ and $ anchors when in .Multiline mode were including the newline characters in the capture group, which is not the behaviour seen on any of regex101's engines (see images).

So this PR changes the behaviour of ^ and $ in multiline mode to ensure that newlines are not included in the capture.

To do this I've changed the regex compiler so that Save operations always occur after an opening Multiline_Start_Open, and before a Assert_Multiline_End. I've added Assert_Multiline_End as this requires special behaviour to not progress the string pointer so that one newline can be used by two matches (e.g. ^foo$ on the text "foo\nfoo").

image
image

@Feoramund your review would be appreciated since you wrote this :)
Closes #5017

@Feoramund
Copy link
Contributor

I should have some time to review and test this tomorrow.

Copy link
Contributor

@Feoramund Feoramund left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spent a few hours trying it, Multiline flag on with regex.match_iterator. I've seen some issues.

  • Lines 72 and 441 of regex/regex.odin: strike comments about not handling .Multiline.
  • ^$ matches a\nb\n\r with two captures groups of 4,4 and 5,5.
  • ^$ matches a\nb\n with one captures group of 4,4.
  • ^$ matches a\nb with no capture group. I should expect these three to return no capture, same behavior as if it was not Multiline.
  • a$ sends the iterator into an infinite loop on a\nb\n. So does a blank pattern string, but that looks like a problem introduced by Kelimion's original implementation of the regex iterator.
  • $ itself alone causes an infinite loop.
  • a matches three times on aaa but as a then two blank captures. Issue exists in original iterator implementation.
  • (^a$|^b$) matches a\nb\na\nb as a blank string then three newlines. Very strange.
  • ^a(b|$) matches a\nb\na\nb as two blank strings.

The rewinding of string_pointer is the most questionable thing here. All threads are supposed to execute in lock step per the original design by Ken Thompson. There may be other issues stemming from this, but I will need more time to review.

@@ -171,5 +171,17 @@ For more information, see: https://swtch.com/~rsc/regexp/regexp2.html
Be aware, this opcode is not compiled in if the `Multiline` flag is on, as
the meaning of `$` changes with that flag.

(0x15) Assert_Multiline_end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(0x15) Assert_Multiline_end
(0x15) Assert_Multiline_End

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch, fixed

case .Wait_For_Rune_Class: iter.pc += size_of(Opcode) + size_of(u8)
case .Wait_For_Rune_Class_Negated: iter.pc += size_of(Opcode) + size_of(u8)
case .Match_All_And_Escape: iter.pc += size_of(Opcode)
case .Assert_Multiline_End: iter.pc += size_of(Opcode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not needed to align all of these out. It creates more noise in the diffs. See other areas involving opcode enums where this was the case. This can be rebased out of the initial commit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've rebased it out. My bad.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The github diffs still seem to be showing this, but I did rebase it out. Maybe it's because I had to force push my rebase? Not sure.

Comment on lines 212 to 216
// Special case where we don't want to progress the string pointer
// Because we want to leave a potential `\r` or `\n` to be consumed
// by a potential `^` in potential future iterations.
vm.string_pointer -= vm.current_rune_size
continue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is suspicious. The VM is not designed to be able to rewind the string_pointer. All threads must operate on the same rune simultaneously and independently of any decisions made by any other thread.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this seems to cause issues.

Do you have any alternate ideas? We need to ensure that the string pointer is not progressed past the newline character to be able to match things like ^foo$ foo\nfoo, where the newline is used by both the $ in the first iteration and ^ in the second.

Forcibly rewinding the string_pointer after the match has concluded might be possible, but it feels very fragile as there are cases such as foo$|bar\n where it's not exactly clear if we should rewind or not.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've entirely changed the behaviour here, we no longer do any rewinding.

Comment on lines 342 to 349
when common.ODIN_DEBUG_REGEX {
io.write_string(common.debug_stream, "Whole program::\n")
for op in vm.code {
io.write_string(common.debug_stream, opcode_to_name(op))
io.write_byte(common.debug_stream, '\n')
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unneeded. Just call trace found in core:text/regex/compiler on your code after compiling. It shows the full opcode name with arguments along with jump targets and arrows like you'd see in a disassembler output.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! trace is much easier to work with 😄

@jist99 jist99 force-pushed the regex-anchor-fix branch from 2e0ef9d to 80654ef Compare April 12, 2025 11:59
@jist99
Copy link
Author

jist99 commented Apr 12, 2025

@Feoramund I've made some changes. We no longer rewind the string pointer, instead ^ can look back to the previous character to check if it was a newline, meaning the cases like foo1\nfoo2 still work.

I've had to remove the slight change I made to the compiler where it was moving Save instructions to avoid newlines as cases such as "bar\nfoo" ^foo$|^bar$ were causing issues as the split introduced more complexity. Most of the time \n is still not included, though in some cases it does tag along.. I'm not sure how to fix this while ensuring that splits still work correctly.

As you pointed out the infinite loops were introduced in the original iterator patch, including the one with just $ alone. I haven't fixed this.

With regards to the ^$ regex

I should expect these three to return no capture, same behavior as if it was not Multiline.
On main now with .Multiline on a regular match (non-iterato) does return things:

    exp, _ := regex.create(`^$`, {.Multiline})
    capture := regex.match_and_allocate_capture(exp, "a\nb\n\r") or_else panic("No match")
    fmt.println(capture)

With my changes this still happens (though only when .Global is enabled). Although it's not ideal the regex engines of js, go, C#, and rust all show similar behaviour to what is shown with my patch — so I don't believe that this is a major issue.

@Feoramund
Copy link
Contributor

I have a partial fix in the works for some of the core iterator problems. I'll need to see how it interacts with these changes. I'd prefer if the output's deterministic with regard to splitting newlines. If there's going to be exceptions, then we need to be able to accurately state what they are, because people will be depending on this API.

@jist99
Copy link
Author

jist99 commented Apr 14, 2025

Sorry, I should have been clearer in my comment. The behaviour for including \n is fully deterministic, and I know exactly when it happens, I just can't think of a good way to prevent it.

^ will include the newline in the match group, and $ will not. So with the regex ^a$ "a\n" will give Capture{pos = [[0, 1]], groups = ["a"]}. And "\na" will give Capture{pos = [[0, 2]], groups = ["\na"]}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Regex .Multiline consuming \n with $
2 participants