-
-
Notifications
You must be signed in to change notification settings - Fork 733
Change .Multiline regex anchors to match common regex implementations #5027
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
I should have some time to review and test this tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spent a few hours trying it, Multiline
flag on with regex.match_iterator
. I've seen some issues.
- Lines 72 and 441 of
regex/regex.odin
: strike comments about not handling.Multiline
. ^$
matchesa\nb\n\r
with two captures groups of4,4
and5,5
.^$
matchesa\nb\n
with one captures group of4,4
.^$
matchesa\nb
with no capture group. I should expect these three to return no capture, same behavior as if it was notMultiline
.a$
sends the iterator into an infinite loop ona\nb\n
. So does a blank pattern string, but that looks like a problem introduced by Kelimion's original implementation of the regex iterator.$
itself alone causes an infinite loop.a
matches three times onaaa
but asa
then two blank captures. Issue exists in original iterator implementation.(^a$|^b$)
matchesa\nb\na\nb
as a blank string then three newlines. Very strange.^a(b|$)
matchesa\nb\na\nb
as two blank strings.
The rewinding of string_pointer
is the most questionable thing here. All threads are supposed to execute in lock step per the original design by Ken Thompson. There may be other issues stemming from this, but I will need more time to review.
@@ -171,5 +171,17 @@ For more information, see: https://swtch.com/~rsc/regexp/regexp2.html | |||
Be aware, this opcode is not compiled in if the `Multiline` flag is on, as | |||
the meaning of `$` changes with that flag. | |||
|
|||
(0x15) Assert_Multiline_end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(0x15) Assert_Multiline_end | |
(0x15) Assert_Multiline_End |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch, fixed
case .Wait_For_Rune_Class: iter.pc += size_of(Opcode) + size_of(u8) | ||
case .Wait_For_Rune_Class_Negated: iter.pc += size_of(Opcode) + size_of(u8) | ||
case .Match_All_And_Escape: iter.pc += size_of(Opcode) | ||
case .Assert_Multiline_End: iter.pc += size_of(Opcode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not needed to align all of these out. It creates more noise in the diffs. See other areas involving opcode enums where this was the case. This can be rebased out of the initial commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've rebased it out. My bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The github diffs still seem to be showing this, but I did rebase it out. Maybe it's because I had to force push my rebase? Not sure.
// Special case where we don't want to progress the string pointer | ||
// Because we want to leave a potential `\r` or `\n` to be consumed | ||
// by a potential `^` in potential future iterations. | ||
vm.string_pointer -= vm.current_rune_size | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is suspicious. The VM is not designed to be able to rewind the string_pointer
. All threads must operate on the same rune simultaneously and independently of any decisions made by any other thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this seems to cause issues.
Do you have any alternate ideas? We need to ensure that the string pointer is not progressed past the newline character to be able to match things like ^foo$
foo\nfoo
, where the newline is used by both the $
in the first iteration and ^
in the second.
Forcibly rewinding the string_pointer
after the match has concluded might be possible, but it feels very fragile as there are cases such as foo$|bar\n
where it's not exactly clear if we should rewind or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've entirely changed the behaviour here, we no longer do any rewinding.
when common.ODIN_DEBUG_REGEX { | ||
io.write_string(common.debug_stream, "Whole program::\n") | ||
for op in vm.code { | ||
io.write_string(common.debug_stream, opcode_to_name(op)) | ||
io.write_byte(common.debug_stream, '\n') | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unneeded. Just call trace
found in core:text/regex/compiler
on your code after compiling. It shows the full opcode name with arguments along with jump targets and arrows like you'd see in a disassembler output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! trace
is much easier to work with 😄
2e0ef9d
to
80654ef
Compare
@Feoramund I've made some changes. We no longer rewind the string pointer, instead I've had to remove the slight change I made to the compiler where it was moving Save instructions to avoid newlines as cases such as As you pointed out the infinite loops were introduced in the original iterator patch, including the one with just With regards to the
exp, _ := regex.create(`^$`, {.Multiline})
capture := regex.match_and_allocate_capture(exp, "a\nb\n\r") or_else panic("No match")
fmt.println(capture) With my changes this still happens (though only when .Global is enabled). Although it's not ideal the regex engines of js, go, C#, and rust all show similar behaviour to what is shown with my patch — so I don't believe that this is a major issue. |
I have a partial fix in the works for some of the core iterator problems. I'll need to see how it interacts with these changes. I'd prefer if the output's deterministic with regard to splitting newlines. If there's going to be exceptions, then we need to be able to accurately state what they are, because people will be depending on this API. |
Sorry, I should have been clearer in my comment. The behaviour for including
|
Initially, I raised the issue #5017 thinking this was just a problem with the iterator, however after some initial investigation I found that the
^
and$
anchors when in.Multiline
mode were including the newline characters in the capture group, which is not the behaviour seen on any of regex101's engines (see images).So this PR changes the behaviour of
^
and$
in multiline mode to ensure that newlines are not included in the capture.To do this I've changed the regex compiler so that Save operations always occur after an opening
Multiline_Start_Open
, and before aAssert_Multiline_End
. I've addedAssert_Multiline_End
as this requires special behaviour to not progress the string pointer so that one newline can be used by two matches (e.g.^foo$
on the text"foo\nfoo"
).@Feoramund your review would be appreciated since you wrote this :)
Closes #5017