-
Notifications
You must be signed in to change notification settings - Fork 242
Add folding and simplication for OP_ECLASS #586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add folding and simplication for OP_ECLASS #586
Conversation
| ECL_AND AND; no additional data | ||
| ECL_OR OR; no additional data | ||
| ECL_XOR XOR; no additional data | ||
| ECL_NOT NOT; no additional data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mentioned that not is removed. Is there a special type (sequence) for all/nothing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't mentioned them in the HACKING document, because they are not part of a fully-formed OP_ECLASS, but yes, there are two values ECL_ANY and ECL_NONE.
When one of these appears as the LHS/RHS of a binary operator, it can be constant-folded away. Therefore, in the final return value from compile_class_nested(), the result either has no instances of ECL_ANY or ECL_NONE, or the result is a single ECL_ANY or ECL_NONE item. In this case, we don't need to wrap it in an OP_ECLASS, so at match-time, an OP_ECLASS will only ever contain ECL_XCLASS items plus operators.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, very unfortunately... I had to back the NOT opcode. There's one case I discovered that I couldn't fold, and I reworked the code a few times, but in the end, it was better with the NOT.
This is the case: [[\p{Thai} & \p{Digits}] ~~ [^z]]. If you have a compound expression, on the LHS of an XOR (~~), then we need to be able to fold the RHS into the LHS. However, because the RHS is all ones, we need to flip (invert) the LHS. At this point, it's too late to go back and do that - I really want the whole optimisation pass to be a single, left-to-right pass over the META codes, without needing backtracking or fixups.
By far the easiest solution is just to place a NOT at the end. In theory... it could be folded away, with more effort, but the code just ends up so ugly I decided not to do that.
zherczeg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice progress! I will check the rest later.
| #endif | ||
| { | ||
| /* Rewind back over the OP_ECLASS. */ | ||
| code = previous; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we compute this at length computation time, store the bit in the META code, and use this knowledge when the byte code is generated. We might do a buffer overrun otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There shouldn't be a buffer overrun?
This is the rule I'm following: "don't do any conditional logic / branches inside "if (lengthptr == NULL)" blocks". It's totally OK to check whether lengthptr == NULL, but you need to absolutely sure when you do that, that the code will behave the same with and without lengthptr.
What would be terrible, would be to do "better" optimisation when lengthptr == NULL, or to do anything different at all (apart from actually writing the bytecode itself). We have to make exactly the same optimisation decisions in each case. The if (lengthptr == NULL) blocks need to extremely short, and careful, so that they don't affect control flow.
That way, we can be sure we'll have enough buffer space - because we followed all the same code paths each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it is a revert. If the remaining length of the buffer is 2 characters, and you write 5, then revert, then write 2, then there no buffer overrun at the end, only at the middle. This is not possible, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whenever I rewind the code pointer, I increment the lengthptr by the same amount.
- Say, we write two bytes to
code, and request another 5 vialengthptr. (A total of 7 bytes requested) - Then we rewind
code(by 2 bytes), and incrementlengthptrto compensate. (Still 7 bytes requested) - Now we write more bytes to
code- can initially decrementlengthptr(if writing 7 bytes or less), but then if we are requesting more bytes, we'll increase the size of the allocation
|
I know the tests are failing. I'll sort that out! I'm also going to do additional testing. I plan to write a grammar-driven fuzzing script, to generate millions of random extended classes which have valid syntax, and deep, complex expressions. Then I'll run pcre2test using the OLD code, to generate a reference answer for each. If pcre2test produces the same output with the NEW code, then we can be almost certain that the optimisations don't have any effect on behaviour. I also need to re-check the code coverage to make sure all the cases in optimiser are actually covered by the existing tests. |
046d4ff to
163f220
Compare
|
Actually, don't start reviewing this quite yet. A big chunk of the core logic is all present & correct, but there are some bits of work-in-progress. The code is close, but there are some very fiddly details around XCLASS handling I'm still wrestling with. |
163f220 to
7056cdf
Compare
|
Just a question: if multiple XCLASS-es uses property accesses, they each read the ucd data structure separately. On the long run, it would be good to do it only once, and share the ucd data pointer. A flag bit could be useful in ECLASS, which tells, that at east one XCLASS needs ucd data. Then ECLASS reads the data, and shares with XCLASS-es. Adding such flag is easy or difficult in your current implementation? |
Probably easy? There are 7 spare unused flag bits in ECLASS currently, and it would be possible to add the UCD pointer to the arguments for I'm not entirely convinced you'd want to do that though. It's a two-level lookup table for UCD data, which isn't terrible, given that the relevant chunks of the table will be in L1 cache. Random character access will be very very painful if the CPU has to hit RAM to load the table chunks; but loading the same character repeatedly (in my guesswork) would be relatively painless? I guess it could be measured. |
|
Great! There are some minor bugs still to fix (CI tells me that the no-Unicode build has some failing tests). However, is working nearly perfectly, and I don't expect any substantial changes from here on. The code which builds XCLASSes (using char lists etc) is incredibly complex, and it would be rather nice to try hard to simplify it at some point. There are two very separate codepaths:
Fortunately... this PR really doesn't go into any of that stuff, even though it builds on top of it, so I've simply left it alone for now. However, I am very tempted to do a follow-up refactor of |
22d9183 to
b62c07c
Compare
|
Let us know when this patch is ready for review. |
|
I have some additional testing to do (eg add some tests where we don't have 100% coverage of all the optimisation paths). So it's not ready to merge. But the code is stable and ready for review. |
src/pcre2_compile.c
Outdated
| // XXX revisit why these flags here are different to when we process a | ||
| // non-empty class - is this a problem? You can have '[^\d\D]' is basically | ||
| // the same as '[]' so they should surely set the same flags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a question for reviewers.
The handling for META_CLASS_EMPTY_NOT emits OP_ALLANY, and sets the flags:
if (firstcuflags == REQ_UNSET) firstcuflags = REQ_NONE;
zerofirstcu = firstcu;
zerofirstcuflags = firstcuflags;
And, the handling for [\d\D] also emits OP_ALLANY, but sets the different flags:
if (firstcuflags == REQ_UNSET) firstcuflags = REQ_NONE;
zerofirstcu = firstcu;
zerofirstcuflags = firstcuflags;
zeroreqcu = reqcu;
zeroreqcuflags = reqcuflags;
Is this a problem? Is there any impact?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't spent time to understand these constructs, so this is also magic to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps @PhilipHazel could comment on this one specific question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I regret that I cannot remember the details of how all that works. It must be over 20 years ago... I can remember that it is is quite complicated and I probably should have documented it better in comments, though I see I did at least document why the zeroxxx variables exist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I have done my best to study how they are used.
I believe that this is not a bug. The code here does not update zeroreqcu[flags], because it doesn't update reqcu[flags]. There are examples elsewhere in the file which confused me, because they appear to be doing unnecessary updates to the zero* variables.
I believe the rule is, "if you write to reqcu[flags] or firstcu[flags], you should also write to the corresponding zero* variables".
I have withdrawn my XXX comment.
Thank you @PhilipHazel for hepling!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always happy to hepple. :-)
6350c1f to
936e8fd
Compare
|
I have added additional tests, until the code-coverage report reached 100% branch coverage for all the (reachable) branches in the ECLASS code. I will do further testing by fuzz-testing the new (optimised) ECLASS behaviour against the old reference implementation. |
|
I have completed my testing. I wrote a script, which generates random extended character classes, UTS#18 and Perl-style, in order to fuzz-test the optimisations in this PR. I generated millions of test cases (GB of testinput), and ran the test cases using pcre2test, comparing master to this branch. The output files were identical. Then I ran my fuzzed inputs using the Clang's coverage check... and found I'd forgotten to generate cases for one of the operators! Oops. So I re-ran... all good. The coverage verifies that the randomised test cases are covering all the branches of the optimiser, and probably covering far more paths through the code than the manually-written test cases. |
zherczeg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice progress!
| - name: Test (main test script) | ||
| run: | | ||
| ulimit -S -s 32768 # Raise stack limit; ASAN with -O0 is very stack-hungry | ||
| ulimit -S -s 49152 # Raise stack limit; ASAN with -O0 is very stack-hungry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. The fuzz tests generate very deep stacks. The stack usage for ordinary builds is quite sensible (15 * 32 bytes), but the stack frames and stack allocations have absurdly-large amounts of guard space in the ASAN builds, so I had to bump this higher.
The eclass code (with a 15-deep parenthesis limit) uses similar or less stack than the main regex parser, which is also recursive-descent with a 255-deep limit.
| { | ||
| *code++ = OP_CLASS; | ||
| memset(code, 0, 32 * sizeof(uint8_t)); | ||
| memset(code, 0, 32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I just noticed a load of sizeof(uint8_t) and thought it was simpler to just remove (personal style).
| #endif | ||
| { | ||
| /* Rewind back over the OP_ECLASS. */ | ||
| code = previous; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean it is a revert. If the remaining length of the buffer is 2 characters, and you write 5, then revert, then write 2, then there no buffer overrun at the end, only at the middle. This is not possible, isn't it?
| cls:[ ] | ||
| op: || | ||
| ] | ||
| [ ^ab] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there tests with /x and comments / spaces?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are previously-committed tests for /x.
I haven't touched the frontend parser in this PR, so interpretation of input characters won't have changed.
| cls:[\p{Lu}] | ||
| cls:[\p{Nd}] | ||
| op: || | ||
| bitmap: [0-9A-Z\xc0-\xd6\xd8-\xde] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is exactly the case where no bitmap is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sadly. Because ECLASS combines bitmaps in ways that are harder to reason about, I made the ECLASS bitmap behaviour a bit different to XCLASS (see the changes in HACKING). For ECLASS, I made it so that there is a bitmap if and only if the ECLASS matches any characters < 256. So an ECLASS will have at most one bitmap, but it will be preserved for cases like \pL & \pN.
(This example is even worse, because additional optimisation would actually mean it could be simplified to a simple XCLASS.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. My idea was to detect that bitmap is needed in the lengthptr != NULL phase, and use this info later. I will try to do it after this patch is landed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an "always_map" flag in this PR (to compile_class_not_nested()). That flag pokes XCLASSes so that when they are inside an ECLASS they always become bitmapped (only if they match any ASCII characters, eg. \pL).
I was planning on cleaning that up in a future PR, but I won't do it straight away - so go ahead if you have ideas.
zherczeg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please squash the commits, and fix the minor things. Good progress.
src/pcre2_compile.c
Outdated
| // XXX revisit why these flags here are different to when we process a | ||
| // non-empty class - is this a problem? You can have '[^\d\D]' is basically | ||
| // the same as '[]' so they should surely set the same flags. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't spent time to understand these constructs, so this is also magic to me.
src/pcre2_compile.c
Outdated
| /* Do some useful counting of what's in the bitmap. */ | ||
| for (int i = 0; i < 8; i++) | ||
| if (op_info.bits.classwords[i] != 0xffffffff) | ||
| { allbitsone = FALSE; break; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The syntax for these is not changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did change the indentation. Is it wrong to put it all on one line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have split the code onto multiple lines.
|
Is there anybody who plans to review it later? If not, I plan to land it after a few changes. |
c211e1d to
09e04a4
Compare
|
Squashed. This should be good to go! |
zherczeg
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
I tried merging this ... after a while the merge button came back so (stupidly) I clicked it again. It now says merge already in progress... has it done it, I wonder? I hope I haven't broken anything. |
|
I still see the "squash and merge" button, but I don't click on it to avoid even more chaos. |
|
I now see that too, but likewise, I'm holding back. @NWilson what do you think? Is it safe to try merging again? |
|
I can't see what buttons you're seeing... but why not? It can't hurt to keep clicking until it merges. |
|
Hooray! It seems to have merged. |
Fixes #537
Whew! This was a bit of a brain-bending PR, for me at least.
OP_ECLASS now only contains ECS_XCLASS, and ECL_AND...ECL_XOR codes. No more CLASS, NCLASS nested inside; we fully-fold those, so anything like
(?[ [a-z] - [aeiou] ])will no longer generate an OP_ECLASS. You have to actually put two XCLASSes inside the extended character class, in order to something that we can't fully fold.In addition, for the cases that do result in an OP_ECLASS (mainly, logical operations on Unicode classes), we hoist up the bitmap so that ASCII characters can be matched without needing to consult any of the unicode properties.