Add folding and simplication for OP_ECLASS #586

NWilson · 2024-11-27T21:53:15Z

Fixes #537

Whew! This was a bit of a brain-bending PR, for me at least.

OP_ECLASS now only contains ECS_XCLASS, and ECL_AND...ECL_XOR codes. No more CLASS, NCLASS nested inside; we fully-fold those, so anything like (?[ [a-z] - [aeiou] ]) will no longer generate an OP_ECLASS. You have to actually put two XCLASSes inside the extended character class, in order to something that we can't fully fold.

In addition, for the cases that do result in an OP_ECLASS (mainly, logical operations on Unicode classes), we hoist up the bitmap so that ASCII characters can be matched without needing to consult any of the unicode properties.

zherczeg · 2024-11-27T22:04:43Z

HACKING

+  ECL_AND       AND; no additional data
+  ECL_OR        OR; no additional data
+  ECL_XOR       XOR; no additional data
+  ECL_NOT       NOT; no additional data


I think you mentioned that not is removed. Is there a special type (sequence) for all/nothing?

I hadn't mentioned them in the HACKING document, because they are not part of a fully-formed OP_ECLASS, but yes, there are two values ECL_ANY and ECL_NONE.

When one of these appears as the LHS/RHS of a binary operator, it can be constant-folded away. Therefore, in the final return value from compile_class_nested(), the result either has no instances of ECL_ANY or ECL_NONE, or the result is a single ECL_ANY or ECL_NONE item. In this case, we don't need to wrap it in an OP_ECLASS, so at match-time, an OP_ECLASS will only ever contain ECL_XCLASS items plus operators.

Also, very unfortunately... I had to back the NOT opcode. There's one case I discovered that I couldn't fold, and I reworked the code a few times, but in the end, it was better with the NOT.

This is the case: [[\p{Thai} & \p{Digits}] ~~ [^z]]. If you have a compound expression, on the LHS of an XOR (~~), then we need to be able to fold the RHS into the LHS. However, because the RHS is all ones, we need to flip (invert) the LHS. At this point, it's too late to go back and do that - I really want the whole optimisation pass to be a single, left-to-right pass over the META codes, without needing backtracking or fixups.

By far the easiest solution is just to place a NOT at the end. In theory... it could be folded away, with more effort, but the code just ends up so ugly I decided not to do that.

src/pcre2_compile.c

zherczeg

Nice progress! I will check the rest later.

src/pcre2_compile.c

zherczeg · 2024-11-27T22:10:36Z

src/pcre2_compile.c

+#endif
+        {
+        /* Rewind back over the OP_ECLASS. */
+        code = previous;


Can we compute this at length computation time, store the bit in the META code, and use this knowledge when the byte code is generated. We might do a buffer overrun otherwise.

There shouldn't be a buffer overrun?

This is the rule I'm following: "don't do any conditional logic / branches inside "if (lengthptr == NULL)" blocks". It's totally OK to check whether lengthptr == NULL, but you need to absolutely sure when you do that, that the code will behave the same with and without lengthptr.

What would be terrible, would be to do "better" optimisation when lengthptr == NULL, or to do anything different at all (apart from actually writing the bytecode itself). We have to make exactly the same optimisation decisions in each case. The if (lengthptr == NULL) blocks need to extremely short, and careful, so that they don't affect control flow.

That way, we can be sure we'll have enough buffer space - because we followed all the same code paths each time.

I mean it is a revert. If the remaining length of the buffer is 2 characters, and you write 5, then revert, then write 2, then there no buffer overrun at the end, only at the middle. This is not possible, isn't it?

Whenever I rewind the code pointer, I increment the lengthptr by the same amount.

Say, we write two bytes to code, and request another 5 via lengthptr. (A total of 7 bytes requested)

Then we rewind code (by 2 bytes), and increment lengthptr to compensate. (Still 7 bytes requested)

Now we write more bytes to code - can initially decrement lengthptr (if writing 7 bytes or less), but then if we are requesting more bytes, we'll increase the size of the allocation

src/pcre2_compile.c

src/pcre2_compile.h

src/pcre2_compile_class.c

NWilson · 2024-11-28T10:49:07Z

I know the tests are failing. I'll sort that out!

I'm also going to do additional testing.

I plan to write a grammar-driven fuzzing script, to generate millions of random extended classes which have valid syntax, and deep, complex expressions. Then I'll run pcre2test using the OLD code, to generate a reference answer for each.

If pcre2test produces the same output with the NEW code, then we can be almost certain that the optimisations don't have any effect on behaviour.

I also need to re-check the code coverage to make sure all the cases in optimiser are actually covered by the existing tests.

NWilson · 2024-11-29T16:21:39Z

Actually, don't start reviewing this quite yet.

A big chunk of the core logic is all present & correct, but there are some bits of work-in-progress. The code is close, but there are some very fiddly details around XCLASS handling I'm still wrestling with.

zherczeg · 2024-11-30T06:38:04Z

Just a question: if multiple XCLASS-es uses property accesses, they each read the ucd data structure separately. On the long run, it would be good to do it only once, and share the ucd data pointer. A flag bit could be useful in ECLASS, which tells, that at east one XCLASS needs ucd data. Then ECLASS reads the data, and shares with XCLASS-es. Adding such flag is easy or difficult in your current implementation?

NWilson · 2024-11-30T21:28:42Z

Just a question: if multiple XCLASS-es uses property accesses, they each read the ucd data structure separately. On the long run, it would be good to do it only once, and share the ucd data pointer. A flag bit could be useful in ECLASS, which tells, that at east one XCLASS needs ucd data. Then ECLASS reads the data, and shares with XCLASS-es. Adding such flag is easy or difficult in your current implementation?

Probably easy? There are 7 spare unused flag bits in ECLASS currently, and it would be possible to add the UCD pointer to the arguments for PRIV(xclass)().

I'm not entirely convinced you'd want to do that though. It's a two-level lookup table for UCD data, which isn't terrible, given that the relevant chunks of the table will be in L1 cache. Random character access will be very very painful if the CPU has to hit RAM to load the table chunks; but loading the same character repeatedly (in my guesswork) would be relatively painless? I guess it could be measured.

NWilson · 2024-11-30T21:36:42Z

Great! There are some minor bugs still to fix (CI tells me that the no-Unicode build has some failing tests).

However, is working nearly perfectly, and I don't expect any substantial changes from here on.

The code which builds XCLASSes (using char lists etc) is incredibly complex, and it would be rather nice to try hard to simplify it at some point. There are two very separate codepaths:

The case for 16-bit/32-bit, or 8-bit with UTF set. These go via cranges, but also share code with the pure-8-bit-chars codepath, in a bit of a messy arrangement
The pure-8-bit codepath, with no cranges, for 8-bit library without UTF set. This basically has its own special-cased code, but it's mingled in with other cases.

Fortunately... this PR really doesn't go into any of that stuff, even though it builds on top of it, so I've simply left it alone for now. However, I am very tempted to do a follow-up refactor of PRIV(compile_class_not_nested)().

zherczeg · 2024-12-02T05:24:20Z

Let us know when this patch is ready for review.

NWilson · 2024-12-02T09:06:25Z

I have some additional testing to do (eg add some tests where we don't have 100% coverage of all the optimisation paths). So it's not ready to merge.

But the code is stable and ready for review.

NWilson · 2024-12-02T11:29:18Z

src/pcre2_compile.c

+    // XXX revisit why these flags here are different to when we process a
+    // non-empty class - is this a problem? You can have '[^\d\D]' is basically
+    // the same as '[]' so they should surely set the same flags.


This is a question for reviewers.

The handling for META_CLASS_EMPTY_NOT emits OP_ALLANY, and sets the flags:

if (firstcuflags == REQ_UNSET) firstcuflags = REQ_NONE;
zerofirstcu = firstcu;
zerofirstcuflags = firstcuflags;

And, the handling for [\d\D] also emits OP_ALLANY, but sets the different flags:

if (firstcuflags == REQ_UNSET) firstcuflags = REQ_NONE;
zerofirstcu = firstcu;
zerofirstcuflags = firstcuflags;
zeroreqcu = reqcu;
zeroreqcuflags = reqcuflags;

Is this a problem? Is there any impact?

I haven't spent time to understand these constructs, so this is also magic to me.

Perhaps @PhilipHazel could comment on this one specific question.

I regret that I cannot remember the details of how all that works. It must be over 20 years ago... I can remember that it is is quite complicated and I probably should have documented it better in comments, though I see I did at least document why the zeroxxx variables exist.

OK, I have done my best to study how they are used.

I believe that this is not a bug. The code here does not update zeroreqcu[flags], because it doesn't update reqcu[flags]. There are examples elsewhere in the file which confused me, because they appear to be doing unnecessary updates to the zero* variables.

I believe the rule is, "if you write to reqcu[flags] or firstcu[flags], you should also write to the corresponding zero* variables".

I have withdrawn my XXX comment.

Thank you @PhilipHazel for hepling!

Always happy to hepple. :-)

HACKING

NWilson · 2024-12-02T16:06:34Z

I have added additional tests, until the code-coverage report reached 100% branch coverage for all the (reachable) branches in the ECLASS code.

I will do further testing by fuzz-testing the new (optimised) ECLASS behaviour against the old reference implementation.

NWilson · 2024-12-02T22:28:39Z

I have completed my testing.

I wrote a script, which generates random extended character classes, UTS#18 and Perl-style, in order to fuzz-test the optimisations in this PR. I generated millions of test cases (GB of testinput), and ran the test cases using pcre2test, comparing master to this branch. The output files were identical.

Then I ran my fuzzed inputs using the Clang's coverage check... and found I'd forgotten to generate cases for one of the operators! Oops. So I re-ran... all good. The coverage verifies that the randomised test cases are covering all the branches of the optimiser, and probably covering far more paths through the code than the manually-written test cases.

zherczeg

Nice progress!

zherczeg · 2024-12-03T10:48:56Z

.github/workflows/dev.yml

      - name: Test (main test script)
        run: |
-          ulimit -S -s 32768 # Raise stack limit; ASAN with -O0 is very stack-hungry
+          ulimit -S -s 49152 # Raise stack limit; ASAN with -O0 is very stack-hungry


Do we know why?

Yes. The fuzz tests generate very deep stacks. The stack usage for ordinary builds is quite sensible (15 * 32 bytes), but the stack frames and stack allocations have absurdly-large amounts of guard space in the ASAN builds, so I had to bump this higher.

The eclass code (with a 15-deep parenthesis limit) uses similar or less stack than the main regex parser, which is also recursive-descent with a 255-deep limit.

zherczeg · 2024-12-03T10:54:45Z

src/pcre2_compile.c

      {
      *code++ = OP_CLASS;
-      memset(code, 0, 32 * sizeof(uint8_t));
+      memset(code, 0, 32);


Consistency?

Yes. I just noticed a load of sizeof(uint8_t) and thought it was simpler to just remove (personal style).

src/pcre2_compile.c

zherczeg · 2024-12-03T10:59:14Z

src/pcre2_compile.c

+#endif
+        {
+        /* Rewind back over the OP_ECLASS. */
+        code = previous;


I mean it is a revert. If the remaining length of the buffer is 2 characters, and you write 5, then revert, then write 2, then there no buffer overrun at the end, only at the middle. This is not possible, isn't it?

src/pcre2_compile.c

src/pcre2_compile_class.c

testdata/testoutput2

zherczeg · 2024-12-03T11:20:04Z

testdata/testoutput2

-          cls:[ ]
-          op: ||
-        ]
+        [ ^ab]


Is there tests with /x and comments / spaces?

Yes, there are previously-committed tests for /x.

I haven't touched the frontend parser in this PR, so interpretation of input characters won't have changed.

zherczeg · 2024-12-03T11:22:38Z

testdata/testoutput5

-          cls:[\p{Lu}]
-          cls:[\p{Nd}]
-          op: ||
+          bitmap: [0-9A-Z\xc0-\xd6\xd8-\xde]


I think this is exactly the case where no bitmap is needed.

Yes, sadly. Because ECLASS combines bitmaps in ways that are harder to reason about, I made the ECLASS bitmap behaviour a bit different to XCLASS (see the changes in HACKING). For ECLASS, I made it so that there is a bitmap if and only if the ECLASS matches any characters < 256. So an ECLASS will have at most one bitmap, but it will be preserved for cases like \pL & \pN.

(This example is even worse, because additional optimisation would actually mean it could be simplified to a simple XCLASS.)

Ok. My idea was to detect that bitmap is needed in the lengthptr != NULL phase, and use this info later. I will try to do it after this patch is landed.

I added an "always_map" flag in this PR (to compile_class_not_nested()). That flag pokes XCLASSes so that when they are inside an ECLASS they always become bitmapped (only if they match any ASCII characters, eg. \pL).

I was planning on cleaning that up in a future PR, but I won't do it straight away - so go ahead if you have ideas.

zherczeg

Please squash the commits, and fix the minor things. Good progress.

zherczeg · 2024-12-04T08:42:39Z

src/pcre2_compile.c

+    // XXX revisit why these flags here are different to when we process a
+    // non-empty class - is this a problem? You can have '[^\d\D]' is basically
+    // the same as '[]' so they should surely set the same flags.


I haven't spent time to understand these constructs, so this is also magic to me.

zherczeg · 2024-12-04T08:43:25Z

src/pcre2_compile.c

+      /* Do some useful counting of what's in the bitmap. */
+      for (int i = 0; i < 8; i++)
+        if (op_info.bits.classwords[i] != 0xffffffff)
+          { allbitsone = FALSE; break; }


The syntax for these is not changed.

I did change the indentation. Is it wrong to put it all on one line?

I have split the code onto multiple lines.

zherczeg · 2024-12-04T08:49:34Z

Is there anybody who plans to review it later? If not, I plan to land it after a few changes.

Fixes PCRE2Project#537

NWilson · 2024-12-04T12:53:06Z

Squashed. This should be good to go!

zherczeg

LGTM

PhilipHazel · 2024-12-04T13:36:44Z

I tried merging this ... after a while the merge button came back so (stupidly) I clicked it again. It now says merge already in progress... has it done it, I wonder? I hope I haven't broken anything.

zherczeg · 2024-12-04T13:41:18Z

I still see the "squash and merge" button, but I don't click on it to avoid even more chaos.

PhilipHazel · 2024-12-04T14:43:05Z

I now see that too, but likewise, I'm holding back. @NWilson what do you think? Is it safe to try merging again?

NWilson · 2024-12-04T14:45:06Z

I can't see what buttons you're seeing... but why not? It can't hurt to keep clicking until it merges.

PhilipHazel · 2024-12-04T15:03:50Z

Hooray! It seems to have merged.

zherczeg reviewed Nov 27, 2024

View reviewed changes

src/pcre2_compile.c Outdated Show resolved Hide resolved

zherczeg reviewed Nov 27, 2024

View reviewed changes

NWilson force-pushed the user/niwilson/eclass-opt branch from 046d4ff to 163f220 Compare November 29, 2024 15:09

NWilson force-pushed the user/niwilson/eclass-opt branch from 163f220 to 7056cdf Compare November 29, 2024 17:23

NWilson marked this pull request as ready for review November 30, 2024 21:31

NWilson force-pushed the user/niwilson/eclass-opt branch from 22d9183 to b62c07c Compare December 1, 2024 21:16

NWilson commented Dec 2, 2024

View reviewed changes

NWilson force-pushed the user/niwilson/eclass-opt branch from 6350c1f to 936e8fd Compare December 2, 2024 15:42

NWilson commented Dec 2, 2024

View reviewed changes

HACKING Show resolved Hide resolved

zherczeg requested changes Dec 3, 2024

View reviewed changes

zherczeg approved these changes Dec 4, 2024

View reviewed changes

Add folding and simplication for OP_ECLASS

09e04a4

Fixes PCRE2Project#537

NWilson force-pushed the user/niwilson/eclass-opt branch from c211e1d to 09e04a4 Compare December 4, 2024 12:52

zherczeg approved these changes Dec 4, 2024

View reviewed changes

PhilipHazel merged commit e8a5cd7 into PCRE2Project:master Dec 4, 2024
21 checks passed

NWilson deleted the user/niwilson/eclass-opt branch December 17, 2024 09:47

Add folding and simplication for OP_ECLASS #586

Add folding and simplication for OP_ECLASS #586

Uh oh!

Conversation

NWilson commented Nov 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zherczeg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NWilson commented Nov 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NWilson commented Nov 29, 2024

Uh oh!

zherczeg commented Nov 30, 2024

Uh oh!

NWilson commented Nov 30, 2024

Uh oh!

NWilson commented Nov 30, 2024

Uh oh!

zherczeg commented Dec 2, 2024

Uh oh!

NWilson commented Dec 2, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NWilson Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NWilson commented Dec 2, 2024

Uh oh!

NWilson commented Dec 2, 2024

Uh oh!

zherczeg left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NWilson commented Nov 27, 2024 •

edited

Loading

NWilson commented Nov 28, 2024 •

edited

Loading

NWilson Dec 4, 2024 •

edited

Loading