Skip to content

Conversation

@NWilson
Copy link
Member

@NWilson NWilson commented Nov 27, 2024

Fixes #537

Whew! This was a bit of a brain-bending PR, for me at least.

OP_ECLASS now only contains ECS_XCLASS, and ECL_AND...ECL_XOR codes. No more CLASS, NCLASS nested inside; we fully-fold those, so anything like (?[ [a-z] - [aeiou] ]) will no longer generate an OP_ECLASS. You have to actually put two XCLASSes inside the extended character class, in order to something that we can't fully fold.

In addition, for the cases that do result in an OP_ECLASS (mainly, logical operations on Unicode classes), we hoist up the bitmap so that ASCII characters can be matched without needing to consult any of the unicode properties.

ECL_AND AND; no additional data
ECL_OR OR; no additional data
ECL_XOR XOR; no additional data
ECL_NOT NOT; no additional data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mentioned that not is removed. Is there a special type (sequence) for all/nothing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't mentioned them in the HACKING document, because they are not part of a fully-formed OP_ECLASS, but yes, there are two values ECL_ANY and ECL_NONE.

When one of these appears as the LHS/RHS of a binary operator, it can be constant-folded away. Therefore, in the final return value from compile_class_nested(), the result either has no instances of ECL_ANY or ECL_NONE, or the result is a single ECL_ANY or ECL_NONE item. In this case, we don't need to wrap it in an OP_ECLASS, so at match-time, an OP_ECLASS will only ever contain ECL_XCLASS items plus operators.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, very unfortunately... I had to back the NOT opcode. There's one case I discovered that I couldn't fold, and I reworked the code a few times, but in the end, it was better with the NOT.

This is the case: [[\p{Thai} & \p{Digits}] ~~ [^z]]. If you have a compound expression, on the LHS of an XOR (~~), then we need to be able to fold the RHS into the LHS. However, because the RHS is all ones, we need to flip (invert) the LHS. At this point, it's too late to go back and do that - I really want the whole optimisation pass to be a single, left-to-right pass over the META codes, without needing backtracking or fixups.

By far the easiest solution is just to place a NOT at the end. In theory... it could be folded away, with more effort, but the code just ends up so ugly I decided not to do that.

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice progress! I will check the rest later.

#endif
{
/* Rewind back over the OP_ECLASS. */
code = previous;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we compute this at length computation time, store the bit in the META code, and use this knowledge when the byte code is generated. We might do a buffer overrun otherwise.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There shouldn't be a buffer overrun?

This is the rule I'm following: "don't do any conditional logic / branches inside "if (lengthptr == NULL)" blocks". It's totally OK to check whether lengthptr == NULL, but you need to absolutely sure when you do that, that the code will behave the same with and without lengthptr.

What would be terrible, would be to do "better" optimisation when lengthptr == NULL, or to do anything different at all (apart from actually writing the bytecode itself). We have to make exactly the same optimisation decisions in each case. The if (lengthptr == NULL) blocks need to extremely short, and careful, so that they don't affect control flow.

That way, we can be sure we'll have enough buffer space - because we followed all the same code paths each time.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it is a revert. If the remaining length of the buffer is 2 characters, and you write 5, then revert, then write 2, then there no buffer overrun at the end, only at the middle. This is not possible, isn't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whenever I rewind the code pointer, I increment the lengthptr by the same amount.

  • Say, we write two bytes to code, and request another 5 via lengthptr. (A total of 7 bytes requested)
  • Then we rewind code (by 2 bytes), and increment lengthptr to compensate. (Still 7 bytes requested)
  • Now we write more bytes to code - can initially decrement lengthptr (if writing 7 bytes or less), but then if we are requesting more bytes, we'll increase the size of the allocation

@NWilson
Copy link
Member Author

NWilson commented Nov 28, 2024

I know the tests are failing. I'll sort that out!

I'm also going to do additional testing.

I plan to write a grammar-driven fuzzing script, to generate millions of random extended classes which have valid syntax, and deep, complex expressions. Then I'll run pcre2test using the OLD code, to generate a reference answer for each.

If pcre2test produces the same output with the NEW code, then we can be almost certain that the optimisations don't have any effect on behaviour.

I also need to re-check the code coverage to make sure all the cases in optimiser are actually covered by the existing tests.

@NWilson NWilson force-pushed the user/niwilson/eclass-opt branch from 046d4ff to 163f220 Compare November 29, 2024 15:09
@NWilson
Copy link
Member Author

NWilson commented Nov 29, 2024

Actually, don't start reviewing this quite yet.

A big chunk of the core logic is all present & correct, but there are some bits of work-in-progress. The code is close, but there are some very fiddly details around XCLASS handling I'm still wrestling with.

@NWilson NWilson force-pushed the user/niwilson/eclass-opt branch from 163f220 to 7056cdf Compare November 29, 2024 17:23
@zherczeg
Copy link
Collaborator

Just a question: if multiple XCLASS-es uses property accesses, they each read the ucd data structure separately. On the long run, it would be good to do it only once, and share the ucd data pointer. A flag bit could be useful in ECLASS, which tells, that at east one XCLASS needs ucd data. Then ECLASS reads the data, and shares with XCLASS-es. Adding such flag is easy or difficult in your current implementation?

@NWilson
Copy link
Member Author

NWilson commented Nov 30, 2024

Just a question: if multiple XCLASS-es uses property accesses, they each read the ucd data structure separately. On the long run, it would be good to do it only once, and share the ucd data pointer. A flag bit could be useful in ECLASS, which tells, that at east one XCLASS needs ucd data. Then ECLASS reads the data, and shares with XCLASS-es. Adding such flag is easy or difficult in your current implementation?

Probably easy? There are 7 spare unused flag bits in ECLASS currently, and it would be possible to add the UCD pointer to the arguments for PRIV(xclass)().

I'm not entirely convinced you'd want to do that though. It's a two-level lookup table for UCD data, which isn't terrible, given that the relevant chunks of the table will be in L1 cache. Random character access will be very very painful if the CPU has to hit RAM to load the table chunks; but loading the same character repeatedly (in my guesswork) would be relatively painless? I guess it could be measured.

@NWilson NWilson marked this pull request as ready for review November 30, 2024 21:31
@NWilson
Copy link
Member Author

NWilson commented Nov 30, 2024

Great! There are some minor bugs still to fix (CI tells me that the no-Unicode build has some failing tests).

However, is working nearly perfectly, and I don't expect any substantial changes from here on.

The code which builds XCLASSes (using char lists etc) is incredibly complex, and it would be rather nice to try hard to simplify it at some point. There are two very separate codepaths:

  • The case for 16-bit/32-bit, or 8-bit with UTF set. These go via cranges, but also share code with the pure-8-bit-chars codepath, in a bit of a messy arrangement
  • The pure-8-bit codepath, with no cranges, for 8-bit library without UTF set. This basically has its own special-cased code, but it's mingled in with other cases.

Fortunately... this PR really doesn't go into any of that stuff, even though it builds on top of it, so I've simply left it alone for now. However, I am very tempted to do a follow-up refactor of PRIV(compile_class_not_nested)().

@NWilson NWilson force-pushed the user/niwilson/eclass-opt branch from 22d9183 to b62c07c Compare December 1, 2024 21:16
@zherczeg
Copy link
Collaborator

zherczeg commented Dec 2, 2024

Let us know when this patch is ready for review.

@NWilson
Copy link
Member Author

NWilson commented Dec 2, 2024

I have some additional testing to do (eg add some tests where we don't have 100% coverage of all the optimisation paths). So it's not ready to merge.

But the code is stable and ready for review.

Comment on lines 6152 to 6154
// XXX revisit why these flags here are different to when we process a
// non-empty class - is this a problem? You can have '[^\d\D]' is basically
// the same as '[]' so they should surely set the same flags.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a question for reviewers.

The handling for META_CLASS_EMPTY_NOT emits OP_ALLANY, and sets the flags:

if (firstcuflags == REQ_UNSET) firstcuflags = REQ_NONE;
zerofirstcu = firstcu;
zerofirstcuflags = firstcuflags;

And, the handling for [\d\D] also emits OP_ALLANY, but sets the different flags:

if (firstcuflags == REQ_UNSET) firstcuflags = REQ_NONE;
zerofirstcu = firstcu;
zerofirstcuflags = firstcuflags;
zeroreqcu = reqcu;
zeroreqcuflags = reqcuflags;

Is this a problem? Is there any impact?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't spent time to understand these constructs, so this is also magic to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps @PhilipHazel could comment on this one specific question.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I regret that I cannot remember the details of how all that works. It must be over 20 years ago... I can remember that it is is quite complicated and I probably should have documented it better in comments, though I see I did at least document why the zeroxxx variables exist.

Copy link
Member Author

@NWilson NWilson Dec 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I have done my best to study how they are used.

I believe that this is not a bug. The code here does not update zeroreqcu[flags], because it doesn't update reqcu[flags]. There are examples elsewhere in the file which confused me, because they appear to be doing unnecessary updates to the zero* variables.

I believe the rule is, "if you write to reqcu[flags] or firstcu[flags], you should also write to the corresponding zero* variables".

I have withdrawn my XXX comment.

Thank you @PhilipHazel for hepling!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always happy to hepple. :-)

@NWilson NWilson force-pushed the user/niwilson/eclass-opt branch from 6350c1f to 936e8fd Compare December 2, 2024 15:42
@NWilson
Copy link
Member Author

NWilson commented Dec 2, 2024

I have added additional tests, until the code-coverage report reached 100% branch coverage for all the (reachable) branches in the ECLASS code.

I will do further testing by fuzz-testing the new (optimised) ECLASS behaviour against the old reference implementation.

@NWilson
Copy link
Member Author

NWilson commented Dec 2, 2024

I have completed my testing.

I wrote a script, which generates random extended character classes, UTS#18 and Perl-style, in order to fuzz-test the optimisations in this PR. I generated millions of test cases (GB of testinput), and ran the test cases using pcre2test, comparing master to this branch. The output files were identical.

Then I ran my fuzzed inputs using the Clang's coverage check... and found I'd forgotten to generate cases for one of the operators! Oops. So I re-ran... all good. The coverage verifies that the randomised test cases are covering all the branches of the optimiser, and probably covering far more paths through the code than the manually-written test cases.

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice progress!

- name: Test (main test script)
run: |
ulimit -S -s 32768 # Raise stack limit; ASAN with -O0 is very stack-hungry
ulimit -S -s 49152 # Raise stack limit; ASAN with -O0 is very stack-hungry
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The fuzz tests generate very deep stacks. The stack usage for ordinary builds is quite sensible (15 * 32 bytes), but the stack frames and stack allocations have absurdly-large amounts of guard space in the ASAN builds, so I had to bump this higher.

The eclass code (with a 15-deep parenthesis limit) uses similar or less stack than the main regex parser, which is also recursive-descent with a 255-deep limit.

{
*code++ = OP_CLASS;
memset(code, 0, 32 * sizeof(uint8_t));
memset(code, 0, 32);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I just noticed a load of sizeof(uint8_t) and thought it was simpler to just remove (personal style).

#endif
{
/* Rewind back over the OP_ECLASS. */
code = previous;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean it is a revert. If the remaining length of the buffer is 2 characters, and you write 5, then revert, then write 2, then there no buffer overrun at the end, only at the middle. This is not possible, isn't it?

cls:[ ]
op: ||
]
[ ^ab]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there tests with /x and comments / spaces?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are previously-committed tests for /x.

I haven't touched the frontend parser in this PR, so interpretation of input characters won't have changed.

cls:[\p{Lu}]
cls:[\p{Nd}]
op: ||
bitmap: [0-9A-Z\xc0-\xd6\xd8-\xde]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is exactly the case where no bitmap is needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sadly. Because ECLASS combines bitmaps in ways that are harder to reason about, I made the ECLASS bitmap behaviour a bit different to XCLASS (see the changes in HACKING). For ECLASS, I made it so that there is a bitmap if and only if the ECLASS matches any characters < 256. So an ECLASS will have at most one bitmap, but it will be preserved for cases like \pL & \pN.

(This example is even worse, because additional optimisation would actually mean it could be simplified to a simple XCLASS.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. My idea was to detect that bitmap is needed in the lengthptr != NULL phase, and use this info later. I will try to do it after this patch is landed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an "always_map" flag in this PR (to compile_class_not_nested()). That flag pokes XCLASSes so that when they are inside an ECLASS they always become bitmapped (only if they match any ASCII characters, eg. \pL).

I was planning on cleaning that up in a future PR, but I won't do it straight away - so go ahead if you have ideas.

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash the commits, and fix the minor things. Good progress.

Comment on lines 6152 to 6154
// XXX revisit why these flags here are different to when we process a
// non-empty class - is this a problem? You can have '[^\d\D]' is basically
// the same as '[]' so they should surely set the same flags.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't spent time to understand these constructs, so this is also magic to me.

/* Do some useful counting of what's in the bitmap. */
for (int i = 0; i < 8; i++)
if (op_info.bits.classwords[i] != 0xffffffff)
{ allbitsone = FALSE; break; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The syntax for these is not changed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did change the indentation. Is it wrong to put it all on one line?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have split the code onto multiple lines.

@zherczeg
Copy link
Collaborator

zherczeg commented Dec 4, 2024

Is there anybody who plans to review it later? If not, I plan to land it after a few changes.

@NWilson NWilson force-pushed the user/niwilson/eclass-opt branch from c211e1d to 09e04a4 Compare December 4, 2024 12:52
@NWilson
Copy link
Member Author

NWilson commented Dec 4, 2024

Squashed. This should be good to go!

Copy link
Collaborator

@zherczeg zherczeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@PhilipHazel
Copy link
Collaborator

I tried merging this ... after a while the merge button came back so (stupidly) I clicked it again. It now says merge already in progress... has it done it, I wonder? I hope I haven't broken anything.

@zherczeg
Copy link
Collaborator

zherczeg commented Dec 4, 2024

I still see the "squash and merge" button, but I don't click on it to avoid even more chaos.

@PhilipHazel
Copy link
Collaborator

I now see that too, but likewise, I'm holding back. @NWilson what do you think? Is it safe to try merging again?

@NWilson
Copy link
Member Author

NWilson commented Dec 4, 2024

I can't see what buttons you're seeing... but why not? It can't hurt to keep clicking until it merges.

@PhilipHazel PhilipHazel merged commit e8a5cd7 into PCRE2Project:master Dec 4, 2024
21 checks passed
@PhilipHazel
Copy link
Collaborator

Hooray! It seems to have merged.

@NWilson NWilson deleted the user/niwilson/eclass-opt branch December 17, 2024 09:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extended character classes: optimisation

3 participants