-
Notifications
You must be signed in to change notification settings - Fork 242
Various code improvements for eclasses #546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| fprintf(stderr, "** Unrecognized parsed pattern item 0x%.8x " | ||
| "in character class\n", meta); | ||
| #endif | ||
| *errorcodeptr = ERR89; /* Internal error - unrecognized. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think assertions are better for this, so I removed this functionality. If it is desired, it can be added to the call sites of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. All the code in this function here (compile_class_not_nested) is stuff I just moved out of pcre2_compile, I didn't write any of it, so it's fine by me to change!
NWilson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I like the changes overall.
It's going to conflict with my WIP branch where I'm doing the optimizations you requested. Don't duplicate the work I've started on that - unless you want to let me know what you're tackling.
I'm on holiday with friends for a few days now, so I won't open another PR before next Tues or Weds.
| /* Macros for the definitions below, to prevent name collisions. */ | ||
|
|
||
| #define _pcre2_posix_class_maps PCRE2_SUFFIX(_pcre2_posix_class_maps) | ||
| #define _pcre2_optimize_class PCRE2_SUFFIX(_pcre2_optimize_class_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to remove the optimise_class prototype below in the header as well.
| fprintf(stderr, "** Unrecognized parsed pattern item 0x%.8x " | ||
| "in character class\n", meta); | ||
| #endif | ||
| *errorcodeptr = ERR89; /* Internal error - unrecognized. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. All the code in this function here (compile_class_not_nested) is stuff I just moved out of pcre2_compile, I didn't write any of it, so it's fine by me to change!
| DONE: | ||
| *pcode = code; | ||
| return TRUE; | ||
| return pptr - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I see what you've done. The code was previously very tightly coupled between the parse_class pre-processing helper, and the code that runs after. Both of them need to agree on "read META codes from ptr onwards, but stop when reaching something that doesn't match a specific pattern".
I changed that by introducing end_ptr, so they will definitely read the same META items. You've gone back to the implicit coupling of the loops in parse_class, optimize_class, and compile_class_not_nested. OK. I believe that they match, it's not terrible.
Perhaps parse_class could output its end_ptr, and the other loops could use the endptr so they stop at the same place, without needing to duplicate the assertion CHECK_POSSIBLE_CLASS_END - instead it could be an assertion "check class end identical to passed-in endptr".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually setting a tmp value used only in an assertion is hard to do nicely. The alternative is using a while (ptr < end) to enforce they run to the same point. If it is ok for you, I will do that.
Background: the meta codes for character sets and operators/brackets are different, so it is safe to process them and return with the last non character set meta code. Later we could pass a list of starting locations, and a single range could be constructed from them.
Example:
[A[B&&C]D] could be reorganized to [AD[B&&C]] internally.
We could just use an operator list and starting meta code pointers.
From the ECLASS above, we construct the following small list:
ptr-to-A
ptr-to-B
ptr-to-C
OP_AND
OP_OR
ptr-to-D
OP_OR
Then an optimizer converts it to:
ptr-to-A and ptr-to-D -> the range is constructed from the combination of these two items.
ptr-to-B
ptr-to-C
OP_AND
OP_OR
|
I don't want to work on any class optimizations at this point. The task is yours as we discussed. I will share my ideas on the topic, but it is your decision how to use them. This patch tries to improve existing code, nothing more. |
|
Thank you Zoltan! I'm very grateful for your contributions, you are welcome to work on the code of course. I only wanted to make sure we don't duplicate work by mistake. |
|
I did the requested changes except one. I could not find a nice way to pass around the class end without many defines. Maybe we can do that in a follow up patch if we find some nice solution. |
|
I found a very different way to do the requested debug check. I have added a macro, |
| #ifdef PCRE2_DEBUG | ||
| #define CLASS_END_CASES(meta) \ | ||
| default: \ | ||
| PCRE2_ASSERT((meta) <= META_END); \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised the compiler lets you write a statement that falls through to another case, with the "Falls through" comment that's enforced elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I forgot it. Thank you for noticing it. Compiler can complain about it with the suitable compiler options, probably something is missing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the "multi-config" CI build doesn't include a debug build, with those options.
| DONE: | ||
| *pcode = code; | ||
| return TRUE; | ||
| return pptr - 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a tiny bit of code smell to returning ptr - 1 here? The last META code we saw might be a two-item one (such as POSIX + code, or BIGVALUE + codepoint). So, the -1 doesn't point to anything meaningful.
It appears you've done it, because the existing callers take the return pointer and always add a fixed +1 to it, and it's simpler in some ways to leave the callers as-is.
Just a suggestion. If the code works out neater this way, that's fine. Maybe the function should have a note in its doc at the top that you must not take the return value and treat it the start of a META code.
OR... perhaps my mental interpreter has completely misunderstood what the pptr - 1 is doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reverts the ++ here.
https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile_class.c#L1141
We read the token which terminates the loop. We don't process the token, just read its type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Argh! Of course. Thank you for explaining.
NWilson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I like the changes. There's one thing I don't understand (pptr - 1) but it might be fine, I just can't follow what's happening from looking at the diff.
Those macros look really effective for aligning the for loops, thank you!
carenas
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
less code is always better
I think the new code is quite nice and works nicely. I did some code reworks to improve the operations. There is no byte code optimizations. @NWilson please check.