Skip to content

Conversation

@zherczeg
Copy link
Collaborator

I think the new code is quite nice and works nicely. I did some code reworks to improve the operations. There is no byte code optimizations. @NWilson please check.

fprintf(stderr, "** Unrecognized parsed pattern item 0x%.8x "
"in character class\n", meta);
#endif
*errorcodeptr = ERR89; /* Internal error - unrecognized. */
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think assertions are better for this, so I removed this functionality. If it is desired, it can be added to the call sites of this function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. All the code in this function here (compile_class_not_nested) is stuff I just moved out of pcre2_compile, I didn't write any of it, so it's fine by me to change!

Copy link
Member

@NWilson NWilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I like the changes overall.

It's going to conflict with my WIP branch where I'm doing the optimizations you requested. Don't duplicate the work I've started on that - unless you want to let me know what you're tackling.

I'm on holiday with friends for a few days now, so I won't open another PR before next Tues or Weds.

/* Macros for the definitions below, to prevent name collisions. */

#define _pcre2_posix_class_maps PCRE2_SUFFIX(_pcre2_posix_class_maps)
#define _pcre2_optimize_class PCRE2_SUFFIX(_pcre2_optimize_class_)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to remove the optimise_class prototype below in the header as well.

fprintf(stderr, "** Unrecognized parsed pattern item 0x%.8x "
"in character class\n", meta);
#endif
*errorcodeptr = ERR89; /* Internal error - unrecognized. */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. All the code in this function here (compile_class_not_nested) is stuff I just moved out of pcre2_compile, I didn't write any of it, so it's fine by me to change!

DONE:
*pcode = code;
return TRUE;
return pptr - 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see what you've done. The code was previously very tightly coupled between the parse_class pre-processing helper, and the code that runs after. Both of them need to agree on "read META codes from ptr onwards, but stop when reaching something that doesn't match a specific pattern".

I changed that by introducing end_ptr, so they will definitely read the same META items. You've gone back to the implicit coupling of the loops in parse_class, optimize_class, and compile_class_not_nested. OK. I believe that they match, it's not terrible.

Perhaps parse_class could output its end_ptr, and the other loops could use the endptr so they stop at the same place, without needing to duplicate the assertion CHECK_POSSIBLE_CLASS_END - instead it could be an assertion "check class end identical to passed-in endptr".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually setting a tmp value used only in an assertion is hard to do nicely. The alternative is using a while (ptr < end) to enforce they run to the same point. If it is ok for you, I will do that.

Background: the meta codes for character sets and operators/brackets are different, so it is safe to process them and return with the last non character set meta code. Later we could pass a list of starting locations, and a single range could be constructed from them.

Example:
[A[B&&C]D] could be reorganized to [AD[B&&C]] internally.

We could just use an operator list and starting meta code pointers.

From the ECLASS above, we construct the following small list:

ptr-to-A
ptr-to-B
ptr-to-C
OP_AND
OP_OR
ptr-to-D
OP_OR

Then an optimizer converts it to:

ptr-to-A and ptr-to-D -> the range is constructed from the combination of these two items.
ptr-to-B
ptr-to-C
OP_AND
OP_OR

@zherczeg
Copy link
Collaborator Author

I don't want to work on any class optimizations at this point. The task is yours as we discussed. I will share my ideas on the topic, but it is your decision how to use them. This patch tries to improve existing code, nothing more.

@NWilson
Copy link
Member

NWilson commented Oct 31, 2024

Thank you Zoltan! I'm very grateful for your contributions, you are welcome to work on the code of course. I only wanted to make sure we don't duplicate work by mistake.

@zherczeg
Copy link
Collaborator Author

zherczeg commented Nov 2, 2024

I did the requested changes except one. I could not find a nice way to pass around the class end without many defines. Maybe we can do that in a follow up patch if we find some nice solution.

@zherczeg
Copy link
Collaborator Author

zherczeg commented Nov 3, 2024

I found a very different way to do the requested debug check. I have added a macro, CLASS_END_CASES, which has a list of the allowed class token endings. With this macro, it is not possible that one loop handles a case, and the other does not:

/* Checks the allowed tokens at the end of a class structure in debug mode.
When a new token is processed by one loop, and not the other, and the token is
a) equals to one of the cases here: the compiler will complain
   about a duplicated case value.
b) equals to none of the cases here: the loop without the handler will stop
   with an assertion failure. */

#ifdef PCRE2_DEBUG
#define CLASS_END_CASES(meta) \
default: \
PCRE2_ASSERT((meta) <= META_END); \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised the compiler lets you write a statement that falls through to another case, with the "Falls through" comment that's enforced elsewhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot it. Thank you for noticing it. Compiler can complain about it with the suitable compiler options, probably something is missing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the "multi-config" CI build doesn't include a debug build, with those options.

DONE:
*pcode = code;
return TRUE;
return pptr - 1;
Copy link
Member

@NWilson NWilson Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a tiny bit of code smell to returning ptr - 1 here? The last META code we saw might be a two-item one (such as POSIX + code, or BIGVALUE + codepoint). So, the -1 doesn't point to anything meaningful.

It appears you've done it, because the existing callers take the return pointer and always add a fixed +1 to it, and it's simpler in some ways to leave the callers as-is.

Just a suggestion. If the code works out neater this way, that's fine. Maybe the function should have a note in its doc at the top that you must not take the return value and treat it the start of a META code.

OR... perhaps my mental interpreter has completely misunderstood what the pptr - 1 is doing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It reverts the ++ here.

https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_compile_class.c#L1141

We read the token which terminates the loop. We don't process the token, just read its type.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh! Of course. Thank you for explaining.

Copy link
Member

@NWilson NWilson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I like the changes. There's one thing I don't understand (pptr - 1) but it might be fine, I just can't follow what's happening from looking at the diff.

Those macros look really effective for aligning the for loops, thank you!

Copy link
Contributor

@carenas carenas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less code is always better

@zherczeg zherczeg merged commit c192b8c into PCRE2Project:master Nov 5, 2024
15 checks passed
@zherczeg zherczeg deleted the code_simpl branch November 5, 2024 05:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants