Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

eduard-bagdasaryan · 2025-02-25T11:50:09Z

This change removes expensive CharacterSet negation and copying on every
AppendTokens() call.

Also simplified complex "empty haystack" and "empty needle" conditions
after naming buf_.substr() return value. Those conditions are mutually
exclusive (in npos cases) but earlier code did not relay that fact well.

No functionality changes expected outside of level-8 debugging messages.

The conditions are mutually exclusive, but that fact was not clear in the official code (because that code lacked `str`). Also adjusted "no prefix" debugs() wording to clarify that use case description and to improve symmetry with "empty haystack" use case. This change also avoids "insufficient input" phrasing that should be reserved for methods throwing InsufficientInput.

XXX: This code compiles, but I am concerned that callers may specify a wrong/third SBuf method (with the same profile as SearchMethod).

Keep the new prefix_() parameter order as a lot more readable.

... across Tokenizer methods (at least).

... and slightly fewer official code changes.

eduard-bagdasaryan · 2025-02-25T11:51:16Z

This PR implements a PR1896 suggestion.

rousskov

Thank you.

rousskov · 2025-02-25T14:12:45Z

src/parser/Tokenizer.cc

@@ -104,7 +111,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar

    SBuf result;

-    if (!prefix(result, tokenChars, limit))
+    if (!prefix_(result, limit, findFirstNotOf, tokenChars))


I have tried to replace new { findFirstOf, findFirstNotOf } enum with SBuf method pointers, but the result was no more readable and more error-prone then current PR code because a prefix_() caller could supply a pointer to a method that compiles fine but is not compatible with prefix_() logic. The current/proposed "findFirstNotOf, tokenChars" and "findFirstOf, delimiters" calls are readable and safer.

Moving findFirstOf() and findFirstNotOf() calls outside prefix_() does not work well either, for several reasons.

This comment does not request any changes.

FYI, That trouble is a "red flag" for this alteration being a bad design change.

IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the token() method design that are currently forcing some weird uses of prefix().

FYI, That trouble is a "red flag" for this alteration being a bad design change.

I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes. Moreover, I do not think it is accurate to describe proposed code changes as "design change". The overall Tokenizer design remains the same. We are just adding another useful method.

Overall I do not see any need for this bloat to the Tokenizer API. The example user code in Notes.cc is fine as-was.

- const auto tokenCharacters = delimiters.complement("non-delimiters");

Existing AppendTokens() always performs expensive CharacterSet negation and copying. It is definitely not "fine"! I have now added an explicit statement to the PR description to detail the problem solved by this PR.

IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the token() method design that are currently forcing some weird uses of prefix().

This PR avoids expensive CharacterSet negation and copying on every AppendTokens() call. This optimization was planned earlier and does not preclude any Tokenizer::token() method redesign. If you would like to propose Tokenizer::token() design improvements, please do so, but please do not block this PR even if you think those future improvements are going to make this optimization unnecessary.

Oh, okay I see where you are coming from now regarding the need for change.

I still think fixing Tokenizer::token would be better over all. What you are calling prefixUntil in this PR is what I recently have been thinking token should be doing. Would you mind making this PR do that and related update of the existing token caller(s) ?

FYI, That trouble is a "red flag" for this alteration being a bad design change.

I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes.

As I am sure you are aware passing a function pointer and calling it should be a far more efficient (and easily coded) solution than enumeration of all cases individually with hard-coded calls to those same function/methods. That is the red flag to me - something has gone wrong with the attempts to implement those should-be-better logic.

Not requesting a change due to this point, but FTR it smells bad to "optimize" code by replacing it with a known sub-optimal implementation.

I still think fixing Tokenizer::token would be better over all. What you are calling prefixUntil in this PR is what I recently have been thinking token should be doing. Would you mind making this PR do that and related update of the existing token caller(s) ?

I do not think we should change token(), especially in this PR. That old method extracts and forgets leading and trailing delimiters; It should continue to do that. The new prefixUntil() method is similar to the existing prefix() method with regard to delimiter handling; prefix*() methods should continue to treat delimiters differently than token() does because their use cases are different.

As I am sure you are aware passing a function pointer and calling it should be a far more efficient (and easily coded) solution than enumeration of all cases individually with hard-coded calls to those same function/methods. That is the red flag to me - something has gone wrong with the attempts to implement those should-be-better logic.

Any speed difference between the two prefix_() designs is negligible/immaterial in this context; this PR optimizes a completely different (and actually significant!) expense. This design decision should be based on other factors. Since I have actually implemented a function pointer design (before rejecting it for reasons not covered in your analysis), I still believe I made the right call. Thank you for not insisting on changing that.

src/parser/Tokenizer.cc

yadij

Overall I do not see any need for this bloat to the Tokenizer API. The example user code in Notes.cc is fine as-was.

src/parser/Tokenizer.cc

Context: squid-cache#2003 (review)

src/parser/Tokenizer.cc

rousskov · 2025-02-26T15:21:21Z

src/parser/Tokenizer.cc

@@ -104,7 +111,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar

    SBuf result;

-    if (!prefix(result, tokenChars, limit))
+    if (!prefix_(result, limit, findFirstNotOf, tokenChars))


FYI, That trouble is a "red flag" for this alteration being a bad design change.

I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes. Moreover, I do not think it is accurate to describe proposed code changes as "design change". The overall Tokenizer design remains the same. We are just adding another useful method.

Overall I do not see any need for this bloat to the Tokenizer API. The example user code in Notes.cc is fine as-was.

- const auto tokenCharacters = delimiters.complement("non-delimiters");

Existing AppendTokens() always performs expensive CharacterSet negation and copying. It is definitely not "fine"! I have now added an explicit statement to the PR description to detail the problem solved by this PR.

IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the token() method design that are currently forcing some weird uses of prefix().

This PR avoids expensive CharacterSet negation and copying on every AppendTokens() call. This optimization was planned earlier and does not preclude any Tokenizer::token() method redesign. If you would like to propose Tokenizer::token() design improvements, please do so, but please do not block this PR even if you think those future improvements are going to make this optimization unnecessary.

yadij · 2025-03-24T05:45:35Z

src/parser/Tokenizer.cc

@@ -104,7 +111,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar

    SBuf result;

-    if (!prefix(result, tokenChars, limit))
+    if (!prefix_(result, limit, findFirstNotOf, tokenChars))


Oh, okay I see where you are coming from now regarding the need for change.

I still think fixing Tokenizer::token would be better over all. What you are calling prefixUntil in this PR is what I recently have been thinking token should be doing. Would you mind making this PR do that and related update of the existing token caller(s) ?

yadij · 2025-03-24T05:54:41Z

src/parser/Tokenizer.cc

@@ -104,7 +111,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar

    SBuf result;

-    if (!prefix(result, tokenChars, limit))
+    if (!prefix_(result, limit, findFirstNotOf, tokenChars))


FYI, That trouble is a "red flag" for this alteration being a bad design change.

I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes.

As I am sure you are aware passing a function pointer and calling it should be a far more efficient (and easily coded) solution than enumeration of all cases individually with hard-coded calls to those same function/methods. That is the red flag to me - something has gone wrong with the attempts to implement those should-be-better logic.

Not requesting a change due to this point, but FTR it smells bad to "optimize" code by replacing it with a known sub-optimal implementation.

eduard-bagdasaryan and others added 10 commits February 20, 2025 01:43

Optimize AppendTokens() by adding Tokenizer::prefixUntil()

8020835

fixup: Reduced prefix() naming ambiguity by renaming protected method

775a49e

WIP: Avoid adding enum

177db9b

XXX: This code compiles, but I am concerned that callers may specify a wrong/third SBuf method (with the same profile as SearchMethod).

Do not use SBuf method pointers, addressing previous commit concern

8986638

Keep the new prefix_() parameter order as a lot more readable.

fixup: More consistent CharacterSet param naming

a2e873d

... across Tokenizer methods (at least).

fixup: More consistent debugs() output formatting for Tokenizer

d20a96f

... and slightly fewer official code changes.

fixup: Use a more descriptive name for the local variable

fcd34e2

fixup: Polished new method description

4063879

fixup: Separate documented method to improve code readability

f290fa9

rousskov mentioned this pull request Feb 25, 2025

Bug 5417: An empty annotation value does not match #1896

Closed

rousskov previously approved these changes Feb 25, 2025

View reviewed changes

rousskov added M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels S-could-use-an-approval An approval may speed this PR merger (but is not required) labels Feb 25, 2025

yadij requested changes Feb 26, 2025

View reviewed changes

src/parser/Tokenizer.cc Show resolved Hide resolved

yadij added S-waiting-for-author author action is expected (and usually required) and removed M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels S-could-use-an-approval An approval may speed this PR merger (but is not required) labels Feb 26, 2025

fixup: Added an optimization TODO

8d848cd

Context: squid-cache#2003 (review)

rousskov dismissed their stale review via 8d848cd February 26, 2025 15:06

rousskov approved these changes Feb 26, 2025

View reviewed changes

rousskov requested a review from yadij February 26, 2025 15:33

rousskov added S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box and removed S-waiting-for-author author action is expected (and usually required) labels Feb 26, 2025

yadij requested changes Mar 24, 2025

View reviewed changes

yadij added S-waiting-for-author author action is expected (and usually required) and removed S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box labels Mar 24, 2025

rousskov requested a review from yadij March 24, 2025 14:45

rousskov removed the S-waiting-for-author author action is expected (and usually required) label Mar 24, 2025

rousskov added the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label Mar 24, 2025

squid-anubis added M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels and removed M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels labels May 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

Uh oh!

eduard-bagdasaryan commented Feb 25, 2025 •

edited by rousskov

Loading

Uh oh!

eduard-bagdasaryan commented Feb 25, 2025

Uh oh!

rousskov left a comment

Uh oh!

rousskov Feb 25, 2025

Uh oh!

yadij Feb 26, 2025

Uh oh!

rousskov Feb 26, 2025

Uh oh!

yadij Mar 24, 2025

Uh oh!

yadij Mar 24, 2025

Uh oh!

rousskov Mar 24, 2025

Uh oh!

Uh oh!

yadij left a comment

Uh oh!

Uh oh!

Uh oh!

rousskov Feb 26, 2025

Uh oh!

yadij Mar 24, 2025

Uh oh!

yadij Mar 24, 2025

Uh oh!

Uh oh!

Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

Are you sure you want to change the base?

Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

Uh oh!

Conversation

eduard-bagdasaryan commented Feb 25, 2025 • edited by rousskov Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eduard-bagdasaryan commented Feb 25, 2025

Uh oh!

rousskov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

yadij left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eduard-bagdasaryan commented Feb 25, 2025 •

edited by rousskov

Loading