Skip to content

Optimize AppendTokens() by adding Tokenizer::prefixUntil() #2003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

eduard-bagdasaryan
Copy link
Contributor

@eduard-bagdasaryan eduard-bagdasaryan commented Feb 25, 2025

This change removes expensive CharacterSet negation and copying on every
AppendTokens() call.

Also simplified complex "empty haystack" and "empty needle" conditions
after naming buf_.substr() return value. Those conditions are mutually
exclusive (in npos cases) but earlier code did not relay that fact well.

No functionality changes expected outside of level-8 debugging messages.

eduard-bagdasaryan and others added 10 commits February 20, 2025 01:43
The conditions are mutually exclusive, but that fact was not clear in
the official code (because that code lacked `str`).

Also adjusted "no prefix" debugs() wording to clarify that use case
description and to improve symmetry with "empty haystack" use case.

This change also avoids "insufficient input" phrasing that should be
reserved for methods throwing InsufficientInput.
XXX: This code compiles, but I am concerned that callers may specify a
wrong/third SBuf method (with the same profile as SearchMethod).
Keep the new prefix_() parameter order as a lot more readable.
... across Tokenizer methods (at least).
... and slightly fewer official code changes.
@eduard-bagdasaryan
Copy link
Contributor Author

This PR implements a PR1896 suggestion.

rousskov
rousskov previously approved these changes Feb 25, 2025
Copy link
Contributor

@rousskov rousskov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@@ -104,7 +111,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar

SBuf result;

if (!prefix(result, tokenChars, limit))
if (!prefix_(result, limit, findFirstNotOf, tokenChars))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tried to replace new { findFirstOf, findFirstNotOf } enum with SBuf method pointers, but the result was no more readable and more error-prone then current PR code because a prefix_() caller could supply a pointer to a method that compiles fine but is not compatible with prefix_() logic. The current/proposed "findFirstNotOf, tokenChars" and "findFirstOf, delimiters" calls are readable and safer.

Moving findFirstOf() and findFirstNotOf() calls outside prefix_() does not work well either, for several reasons.

This comment does not request any changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, That trouble is a "red flag" for this alteration being a bad design change.

IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the token() method design that are currently forcing some weird uses of prefix().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, That trouble is a "red flag" for this alteration being a bad design change.

I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes. Moreover, I do not think it is accurate to describe proposed code changes as "design change". The overall Tokenizer design remains the same. We are just adding another useful method.

Overall I do not see any need for this bloat to the Tokenizer API. The example user code in Notes.cc is fine as-was.

-    const auto tokenCharacters = delimiters.complement("non-delimiters");

Existing AppendTokens() always performs expensive CharacterSet negation and copying. It is definitely not "fine"! I have now added an explicit statement to the PR description to detail the problem solved by this PR.

IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the token() method design that are currently forcing some weird uses of prefix().

This PR avoids expensive CharacterSet negation and copying on every AppendTokens() call. This optimization was planned earlier and does not preclude any Tokenizer::token() method redesign. If you would like to propose Tokenizer::token() design improvements, please do so, but please do not block this PR even if you think those future improvements are going to make this optimization unnecessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay I see where you are coming from now regarding the need for change.

I still think fixing Tokenizer::token would be better over all. What you are calling prefixUntil in this PR is what I recently have been thinking token should be doing. Would you mind making this PR do that and related update of the existing token caller(s) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, That trouble is a "red flag" for this alteration being a bad design change.

I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes.

As I am sure you are aware passing a function pointer and calling it should be a far more efficient (and easily coded) solution than enumeration of all cases individually with hard-coded calls to those same function/methods. That is the red flag to me - something has gone wrong with the attempts to implement those should-be-better logic.

Not requesting a change due to this point, but FTR it smells bad to "optimize" code by replacing it with a known sub-optimal implementation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think fixing Tokenizer::token would be better over all. What you are calling prefixUntil in this PR is what I recently have been thinking token should be doing. Would you mind making this PR do that and related update of the existing token caller(s) ?

I do not think we should change token(), especially in this PR. That old method extracts and forgets leading and trailing delimiters; It should continue to do that. The new prefixUntil() method is similar to the existing prefix() method with regard to delimiter handling; prefix*() methods should continue to treat delimiters differently than token() does because their use cases are different.

As I am sure you are aware passing a function pointer and calling it should be a far more efficient (and easily coded) solution than enumeration of all cases individually with hard-coded calls to those same function/methods. That is the red flag to me - something has gone wrong with the attempts to implement those should-be-better logic.

Any speed difference between the two prefix_() designs is negligible/immaterial in this context; this PR optimizes a completely different (and actually significant!) expense. This design decision should be based on other factors. Since I have actually implemented a function pointer design (before rejecting it for reasons not covered in your analysis), I still believe I made the right call. Thank you for not insisting on changing that.

@rousskov rousskov added M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels S-could-use-an-approval An approval may speed this PR merger (but is not required) labels Feb 25, 2025
Copy link
Contributor

@yadij yadij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I do not see any need for this bloat to the Tokenizer API. The example user code in Notes.cc is fine as-was.

@yadij yadij added S-waiting-for-author author action is expected (and usually required) and removed M-cleared-for-merge https://github.com/measurement-factory/anubis#pull-request-labels S-could-use-an-approval An approval may speed this PR merger (but is not required) labels Feb 26, 2025
@@ -104,7 +111,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar

SBuf result;

if (!prefix(result, tokenChars, limit))
if (!prefix_(result, limit, findFirstNotOf, tokenChars))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, That trouble is a "red flag" for this alteration being a bad design change.

I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes. Moreover, I do not think it is accurate to describe proposed code changes as "design change". The overall Tokenizer design remains the same. We are just adding another useful method.

Overall I do not see any need for this bloat to the Tokenizer API. The example user code in Notes.cc is fine as-was.

-    const auto tokenCharacters = delimiters.complement("non-delimiters");

Existing AppendTokens() always performs expensive CharacterSet negation and copying. It is definitely not "fine"! I have now added an explicit statement to the PR description to detail the problem solved by this PR.

IMHO, the use-case I assume you are trying to achieve would be better reached by fixing the problems we have with the token() method design that are currently forcing some weird uses of prefix().

This PR avoids expensive CharacterSet negation and copying on every AppendTokens() call. This optimization was planned earlier and does not preclude any Tokenizer::token() method redesign. If you would like to propose Tokenizer::token() design improvements, please do so, but please do not block this PR even if you think those future improvements are going to make this optimization unnecessary.

@rousskov rousskov requested a review from yadij February 26, 2025 15:33
@rousskov rousskov added S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box and removed S-waiting-for-author author action is expected (and usually required) labels Feb 26, 2025
@@ -104,7 +111,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar

SBuf result;

if (!prefix(result, tokenChars, limit))
if (!prefix_(result, limit, findFirstNotOf, tokenChars))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, okay I see where you are coming from now regarding the need for change.

I still think fixing Tokenizer::token would be better over all. What you are calling prefixUntil in this PR is what I recently have been thinking token should be doing. Would you mind making this PR do that and related update of the existing token caller(s) ?

@@ -104,7 +111,7 @@ Parser::Tokenizer::prefix(const char *description, const CharacterSet &tokenChar

SBuf result;

if (!prefix(result, tokenChars, limit))
if (!prefix_(result, limit, findFirstNotOf, tokenChars))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, That trouble is a "red flag" for this alteration being a bad design change.

I do not see "bad design change" signs here. My comment was describing troubles with rejected solutions, not with the proposed one, so "that trouble" does not really apply to the proposed code changes.

As I am sure you are aware passing a function pointer and calling it should be a far more efficient (and easily coded) solution than enumeration of all cases individually with hard-coded calls to those same function/methods. That is the red flag to me - something has gone wrong with the attempts to implement those should-be-better logic.

Not requesting a change due to this point, but FTR it smells bad to "optimize" code by replacing it with a known sub-optimal implementation.

@yadij yadij added S-waiting-for-author author action is expected (and usually required) and removed S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box labels Mar 24, 2025
@rousskov rousskov requested a review from yadij March 24, 2025 14:45
@rousskov rousskov removed the S-waiting-for-author author action is expected (and usually required) label Mar 24, 2025
@rousskov rousskov added the S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box label Mar 24, 2025
@squid-anubis squid-anubis added M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels and removed M-failed-other https://github.com/measurement-factory/anubis#pull-request-labels labels May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-for-reviewer ready for review: Set this when requesting a (re)review using GitHub PR Reviewers box
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants