Skip to content

make x3 interger parser less dependent on fundamental character types #762

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -112,17 +112,19 @@ namespace boost { namespace spirit { namespace x3 { namespace detail
template <typename Char>
inline static bool is_valid(Char ch)
{
return (ch >= '0' && ch <= (Radix > 10 ? '9' : static_cast<Char>('0' + Radix -1)))
|| (Radix > 10 && ch >= 'a' && ch <= static_cast<Char>('a' + Radix -10 -1))
|| (Radix > 10 && ch >= 'A' && ch <= static_cast<Char>('A' + Radix -10 -1));
return (ch >= '0' && ch <= (Radix > 10 ? '9' : static_cast<char>('0' + Radix -1)))
|| (Radix > 10 && ch >= 'a' && ch <= static_cast<char>('a' + Radix -10 -1))
|| (Radix > 10 && ch >= 'A' && ch <= static_cast<char>('A' + Radix -10 -1));
Comment on lines +115 to +117
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that at least for consistency it is better, though the original code is relying without a static assert or a comment on a fact that all these constants always are positive numbers and radix addition to a constant would never overflow or will be caught by the compiler.

}

template <typename Char>
inline static unsigned digit(Char ch)
{
return (Radix <= 10 || (ch >= '0' && ch <= '9'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a compile time branch, an if constexpr that doesn't produce a range of compiler warnings 26356b5 think-cell@7771d20. You basically replaced a compile time branch with a runtime one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this minor optimization is better than simplicity. But If it is really important that the commit won't change the performance in any branch, we could add this compile time optimization back in. Then we could keep this optimization and don't rely on char_encoding::ascii::tolower.

     return (Radix <= 10 || ch < 'A')
        ? ch - '0'
        : ch < 'a'
            ? ch - 'A' + 10
            : ch - 'a' + 10;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a char_ascii type that checks ascii range during construction. It is implicitly convertible and comparable to fundamental char types but not integer types.

IMO the two changes (with Radix constexpr optimization for the 2nd one?) are better implementation than the original code. They provide better semantics and less dependency. If you @Kojoley agree with both changes I could add a simplified char_ascii type and test cases to cover this change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does your 'user defined char type' support bitwise operators? Will replacing char_encoding::ascii::tolowe with (ch | 0x20) - 'a' + 10 work for you?

Copy link
Contributor Author

@wanghan02 wanghan02 May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fundamental char types, integral promotion is performed before doing bitwise operations. Basically it is an integer operation and relies on implicit conversion to integer type. For our char_ascii type, we focus more on character side of the type and don't support any integer operations other than getting the distance between 2 char_asciis.

But I do see there are some performance improvement with (ch | 0x20) - 'a' + 10 than ternary operator in case of fundamental character types. Can we take std::is_convertible<Char const&, int>::value into the path to make user-defined char types similar to our char_ascii work without downgrade the performance of fundamental char types?

     return (Radix <= 10 || ch < 'A')
        ? ch - '0'
        : std::is_convertible<Char const&, int>::value
            ? (ch | 0x20) - 'a' + 10
            : ch < 'a'
                ? ch - 'A' + 10
                : ch - 'a' + 10;

The performance is the same as below with fundamental char types.

    return (Radix <= 10 || ch < 'A')
        ? ch - '0'
        : (ch | 0x20) - 'a' + 10;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your char type design is just broken. Is there tolower associated with it?

Can we take std::is_convertible<Char const&, int>::value into the path to make user-defined char types similar to our char_ascii work without downgrade the performance of fundamental char types?

I don't want to add a code path that literally nobody else is using and also is not tested by the test suite.

Copy link
Contributor Author

@wanghan02 wanghan02 May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We simply don't use tolower. It doesn't matter how we design our character type. The user has freedom to define any character type. It's just if some day the community wants to add constraints to every spirit x3 parser, what is the minimum request on the Char type for this parser? I don't see depending on tolower or implicit cast to int is necessary in this function performance-wise and semantics-wise. We can split the function into overloads for better semantics. The Radix<=10 compile time optimization mixed in the code path is already weird.

        template <typename Char, unsigned Radix_=Radix, std::enable_if_t<Radix_<=10>* = nullptr>
        inline static unsigned digit(Char ch)
        {
            return ch - '0';
        }
        
        template <typename Char, unsigned Radix_=Radix, std::enable_if_t<10<Radix_ && std::is_convertible<Char, int>::value>* = nullptr>
        inline static unsigned digit(Char ch)
        {
            return ch < 'A'
                ? ch - '0'
                : (ch | 0x20) - 'a' + 10;
        }
        
        template <typename Char, unsigned Radix_=Radix, std::enable_if_t<10<Radix_ && !std::is_convertible<Char, int>::value>* = nullptr>
        inline static unsigned digit(Char ch)
        {
            return ch < 'A'
                ? ch - '0'
                : ch < 'a'
                    ? ch - 'A' + 10
                    : ch - 'a' + 10;
        }

You can take this pull request as a refactoring suggestion. The first change static_cast<Char> -> static_cast<char> for me definitely makes the code cleaner. The second change makes digit more aligned with the implementation of is_valid which doesn't use tolower either. Since digit is only called if is_valid is true, the implementation of these 2 functions should be aligned with each other. But it's only a refactoring suggestion. As I said if you agree, I could add test cases to cover the change. Otherwise we have to specialize this struct to workaround the dependency.

return ch < 'A'
? ch - '0'
: char_encoding::ascii::tolower(ch) - 'a' + 10;
: ch < 'a'
? ch - 'A' + 10
: ch - 'a' + 10;
}
};

Expand Down