Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal] Scanning Tokenizer with Improved String support #1174

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

shopmike
Copy link
Contributor

@shopmike shopmike commented Sep 20, 2019

The commit currently isn't the final version but is showing the working version. The proposal is to shift the Tokenizer to be a scanner that identifies the following tokens.

    {% - Tag Open
    %} - Tag Close
    {{ - Variable Open 
    }} - Variable Close
    }  - Variable Incomplete Close
    '  - Single Quote
    "  - Double Quote

Because of this the scanner only ever needs to look at 2 positions and can operate in a single pass.

It then has the following states to control flow

    0 - Outside Any Tag or Variable
    1 - Inside a Tag
    2 - Inside a Variable
    3 - Inside a Tag and Single Quotes
    4 - Inside a Tag and Double Quotes
    5 - Inside a Variable and Single Quotes
    6 - Inside a Variable and Double Quotes
0,1,2 - Not Inside Quotes

The pseudo-code is as follows

    if 'Found Tag Open' && 'Not Inside Quotes'
        State is now 'Inside Tag'
        End Token
        Start Token
        Append to Token

    else if 'Found Variable Open' && 'Not Inside Quotes'
        State is now 'Inside Variable'
        End Token
        Start Token
        Append to Token

    else if 'Found Single Quote' && 'Inside Tag'
        State is now 'Inside Tag and Single Quotes'
        Append to Token

    else if 'Found Single Quote' && 'Inside Tag and Single Quotes'
        State is now 'Inside Tag'
        Append to Token

    else if 'Found Double Quote' && 'Inside Tag'
        State is now 'Inside Tag and Double Quotes'
        Append to Token

    else if 'Found Double Quote' && 'Inside Tag and Double Quotes'
        State is now 'Inside Tag'
        Append to Token

    else if 'Found Single Quote' && 'Inside Variable'
        State is now 'Inside Variable and Single Quotes'
        Append to Token

    else if 'Found Single Quote' && 'Inside Variable and Single Quotes'
        State is now 'Inside Variable'
        Append to Token

    else if 'Found Double Quote' && 'Inside Variable'
        State is now 'Inside Variable and Double Quotes'
        Append to Token

    else if 'Found Double Quote' && 'Inside Variable and Double Quotes'
        State is now 'Inside Variable'
        Append to Token

    else if 'Found Tag Close' && 'Inside Tag'
        State is now 'Outside Any Tag or Variable'
        Append to Token
        End Token
        Start Token

    else if 'Found Variable Close' && 'Inside Variable'
        State is now 'Outside Any Tag or Variable'
        Append to Token
        End Token
        Start Token

    else
        Append to Token
    end

This resolves and has the tests from the following PRs and Issues

Closes #701
Closes #779
Closes #624
Closes #623
Closes #344
Closes #213

Will need matching PR for liquid-c and improvements to this ruby version but the concept is easily implemented in both.

@Shopify/guardians-of-the-liquid @Shopify/liquid

@ashmaroli
Copy link
Contributor

This approach seems to have increased memory usage for the parse phase significantly:

  +-----------------+-------------------------+--------------------------+
  | Phase           | Parse                   | Render                   |
  +-----------------+-------------------------+--------------------------+
- | Total allocated | 4.53 MB (53197 objects) | 979.68 kB (8827 objects) |
+ | Total allocated | 7.23 MB (88561 objects) | 979.68 kB (8827 objects) |
  | Total retained  | 0 B (0 objects)         | 49.70 kB (276 objects)   |
  +-----------------+-------------------------+--------------------------+

@shopmike
Copy link
Contributor Author

Oh yeah, performance for this is likely, not great at the moment. I was just focusing on getting it to pass. This has the ability to be highly efficient though so just needs to be optimised.

@ashmaroli
Copy link
Contributor

Great to know!
I wasn't complaining. Just thought I'd post it so that you'd have it in the back of your head while you develop the concept further.

@shopmike
Copy link
Contributor Author

This regex wouldn't exist either, needs to be replaced by a scanner. Just a quick hack to get this moving

source.split(/({%|{{|"|'|}}|%}|})/om).each

@shopmike
Copy link
Contributor Author

shopmike commented Sep 20, 2019

And the conditional statements are double checking things which should be optimised

@shopmike
Copy link
Contributor Author

shopmike commented Sep 20, 2019

Plus the point of all this is to achieve that the following code is valid (basically curly brackets inside strings that are inside tags)

{{ variable | prepend: '{' | append: '}' }}

and

{{ 'blah {{ yy }}' | replace: '{{', 'xx' }}

@shopmike
Copy link
Contributor Author

shopmike commented Sep 23, 2019

I'm also starting to wonder if we should solve the few following issues as these can be fixed with a few more tokens to scan for

Allow escaping in " so that it is possible to use both a single quote and double quote in a string, which is not currently possible

{% assign = "we can handle \"double quotes\" and 'single quotes'" %}

This does not break backwards support for the tokenizer as double quotes are not currently possible inside double quotes. However it will break string detection inside tags

The new liquid tag splits on newlines, this means strings will no longer be able to express new lines in the new liquid tag

{% liquid 
    assign = "we can handle\n new lines" 
%}

This can possibly break templates that use "\n" in a string today. The liquid.format() command could be a way to migrate templates without issues.

Carriage returns and tabs as these are also common, not as much with the web but windows and Tab Seperated Values

{% assign = "we can handle returns\r and tabs \t" %}

This can possibly break templates that use "\r" or "\t" in a string today. The liquid.format() command could be a way to migrate templates without issues.

Finally because we support the above we have to handle escaping escapes

{% assign = "we can handle a real backslash next to the character n \\n we are not on a newline" %}

This can possibly break templates that use "\" before a control character in a string today. The liquid.format() command could be a way to migrate templates without issues.

@shopmike
Copy link
Contributor Author

shopmike commented Sep 23, 2019

An alternative proposal for the changes above is to bring in a third quoting method using backticks

Allow escaping in ` so that it is possible to use both a single quote and double quote, and back ticks in a string, which is not currently possible

{% assign = `we can handle "double quotes" and 'single quotes' and \`back ticks\`` %}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

The new liquid tag splits on newlines, this means strings will no longer be able to express new lines in the new liquid tag

{% liquid 
    assign = `we can handle\n new lines`
%}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

Carriage returns and tabs as these are also common, not as much with the web but windows and Tab Separated Values

{% assign = `we can handle returns\r and tabs \t` %}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

Finally because we support the above we have to handle escaping escapes

{% assign = `we can handle a real backslash next to the character n \\n we are not on a newline` %}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants