[Proposal] Scanning Tokenizer with Improved String support #1174

shopmike · 2019-09-20T06:07:20Z

The commit currently isn't the final version but is showing the working version. The proposal is to shift the Tokenizer to be a scanner that identifies the following tokens.

    {% - Tag Open
    %} - Tag Close
    {{ - Variable Open 
    }} - Variable Close
    }  - Variable Incomplete Close
    '  - Single Quote
    "  - Double Quote

Because of this the scanner only ever needs to look at 2 positions and can operate in a single pass.

It then has the following states to control flow

    0 - Outside Any Tag or Variable
    1 - Inside a Tag
    2 - Inside a Variable
    3 - Inside a Tag and Single Quotes
    4 - Inside a Tag and Double Quotes
    5 - Inside a Variable and Single Quotes
    6 - Inside a Variable and Double Quotes
0,1,2 - Not Inside Quotes

The pseudo-code is as follows

    if 'Found Tag Open' && 'Not Inside Quotes'
        State is now 'Inside Tag'
        End Token
        Start Token
        Append to Token

    else if 'Found Variable Open' && 'Not Inside Quotes'
        State is now 'Inside Variable'
        End Token
        Start Token
        Append to Token

    else if 'Found Single Quote' && 'Inside Tag'
        State is now 'Inside Tag and Single Quotes'
        Append to Token

    else if 'Found Single Quote' && 'Inside Tag and Single Quotes'
        State is now 'Inside Tag'
        Append to Token

    else if 'Found Double Quote' && 'Inside Tag'
        State is now 'Inside Tag and Double Quotes'
        Append to Token

    else if 'Found Double Quote' && 'Inside Tag and Double Quotes'
        State is now 'Inside Tag'
        Append to Token

    else if 'Found Single Quote' && 'Inside Variable'
        State is now 'Inside Variable and Single Quotes'
        Append to Token

    else if 'Found Single Quote' && 'Inside Variable and Single Quotes'
        State is now 'Inside Variable'
        Append to Token

    else if 'Found Double Quote' && 'Inside Variable'
        State is now 'Inside Variable and Double Quotes'
        Append to Token

    else if 'Found Double Quote' && 'Inside Variable and Double Quotes'
        State is now 'Inside Variable'
        Append to Token

    else if 'Found Tag Close' && 'Inside Tag'
        State is now 'Outside Any Tag or Variable'
        Append to Token
        End Token
        Start Token

    else if 'Found Variable Close' && 'Inside Variable'
        State is now 'Outside Any Tag or Variable'
        Append to Token
        End Token
        Start Token

    else
        Append to Token
    end

This resolves and has the tests from the following PRs and Issues

Closes #701
Closes #779
Closes #624
Closes #623
Closes #344
Closes #213

Will need matching PR for liquid-c and improvements to this ruby version but the concept is easily implemented in both.

@Shopify/guardians-of-the-liquid @Shopify/liquid

ashmaroli · 2019-09-20T07:56:27Z

This approach seems to have increased memory usage for the parse phase significantly:

  +-----------------+-------------------------+--------------------------+
  | Phase           | Parse                   | Render                   |
  +-----------------+-------------------------+--------------------------+
- | Total allocated | 4.53 MB (53197 objects) | 979.68 kB (8827 objects) |
+ | Total allocated | 7.23 MB (88561 objects) | 979.68 kB (8827 objects) |
  | Total retained  | 0 B (0 objects)         | 49.70 kB (276 objects)   |
  +-----------------+-------------------------+--------------------------+

shopmike · 2019-09-20T08:54:57Z

Oh yeah, performance for this is likely, not great at the moment. I was just focusing on getting it to pass. This has the ability to be highly efficient though so just needs to be optimised.

ashmaroli · 2019-09-20T08:59:36Z

Great to know!
I wasn't complaining. Just thought I'd post it so that you'd have it in the back of your head while you develop the concept further.

shopmike · 2019-09-20T09:04:19Z

This regex wouldn't exist either, needs to be replaced by a scanner. Just a quick hack to get this moving

source.split(/({%|{{|"|'|}}|%}|})/om).each

shopmike · 2019-09-20T09:04:53Z

And the conditional statements are double checking things which should be optimised

shopmike · 2019-09-20T10:17:54Z

Plus the point of all this is to achieve that the following code is valid (basically curly brackets inside strings that are inside tags)

{{ variable | prepend: '{' | append: '}' }}

and

{{ 'blah {{ yy }}' | replace: '{{', 'xx' }}

shopmike · 2019-09-23T11:23:00Z

I'm also starting to wonder if we should solve the few following issues as these can be fixed with a few more tokens to scan for

Allow escaping in " so that it is possible to use both a single quote and double quote in a string, which is not currently possible

{% assign = "we can handle \"double quotes\" and 'single quotes'" %}

This does not break backwards support for the tokenizer as double quotes are not currently possible inside double quotes. However it will break string detection inside tags

The new liquid tag splits on newlines, this means strings will no longer be able to express new lines in the new liquid tag

{% liquid 
    assign = "we can handle\n new lines" 
%}

This can possibly break templates that use "\n" in a string today. The liquid.format() command could be a way to migrate templates without issues.

Carriage returns and tabs as these are also common, not as much with the web but windows and Tab Seperated Values

{% assign = "we can handle returns\r and tabs \t" %}

This can possibly break templates that use "\r" or "\t" in a string today. The liquid.format() command could be a way to migrate templates without issues.

Finally because we support the above we have to handle escaping escapes

{% assign = "we can handle a real backslash next to the character n \\n we are not on a newline" %}

This can possibly break templates that use "\" before a control character in a string today. The liquid.format() command could be a way to migrate templates without issues.

shopmike · 2019-09-23T11:37:57Z

An alternative proposal for the changes above is to bring in a third quoting method using backticks

Allow escaping in ` so that it is possible to use both a single quote and double quote, and back ticks in a string, which is not currently possible

{% assign = `we can handle "double quotes" and 'single quotes' and \`back ticks\`` %}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

The new liquid tag splits on newlines, this means strings will no longer be able to express new lines in the new liquid tag

{% liquid 
    assign = `we can handle\n new lines`
%}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

Carriage returns and tabs as these are also common, not as much with the web but windows and Tab Separated Values

{% assign = `we can handle returns\r and tabs \t` %}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

Finally because we support the above we have to handle escaping escapes

{% assign = `we can handle a real backslash next to the character n \\n we are not on a newline` %}

This does not break backwards support for the tokenizer as backticks are new functionality. However string detection in other areas will need to be updated

Initial concept

3ffae9a

Remove redunant logic

331b2a2

hc0503 approved these changes Nov 9, 2020

View reviewed changes

This was referenced Jun 16, 2021

Escape charactors in string literals osteele/liquid#45

Open

文字列リテラルでエスケープできるようにする smartoperation/liquid#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Proposal] Scanning Tokenizer with Improved String support #1174

[Proposal] Scanning Tokenizer with Improved String support #1174

Uh oh!

shopmike commented Sep 20, 2019 •

edited

Loading

Uh oh!

ashmaroli commented Sep 20, 2019

Uh oh!

shopmike commented Sep 20, 2019

Uh oh!

ashmaroli commented Sep 20, 2019

Uh oh!

shopmike commented Sep 20, 2019

Uh oh!

shopmike commented Sep 20, 2019 •

edited

Loading

Uh oh!

shopmike commented Sep 20, 2019 •

edited

Loading

Uh oh!

shopmike commented Sep 23, 2019 •

edited

Loading

Uh oh!

shopmike commented Sep 23, 2019 •

edited

Loading

Uh oh!

Uh oh!

[Proposal] Scanning Tokenizer with Improved String support #1174

Are you sure you want to change the base?

[Proposal] Scanning Tokenizer with Improved String support #1174

Uh oh!

Conversation

shopmike commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashmaroli commented Sep 20, 2019

Uh oh!

shopmike commented Sep 20, 2019

Uh oh!

ashmaroli commented Sep 20, 2019

Uh oh!

shopmike commented Sep 20, 2019

Uh oh!

shopmike commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shopmike commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shopmike commented Sep 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Allow escaping in " so that it is possible to use both a single quote and double quote in a string, which is not currently possible

The new liquid tag splits on newlines, this means strings will no longer be able to express new lines in the new liquid tag

Carriage returns and tabs as these are also common, not as much with the web but windows and Tab Seperated Values

Finally because we support the above we have to handle escaping escapes

Uh oh!

shopmike commented Sep 23, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Allow escaping in ` so that it is possible to use both a single quote and double quote, and back ticks in a string, which is not currently possible

The new liquid tag splits on newlines, this means strings will no longer be able to express new lines in the new liquid tag

Carriage returns and tabs as these are also common, not as much with the web but windows and Tab Separated Values

Finally because we support the above we have to handle escaping escapes

Uh oh!

Uh oh!

shopmike commented Sep 20, 2019 •

edited

Loading

shopmike commented Sep 20, 2019 •

edited

Loading

shopmike commented Sep 20, 2019 •

edited

Loading

shopmike commented Sep 23, 2019 •

edited

Loading

shopmike commented Sep 23, 2019 •

edited

Loading