Skip to content

Tokenizer doesn't include new line chars in "length" #195

Open
@jrfnl

Description

@jrfnl

Repost from squizlabs/PHP_CodeSniffer#3601:

The following code sample:

<?php

    // comment.
    function foo() {}

... will tokenize as follows:

Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  0 | L1 | C  1 | CC 0 | ( 0) | T_OPEN_TAG                 | [  5]: <?php

  1 | L2 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

  2 | L3 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  4]: ⸱⸱⸱⸱
  3 | L3 | C  5 | CC 0 | ( 0) | T_COMMENT                  | [ 11]: // comment.

  4 | L4 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  4]: ⸱⸱⸱⸱
  5 | L4 | C  5 | CC 0 | ( 0) | T_FUNCTION                 | [  8]: function
  6 | L4 | C 13 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  7 | L4 | C 14 | CC 0 | ( 0) | T_STRING                   | [  3]: foo
  8 | L4 | C 17 | CC 0 | ( 0) | T_OPEN_PARENTHESIS         | [  1]: (
  9 | L4 | C 18 | CC 0 | ( 0) | T_CLOSE_PARENTHESIS        | [  1]: )
 10 | L4 | C 19 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
 11 | L4 | C 20 | CC 0 | ( 0) | T_OPEN_CURLY_BRACKET       | [  1]: {
 12 | L4 | C 21 | CC 0 | ( 0) | T_CLOSE_CURLY_BRACKET      | [  1]: }
 13 | L4 | C 22 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

Looking at the above, raised some questions for me regarding the length provided in the token array as it does not seem to include new line characters, Is this intentional ?


I never did get an answer to this question.

Some background behind the question:

  • The length key is typically an efficiency feature, which can prevent lots of calls to the strlen() function.
  • The length key is often used instead of an indent calculation, however, for tokens which can include a new line character and may include leading indentation (like, amongst others, T_INLINE_HTML and T_COMMENT), an extra calculation is needed - length - strlen(ltrim(content)), however, the length is unreliable as the actual length may be longer due to the new line character not being included, meaning that if these type of tokens could be the target of the length determination, you will still always need to do a call to strlen() instead of being able to use the pre-calculated length.
    This is counter-intuitive, inefficient and means that contributors need to have that detailed token knowledge.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions