Tokenizer doesn't include new line chars in "length"

Repost from https://github.com/squizlabs/PHP_CodeSniffer/issues/3601:

> The following code sample:
> ```php
> <?php
> 
>     // comment.
>     function foo() {}
> 
> ```
> 
> ... will tokenize as follows:
> ```
> Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
> -------------------------------------------------------------------------
>   0 | L1 | C  1 | CC 0 | ( 0) | T_OPEN_TAG                 | [  5]: <?php
> 
>   1 | L2 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:
> 
>   2 | L3 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  4]: ⸱⸱⸱⸱
>   3 | L3 | C  5 | CC 0 | ( 0) | T_COMMENT                  | [ 11]: // comment.
> 
>   4 | L4 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  4]: ⸱⸱⸱⸱
>   5 | L4 | C  5 | CC 0 | ( 0) | T_FUNCTION                 | [  8]: function
>   6 | L4 | C 13 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
>   7 | L4 | C 14 | CC 0 | ( 0) | T_STRING                   | [  3]: foo
>   8 | L4 | C 17 | CC 0 | ( 0) | T_OPEN_PARENTHESIS         | [  1]: (
>   9 | L4 | C 18 | CC 0 | ( 0) | T_CLOSE_PARENTHESIS        | [  1]: )
>  10 | L4 | C 19 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
>  11 | L4 | C 20 | CC 0 | ( 0) | T_OPEN_CURLY_BRACKET       | [  1]: {
>  12 | L4 | C 21 | CC 0 | ( 0) | T_CLOSE_CURLY_BRACKET      | [  1]: }
>  13 | L4 | C 22 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:
> ```
> 
> 
> Looking at the above, raised some questions for me regarding the `length` provided in the token array as it does not seem to include new line characters, Is this intentional ?

---

I never did get an answer to this question.

Some background behind the question:
* The `length` key is typically an efficiency feature, which can prevent lots of calls to the `strlen()` function.
* The `length` key is often used instead of an indent calculation, however, for tokens which can include a new line character and may include leading indentation (like, amongst others, `T_INLINE_HTML` and `T_COMMENT`), an extra calculation is needed - `length - strlen(ltrim(content))`, however, the `length` is unreliable as the _actual_ length may be longer due to the new line character not being included, meaning that if these type of tokens _could_ be the target of the length determination, you will still always need to do a call to `strlen()` instead of being able to use the pre-calculated `length`.
    This is counter-intuitive, inefficient and means that contributors need to have that detailed token knowledge.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer doesn't include new line chars in "length" #195

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Tokenizer doesn't include new line chars in "length" #195

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions