Skip to content

Immediate token handling and "next token" are incompatible #4910

Open
@jordanbtucker

Description

@jordanbtucker

The spec makes the following two statements:

https://html.spec.whatwg.org/multipage/parsing.html#tokenization

When a token is emitted, it must immediately be handled by the tree construction stage. The tree construction stage can affect the state of the tokenization stage, and can insert additional characters into the stream. (For example, the script element can result in scripts executing and using the dynamic markup insertion APIs to insert characters into the stream being tokenized.)

https://html.spec.whatwg.org/multipage/parsing.html#next-token

The next token is the token that is about to be processed by the tree construction dispatcher (even if the token is subsequently just ignored).

These two statements seem to be incompatible with each other. How can the tree constructor know what the "next token" is if the tokenizer is supposed to wait for the tree constructor to finish its steps?

For example, take the following steps from the in body insertion mode.

When the user agent is to apply the rules for the "in body" insertion mode, the user agent must handle the token as follows:
...
A start tag whose tag name is "textarea"
Run these steps:

  1. Insert an HTML element for the token.
  2. If the next token is a U+000A LINE FEED (LF) character token, then ignore that token and move on to the next one. (Newlines at the start of textarea elements are ignored as an authoring convenience.)
  3. Switch the tokenizer to the RCDATA state.
  4. Let the original insertion mode be the current insertion mode.
  5. Set the frameset-ok flag to "not ok".
  6. Switch the insertion mode to "text".

So, if the tokenizer has just emitted the start tag token, then it is supposed to wait for the tree constructor to run these steps before parsing the next token. How does the tree constructor know whether the next token is a \n character token if the tokenizer hasn't parsed it yet? When "next token" appears, does that mean the tree constructor is giving the tokenizer permission to parse another token?

Metadata

Metadata

Assignees

No one assigned

    Labels

    clarificationStandard could be clearergood first issueIdeal for someone new to a WHATWG standard or software projecttopic: parser

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions