Modifying the token stream #4914

The-Futurist · 2025-12-31T14:26:02Z

The-Futurist
Dec 31, 2025

I work in C#.

I've created a solid proof of concept but need some confidence this carries no hidden risks.

As part of language preprocessing, I want to operate on the token stream (list) and remove/insert tokens (based on preprocessing metadata obtained after initial parsing) this is the phase 1 parse (preprocessing directives are just part of the grammar).

This works fine using AddRange/RemoveRange etc, once I've updated the token list after all preprocessing, I wrap the updated list in a token stream class then reparse that stream to get my actual CST (macros expanded, preprocessor directives removed etc) this is the phase 2 parse.

This performs nested include file expansion for example, I read the header file, then tokenize it and remove the original include directive tokens and insert the tokens generated when the header file was tokenized.

This works well, the code is easy to understand and as a strategy it seems solid. I understand this could be done textually but operating directly with the tokens seems much cleaner for preprocessing.

But the tokenindex gets discontinuous, as if one should walk the list and reset it for every token once preprocessing is complete before the second parse, but is that important? does the parser care about tokenindex property? Is this approach reasonable? I opted months ago to not use listeners/visitors incidentally, I manually walk the CST to create my AST and this has been very successful.

Answered by kaby76

Dec 31, 2025

You can certainly reindex each token in the token stream after an edit. However, I do know tokens are referenced in at least the start and stop pointers of a tree node. If you remove a token in the token stream that a parse tree node referenced, you will need to update the start/stop pointers, too. But it depends on what you are planning to do with the tree editing. The data structure is not designed to support fast, extensive, independent tree-to-tree edits. You'll need a tree representation if you cannot reconstruct the tree from a serialized representation, or if you can't do so quickly.

View full answer

kaby76 · 2025-12-31T15:33:52Z

kaby76
Dec 31, 2025

You can certainly reindex each token in the token stream after an edit. However, I do know tokens are referenced in at least the start and stop pointers of a tree node. If you remove a token in the token stream that a parse tree node referenced, you will need to update the start/stop pointers, too. But it depends on what you are planning to do with the tree editing. The data structure is not designed to support fast, extensive, independent tree-to-tree edits. You'll need a tree representation if you cannot reconstruct the tree from a serialized representation, or if you can't do so quickly.

1 reply

The-Futurist Dec 31, 2025
Author

Thanks, OK that's reassures me somewhat. The processing is working well, once the token list has been through preprocessing it gets parsed again and it seems fine. I reindexed all the tokens and even if I don't it seems OK.

The-Futurist · 2025-12-31T18:01:02Z

The-Futurist
Dec 31, 2025
Author

@kaby76

By the way, would it be sensible then to convert the List<IToken> back to a stream (if that's possible) and just retokenize that? That would clear up all the token index and line. column metadata.

1 reply

kaby76 Jan 1, 2026

You could serialize a list of tokens (or token stream) by just printing the Text for each token. That would give you a text file, which you could retokenize or even reparse with the grammar or a different grammar. All tree-editing rules I've written have never needed line and column attributes of the DOM tree (computed from the Antlr4 parse tree) for rule matching in tree editing. In fact, for Trash, trparse only adds line and column information if it's requested, and most of the time, I never use that data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modifying the token stream #4914

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Modifying the token stream #4914

Uh oh!

Uh oh!

The-Futurist Dec 31, 2025

Replies: 2 comments · 2 replies

Uh oh!

kaby76 Dec 31, 2025

Uh oh!

The-Futurist Dec 31, 2025 Author

Uh oh!

The-Futurist Dec 31, 2025 Author

Uh oh!

kaby76 Jan 1, 2026

The-Futurist
Dec 31, 2025

Replies: 2 comments 2 replies

kaby76
Dec 31, 2025

The-Futurist Dec 31, 2025
Author

The-Futurist
Dec 31, 2025
Author