`modalex`

Warning

This is an old project! I don't believe in many of these statements anymore and on this approach to handling context sensitivity.

Consider this a failed experiment.

`modalex`

A modal lexer for context-sensitive grammars, written in Rust.

PSA: If your needs are simpler, you should probably use Logos

Rust has S-Tier LangDev tooling, but very few projects adequately support context-sensitive grammars, leaving you to deal with things like whitespace significance on your own. modalex wants to be a simple and portable way to tokenize such grammars — maximizing performance is not the main goal, having a simple design that lowers the cost-barrier to supporting languages that are usually left out of the party is what we're aiming for.

When would I need this?
Is this safe to use against adversarial inputs?
The grammar's contract
Advice for designing new lexers

Goals and non-goals

A simple design reduces the surface where bugs can hide helping to ensure correct behaviour. We don't use unsafe, but we frequently unwrap under several assumptions who's responsibility to uphold lies with the grammar's authors. We also encourage authors to unwrap and to use the unreachable! macro to help the compiler optimize paths that are guaranteed to not be traversed given the grammar's design — doing this also forces you to rely heavily on fuzzing, and punishes sloppy grammar authors in a way that forces them to adequately rework their grammars into correctness;
Minimal dependencies help the project be as portable and adaptable as it can be, widening the amount of supported targets, and making it a no brainer to adopt;
No magic. The grammar you write is very close to what gets executed, leaving no place where modalex will introduce a bug into a correct grammar. We acknowlege that giving this much control to grammar authors might make modalex more masochistic to use than it could have been, and thus we welcome other people who might want to bolt some magic on top of modalex. Feedback regarding ergonomics is very much appreciated;

Future goals

ECO: Integrate with tree-sitter;
ERG: Macros to help authoring grammars;
PWR: Support hybrid grammars that intermingle UTF-8 and binary data;
SEC: Add lexer CPU budget for progression as measured by the strides;

When would I need this? What language features make this necessary?

Whitespace significant grammars: languages like Python and Haskell need this to parse indent blocks;
Open a block with multiple delimiters, and only close it when you see a matching number of delimiters: this is what powers Rust-like r###""### raw string blocks, and Lua-like --[[[ comments. It's a lovely language feature that makes it much easier to write complex strings without having to worry about escaping characters, but it unfortunately (and necessarily) makes the grammar context-sensitive. This is also equivalent to unrestricted XML's <arbitrary-string></arbitrary-string> problem, which is part of why XML is so notoriously annoying to parse;
Read a length field, then proceed to read the lenght's number of characters: admittedly more common in binary formats when you have length fields;
Things that should be lexed differently based on an out-of-parse definition. C does this in a way that can be worked around, but Scala leans pretty heavy into this for its DSL features. Only way of tackling this is providing a way to feed symbol metadata into the lexer (which is what we do);

Any comments on the security aspects of this project?

Yes, several. Given the nature of the context-sensitive grammars you're implementing, LangSec is already out the window: you hitting pathological performance scenarios either by accident or because of adversarial input is very possible, and it depends far more on the grammar and its implementation than on modalex internals: 0. Internals use no unsafe code;

We use Rust's regex library which is regular and has excellent track record;
Our design is focused on simplicity, the surface area is around 300LOC (tiny);

On our end, we also try to chaperone the grammar as much as we can: 0. By forcing all regexes to be regular you can't accidentally Re-DoS yourself in the traditional way, but you can still implement those dangerous bits yourself on top of the mode stack and shoot yourself in the foot that way;

We also detect certain cases where the lexer stops progressing in non-release builds to help you during grammar development, but still can't detect all of them, such as when the lexer is still emitting zero-length productions but not striding and stuck in a mode cycle;
Several other design decisions also incentivize the grammar to be written in a way that is very tolerant of input and always tries to progress the lexer:
- You can't write rules that don't emit a production;
- Our strategy for dealing with authors who disrespect the Grammar's contract is to panic, and in release builds we disable the ability to catch them. This makes it very painful to ship broken grammars, forces authors to setup an extensive valiadtion harness and fuzzing, and encourages production usage to sandbox the execution by either forking into an underprivileged process or virtualizing the runtime with something like WASM;

This is still all just a long way of saying that modalex is NOT RESILIENT against adversarial grammars, and that it MIGHT BE resilient to adversarial inputs, but its resilience to adversarial inputs is ENTIRELY DEPENDENT on the grammar. The failure modes for sloppy grammars include:

Non-termination
Panics

The grammar's contract

The grammar MUST process any valid UTF-8 string, and the tokenization will never stop progressing. It MUST deal with syntax errors by leveraging error productions;
You may not skip over input without emitting a token, Skipping is a cop-out for sloppy grammar authoring. You must deal with the input string in it's totality, maintaing your tokenized representation bijective with regards to the original input;
The implementation of Token::original() for your Token type will reproduce the token as a string exactly as it was in the original input. Mapping .original() over the tokens should be completely bijective, outputting the original input verbatim;
The grammar MUST panic when an invariant is violated. This is easier than trying to recover from errors, and forces you to deal with the full complexity of the possible inputs at the grammar level (where it belongs). Following this rule yields higher quality grammars by forcing you to adopt best practices, rather than making them optional;
All your regexes MUST be anchored to the beginning of the input string. Yes, you'll be sprinkling a lot of ^ characters around. We could make this smarter but it would add overhead to Lexer instantiation. We could also at least add a check that gets compiled out for release builds, but for now we don't have that (PRs welcome);
A matching Rule MUST emit a production, it can't be aborted from within the closure to try the next rule;
Tokens that were parsed in different modes WILL NOT be coalesced: this would screw with the positional tracking of state changes, which would break incremental reparsing;
You will resist the temptation to parse in your lexer. A lexer and a parser are only different in practice — in theory, they are the same thing. In fact, modalex being able to handle context-sensitive grammars probably makes it a more powerful parser than your parser. That is not a good thing, it mainly just means it is extremely unwieldy to work with. You want to restrict the scope of your lexer to "taming the grammar into something your less powerful parser can handle", so that you can write your parser leveraging mechanisms that are simpler and easier to tame, allowing for greater resilience, caching, and incrementality;

Advice for designing new lexers

This project is my 5th try at writing a modal lexer that is maximally useful and has a reasonably ergonomic API. May these battle scars serve as a lesson to others: 0. A three-concept design of modes, rules, and tokens feels very close to the complexity sweet spot. If you simplify any further things will fall off the plate, while adding more complexity feels extremely poorly justified;

I knew this going into the project but it costs me nothing to state out loud: Given the nature of context-sensitive grammars, you will never pull off a fully AOT compiled lexer (no_std is also impossible for the same reasons). You might be able to AOT subsets of it, but you'll still need to be able dynamically create contexts based on information that is only available at parse time (that is, after all, pretty much the definition of a context-sensitive grammar);
A 1:1 relationship between rules that match tokens and the tokens that exist in the grammar is harmful: decouple them. Allow grammar authors to match a token from multiple different rules, and a single rule to emit different types of tokens. Keep these two concepts orthogonal to each other in your design;
Forcing rules to be strictly owned by a mode is harmful. This leads to a significant amount of duplication, allowing rules to be reused across different modes is very useful, even if sometimes it is also a footgun;
Restricted lookahead is extremely useful. TODO
- Coalescing adjacent tokens is a great workaround to supporting advanced lookahead, and an excellent middle ground between having to support non-regular regexes or not allowing any lookaheads at all;
A one-production-per-match restriction is harmful. Yes, there are cases where issuing multiple tokens per match can be abused because a mode should have been used instead, but keeping mode proliferation in check is very helpful for performance. More importantly, it's one of the more decent ways I found to allow authors to emit zero-length sigils into the token stream to bubble context information up for the parser, and it also makes it much easier to deal with trivia tokens;
- Forcing every match to issue a production, however, is VERY USEFUL. It forces authors to use error productions and consider invalid inputs more carefully, making grammars more far more resilient than they would otherwise be;
Forcing users to treat the root node of a stack or tree as being different than the rest of the nodes is an error that too many libraries of too many languages make. It forces users to litter ifs everywhere, and duplicate things that are nearly identical. For modalex, allowing the grammar to POP the root mode by converting it to a NOP makes it MUCH easier to encode self-similar contexts into the grammar, but this problem appears in multiple places — keep it in mind, always;
Incrementally re-parsing the input is also a problem that is often not tackled by most libraries. Doing this well pretty much boils down to having a lightweight way of serializing modes and lexer state, and also restricting what the mutator has access to and can do:
- TODO

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
src		src
testing		testing
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

`modalex`

Goals and non-goals

Future goals

When would I need this? What language features make this necessary?

Any comments on the security aspects of this project?

The grammar's contract

Advice for designing new lexers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

modalex

Goals and non-goals

Future goals

When would I need this? What language features make this necessary?

Any comments on the security aspects of this project?

The grammar's contract

Advice for designing new lexers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`modalex`

Packages