A simple, robust, and blazingly fast lexical analyzer* generator in Luau.
Important
v1.0.0 is a complete rewrite from the previous version. This version provides a higher degree of flexibility while offering better performance than its predecessor, v0.3.3. It does this by effectively utilizing pre-order tree traversal and Lua pattern matching.
Compared to the previous version — which was worth ~650 lines of code, spread across 5 files — it is now worth ~270 lines of code, in just a single file; all while being completely strictly typed, something which the previous version just couldn't do due to weird structural decisions that I couldn't undo.
Note
Please be guided that throughout this README, whenever the phrase "previous version" or anything similar is mentioned, we are referring to v0.3.3.
The Token produced by the lexers generated by Moonslice will always be formatted as such:
type Token = {
Lexeme: string,
Type: string
}A Lexeme refers to the actual character sequence in the text we're lexing that matches the pattern for a token, whereas the Type is the token's type.
Requiring Moonslice returns a function that accepts two tables:
- a table of pattern tokens; tokens whose lexemes are loosely defined using Lua string patterns.
- a table of static tokens; tokens whose lexemes are defined by one (or more) static string.
Pattern-based tokens can use capture groups () to specify which portion of the match should be treated as the token’s lexeme.
Warning
Only one capture group is allowed. While this might seem restrictive, it's an intentional design decision. Supporting multiple captures would require the Lexeme field to sometimes be a table, adding unecessary typecheck complexity and complexity to operations towards the Token struct.
Tip
If you still need to extract multiple pieces of data from a pattern, consider capturing them as a whole cluster. Then, whenever such token is met, apply a post-processing function that further dissects its contents.
Consider the basic example below:
local Moonslice = require(game.ReplicatedStorage:WaitForChild("Moonslice"))
local Lexer = Moonslice ({
STRING = "\"([^\n\r]*)\"",
BIN_NUMBER = "0[bB](%d+)",
HEX_NUMBER = "0[xX](%d+)",
SCI_NUMBER = "%d+e%d+",
DEC_NUMBER = "(%d+)"
}, {
LOCAL = "local",
FUNCTION = "function",
IF = "if",
THEN = "then",
ELSE = "else",
END = "end"
})Tokens can also be defined using multiple lexeme strings. This goes out for both static and pattern-based tokens.
-- ...
TRUE = { "true", "TRUE" },
-- ...You can also specify whether a token should always be dropped (i.e., ignored by the lexer). This is optional, and will default to false when unspecified.
-- ...
WHITESPACE = { "%s+", Drop = true },
-- ...The precedence of matching for identifiers over static "keyword" tokens (e.g. "function", "const") are automatically handled. The default pattern for identifiers is "[%a_][%a%d_]*", but you can override it by assigning a NAME field in the pattern token table:
-- ...
NAME = "[%a_][%a%d_]*",
-- ...Note
Tokens whose static lexemes match the identifier pattern are given special precedence over other tokens so that match conflicts can be immediately resolved. This design is deliberate to make "keyword"-like token recognition reliable and unambiguous.
Warning
While you can override the identifier matching pattern, you unfortunately cannot rename its token type (NAME) naturally; and would require you to modify the source code.
Moreover, other static tokens (other than the ones described above) can also take precendence over pattern tokens, if they consist only of punctuation characters (i.e., match "%p+"). This resolves not only ambiguity but also improves performance, as many language tokens rely on punctuation for symbolism and can be matched quickly.
Warning
While these behaviors are designed to resolve conflicts, they may limit flexibility in how you define your lexer, depending on your use case.
Once a lexer is constructed, you have four main methods and three primary properties of interest:
Resets the internal state of the lexer in preparation to lexically analyze the given Input text.
Input: The raw source code (or string) to tokenize.Source(optional): A label or name for the input (e.g., filename). If omitted, defaults to "unknown". Used only for debugging and error messages.- This method also clears
TokenNowandTokensAhead.
Advances the lexer to the next token and returns it.
- Updates the
TokenNowproperty to the returned token. - If
TokensAheadhas peeked tokens,:Next()will consume the first one; otherwise, it parses a new token from the input.
Returns the token Amount positions ahead of the current one without advancing the lexer.
- If
Amountis not specified, defaults to1(i.e., peeks at the next token). - The returned token is also cached in
TokensAhead, so repeated peeks do not reparse the input.
Returns the current line number the lexer is reading from.
- Useful for error reporting and diagnostics.
- Starts at 1 and increments as the lexer encounters newline characters.
Raises a lexical error, typically during the tokenization phase.
Message: A description of what went wrong (e.g., "Unexpected character").Token(optional): The specific token or position to attach to the error. If omitted, the lexer may fallback to the current position.- This function usually includes the line number and
Sourcelabel (if set), making debugging easier.
Raises a syntax error, typically during parsing.
- Functionally equivalent to
:LexicalError(Message, TokenNow), using the current token. - Intended for situations where a valid token was produced, but it doesn't fit expected grammar (e.g., “Expected
)after expression”).
The current token being viewed or processed.
- Updated each time
:Next()is called. - Represents the most recently consumed token.
A list of lookahead tokens that have been generated by :Peek().
- Each call to
:Peek()extends this array if needed. :Next()will consume tokens from this array before lexing new ones.
The name or label of the current input being analyzed.
- Defaults to
"unknown"if not explicitly provided via:Reset(). - Used only for contextual debugging (e.g., in error messages or tracebacks).
Note
These are not all of the properties, but the ones you might find of use.
No constraints other than what have already been mentioned at lexer construction section are known.
Moreover — unlike the previous version — token lexemes are no longer automatically coerced into values like numbers when possible. Moonslice is now a “what you define is what you get” lexer generator — meaning any transformation of a token’s lexeme is delegated entirely to the recipient.
After running the test on my laptop equipped with an 11th Gen 3.0 GHz Intel Core i3 processor with 8GB of 3.2 GHz RAM, I observed that producing a token takes between 0.6 and 25 μs — and yes, we're talking in microseconds. On average, tokens takes about 2 μs to parse. This translates to an average throughput of approximately 500,000 tokens per second.
Note
If you wish to run the tests yourself, make sure to comment out the warns or prints inside the while loop, as they add a couple 3-4 microseconds to the average token production speed.
When compared to the performance of the previous version — which took between 0.3 to 40 μs per token — this version is about 1.5× faster on average (depending on the old average measurement, which was around 2.5 μs).
Installation can be easily done by either downloading the latest release and importing it directly into studio or by using Wally.
If you're using Wally, simply add Moonslice = "anothersubatomo/[email protected]" to your dependencies at the wally.toml file and run wally install.
This module was actually a side-product another project of mine called Eclipse, which you'll hear more about once it's out. But to be brief about it, you can think of it as the "lesser-ambitious" version of LLVM.
This rewrite was hugely influenced by @CosmicToast's patok — a tokenizer (or more accurately, a tokenizer generator) written in pure Lua that works by continuously trying to match a series of user-defined patterns. Upon discovery, I was immediately struck by how simple yet effective it was.
Along with the feeling of “Why didn’t I think of that?”.
At the time, I thought patok might completely eclipse Moonslice; It felt that good. I even wanted it to. So, I eagerly put it to the test and ran some benchmarks against Moonslice, expecting to be humbled. While it was reasonably efficient, patok v1.x-experimental* surprisingly underperformed against Moonslice v0.3.3. Token production with patok ranged from 4 to 50 μs (averaging 16 μs, or ~62,500 TPS), whereas Moonslice consistently clocked in at 0.3 to 40 μs (averaging 2.5 μs, or ~400,000 TPS).
Note
"v1.x-experimental" refers to the state of patok’s repository up to commit d540551, not an official release. Benchmarks of its average token production time we're highly inconsistent — even under identical conditions — with averages fluctuating between 7.4 μs, 12 μs, 24 μs, and sometimes spiking to 32 μs. While I'd love to do a better job, I believe these inconsistencies are out of my control.
I also discovered a small bug in this version: its recursive next() call (used for an experimental token-dropping feature) was written as self.next() instead of self:next(), resulting in an "attempt to index nil with 'tokens'" error upon entering the next function at the instance.
Fixing this didn't (and really shouldn't) noticeably impact performance, as the stack frame overhead of calling the drop function should be near negligible.
Not being the results I expected, it caught me off guard. After some thought, I realized patok’s strength — its dynamic pattern-matching — was also its weakness. Its approach incurs a time complexity closer to O(p * n) (with p being the number of patterns and n the input size). Moonslice, in contrast, uses a DFA-inspired tree structure that builds lexemes from character streams, making it more predictable and closer to O(n) in practice.
Note
While Moonslice does use Lua string patterns, it only did it to categorize the tokens — which, in hindsight, wasn't the brightest decision.
That realization was a turning point. Instead of seeing patok as competition, I saw an opportunity: why not combine the best of both approaches? So I scrapped months of work and rewrote everything from the ground up with a clearer vision. The result was Moonslice v1.0.0.
I couldn’t be happier with that decision. 😊
