Moonslice

A simple, robust, and blazingly fast lexical analyzer* generator in Luau.

Important

v1.0.0 is a complete rewrite from the previous version. This version provides a higher degree of flexibility while offering better performance than its predecessor, v0.3.3. It does this by effectively utilizing pre-order tree traversal and Lua pattern matching.

Compared to the previous version — which was worth ~650 lines of code, spread across 5 files — it is now worth ~270 lines of code, in just a single file; all while being completely strictly typed, something which the previous version just couldn't do due to weird structural decisions that I couldn't undo.

Note

Please be guided that throughout this README, whenever the phrase "previous version" or anything similar is mentioned, we are referring to v0.3.3.

Usage

Preamble

The Token produced by the lexers generated by Moonslice will always be formatted as such:

type Token = {
  Lexeme: string,
  Type:  string
}

A Lexeme refers to the actual character sequence in the text we're lexing that matches the pattern for a token, whereas the Type is the token's type.

Constructing a Lexer

Requiring Moonslice returns a function that accepts two tables:

a table of pattern tokens; tokens whose lexemes are loosely defined using Lua string patterns.
a table of static tokens; tokens whose lexemes are defined by one (or more) static string.

Pattern-based tokens can use capture groups () to specify which portion of the match should be treated as the token’s lexeme.

Warning

Only one capture group is allowed. While this might seem restrictive, it's an intentional design decision. Supporting multiple captures would require the Lexeme field to sometimes be a table, adding unecessary typecheck complexity and complexity to operations towards the Token struct.

Tip

If you still need to extract multiple pieces of data from a pattern, consider capturing them as a whole cluster. Then, whenever such token is met, apply a post-processing function that further dissects its contents.

Consider the basic example below:

local Moonslice = require(game.ReplicatedStorage:WaitForChild("Moonslice"))
local Lexer = Moonslice ({
	STRING = "\"([^\n\r]*)\"",
	BIN_NUMBER = "0[bB](%d+)",
	HEX_NUMBER = "0[xX](%d+)",
	SCI_NUMBER = "%d+e%d+",
	DEC_NUMBER = "(%d+)"
}, {
	LOCAL = "local",
	FUNCTION = "function",
	IF = "if",
	THEN = "then",
	ELSE = "else",
	END = "end"
})

Tokens can also be defined using multiple lexeme strings. This goes out for both static and pattern-based tokens.

	-- ...
	TRUE = { "true", "TRUE" },
	-- ...

You can also specify whether a token should always be dropped (i.e., ignored by the lexer). This is optional, and will default to false when unspecified.

	-- ...
	WHITESPACE = { "%s+", Drop = true },
	-- ...

The precedence of matching for identifiers over static "keyword" tokens (e.g. "function", "const") are automatically handled. The default pattern for identifiers is "[%a_][%a%d_]*", but you can override it by assigning a NAME field in the pattern token table:

	-- ...
	NAME = "[%a_][%a%d_]*",
	-- ...

Note

Tokens whose static lexemes match the identifier pattern are given special precedence over other tokens so that match conflicts can be immediately resolved. This design is deliberate to make "keyword"-like token recognition reliable and unambiguous.

Warning

While you can override the identifier matching pattern, you unfortunately cannot rename its token type (NAME) naturally; and would require you to modify the source code.

Moreover, other static tokens (other than the ones described above) can also take precendence over pattern tokens, if they consist only of punctuation characters (i.e., match "%p+"). This resolves not only ambiguity but also improves performance, as many language tokens rely on punctuation for symbolism and can be matched quickly.

Warning

While these behaviors are designed to resolve conflicts, they may limit flexibility in how you define your lexer, depending on your use case.

Using a Lexer

Once a lexer is constructed, you have four main methods and three primary properties of interest:

Methods

`:Reset(Input: string, Source: string?)` -> `()`

Resets the internal state of the lexer in preparation to lexically analyze the given Input text.

Input: The raw source code (or string) to tokenize.
Source (optional): A label or name for the input (e.g., filename). If omitted, defaults to "unknown". Used only for debugging and error messages.
This method also clears TokenNow and TokensAhead.

`:Next() -> (Token)`

Advances the lexer to the next token and returns it.

Updates the TokenNow property to the returned token.
If TokensAhead has peeked tokens, :Next() will consume the first one; otherwise, it parses a new token from the input.

`:Peek(Amount: number?) -> (Token)`

Returns the token Amount positions ahead of the current one without advancing the lexer.

If Amount is not specified, defaults to 1 (i.e., peeks at the next token).
The returned token is also cached in TokensAhead, so repeated peeks do not reparse the input.

`:GetCurrentLine() -> (number)`

Returns the current line number the lexer is reading from.

Useful for error reporting and diagnostics.
Starts at 1 and increments as the lexer encounters newline characters.

`:LexicalError(Message: string, Token: Token?) -> ()`

Raises a lexical error, typically during the tokenization phase.

Message: A description of what went wrong (e.g., "Unexpected character").
Token (optional): The specific token or position to attach to the error. If omitted, the lexer may fallback to the current position.
This function usually includes the line number and Source label (if set), making debugging easier.

`:SyntaxError(Message: string) -> ()`

Raises a syntax error, typically during parsing.

Functionally equivalent to :LexicalError(Message, TokenNow), using the current token.
Intended for situations where a valid token was produced, but it doesn't fit expected grammar (e.g., “Expected ) after expression”).

Properties

`TokenNow: Token`

The current token being viewed or processed.

Updated each time :Next() is called.
Represents the most recently consumed token.

`TokensAhead: { Token }`

A list of lookahead tokens that have been generated by :Peek().

Each call to :Peek() extends this array if needed.
:Next() will consume tokens from this array before lexing new ones.

`Source: string`

The name or label of the current input being analyzed.

Defaults to "unknown" if not explicitly provided via :Reset().
Used only for contextual debugging (e.g., in error messages or tracebacks).

Note

These are not all of the properties, but the ones you might find of use.

Constraints

No constraints other than what have already been mentioned at lexer construction section are known.

Moreover — unlike the previous version — token lexemes are no longer automatically coerced into values like numbers when possible. Moonslice is now a “what you define is what you get” lexer generator — meaning any transformation of a token’s lexeme is delegated entirely to the recipient.

Performance

After running the test on my laptop equipped with an 11th Gen 3.0 GHz Intel Core i3 processor with 8GB of 3.2 GHz RAM, I observed that producing a token takes between 0.6 and 25 μs — and yes, we're talking in microseconds. On average, tokens takes about 2 μs to parse. This translates to an average throughput of approximately 500,000 tokens per second.

Note

If you wish to run the tests yourself, make sure to comment out the warns or prints inside the while loop, as they add a couple 3-4 microseconds to the average token production speed.

When compared to the performance of the previous version — which took between 0.3 to 40 μs per token — this version is about 1.5× faster on average (depending on the old average measurement, which was around 2.5 μs).

Installation

Installation can be easily done by either downloading the latest release and importing it directly into studio or by using Wally.

If you're using Wally, simply add Moonslice = "anothersubatomo/[email protected]" to your dependencies at the wally.toml file and run wally install.

Rationale

This module was actually a side-product another project of mine called Eclipse, which you'll hear more about once it's out. But to be brief about it, you can think of it as the "lesser-ambitious" version of LLVM.

Appendix

This rewrite was hugely influenced by @CosmicToast's patok — a tokenizer (or more accurately, a tokenizer generator) written in pure Lua that works by continuously trying to match a series of user-defined patterns. Upon discovery, I was immediately struck by how simple yet effective it was.

_{Along with the feeling of “Why didn’t I think of that?”.}

At the time, I thought patok might completely eclipse Moonslice; It felt that good. I even wanted it to. So, I eagerly put it to the test and ran some benchmarks against Moonslice, expecting to be humbled. While it was reasonably efficient, patok v1.x-experimental* surprisingly underperformed against Moonslice v0.3.3. Token production with patok ranged from 4 to 50 μs (averaging 16 μs, or ~62,500 TPS), whereas Moonslice consistently clocked in at 0.3 to 40 μs (averaging 2.5 μs, or ~400,000 TPS).

Note

"v1.x-experimental" refers to the state of patok’s repository up to commit d540551, not an official release. Benchmarks of its average token production time we're highly inconsistent — even under identical conditions — with averages fluctuating between 7.4 μs, 12 μs, 24 μs, and sometimes spiking to 32 μs. While I'd love to do a better job, I believe these inconsistencies are out of my control.

I also discovered a small bug in this version: its recursive next() call (used for an experimental token-dropping feature) was written as self.next() instead of self:next(), resulting in an "attempt to index nil with 'tokens'" error upon entering the next function at the instance.

Fixing this didn't (and really shouldn't) noticeably impact performance, as the stack frame overhead of calling the drop function should be near negligible.

Not being the results I expected, it caught me off guard. After some thought, I realized patok’s strength — its dynamic pattern-matching — was also its weakness. Its approach incurs a time complexity closer to O(p * n) (with p being the number of patterns and n the input size). Moonslice, in contrast, uses a DFA-inspired tree structure that builds lexemes from character streams, making it more predictable and closer to O(n) in practice.

Note

While Moonslice does use Lua string patterns, it only did it to categorize the tokens — which, in hindsight, wasn't the brightest decision.

That realization was a turning point. Instead of seeing patok as competition, I saw an opportunity: why not combine the best of both approaches? So I scrapped months of work and rewrote everything from the ground up with a clearer vision. The result was Moonslice v1.0.0.

I couldn’t be happier with that decision. 😊

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
test		test
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aftman.toml		aftman.toml
default.project.json		default.project.json
moonslice.png		moonslice.png
package.project.json		package.project.json
selene.toml		selene.toml
wally.toml		wally.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Moonslice

Usage

Preamble

Constructing a Lexer

Using a Lexer

Methods

`:Reset(Input: string, Source: string?)` -> `()`

`:Next() -> (Token)`

`:Peek(Amount: number?) -> (Token)`

`:GetCurrentLine() -> (number)`

`:LexicalError(Message: string, Token: Token?) -> ()`

`:SyntaxError(Message: string) -> ()`

Properties

`TokenNow: Token`

`TokensAhead: { Token }`

`Source: string`

Constraints

Performance

Installation

Rationale

Appendix

About

Uh oh!

Releases 2

Packages

Languages

License

AnotherSubatomo/Moonslice

Folders and files

Latest commit

History

Repository files navigation

Moonslice

Usage

Preamble

Constructing a Lexer

Using a Lexer

Methods

:Reset(Input: string, Source: string?) -> ()

:Next() -> (Token)

:Peek(Amount: number?) -> (Token)

:GetCurrentLine() -> (number)

:LexicalError(Message: string, Token: Token?) -> ()

:SyntaxError(Message: string) -> ()

Properties

TokenNow: Token

TokensAhead: { Token }

Source: string

Constraints

Performance

Installation

Rationale

Appendix

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

`:Reset(Input: string, Source: string?)` -> `()`

`:Next() -> (Token)`

`:Peek(Amount: number?) -> (Token)`

`:GetCurrentLine() -> (number)`

`:LexicalError(Message: string, Token: Token?) -> ()`

`:SyntaxError(Message: string) -> ()`

`TokenNow: Token`

`TokensAhead: { Token }`

`Source: string`

Packages