Skip to content

Comments

[v2] LALRPOP based parser#1498

Open
teofr wants to merge 16 commits intomainfrom
teofr/lalrpop-parser-squashed
Open

[v2] LALRPOP based parser#1498
teofr wants to merge 16 commits intomainfrom
teofr/lalrpop-parser-squashed

Conversation

@teofr
Copy link
Contributor

@teofr teofr commented Dec 31, 2025

The PR is huge, but most of the changes are generated or snapshot tests. The actual changes can be summarised as:

  • We create a ParserModel, where each item from the language is transformed into a model of an LALRPOP rule
    • Rules are structured as Rust's ADTs, these should (with some small changes) work for other parser generators in the future.
  • From the ParserModel we generate an actual LALRPOP grammar (grammar.lalrpop.jinja2)
  • We add a ParserOptions construct to the language definition, allowing for some grammar specific options
    • Some complex language items have their rules written by hand within the language definition, these are copied verbatim. I tried to be very didactic in the comments explaining this manual rules, if something is not clear please let me know.
  • The pragma and assembly blocks require context switching from the lexer, until then they're basically being ignored
  • I tried to add a bunch of new tests, specially for rules written by hand
  • These are the tests that generate different ASTs between V1 and V2, please review them carefully:
    • solidity_cargo_tests cst::cst_output::generated::source_unit::braces_inside_assembly: This needs assembly blocks to be parsed correctly.
    • solidity_cargo_tests cst::cst_output::generated::source_unit::state_variable_function: V1 parse this function (uint a) internal internal foo; as the function type having two internal attributes, but the second should be an attribute of the state variable, this works as expected on V2.
    • solidity_cargo_tests cst::cst_output::generated::source_unit::unreserved_keywords: V1 can't parse this uint transient;, V2 correctly parse it as an uint with the transient identifier (not a storage attribute).

@changeset-bot
Copy link

changeset-bot bot commented Dec 31, 2025

⚠️ No Changeset found

Latest commit: d9aa154

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch 2 times, most recently from 445da70 to a06179b Compare January 7, 2026 13:08
@teofr teofr changed the base branch from teofr/v2-ast-2 to teofr/node_checker January 7, 2026 13:18
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch from 69af5d6 to 3a7f5ae Compare January 7, 2026 20:34
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch from 99bcd05 to 9f5d166 Compare January 19, 2026 12:13
@teofr teofr force-pushed the teofr/node_checker branch from 2f7d455 to acfa26c Compare January 19, 2026 12:36
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch from 9f5d166 to 8150fcf Compare January 19, 2026 12:38
@teofr teofr force-pushed the teofr/node_checker branch from acfa26c to 4cc4786 Compare January 22, 2026 14:35
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch 2 times, most recently from 8d49a31 to 6711af3 Compare January 22, 2026 16:12
@teofr teofr force-pushed the teofr/node_checker branch from 4cc4786 to 072144d Compare January 23, 2026 10:14
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch from 6711af3 to 4b6bc06 Compare January 23, 2026 10:15
@teofr teofr force-pushed the teofr/node_checker branch from 072144d to d95ed0d Compare January 28, 2026 09:05
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch 3 times, most recently from 18a9e63 to 3bd60df Compare January 28, 2026 10:07
@teofr teofr changed the base branch from teofr/node_checker to teofr/v2-definition-changes January 28, 2026 10:07
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch from 3bd60df to d16de5a Compare January 28, 2026 13:45
@teofr teofr force-pushed the teofr/v2-definition-changes branch from fb9d550 to c14dcbf Compare February 2, 2026 16:52
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch 2 times, most recently from 88717aa to d9ea068 Compare February 3, 2026 15:09
@teofr teofr changed the title LALRPOP based parser [v2] LALRPOP based parser Feb 4, 2026
@teofr teofr force-pushed the teofr/v2-definition-changes branch 2 times, most recently from 45918e8 to 7abaedc Compare February 5, 2026 11:51
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch 2 times, most recently from b33f4f1 to 31cb39c Compare February 5, 2026 12:03
@teofr teofr force-pushed the teofr/v2-definition-changes branch from 7abaedc to 0e123dd Compare February 5, 2026 12:14
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch from 31cb39c to c86cbe9 Compare February 5, 2026 12:31
@teofr teofr marked this pull request as ready for review February 5, 2026 12:55
@teofr teofr force-pushed the teofr/v2-definition-changes branch from c5d1842 to 71d73b7 Compare February 6, 2026 13:15
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch from 836f1d5 to 711c3d6 Compare February 6, 2026 14:35
@teofr teofr force-pushed the teofr/v2-definition-changes branch from 2cd6afb to f89ae1d Compare February 12, 2026 15:37
Base automatically changed from teofr/v2-definition-changes to main February 12, 2026 16:15
@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch 2 times, most recently from 59f53c8 to b6d5922 Compare February 17, 2026 16:40
Copy link
Contributor

@OmarTawfik OmarTawfik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few questions/suggestions. Thanks!

producing_type: Identifier,
options: Vec<LALRPOPOption>,
pub inline: bool,
pub pubb: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of using public for clarity?

/// };
/// ```
#[derive(Clone, Debug, Serialize)]
pub(crate) struct LALRPOPItemInner {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WDYT of using LALRPOPDerivedItem instead of "inner" here, as opposed to the verbatim ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better name, thanks!

/// <_OverrideSpecifier: OverrideSpecifier> => new_modifier_attribute_override_specifier(<>),
/// ```
#[derive(Clone, Debug, Serialize)]
struct LALRPOPOption {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially misunderstood these options as LALRPOP configuration/settings.
WDYT of using LALRPOPDefinition instead? similar to how LALRPOP docs call them.


/// Checks if a given version specifier enables the supported version
fn is_enabled(enabled: Option<&VersionSpecifier>) -> bool {
enabled.as_ref().is_none_or(|v| v.contains(&VERSION))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in an earlier PR you introduced Always and auto-derived Default for it, along with its contains() semantics.
I suggest reusing it here: enabled.unwrap_or_default().contains(....)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, this function is older than that and didn't remember to change it. Thank you!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to enabled.unwrap_or(&VersionSpecifier::default()).contains(&VERSION), since &VersionSpecifier doesn't implement Default

/// In the case multiple operators are defined within a precedence expression, we wrap them in their own rule,
/// basically converting this into syntactic sugar for using an enum item.
///
/// TODO(v2): We're assuming that Precedence Items follow a strict shape, in particular that Binary Operators
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can definetely improve this structure. Let's discuss options f2f if you like.
Not blocking of course.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, I think it's out of the scope of this PR


grammar<'source>(source: &'source str);

Sep<S, T>: Vec<T> = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rather than the shortnames, WDYT of using the full names we already use in the language definition?
Separated, SeparatedAllowEmpty, Repeated, RepeatedAllowEmpty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

pub(crate) mod nodes;

lalrpop_mod!(
#[allow(clippy::all)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this temporary? should we add a TODO(v2) to remove it?
Otherwise, if possible, I suggest using allowing specific rules on the parts that need it for clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a TODO, will clean it in another PR since I may need to change the rules a bit.

use crate::lexer::contexts::ContextKind;
use crate::lexer::definition::Lexer;

pub(crate) mod nodes;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//! This module contains certain nodes and functions used internally by the parser.
//!
//! They shouldn't be used outside of the parser, and should be transformed into AST nodes.

should this be private to the parser?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, changed it!

FunctionTypeAttribute::PublicKeyword(terminal) => {
StateVariableAttribute::PublicKeyword(terminal)
}
_ => panic!("This is wrong, I don't really know what to do for now, but it should fail gracefully (like a parser error)")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, we try to skip/recover from such errors, rather than panicking, even if we don't validate for it now.
I wonder if panicking is needed here? are there other places that can panic from user input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments on this and another similar case, for now skipping it here, and returning an invalid Identifier (with range 0..0) in the other case. Once we start working on validation we can improve here.

Cargo.toml Outdated
Inflector = { version = "0.11.4" }
itertools = { version = "0.13.0" }
lalrpop = { version = "0.22.2" }
lalrpop-util = { version = "0.22.2", features = ["lexer", "unicode"] }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that we are using logos for lexing, why are these additional features needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I used the built-in lexer when doing early experiments and forgot about it.

I also upgraded to 0.23.0 since it was easy enough.

// TODO(v2): Errors should be something other than `String`
fn parse(input: &str, version: LanguageVersion) -> Result<Self::NonTerminal, String>;

fn check_version(version: LanguageVersion) -> Result<(), String> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, we will restrict LanguageVersion itself to the set of supported languages. Would this check_version() be needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment, we should remove it in the future.

/// The type of the non-terminal that this parser produces
type NonTerminal;

// TODO(v2): Errors should be something other than `String`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If not a big change, I suggest prioritizing this sooner than later, as currently the "unexpected" tokens are being categorized as "unrecognized" in the error message, which conflict with the UNRECOGNIZED lexeme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll prioritize this as a follow up PR. Thanks!

}
None
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed f2f: splitting the snapshot tests for v1/v2, for a few reasons:

  • v2 is not versioned, so we don't need diffing over versions
  • v2 only allows source unit/statement/expression inputs, so we need to rewrite/refactor the tests to fit one of these three
  • v2 produces the AST directly, so we can serialize this to the snapshot directly, rather than comparing with the legacy CST nodes. Maybe we can generate a similar YAML serializer for the new L0 types.

We should probably do this in a separate PR that doesn't contain product changes (only test changes) to make sure we don't introduce any bugs. That PR will have a ton of newly added snapshots, and it would make it harder to spot any issues with manual reviews.

@@ -0,0 +1,229 @@
//! This module contains certain nodes and functions used internally by the parser.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, these helpers will eventually move to the ParserConsumer structure you proposed earlier. Is that correct? should we add a TODO here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this depends on whether we go down that path or not. But even then I'm not sure if we would, the ParserConsumer should follow strictly the language definition, where as most of the code here is only there to resolve conflicts in the parser.

@teofr
Copy link
Contributor Author

teofr commented Feb 20, 2026

Thanks for the review, I think I addressed/replied to everything.

These are the tasks that I should look at next:

  • Discussing the Precedence item structure
  • Split the snapshot tests for V2, testing all of them at the same non terminal level (SourceUnit).
  • Remove the parsers for non terminals we don't need.
  • Simplify the testing structure, both the comparison for nodes (ast comparison should be easire) and the multiplexer for different non terminals
  • Once the V2 language definition has the restricted set of versions, we should remove all the ad-hoc checks that compare against 0.8.30
  • Remove the allow(clippy::all) in the parser, and only add individual ones that may actually be needed
  • Improve the parser errors to be something else than just String, probably including location and a short reason.

@teofr teofr force-pushed the teofr/lalrpop-parser-squashed branch from 77c89ff to d9aa154 Compare February 20, 2026 16:44
@socket-security
Copy link

Review the following changes in direct dependencies. Learn more about Socket for GitHub.

Diff Package Supply Chain
Security
Vulnerability Quality Maintenance License
Addedcargo/​lalrpop@​0.23.08010093100100
Addedcargo/​lalrpop-util@​0.23.010010093100100

View full report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants