Skip to content

Feat: Verified and easy-to-use parser combinators #6143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

MikaelMayer
Copy link
Member

@MikaelMayer MikaelMayer commented Mar 7, 2025

Fixes #5418

This PR adds to the standard library support for functional parsers and parsers combinators.
A parser is simply a Input -> ParseResult<T> where ParseResult is either a success or a failure that contains the remaining unparsed input. A parsing failure is "committed" if the remaining unparsed input is strictly smaller than the input. Disjunctions can happen only if a failure is not committed. Combinators can help modify the consumption of the input, which makes it possible to create a rich variety of parsers.

The Parsers module is abstract as parsers operate on an Input that can be converted at any time to a seq<C>. In StringParsers, C is a char and Input is a datatype simulating lazy slicing on the input string. That lazy slicing resulted in a +16% speedup on a real benchmark compared to just having seq<char> for Input.

You can browse the example to understand how parsers combinators can be used.

The easiest way to use parsers is to use the parsers builders. Parser builds are a datatype wrapper around parsers to make it possible to have suffix and infix combinators. This allows chaining and avoid nested parentheses. See the following examples:

Notable features include:

  • Combinators for repetition do not concatenate sequences with one value at a time. Instead, they build a linked list and when done convert the whole linked list to a sequence. This resulted in a +430% speedup on a benchmark that was using single-character repetition when using the JavaScript version of the parsers.
  • Debugging utilities and tutorial to explain how to use them in the README.md
  • It is not possible to create parsers that don't terminate, thanks Dafny for proving it. At worst, a non-terminating parser will always return a "fatal" parsing result that can be still be analyzed at run-time.
  • It is possible to prove that parsers won't return fatal errors and perform progress. Some parser combinators have proofs of progress preservation and absence of fatal errors in ParserTheorems.dfy
  • Dual of parsers combinators (Displayers) defined and proved to be the exact inverse in the Concat case. (to discuss -- should we just remove them as they are immature?)
  • Abstract tests currently covering a good portion of the code, including full proofs that IntToString and StringToInt are reverse of each other in both ways.
  • Support for multiple parse error messages
  • For string parsers, a utility to display nice error messages that includes the line number, the line, a hat to indicate the column where parsing fails, and a detailed error message.
Error:
1; 1+*   
     ^   
expected '(', or
expected a digit, or
expected 'x'
  • Largely inspired from (Meijer 1996) parsers

How has this been tested?

  • Main examples in the README.md are checked
  • Several example illustrate the parsers functionality
  • I have used this parsers combinators library on a big project.

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

@MikaelMayer MikaelMayer requested a review from robin-aws March 7, 2025 18:52
A.AboutDrop(input, A.Length(input)-A.Length(remaining1), A.Length(remaining1)-A.Length(remaining2));
}

// Cannot express this predicate if Input is allocated. Add once we accept quantification over unallocated objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be interested in the "proof trait" pattern I used in the Actions library: https://github.com/dafny-lang/dafny/pull/6074/files#diff-cb5145d4b584ea7656360f05c946523c48c5651815cea269b469a855844c3ed2R105

Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial thoughts based on the offline demo, will return for a more thorough review.

FailureData("expected a "+name, input, Option.None))
}

opaque function IntToString(n: int): string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the existing utilities like Std.Strings.OfInt()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember why I did not do that. In the project that was using these parsers, I was only import Std.Wrappers and these parsers.
By using Std.Strings.OfInt(), I now need to include all these files

  "Std/Strings.dfy",
  "Std/Arithmetic/Power.dfy",
  "Std/Arithmetic/Mul.dfy",
  "Std/Arithmetic/DivMod.dfy",
  "Std/Arithmetic/Logarithm.dfy",
  "Std/Arithmetic/Internal/DivInternals.dfy",
  "Std/Arithmetic/Internal/GeneralInternals.dfy",
  "Std/Arithmetic/Internal/ModInternals.dfy",
  "Std/Arithmetic/Internal/MulInternalsNonlinear.dfy",
  "Std/Arithmetic/Internal/ModInternalsNonlinear.dfy",
  "Std/Arithmetic/Internal/DivInternalsNonlinear.dfy",
  "Std/Arithmetic/Internal/MulInternals.dfy",
  "Std/Arithmetic/LittleEndianNat.dfy",

much of which is useless to my application. Looking forward being able to just enable standard libraries soon :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure, but as you move this code in here it will become part of the project/pre-built doo file, so there's no need for includes anywhere!

a.e_I(b.e_I(c))

will parse a, b, c but only return the result of `c`. */
function e_I<U>(other: B<U>): (p: B<U>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just _I and I_, since the "e" can be inferred by the lack of "I"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_I is not a valid user-defined Dafny identifier because it starts with underscore. I also wanted that initially. That's also why I made the e lowercase. Also, the e lowercase makes me think of the interdiction sign on the road, a bit bended :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, damn. I suppose you could also do '_I and I_'? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could but then that means the resulting generated code will look like _q__I and I___q in most backends, and even __q__I and _I___q in Rust, which does not look as symmetrical. Plus, quotes make it hard to select the entire identifier in any regular IDE. So no, I don't think I want to add quotes in identifiers in a builder pattern.

expect r.ParseFailure?;
var failureMessage := P.FailureToString(input, r);
expect failureMessage
== "Error:" + "\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use @"..." verbatim strings for multi-line constants in a few places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with @"..." is that indentation is treated as part of the constant, the first line requires two more characters than the remaining lines (because of the @" initial) and, because of the way Windows checks out lines, the expect would automatically fail if it expects "\r\n" instead of "\n".
The format I used above makes it possible to preserve the indentation while still having a visual about what the output should look like, and having control over all the characters.
Alternatively, I could use @" and still have the right indentation, by writing something like:

    StripIdent(@"
       FirstLine()
       SecondLine()");

where strip ident would

  • Be a method in the standard library, somewhere?
  • Remove the first newline
  • Detect the indentation of the first line
  • Erase the indentation on all subsequent lines
  • Erase Windows' carriage return \r.

I would prefer the current approach because it does not require any special handling of the string.
I'm personally using @"" mostly for regular expressions, to avoid escaping backslashes, not to avoid newlines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair that this is well out of scope for this PR at least. But those stripping utilities definitely sound like great things to add to Std.Strings in the future, so all Dafny users can get more value out of verbatim strings.

@@ -5,7 +5,7 @@ standard-libraries = true
# but consumers don't have to have this option set
# to use the standard libraries in general.
reads-clauses-on-methods = true
enforce-determinism = true
#enforce-determinism = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change this as it means we won't catch cases of libraries that accidentally can't be used in this mode. Is the Parsers library fundamentally non-deterministic or is this something fixable?

The good news is my PR creates a separate examples project with this off since I hit this as well: https://github.com/dafny-lang/dafny/pull/6074/files#diff-8ae9e889ff28306ca6c425716f35f7ef20201f4c15d7128363b49411e724bb3b :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was accidentally committed. Thanks for pointing it out.

@@ -0,0 +1,756 @@
abstract module Std.Parsers.AbstractInput {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you add the license header to each source file.

Comment on lines 179 to 180
It's expecting a digit where it should have just accepted the end of the string. To figure out what went wrong,
let's first define two debug functions like below - they can't be defined in the library because they have print effects. You could change the type of the output of string if you were to customize them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a decent solution given the current constraints, but I definitely wish it was easier to flip on for users somehow, and it's not great that it just plain doesn't work if users have --track-print-effects on.

Would it be possible to have an alternate evaluation mode that also carries a separate debugging information type in parallel, so you could do something like:

var r, log := ApplyWithLog(p, input);
print log;

(I'll know better myself how feasible that is once I review this more deeply)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way we could do it would be to define a new module ParserBuildersLog that would keep track of everything like this

datatype B<P> = B(p: Parser<(P, string)>) = {
  ... All operators but adding a log to it
}
... All operators but taking into account the log

Then, to import these, one could simply replace import Std.Parsers.StringParsersBuilders by adding Log at the end of the last one.
However, it would have the extra hassle that it has to be maintained separately.
To avoid the separation, one would have to create another abstract module

abstract module StringBuildersAbstract refines ParserBuilders {
  type Returning<P>
  function Extract(r: Returning<P>): P
  function Combine(a: Returning<P>, r: Returning<Q>): Returning<P>
}
module StringBuilders StringBuildersAbstract {
  type Returning<P> = P
}
module StringBuildersLog StringBuildersAbstract {
  type Returning<P> = (P, string)
}

Let me know if it's important for you that we do it this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that solution would have a lot of overhead. I'll keep this in mind when I review the implementation to see if I can think of a simpler solution.

Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed README and examples as requested, will start reviewing the rest next week.

@@ -0,0 +1,336 @@
# Verified Parser Combinators

Parser combinators in Dafny, inspired from the model (Meijer&Hutton 1996).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyperlink to paper perhaps?


This library offers two styles of functional parser combinators.

- The first parsers style is a synonym for `seq<character> -> ParseResult<Result>` that supports monadic styles, is straightforward to use, but results in lots of closing parentheses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be literally char, or is character more like a type parameter here?


/** Tic tac toe using string parsers */
method Main() {
var x := OrSeq([
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider var cell instead of var x

Comment on lines 6 to 7
// A parser is valid iff for any input, it never returns a fatal error
// and always returns a suffix of its input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many parsers are NOT valid? (Begs the question again of what a fatal error is)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformulated to:


    // A parser is NeverFatal iff for any input, it never returns a fatal error
    // Since fatal errors occur only when a parser returns more characters than it was given,
    // NeverFatal means the remaining unparsed must be smaller or equal to the input

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that you're showing this example with both styles of constructing parsers, but since the rest is identical, I'd say combine them into one file and just run the tests on both copies.

expect r.ParseFailure?;
var failureMessage := P.FailureToString(input, r);
expect failureMessage
== "Error:" + "\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair that this is well out of scope for this PR at least. But those stripping utilities definitely sound like great things to add to Std.Strings in the future, so all Dafny users can get more value out of verbatim strings.

FailureData("expected a "+name, input, Option.None))
}

opaque function IntToString(n: int): string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure, but as you move this code in here it will become part of the project/pre-built doo file, so there's no need for includes anywhere!

a.e_I(b.e_I(c))

will parse a, b, c but only return the result of `c`. */
function e_I<U>(other: B<U>): (p: B<U>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, damn. I suppose you could also do '_I and I_'? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standard Library: Parser combinators
2 participants