Skip to content

Feat: Verified and easy-to-use parser combinators #6143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 58 commits into from
May 16, 2025

Conversation

MikaelMayer
Copy link
Member

@MikaelMayer MikaelMayer commented Mar 7, 2025

Fixes #5418

This PR adds to the standard library support for functional parsers and parsers combinators.
A parser is simply a Input -> ParseResult<T> where ParseResult is either a success or a failure that contains the remaining unparsed input. A parsing failure is "committed" if the remaining unparsed input is strictly smaller than the input. Disjunctions can happen only if a failure is not committed. Combinators can help modify the consumption of the input, which makes it possible to create a rich variety of parsers.

The Parsers module is abstract as parsers operate on an Input that can be converted at any time to a seq<C>. In StringParsers, C is a char and Input is a datatype simulating lazy slicing on the input string. That lazy slicing resulted in a +16% speedup on a real benchmark compared to just having seq<char> for Input.

You can browse the example to understand how parsers combinators can be used.

The easiest way to use parsers is to use the parsers builders. Parser builds are a datatype wrapper around parsers to make it possible to have suffix and infix combinators. This allows chaining and avoid nested parentheses. See the following examples:

Notable features include:

  • Combinators for repetition do not concatenate sequences with one value at a time. Instead, they build a linked list and when done convert the whole linked list to a sequence. This resulted in a +430% speedup on a benchmark that was using single-character repetition when using the JavaScript version of the parsers.
  • Debugging utilities and tutorial to explain how to use them in the README.md
  • It is not possible to create parsers that don't terminate, thanks Dafny for proving it. At worst, a non-terminating parser will always return a "fatal" parsing result that can be still be analyzed at run-time.
  • It is possible to prove that parsers won't return fatal errors and perform progress. Some parser combinators have proofs of progress preservation and absence of fatal errors in ParserTheorems.dfy
  • Support for multiple parse error messages
  • For string parsers, a utility to display nice error messages that includes the line number, the line, a hat to indicate the column where parsing fails, and a detailed error message.
Error:
1; 1+*   
     ^   
expected '(', or
expected a digit, or
expected 'x'
  • Largely inspired from (Meijer 1996) parsers

How has this been tested?

  • Main examples in the README.md are checked
  • Several example illustrate the parsers functionality
  • I have used this parsers combinators library on a big project.

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

@MikaelMayer MikaelMayer requested a review from robin-aws March 7, 2025 18:52
A.AboutDrop(input, A.Length(input)-A.Length(remaining1), A.Length(remaining1)-A.Length(remaining2));
}

// Cannot express this predicate if Input is allocated. Add once we accept quantification over unallocated objects
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be interested in the "proof trait" pattern I used in the Actions library: https://github.com/dafny-lang/dafny/pull/6074/files#diff-cb5145d4b584ea7656360f05c946523c48c5651815cea269b469a855844c3ed2R105

Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial thoughts based on the offline demo, will return for a more thorough review.

FailureData("expected a "+name, input, Option.None))
}

opaque function IntToString(n: int): string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the existing utilities like Std.Strings.OfInt()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent suggestion.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember why I did not do that. In the project that was using these parsers, I was only import Std.Wrappers and these parsers.
By using Std.Strings.OfInt(), I now need to include all these files

  "Std/Strings.dfy",
  "Std/Arithmetic/Power.dfy",
  "Std/Arithmetic/Mul.dfy",
  "Std/Arithmetic/DivMod.dfy",
  "Std/Arithmetic/Logarithm.dfy",
  "Std/Arithmetic/Internal/DivInternals.dfy",
  "Std/Arithmetic/Internal/GeneralInternals.dfy",
  "Std/Arithmetic/Internal/ModInternals.dfy",
  "Std/Arithmetic/Internal/MulInternalsNonlinear.dfy",
  "Std/Arithmetic/Internal/ModInternalsNonlinear.dfy",
  "Std/Arithmetic/Internal/DivInternalsNonlinear.dfy",
  "Std/Arithmetic/Internal/MulInternals.dfy",
  "Std/Arithmetic/LittleEndianNat.dfy",

much of which is useless to my application. Looking forward being able to just enable standard libraries soon :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure, but as you move this code in here it will become part of the project/pre-built doo file, so there's no need for includes anywhere!

a.e_I(b.e_I(c))

will parse a, b, c but only return the result of `c`. */
function e_I<U>(other: B<U>): (p: B<U>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just _I and I_, since the "e" can be inferred by the lack of "I"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_I is not a valid user-defined Dafny identifier because it starts with underscore. I also wanted that initially. That's also why I made the e lowercase. Also, the e lowercase makes me think of the interdiction sign on the road, a bit bended :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, damn. I suppose you could also do '_I and I_'? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could but then that means the resulting generated code will look like _q__I and I___q in most backends, and even __q__I and _I___q in Rust, which does not look as symmetrical. Plus, quotes make it hard to select the entire identifier in any regular IDE. So no, I don't think I want to add quotes in identifiers in a builder pattern.

expect r.ParseFailure?;
var failureMessage := P.FailureToString(input, r);
expect failureMessage
== "Error:" + "\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could use @"..." verbatim strings for multi-line constants in a few places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with @"..." is that indentation is treated as part of the constant, the first line requires two more characters than the remaining lines (because of the @" initial) and, because of the way Windows checks out lines, the expect would automatically fail if it expects "\r\n" instead of "\n".
The format I used above makes it possible to preserve the indentation while still having a visual about what the output should look like, and having control over all the characters.
Alternatively, I could use @" and still have the right indentation, by writing something like:

    StripIdent(@"
       FirstLine()
       SecondLine()");

where strip ident would

  • Be a method in the standard library, somewhere?
  • Remove the first newline
  • Detect the indentation of the first line
  • Erase the indentation on all subsequent lines
  • Erase Windows' carriage return \r.

I would prefer the current approach because it does not require any special handling of the string.
I'm personally using @"" mostly for regular expressions, to avoid escaping backslashes, not to avoid newlines.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair that this is well out of scope for this PR at least. But those stripping utilities definitely sound like great things to add to Std.Strings in the future, so all Dafny users can get more value out of verbatim strings.

@@ -5,7 +5,7 @@ standard-libraries = true
# but consumers don't have to have this option set
# to use the standard libraries in general.
reads-clauses-on-methods = true
enforce-determinism = true
#enforce-determinism = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should change this as it means we won't catch cases of libraries that accidentally can't be used in this mode. Is the Parsers library fundamentally non-deterministic or is this something fixable?

The good news is my PR creates a separate examples project with this off since I hit this as well: https://github.com/dafny-lang/dafny/pull/6074/files#diff-8ae9e889ff28306ca6c425716f35f7ef20201f4c15d7128363b49411e724bb3b :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this was accidentally committed. Thanks for pointing it out.

Comment on lines 179 to 180
It's expecting a digit where it should have just accepted the end of the string. To figure out what went wrong,
let's first define two debug functions like below - they can't be defined in the library because they have print effects. You could change the type of the output of string if you were to customize them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a decent solution given the current constraints, but I definitely wish it was easier to flip on for users somehow, and it's not great that it just plain doesn't work if users have --track-print-effects on.

Would it be possible to have an alternate evaluation mode that also carries a separate debugging information type in parallel, so you could do something like:

var r, log := ApplyWithLog(p, input);
print log;

(I'll know better myself how feasible that is once I review this more deeply)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way we could do it would be to define a new module ParserBuildersLog that would keep track of everything like this

datatype B<P> = B(p: Parser<(P, string)>) = {
  ... All operators but adding a log to it
}
... All operators but taking into account the log

Then, to import these, one could simply replace import Std.Parsers.StringParsersBuilders by adding Log at the end of the last one.
However, it would have the extra hassle that it has to be maintained separately.
To avoid the separation, one would have to create another abstract module

abstract module StringBuildersAbstract refines ParserBuilders {
  type Returning<P>
  function Extract(r: Returning<P>): P
  function Combine(a: Returning<P>, r: Returning<Q>): Returning<P>
}
module StringBuilders StringBuildersAbstract {
  type Returning<P> = P
}
module StringBuildersLog StringBuildersAbstract {
  type Returning<P> = (P, string)
}

Let me know if it's important for you that we do it this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that solution would have a lot of overhead. I'll keep this in mind when I review the implementation to see if I can think of a simpler solution.

Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed README and examples as requested, will start reviewing the rest next week.

@@ -0,0 +1,336 @@
# Verified Parser Combinators

Parser combinators in Dafny, inspired from the model (Meijer&Hutton 1996).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hyperlink to paper perhaps?


This library offers two styles of functional parser combinators.

- The first parsers style is a synonym for `seq<character> -> ParseResult<Result>` that supports monadic styles, is straightforward to use, but results in lots of closing parentheses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should that be literally char, or is character more like a type parameter here?


/** Tic tac toe using string parsers */
method Main() {
var x := OrSeq([
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider var cell instead of var x

Comment on lines 6 to 7
// A parser is valid iff for any input, it never returns a fatal error
// and always returns a suffix of its input
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many parsers are NOT valid? (Begs the question again of what a fatal error is)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reformulated to:


    // A parser is NeverFatal iff for any input, it never returns a fatal error
    // Since fatal errors occur only when a parser returns more characters than it was given,
    // NeverFatal means the remaining unparsed must be smaller or equal to the input

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like that you're showing this example with both styles of constructing parsers, but since the rest is identical, I'd say combine them into one file and just run the tests on both copies.

expect r.ParseFailure?;
var failureMessage := P.FailureToString(input, r);
expect failureMessage
== "Error:" + "\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair that this is well out of scope for this PR at least. But those stripping utilities definitely sound like great things to add to Std.Strings in the future, so all Dafny users can get more value out of verbatim strings.

FailureData("expected a "+name, input, Option.None))
}

opaque function IntToString(n: int): string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sure, but as you move this code in here it will become part of the project/pre-built doo file, so there's no need for includes anywhere!

a.e_I(b.e_I(c))

will parse a, b, c but only return the result of `c`. */
function e_I<U>(other: B<U>): (p: B<U>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right, damn. I suppose you could also do '_I and I_'? :)

Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really exemplary work all around, and excellent depth of documentation and examples. Mainly quibbling over too-short names in the DSL and some potential reuse of existing utilities.

@@ -1,6 +1,7 @@
using System;
using System.Collections.Generic;
using System.Diagnostics.Contracts;
using System.IO;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused?

module ExampleParsers.SmtParser {
import opened Std.Parsers.StringBuilders

function SeqToString<T>(s: seq<T>, f: T --> string, separator: string := ""): string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be Seq.Join(Seq.Map(f, s), separator)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The map from the library takes a total function, and I need a partial one to ensure all the things it's called with are in the list for termination purposes.
Perhaps we could create another Map for that purpose but I don't plan to for now.

Copy link
Member Author

@MikaelMayer MikaelMayer May 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just found out that I had a wrong assumption, that the standard library Map function actually a partial function. Nevermind, will update, I prefer to reuse existing code too and Seq.Join(Seq.MapPartialFunction(f, s), separator)

// B1.I_e(B2) will parse both but return the result of B1
// B.M(f) will map the result of the parser builder by f if succeeded
// B1.O(B2) will either parse B1, or B2 if B1 fails with Recoverable
// O([B1, B2, B3])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, misread this as taking three arguments instead of a sequence of arguments at first. Might be enough to tweak the documentation to say "Chains multiple parsers in sequence. For example, will parse with B1..."


/** `a.M(f)` evaluates `a` on the input, and then if it returns a success,
transforms the value using the function `f`. */
function M<U>(mappingFunc: R -> U): (p: B<U>) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth the extra two letters to call this Map, and similarly for the other arity variants.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I thought of that too. I deliberated and here is why I'll stick to M only.
I designed the builders so that variants used the most can be as short as possible while remaining readable. That way constants can use S(), include/exclude take only three letters, and Map only one letter.

Since parser builders already require a learning curve, I'm piggy backing on it to avoid long identifiers, and only abbreviations. That way there is a relatively clear separation between the raw parsers and these builders.

If one wants to use the longer forms like Map, I'd recommend to use the parser combinators without the DSL, which use explicit names everywhere.
In very little time I was used to just M and I am glad I don't need to write 2 extra chars everywhere.

/** `a.RepFold(init, combine) will consider init as an accumulator, and every time `a` succeeds and parses some input, it will
combine(init, result) for the new accumulator value.
If `a` fails once while consuming input, it returns the failure. */
function RepFold<A>(init: A, combine: (A, R) -> A): (p: B<A>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Rep" isn't quite unambiguous enough for me to read it as "Repeat", too close to "Representation" or others.

Perhaps "Star" as the prefix for repeating instead of "Rep", since that's so often the symbol used in grammers/regexps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering it but then instead of Rep and Rep1 you'd have the two heterogeneous Star and Plus which are not descriptive enough for me. So I'd prefer to stick with the 3 letter version of Rep.

// Error handling utilities
// ########################

function repeat_(str: string, n: nat): (r: string)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collections.Seq.Repeat!

if vars.i < |vars.input| && vars.i != vars.pos then
var colNumber := vars.colNumber + 1;
if vars.input[vars.i] == '\r' && vars.i + 1 < |vars.input| && vars.input[vars.i+1] == '\n' then
//var startLinePos := i + 1; // ORIGINAL BUG: vars.i + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recording a time Dafny caught a bug? :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, but the line has been changed so much that this is no longer relevant. Removing.

@@ -0,0 +1 @@
Parser combinators now available from the standard library
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've earned a few more words on this than just that. :) Mention some of the cool features at least too!

* You can prove the equivalence between various combinations of parser combinators (e.g. Bind/Succeed == Map)
* You can use these parser combinators as a specification for optimized methods that perform the same task but perhaps with a different representation, e.g. array and position instead of sequences of characters.

### What are fatal errors?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks

output
}

// A datatype that avoid a quadratic complexity in concatenating long sequences.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary given all the backends already do this for sequence concatenation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this resulted in a +430% speedup on a benchmark that was using single-character repetition when using the JavaScript version of the parsers.
I think that the problem is when we do this.

"a" + "b" + "c" + "d" + ...

it does perform the concatenation in a lazy way, but when the time comes when we want to realize the final strings, two things happen

  • The recursion to get the string blows the stack in some cases
  • When it does not blow the stack, intermediate strings seem to be concretized even though they are not necessary.

Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much appreciate the more verbose aliases we decided on, thanks for putting in the extra effort to make more users happy. :)

@robin-aws robin-aws enabled auto-merge (squash) May 16, 2025 03:54
@robin-aws robin-aws merged commit f3b6788 into master May 16, 2025
34 of 35 checks passed
@robin-aws robin-aws deleted the feat-parsers-in-standard-library branch May 16, 2025 04:19
fabiomadge added a commit to fabiomadge/dafny that referenced this pull request May 16, 2025
fabiomadge added a commit that referenced this pull request May 16, 2025
MikaelMayer added a commit that referenced this pull request May 16, 2025
MikaelMayer added a commit that referenced this pull request May 20, 2025
…#6243)

This reinstates commit f3b6788.
This time deep tests will be run to ensure CI failure is fixed.

<small>By submitting this pull request, I confirm that my contribution
is made under the terms of the [MIT
license](https://github.com/dafny-lang/dafny/blob/master/LICENSE.txt).</small>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Standard Library: Parser combinators
2 participants