-
Notifications
You must be signed in to change notification settings - Fork 280
Feat: Verified and easy-to-use parser combinators #6143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A.AboutDrop(input, A.Length(input)-A.Length(remaining1), A.Length(remaining1)-A.Length(remaining2)); | ||
} | ||
|
||
// Cannot express this predicate if Input is allocated. Add once we accept quantification over unallocated objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be interested in the "proof trait" pattern I used in the Actions library: https://github.com/dafny-lang/dafny/pull/6074/files#diff-cb5145d4b584ea7656360f05c946523c48c5651815cea269b469a855844c3ed2R105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some initial thoughts based on the offline demo, will return for a more thorough review.
FailureData("expected a "+name, input, Option.None)) | ||
} | ||
|
||
opaque function IntToString(n: int): string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the existing utilities like Std.Strings.OfInt()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember why I did not do that. In the project that was using these parsers, I was only import Std.Wrappers and these parsers.
By using Std.Strings.OfInt(), I now need to include all these files
"Std/Strings.dfy",
"Std/Arithmetic/Power.dfy",
"Std/Arithmetic/Mul.dfy",
"Std/Arithmetic/DivMod.dfy",
"Std/Arithmetic/Logarithm.dfy",
"Std/Arithmetic/Internal/DivInternals.dfy",
"Std/Arithmetic/Internal/GeneralInternals.dfy",
"Std/Arithmetic/Internal/ModInternals.dfy",
"Std/Arithmetic/Internal/MulInternalsNonlinear.dfy",
"Std/Arithmetic/Internal/ModInternalsNonlinear.dfy",
"Std/Arithmetic/Internal/DivInternalsNonlinear.dfy",
"Std/Arithmetic/Internal/MulInternals.dfy",
"Std/Arithmetic/LittleEndianNat.dfy",
much of which is useless to my application. Looking forward being able to just enable standard libraries soon :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sure, but as you move this code in here it will become part of the project/pre-built doo file, so there's no need for includes anywhere!
a.e_I(b.e_I(c)) | ||
|
||
will parse a, b, c but only return the result of `c`. */ | ||
function e_I<U>(other: B<U>): (p: B<U>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about just _I
and I_
, since the "e" can be inferred by the lack of "I"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_I
is not a valid user-defined Dafny identifier because it starts with underscore. I also wanted that initially. That's also why I made the e
lowercase. Also, the e
lowercase makes me think of the interdiction sign on the road, a bit bended :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, damn. I suppose you could also do '_I
and I_'
? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could but then that means the resulting generated code will look like _q__I
and I___q
in most backends, and even __q__I
and _I___q
in Rust, which does not look as symmetrical. Plus, quotes make it hard to select the entire identifier in any regular IDE. So no, I don't think I want to add quotes in identifiers in a builder pattern.
expect r.ParseFailure?; | ||
var failureMessage := P.FailureToString(input, r); | ||
expect failureMessage | ||
== "Error:" + "\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use @"..."
verbatim strings for multi-line constants in a few places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with @"..."
is that indentation is treated as part of the constant, the first line requires two more characters than the remaining lines (because of the @"
initial) and, because of the way Windows checks out lines, the expect would automatically fail if it expects "\r\n"
instead of "\n"
.
The format I used above makes it possible to preserve the indentation while still having a visual about what the output should look like, and having control over all the characters.
Alternatively, I could use @"
and still have the right indentation, by writing something like:
StripIdent(@"
FirstLine()
SecondLine()");
where strip ident would
- Be a method in the standard library, somewhere?
- Remove the first newline
- Detect the indentation of the first line
- Erase the indentation on all subsequent lines
- Erase Windows' carriage return
\r
.
I would prefer the current approach because it does not require any special handling of the string.
I'm personally using @""
mostly for regular expressions, to avoid escaping backslashes, not to avoid newlines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair that this is well out of scope for this PR at least. But those stripping utilities definitely sound like great things to add to Std.Strings
in the future, so all Dafny users can get more value out of verbatim strings.
@@ -5,7 +5,7 @@ standard-libraries = true | |||
# but consumers don't have to have this option set | |||
# to use the standard libraries in general. | |||
reads-clauses-on-methods = true | |||
enforce-determinism = true | |||
#enforce-determinism = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should change this as it means we won't catch cases of libraries that accidentally can't be used in this mode. Is the Parsers library fundamentally non-deterministic or is this something fixable?
The good news is my PR creates a separate examples project with this off since I hit this as well: https://github.com/dafny-lang/dafny/pull/6074/files#diff-8ae9e889ff28306ca6c425716f35f7ef20201f4c15d7128363b49411e724bb3b :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this was accidentally committed. Thanks for pointing it out.
It's expecting a digit where it should have just accepted the end of the string. To figure out what went wrong, | ||
let's first define two debug functions like below - they can't be defined in the library because they have print effects. You could change the type of the output of string if you were to customize them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a decent solution given the current constraints, but I definitely wish it was easier to flip on for users somehow, and it's not great that it just plain doesn't work if users have --track-print-effects
on.
Would it be possible to have an alternate evaluation mode that also carries a separate debugging information type in parallel, so you could do something like:
var r, log := ApplyWithLog(p, input);
print log;
(I'll know better myself how feasible that is once I review this more deeply)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way we could do it would be to define a new module ParserBuildersLog
that would keep track of everything like this
datatype B<P> = B(p: Parser<(P, string)>) = {
... All operators but adding a log to it
}
... All operators but taking into account the log
Then, to import these, one could simply replace import Std.Parsers.StringParsersBuilders
by adding Log
at the end of the last one.
However, it would have the extra hassle that it has to be maintained separately.
To avoid the separation, one would have to create another abstract module
abstract module StringBuildersAbstract refines ParserBuilders {
type Returning<P>
function Extract(r: Returning<P>): P
function Combine(a: Returning<P>, r: Returning<Q>): Returning<P>
}
module StringBuilders StringBuildersAbstract {
type Returning<P> = P
}
module StringBuildersLog StringBuildersAbstract {
type Returning<P> = (P, string)
}
Let me know if it's important for you that we do it this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that solution would have a lot of overhead. I'll keep this in mind when I review the implementation to see if I can think of a simpler solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed README and examples as requested, will start reviewing the rest next week.
@@ -0,0 +1,336 @@ | |||
# Verified Parser Combinators | |||
|
|||
Parser combinators in Dafny, inspired from the model (Meijer&Hutton 1996). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hyperlink to paper perhaps?
|
||
This library offers two styles of functional parser combinators. | ||
|
||
- The first parsers style is a synonym for `seq<character> -> ParseResult<Result>` that supports monadic styles, is straightforward to use, but results in lots of closing parentheses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should that be literally char
, or is character
more like a type parameter here?
|
||
/** Tic tac toe using string parsers */ | ||
method Main() { | ||
var x := OrSeq([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider var cell
instead of var x
// A parser is valid iff for any input, it never returns a fatal error | ||
// and always returns a suffix of its input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many parsers are NOT valid? (Begs the question again of what a fatal error is)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reformulated to:
// A parser is NeverFatal iff for any input, it never returns a fatal error
// Since fatal errors occur only when a parser returns more characters than it was given,
// NeverFatal means the remaining unparsed must be smaller or equal to the input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like that you're showing this example with both styles of constructing parsers, but since the rest is identical, I'd say combine them into one file and just run the tests on both copies.
expect r.ParseFailure?; | ||
var failureMessage := P.FailureToString(input, r); | ||
expect failureMessage | ||
== "Error:" + "\n" + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair that this is well out of scope for this PR at least. But those stripping utilities definitely sound like great things to add to Std.Strings
in the future, so all Dafny users can get more value out of verbatim strings.
FailureData("expected a "+name, input, Option.None)) | ||
} | ||
|
||
opaque function IntToString(n: int): string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah sure, but as you move this code in here it will become part of the project/pre-built doo file, so there's no need for includes anywhere!
a.e_I(b.e_I(c)) | ||
|
||
will parse a, b, c but only return the result of `c`. */ | ||
function e_I<U>(other: B<U>): (p: B<U>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah right, damn. I suppose you could also do '_I
and I_'
? :)
Co-authored-by: Robin Salkeld <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really exemplary work all around, and excellent depth of documentation and examples. Mainly quibbling over too-short names in the DSL and some potential reuse of existing utilities.
@@ -1,6 +1,7 @@ | |||
using System; | |||
using System.Collections.Generic; | |||
using System.Diagnostics.Contracts; | |||
using System.IO; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused?
module ExampleParsers.SmtParser { | ||
import opened Std.Parsers.StringBuilders | ||
|
||
function SeqToString<T>(s: seq<T>, f: T --> string, separator: string := ""): string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be Seq.Join(Seq.Map(f, s), separator)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The map from the library takes a total function, and I need a partial one to ensure all the things it's called with are in the list for termination purposes.
Perhaps we could create another Map for that purpose but I don't plan to for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I just found out that I had a wrong assumption, that the standard library Map function actually a partial function. Nevermind, will update, I prefer to reuse existing code too and Seq.Join(Seq.MapPartialFunction(f, s), separator)
// B1.I_e(B2) will parse both but return the result of B1 | ||
// B.M(f) will map the result of the parser builder by f if succeeded | ||
// B1.O(B2) will either parse B1, or B2 if B1 fails with Recoverable | ||
// O([B1, B2, B3]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit, misread this as taking three arguments instead of a sequence of arguments at first. Might be enough to tweak the documentation to say "Chains multiple parsers in sequence. For example, will parse with B1..."
|
||
/** `a.M(f)` evaluates `a` on the input, and then if it returns a success, | ||
transforms the value using the function `f`. */ | ||
function M<U>(mappingFunc: R -> U): (p: B<U>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth the extra two letters to call this Map
, and similarly for the other arity variants.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I thought of that too. I deliberated and here is why I'll stick to M
only.
I designed the builders so that variants used the most can be as short as possible while remaining readable. That way constants can use S()
, include/exclude take only three letters, and Map only one letter.
Since parser builders already require a learning curve, I'm piggy backing on it to avoid long identifiers, and only abbreviations. That way there is a relatively clear separation between the raw parsers and these builders.
If one wants to use the longer forms like Map
, I'd recommend to use the parser combinators without the DSL, which use explicit names everywhere.
In very little time I was used to just M
and I am glad I don't need to write 2 extra chars everywhere.
/** `a.RepFold(init, combine) will consider init as an accumulator, and every time `a` succeeds and parses some input, it will | ||
combine(init, result) for the new accumulator value. | ||
If `a` fails once while consuming input, it returns the failure. */ | ||
function RepFold<A>(init: A, combine: (A, R) -> A): (p: B<A>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Rep" isn't quite unambiguous enough for me to read it as "Repeat", too close to "Representation" or others.
Perhaps "Star" as the prefix for repeating instead of "Rep", since that's so often the symbol used in grammers/regexps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering it but then instead of Rep
and Rep1
you'd have the two heterogeneous Star
and Plus
which are not descriptive enough for me. So I'd prefer to stick with the 3 letter version of Rep
.
// Error handling utilities | ||
// ######################## | ||
|
||
function repeat_(str: string, n: nat): (r: string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Collections.Seq.Repeat
!
if vars.i < |vars.input| && vars.i != vars.pos then | ||
var colNumber := vars.colNumber + 1; | ||
if vars.input[vars.i] == '\r' && vars.i + 1 < |vars.input| && vars.input[vars.i+1] == '\n' then | ||
//var startLinePos := i + 1; // ORIGINAL BUG: vars.i + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recording a time Dafny caught a bug? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, but the line has been changed so much that this is no longer relevant. Removing.
docs/dev/news/parsers.feat
Outdated
@@ -0,0 +1 @@ | |||
Parser combinators now available from the standard library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you've earned a few more words on this than just that. :) Mention some of the cool features at least too!
* You can prove the equivalence between various combinations of parser combinators (e.g. Bind/Succeed == Map) | ||
* You can use these parser combinators as a specification for optimized methods that perform the same task but perhaps with a different representation, e.g. array and position instead of sequences of characters. | ||
|
||
### What are fatal errors? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, thanks
output | ||
} | ||
|
||
// A datatype that avoid a quadratic complexity in concatenating long sequences. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this necessary given all the backends already do this for sequence concatenation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing this resulted in a +430% speedup on a benchmark that was using single-character repetition when using the JavaScript version of the parsers.
I think that the problem is when we do this.
"a" + "b" + "c" + "d" + ...
it does perform the concatenation in a lazy way, but when the time comes when we want to realize the final strings, two things happen
- The recursion to get the string blows the stack in some cases
- When it does not blow the stack, intermediate strings seem to be concretized even though they are not necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very much appreciate the more verbose aliases we decided on, thanks for putting in the extra effort to make more users happy. :)
)" This reverts revers commit f3b6788.
…#6243) This reinstates commit f3b6788. This time deep tests will be run to ensure CI failure is fixed. <small>By submitting this pull request, I confirm that my contribution is made under the terms of the [MIT license](https://github.com/dafny-lang/dafny/blob/master/LICENSE.txt).</small>
Fixes #5418
This PR adds to the standard library support for functional parsers and parsers combinators.
A parser is simply a
Input -> ParseResult<T>
whereParseResult
is either a success or a failure that contains the remaining unparsed input. A parsing failure is "committed" if the remaining unparsed input is strictly smaller than the input. Disjunctions can happen only if a failure is not committed. Combinators can help modify the consumption of the input, which makes it possible to create a rich variety of parsers.The Parsers module is abstract as parsers operate on an
Input
that can be converted at any time to aseq<C>
. InStringParsers
,C
is a char andInput
is a datatype simulating lazy slicing on the input string. That lazy slicing resulted in a +16% speedup on a real benchmark compared to just havingseq<char>
forInput
.You can browse the example to understand how parsers combinators can be used.
4*(1+3)*x^3+x
The easiest way to use parsers is to use the parsers builders. Parser builds are a datatype wrapper around parsers to make it possible to have suffix and infix combinators. This allows chaining and avoid nested parentheses. See the following examples:
4*(1+3)*x^3+x
but the syntax seems better to me.abc((de|f((g))*))ml"
, create another parser out of that, and apply this parser to an input string like"abcdeml"
.Notable features include:
How has this been tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.