GoPAPAGENO was originally developed as a Go port of PAPAGENO, a Python 2.x tool to generate C parallel parsers based on Floyd's Operator Precedence Grammars. While PAPAGENO required the produced parsers to be used in conjunction with lexers generated by other programs such as Flex, GoPAPAGENO also implemented an efficient way to generate parallel lexical scanners.
This fork extends the original codebase by implementing a new parsing algorithm based on Cyclic Operator Precedence Grammars that can yield impressive results in terms of performance compared to the original OPP algorithm, while also generating much flatter parse trees.
Note
This project was developed as part of my Master's Thesis work at Politecnico di Milano. This is not at a production-grade level of functionality and optimization, and could be greatly improved to achieve even better results.
GoPAPAGENO is able to automatically generate parallel scanners and parsers starting from two specification files which resemble those required by programs like Flex and Bison. This section explains how to write these files and how to use the tool and the programs that it produces.
GoPAPAGENO can be installed by using the Go toolchain and running the following command:
go install github.com/giornetta/gopapageno/cmd/gopapageno@latest
Alternatively, you can run a Docker container:
docker run -it giornetta/gopapageno
The gopapageno program accepts the following command-line flags:
-l lexfile.l
: the lexicon specification file (Mandatory);-g grammarfile.g
: the grammar specification file (Mandatory);-o outdir
: the output directory of generated code (Mandatory);-s strat
: the parsing strategy to use betweenopp
,aopp
, andcopp
;-types-only
: if set, the program will not generate a main file, leaving integration into existing programs up to the user;-benchmark
: if set, the program will also generate a file containing some benchmarking utilities;-log
: if set, it enables verbose logging during code generation.
Unless running gopapageno with -types-only
, the tool will generate a self-contained program, which can be executed with the following flags:
-f source
: the source file to be parsed;-c concurrency
: the maximum number of goroutines to spawn during parallel lexing and parsing;-s strat
: which reduction strategy to use betweensweep
,mixed
andparallel
;-avg num
: the average length of tokens in the source file. Properly setting this can result in better performance;-pf factor
: an estimate of much parallelism could be exploited in the provided input file, in the range[0, 1]
. Properly setting this can result in better performance;-log
: if set, it enables verbose logging during parsing.-cpuprof outfile
: if set, it will enable cpu profiling and output its result in outfile;-memprof outfile
: if set, it will enable memory profiling and output its result in outfile.
Otherwise, GoPAPAGENO will only generate a lexer.pg.go
and parser.pg.go
file, which declare, respectively, the functions NewLexer()
and NewGrammar()
, whose result can then be passed into gopapageno.NewRunner
alongside any desired option, using Go’s functional options pattern, as shown in the example below.
r := gopapageno.NewRunner(
NewLexer(),
NewGrammar(),
gopapageno.WithConcurrency(4),
gopapageno.WithMemoryProfiling(memProfileWriter),
gopapageno.WithReductionStrategy(gopapageno.ReductionSweep),
gopapageno.WithAverageTokenLength(7),
)
The obtained runner can be used to execute the whole parallel lexing and parsing process, simply running r.Run(ctx, source)
, where source
is a []byte
.
The lexicon specification file must follow the following format:
Options
%%
Definitions
%%
Rules
%%
Code
The Options
section allows to specify parameters that affect how the parallel scanner is generated. Currently, the only supported options are:
%cut
Pattern to specify a regular expression that will be used to identify cut points in the source strings. If omitted, the source will be cut arbitrarily, potentially causing errors.%preamble
Func to specify the name of a function that will be executed before the lexing phase begins. The function specified here must adhere totype PreambleFunc func(sourceLen, concurrency int)
.
The Definitions
section allows to define, on separate lines, regular expressions to be associated with an identifier following the format Identifier Pattern
. The identifiers defined here can then be used in the Rules
section by writing {Identifier}
.
The Rules
section contains the rules that specify how tokens will be produced by the generated scanner. Every rule must have the following format:
Pattern {
Action
}
Here, Pattern
is the actual regular expression to recognize, and Action
is the Go code that will be executed when the associated regular expression is matched. Inside an Action
, the following variables are available:
rule
: the index of the lex rule being matched;text
: the actual string being matched;start
andend
: the starting and ending indices in the source text of the current token;thread
: the index of the concurrent worker scanning the current token;token
: a reference to the actual token being produced.
By accessing the token, its Type
and Value
can be set from here.
Moreover, from each Action
block, one of the following LexResults
should be returned:
LexOK
when a new token should be correctly recognized. (This is the default behavior, thus returning this value can be omitted)LexSkip
when the characters matched by the Rule should be skipped.LexErr
when some error has occurred.
Lastly, the Code
section allows for any valid Go code, which will be pasted into the generated parser as is. This can be useful to declare global variables or functions to use within the rules actions.
The patterns used to describe cut points, definitions, and rules accept the following operators, listed in decreasing precedence order:
x
, wherex
is a generic non special character, indicates an occurrence ofx
;\x
, wherex
is a generic special character, indicates an occurrence ofx
;[a-dz]
indicates an occurrence of a character betweena
andd
, orz
;^[a-dz]
indicates an occurrence of a character not between a and d, or z;.
indicates an occurrence of any character except\n
;(reg)
indicates the regular expressionreg
;reg*
indicates zero or more occurrences ofreg
;reg+
indicates one or more occurrences ofreg
;reg1reg2
indicates the concatenation ofreg1
andreg2
;reg1|reg2
indicates the union ofreg1
andreg2
.
The grammar specification file must follow the following format:
Options
%%
Rules
%%
Code
The Options
section has the same purpose as the one of the lexicon specification, but it affects how the parallel parser is generated. The supported options are:
%axiom
Token to specify the axiom token of the grammar. (This option is mandatory)%preamble
Func to specify the name of a function that will be executed before the parsing phase begins. The function specified here must adhere totype PreambleFunc func(sourceLen, concurrency int)
.
The Rules
section contains the rules of the grammar. Every rule must have the following format:
Lhs : Rhs {
Action
};
Or, for multiple rules that share the same lhs:
Lhs : Rhs1 {
Action1
} | Rhs2 {
Action2
};
Here, Rhs
is the name of a nonterminal token, and Rhs
is a list of tokens separated by spaces like Token1 Token2 ... TokenN
. When using COPP, the Kleene-+ operator can be used by wrapping its operands in parentheses like this: (Token1 . . . TokenI)+ TokenN
.
The Action
is the Go code that will be executed when the rule is reduced. Inside an Action
, the following variables are available:
rule
: the index of the grammar rule being matched.ruleFlags
: additional metadata about the kind of rule being matched. (Relevant only when working with C-OPGs)thread
: the index of the concurrent worker performing the reduction.
Moreover, the special symbols $$
and $k
can be used to access, respectively, the lhs token and the k-th rhs token. GoPAPAGENO will replace them with the actual references to the tokens when generating code.
The ruleFlags
variable is a bit set that can provide additional information regarding the kind of rule being matched, and its use its fundamental when defining semantic actions for cyclic rules. The RuleFlags
type provides a method Has
to easily check if flags are set, the most relevant ones being:
RuleAppend
: this is set whenever the current rule is the second (or later) occurrence of a cyclic rule. The lhs could possibly already have a value associated with it,Token1
contains the first token of the sequence, andTokenN
contains the newest nonterminal to consider. If this isn’t set, that means that the sequence is being recognized for the very first time, and the lhs’ value should be appropriately initialized, if necessary.RuleCombine
: this is set whenever the rightmost token of the matched prefix has already been matched as part of a sequence in a previous run by another worker. This means that a temporary parent node for it (and the rest of the sequence) has already been created, and possibly contains a partial result. In this particular case,TokenN
is swapped with its parent, so that its value can be accessed and properly combined to the lhs.
As for the lexicon specification, the Code
section allows any valid Go code. Beware of conflicts, though, as both the code declared here and in the Code section of the lexicon specification will be placed in the same package.
The examples
folder contains examples of generated parsers for different languages, such as simple arithmetic expressions and JSON. Each example folder contains parsers that leverage different algorithms: OPP
, AOPP
and COPP
.
To run a parser, execute the generated main.pg.go
files with the command-line flags detailed in the Installing and running the tool section.
Example folders also contain a benchmark_test.go
benchmark file that can be executed to run benchmarks specific to that parsing algorithm and grammar. Two benchmarks named BenchmarkParse
and BenchmarkParseOnly
are defined: the first considers the entire lexing and parsing process, the second one only considers the parsing stage.
To run a benchmark, execute a command with the following format:
go test github.com/giornetta/gopapageno/examples/{{ .EXAMPLE }}/{{ .STRAT }} -bench=BenchmarkParseOnly -count={{ .COUNT }}
To learn more about benchmarking details and how to replicate the process, please read the EVALUATING.md file.
- Michele Giornetta [email protected] [email protected] (Refactor, AOPP and C-OPP Extensions)
- Luca Marzi [email protected] (XPath Extension)
- Simone Guidi [email protected] (Original Version)