Go PAPAGENO

GoPAPAGENO was originally developed as a Go port of PAPAGENO, a Python 2.x tool to generate C parallel parsers based on Floyd's Operator Precedence Grammars. While PAPAGENO required the produced parsers to be used in conjunction with lexers generated by other programs such as Flex, GoPAPAGENO also implemented an efficient way to generate parallel lexical scanners.

This fork extends the original codebase by implementing a new parsing algorithm based on Cyclic Operator Precedence Grammars that can yield impressive results in terms of performance compared to the original OPP algorithm, while also generating much flatter parse trees.

Note

This project was developed as part of my Master's Thesis work at Politecnico di Milano. This is not at a production-grade level of functionality and optimization, and could be greatly improved to achieve even better results.

Usage

GoPAPAGENO is able to automatically generate parallel scanners and parsers starting from two specification files which resemble those required by programs like Flex and Bison. This section explains how to write these files and how to use the tool and the programs that it produces.

Installing and running the tool

GoPAPAGENO can be installed by using the Go toolchain and running the following command:

go install github.com/giornetta/gopapageno/cmd/gopapageno@latest

Alternatively, you can run a Docker container:

docker run -it giornetta/gopapageno

The gopapageno program accepts the following command-line flags:

-l lexfile.l: the lexicon specification file (Mandatory);
-g grammarfile.g: the grammar specification file (Mandatory);
-o outdir: the output directory of generated code (Mandatory);
-s strat: the parsing strategy to use between opp, aopp, and copp;
-types-only: if set, the program will not generate a main file, leaving integration into existing programs up to the user;
-benchmark: if set, the program will also generate a file containing some benchmarking utilities;
-log: if set, it enables verbose logging during code generation.

Unless running gopapageno with -types-only, the tool will generate a self-contained program, which can be executed with the following flags:

-f source: the source file to be parsed;
-c concurrency: the maximum number of goroutines to spawn during parallel lexing and parsing;
-s strat: which reduction strategy to use between sweep, mixed and parallel;
-avg num: the average length of tokens in the source file. Properly setting this can result in better performance;
-pf factor: an estimate of much parallelism could be exploited in the provided input file, in the range [0, 1]. Properly setting this can result in better performance;
-log: if set, it enables verbose logging during parsing.
-cpuprof outfile: if set, it will enable cpu profiling and output its result in outfile;
-memprof outfile: if set, it will enable memory profiling and output its result in outfile.

Otherwise, GoPAPAGENO will only generate a lexer.pg.go and parser.pg.go file, which declare, respectively, the functions NewLexer() and NewGrammar(), whose result can then be passed into gopapageno.NewRunner alongside any desired option, using Go’s functional options pattern, as shown in the example below.

r := gopapageno.NewRunner(
    NewLexer(),
    NewGrammar(),
    gopapageno.WithConcurrency(4),
    gopapageno.WithMemoryProfiling(memProfileWriter),
    gopapageno.WithReductionStrategy(gopapageno.ReductionSweep),
    gopapageno.WithAverageTokenLength(7),
)

The obtained runner can be used to execute the whole parallel lexing and parsing process, simply running r.Run(ctx, source), where source is a []byte.

Lexicon Specification

The lexicon specification file must follow the following format:

Options
%%
Definitions
%%
Rules
%%
Code

The Options section allows to specify parameters that affect how the parallel scanner is generated. Currently, the only supported options are:

%cut Pattern to specify a regular expression that will be used to identify cut points in the source strings. If omitted, the source will be cut arbitrarily, potentially causing errors.
%preamble Func to specify the name of a function that will be executed before the lexing phase begins. The function specified here must adhere to type PreambleFunc func(sourceLen, concurrency int).

The Definitions section allows to define, on separate lines, regular expressions to be associated with an identifier following the format Identifier Pattern. The identifiers defined here can then be used in the Rules section by writing {Identifier}.

The Rules section contains the rules that specify how tokens will be produced by the generated scanner. Every rule must have the following format:

Pattern {
    Action
}

Here, Pattern is the actual regular expression to recognize, and Action is the Go code that will be executed when the associated regular expression is matched. Inside an Action, the following variables are available:

rule: the index of the lex rule being matched;
text: the actual string being matched;
start and end: the starting and ending indices in the source text of the current token;
thread: the index of the concurrent worker scanning the current token;
token: a reference to the actual token being produced.

By accessing the token, its Type and Value can be set from here. Moreover, from each Action block, one of the following LexResults should be returned:

LexOK when a new token should be correctly recognized. (This is the default behavior, thus returning this value can be omitted)
LexSkip when the characters matched by the Rule should be skipped.
LexErr when some error has occurred.

Lastly, the Code section allows for any valid Go code, which will be pasted into the generated parser as is. This can be useful to declare global variables or functions to use within the rules actions.

The patterns used to describe cut points, definitions, and rules accept the following operators, listed in decreasing precedence order:

x, where x is a generic non special character, indicates an occurrence of x;
\x, where x is a generic special character, indicates an occurrence of x;
[a-dz] indicates an occurrence of a character between a and d, or z;
^[a-dz] indicates an occurrence of a character not between a and d, or z;
. indicates an occurrence of any character except \n;
(reg) indicates the regular expression reg;
reg* indicates zero or more occurrences of reg;
reg+ indicates one or more occurrences of reg;
reg1reg2 indicates the concatenation of reg1 and reg2;
reg1|reg2 indicates the union of reg1 and reg2.

Grammar Specification

The grammar specification file must follow the following format:

Options
%%
Rules
%%
Code

The Options section has the same purpose as the one of the lexicon specification, but it affects how the parallel parser is generated. The supported options are:

%axiom Token to specify the axiom token of the grammar. (This option is mandatory)
%preamble Func to specify the name of a function that will be executed before the parsing phase begins. The function specified here must adhere to type PreambleFunc func(sourceLen, concurrency int).

The Rules section contains the rules of the grammar. Every rule must have the following format:

Lhs : Rhs {
    Action
};

Or, for multiple rules that share the same lhs:

Lhs : Rhs1 {
    Action1
} | Rhs2 {
    Action2
};

Here, Rhs is the name of a nonterminal token, and Rhs is a list of tokens separated by spaces like Token1 Token2 ... TokenN. When using COPP, the Kleene-+ operator can be used by wrapping its operands in parentheses like this: (Token1 . . . TokenI)+ TokenN.

The Action is the Go code that will be executed when the rule is reduced. Inside an Action, the following variables are available:

rule: the index of the grammar rule being matched.
ruleFlags: additional metadata about the kind of rule being matched. (Relevant only when working with C-OPGs)
thread: the index of the concurrent worker performing the reduction.

Moreover, the special symbols $$ and $k can be used to access, respectively, the lhs token and the k-th rhs token. GoPAPAGENO will replace them with the actual references to the tokens when generating code.

The ruleFlags variable is a bit set that can provide additional information regarding the kind of rule being matched, and its use its fundamental when defining semantic actions for cyclic rules. The RuleFlags type provides a method Has to easily check if flags are set, the most relevant ones being:

RuleAppend: this is set whenever the current rule is the second (or later) occurrence of a cyclic rule. The lhs could possibly already have a value associated with it, Token1 contains the first token of the sequence, and TokenN contains the newest nonterminal to consider. If this isn’t set, that means that the sequence is being recognized for the very first time, and the lhs’ value should be appropriately initialized, if necessary.
RuleCombine: this is set whenever the rightmost token of the matched prefix has already been matched as part of a sequence in a previous run by another worker. This means that a temporary parent node for it (and the rest of the sequence) has already been created, and possibly contains a partial result. In this particular case, TokenN is swapped with its parent, so that its value can be accessed and properly combined to the lhs.

As for the lexicon specification, the Code section allows any valid Go code. Beware of conflicts, though, as both the code declared here and in the Code section of the lexicon specification will be placed in the same package.

Examples and Benchmarks

The examples folder contains examples of generated parsers for different languages, such as simple arithmetic expressions and JSON. Each example folder contains parsers that leverage different algorithms: OPP, AOPP and COPP.

To run a parser, execute the generated main.pg.go files with the command-line flags detailed in the Installing and running the tool section.

Example folders also contain a benchmark_test.go benchmark file that can be executed to run benchmarks specific to that parsing algorithm and grammar. Two benchmarks named BenchmarkParse and BenchmarkParseOnly are defined: the first considers the entire lexing and parsing process, the second one only considers the parsing stage.

To run a benchmark, execute a command with the following format:

go test github.com/giornetta/gopapageno/examples/{{ .EXAMPLE }}/{{ .STRAT }} -bench=BenchmarkParseOnly -count={{ .COUNT }}

To learn more about benchmarking details and how to replicate the process, please read the EVALUATING.md file.

Authors and Contributors

Michele Giornetta [email protected] [email protected] (Refactor, AOPP and C-OPP Extensions)
Luca Marzi [email protected] (XPath Extension)
Simone Guidi [email protected] (Original Version)

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
benchmark		benchmark
cmd/gopapageno		cmd/gopapageno
examples		examples
ext/xpath		ext/xpath
generator		generator
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
EVALUATING.md		EVALUATING.md
LICENSE		LICENSE
README.md		README.md
Taskfile.yml		Taskfile.yml
aopp.go		aopp.go
copp.go		copp.go
copp_stack.go		copp_stack.go
go.mod		go.mod
go.sum		go.sum
grammar.go		grammar.go
lexer.go		lexer.go
lops.go		lops.go
los.go		los.go
los_test.go		los_test.go
opp.go		opp.go
opp_stack.go		opp_stack.go
parser_stack.go		parser_stack.go
pool.go		pool.go
precedence.go		precedence.go
rule.go		rule.go
runner.go		runner.go
stack.go		stack.go
token.go		token.go
token_test.go		token_test.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Go PAPAGENO

Usage

Installing and running the tool

Lexicon Specification

Grammar Specification

Examples and Benchmarks

Authors and Contributors

About

Releases

Packages

Languages

License

giornetta/gopapageno

Folders and files

Latest commit

History

Repository files navigation

Go PAPAGENO

Usage

Installing and running the tool

Lexicon Specification

Grammar Specification

Examples and Benchmarks

Authors and Contributors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages