Skip to content

simple PEG grammar not working as expected #36

@wiktortomczak

Description

@wiktortomczak

I'd expect Pika to parse both of the following

a[]()
a()[]

using the following grammar:

program
  <- primary;

primary
  <- call
  /  subscript
  /  attribute
  /  name;

call
  <- call:(primary '(' ')');
  
subscript
  <- subscript:(primary '[' ']');

attribute
  <- attribute:(primary '.' name);

name <- name:[A-Za-z0-9_]+;

However, only a[]() works:

Match tree for rule program:
                                                      ┌─────────┐  
 10 : primary <- call / subscript / attribute / name  │a [ ] ( )│  
                                                      ├─────────┤  
                  6 : call <- call:(primary '(' ')')  │a [ ] ( )│  
                                                      ├─────┐   │  
 10 : primary <- call / subscript / attribute / name  │a [ ]│   │  
                                                      ├─────┤   │  
        7 : subscript <- subscript:(primary '[' ']')  │a [ ]│   │  
                                                      ├─┐   │   │  
 10 : primary <- call / subscript / attribute / name  │a│   │   │  
                                                      ├─┤   │   │  
                      8 : name <- name:[0-9A-Z_a-z]+  │a│   │   │  
                                                      ├─┤   │   │  
                         5 : [terminal] [0-9A-Z_a-z]  │a│   │   │  
                                                      │ │ ┌─┤   │  
                                  3 : [terminal] ']'  │ │ │]│   │  
                                                      │ ├─┤ │   │  
                                  2 : [terminal] '['  │ │[│ │   │  
                                                      │ │ │ │ ┌─┤  
                                  1 : [terminal] ')'  │ │ │ │ │)│  
                                                      │ │ │ ├─┤ │  
                                  0 : [terminal] '('  │ │ │ │(│ │  
                                                       0 1 2 3 4 
                                                       a [ ] ( ) 
AST for rule "program":

└─program : 0+5 : "a[]()"
  └─call : 0+5 : "a[]()"
    └─subscript : 0+3 : "a[]"
      └─name : 0+1 : "a"

while a()[] produces a partial match (the trailing [] is not matched):

Match tree for rule program:
                                                      ┌─────┐░░░   
 10 : primary <- call / subscript / attribute / name  │a ( )│░░░   
                                                      ├─────┤░░░   
                  6 : call <- call:(primary '(' ')')  │a ( )│░░░   
                                                      ├─┐   │░░░   
 10 : primary <- call / subscript / attribute / name  │a│   │░░░   
                                                      ├─┤   │░░░   
                      8 : name <- name:[0-9A-Z_a-z]+  │a│   │░░░   
                                                      ├─┤   │░░░   
                         5 : [terminal] [0-9A-Z_a-z]  │a│   │░░░   
                                                      │ │   │░┌─┐  
                                  3 : [terminal] ']'  │ │   │░│]│  
                                                      │ │   ├─┤ │  
                                  2 : [terminal] '['  │ │   │[│ │  
                                                      │ │ ┌─┤ │ │  
                                  1 : [terminal] ')'  │ │ │)│ │ │  
                                                      │ ├─┤ │ │ │  
                                  0 : [terminal] '('  │ │(│ │ │ │  
                                                       0 1 2 3 4 
                                                       a ( ) [ ] 
AST for rule "program":

└─program : 0+3 : "a()"
  └─call : 0+3 : "a()"
    └─name : 0+1 : "a"

The Pika library is used as follows:

import java.io.IOException;
import java.nio.charset.Charset;
import java.nio.file.Files;
import java.nio.file.Path;

import pikaparser.grammar.Grammar;
import pikaparser.grammar.MetaGrammar;
import pikaparser.parser.utils.ParserInfo;

public class Parser {

  static final String TOP_RULE_NAME = "program";

    public static void main(String[] args) throws IOException {
        final String grammarStr = Files.readString(Path.of(args[0]), Charset.defaultCharset());
        final String text = Files.readString(Path.of(args[1]), Charset.defaultCharset());

        final var grammar = MetaGrammar.parse(grammarStr);
        final var memoTable = grammar.parse(text);
        final var topClause = grammar.allClauses.get(grammar.allClauses.size()-1);
        
        System.out.println("Match tree for rule " + TOP_RULE_NAME + ":");
        ParserInfo.printParseTreeInMemoTableForm(memoTable);
        System.out.println("AST for rule \"" + TOP_RULE_NAME + "\":\n");
        ParserInfo.printAST(TOP_RULE_NAME, topClause, memoTable);
    }
}

I suppose this has to do with the relative order of call and subscript choices in primary.

Is this as intended? If yes, how should the grammar be rewritten to handle both a[]() and a()[] ?

For comparison, pegen, CPython's PEG parser with support for left-recursion parses both. I'm including a minimal dockerfile that reproduces pegen results:

# syntax=docker/dockerfile:1

# save this to pegen.Dockerfile then run 
# $ docker buildx build -f pegen.Dockerfile .

FROM python:3.12-slim


RUN pip install pegen


COPY <<GRAMMAR grammar.pegen

start: 
  | p=primary NEWLINE? ENDMARKER { ['start', p] } 

primary:
  | call
  | subscript
  | attribute
  | name_

call:
  | p=primary '(' ')' { ['call', p] } 

subscript:
  | p=primary '[' ']' { ['subscript', p] } 

attribute:
  | p=primary '.' n=name_ { ['attribute', p, n] } 

name_:
  | n=NAME { ['name', n.string] } 

GRAMMAR


RUN python3 <<PEGEN

import pegen.utils
parser = pegen.utils.make_parser(open('grammar.pegen').read())
def parse(text):
  tree = pegen.utils.parse_string(text, parser)
  print(f'{text=} {tree=}')
parse('a[]()')
parse('a()[]')
exit(1)  # make this script fail so that docker shows its output

PEGEN

This outputs:

text='a[]()' tree=['start', ['call', ['subscript', ['name', 'a']]]]
text='a()[]' tree=['start', ['subscript', ['call', ['name', 'a']]]]

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions