Skip to content

contextual lexer accept set seems wrong #1581

@MartyMcFlyInTheSky

Description

@MartyMcFlyInTheSky

Trying to parse this AIREP:

UAAA01 EGRR 031514
VPFBQ 6600S 06834W 1320 F200 MS32 290/36
VPFBQ 6400S 06903W 1356 F200 MS30 280/33
VPFBQ 6200S 06928W 1433 F200 MS30 310/18
VPFBQ 6000S 06950W 1514 F200 MS30 00/38

with this grammar:

%import common.WS_INLINE
%import common.NEWLINE
%ignore WS_INLINE
%ignore NEWLINE


?start: airep_tac

// significant newline has higher prio than NEWLINE
_NL.2: /\n/

airep_tac: header_line designator_line? (airep_block | airep_block_snl)


// -------------------- WMO header --------------------
header_line: message_type issuing_office issue_time correction* _NL

message_type: TTAAII
issuing_office: CCCC
issue_time: YYGGGG
correction: BBB

TTAAII: /U[A-Z]{3}[0-9]{2}(?![A-Z0-9])/
CCCC: /[A-Z]{4}(?![A-Z])/ 
YYGGGG: /[0-9]{6}(?![0-9])/ 
BBB: /[A-Z]{3}/

// -------------------- description line --------------------
designator_line: AIREP date? _NL

AIREP.2: "AIREP"

date: DDHH
DDHH: /\d{4}/

// -------------------- airep_blocks --------------------

// airep blocks using ARP/ARS as record seperator
airep_block: airep_line+
airep_line: msg_type_designator airplane_id loc_ref REST+

REST: /(?!ARP|ARS)[^\s]+/

msg_type_designator: ARP | ARS
ARP: "ARP"
ARS: "ARS"

airplane_id: /[A-Z0-9]{4,7}/

loc_ref: latlon_ddmm

latlon_ddmm: LAT_DD LAT_MM LAT_HEM LON_DDD LON_MM LON_HEM
LAT_DD.5: /\d{2}(?=\d{2}[NS]\s*\d{5}[EW])/
LAT_MM.5: /\d{2}(?=[NS]\s*\d{5}[EW])/
LAT_HEM.5: /[NS]/
LON_DDD.5: /\d{3}(?=\d{2}[EW])/
LON_MM.5: /\d{2}(?=[EW])/
LON_HEM.5: /[EW]/

// airep blocks using significant newlines as record seperator
airep_block_snl: airep_line_snl+
airep_line_snl: airplane_id loc_ref REST_SNL+ _NL

REST_SNL: /[^\n]+/

the lexer still has the REST terminal in his accept set, which is what I don't understand? Here's the concrete error:

E               lark.exceptions.UnexpectedToken: Unexpected token Token('REST', '1320') at line 2, column 20.
E               Expected one of:
E                       * REST_SNL

But this seems crazy to me, after all the parse table should not even consider the REST token if I understand how the contextual lexer works. The REST token should only be a valid choice in the airep_line derivations, not in airep_line_snl-derivations. So I assumed the parse table would be built like this, but a bit of logging using the interactive parser shows:

Parser choices:
        - REST_SNL -> (Reduce, Rule(NonTerminal(Token('RULE', 'latlon_ddmm')), [Terminal('LAT_DD'), Terminal('LAT_MM'), Terminal('LAT_HEM'), Terminal('LON_DDD'), Terminal('LON_MM'), Terminal('LON_HEM')], None, RuleOptions(False, False, None, None)))
        - REST -> (Reduce, Rule(NonTerminal(Token('RULE', 'latlon_ddmm')), [Terminal('LAT_DD'), Terminal('LAT_MM'), Terminal('LAT_HEM'), Terminal('LON_DDD'), Terminal('LON_MM'), Terminal('LON_HEM')], None, RuleOptions(False, False, None, None)))
stack size: 9
EXPECTED: ['REST_SNL']
NEXT: 1320
LAST_OK: W
F

After parsing LON_HEM he still has both tokens up for the grabs. Why?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions