Lexing a grammar with string interpolation #891
-
|
I'm trying to lex a grammar that includes Ruby-style string interpolation. If you're not familiar, it has this syntax: "This part of the string is literal, but #{expr} will evaluate `expr` as Ruby code."I've tried to approach this in a few ways, and the closest I've come is by using two recursive parsers, one for strings, and one for all tokens, that each refer to each other. What I'm finding is that having tokens for curly braces in my grammar (for use outside of string interpolations) is causing lexing of string interpolation to fail. This is the best minimal, reproducible example I could come up with. Hopefully there isn't too much incidental complexity here. #![allow(dead_code)]
use chumsky::prelude::*;
#[derive(Clone, Debug)]
enum Token<'a> {
Add,
Ident(&'a str),
String(Vec<StringPart<'a>>),
LBrace,
RBrace,
}
#[derive(Clone, Debug)]
enum StringPart<'a> {
Literal(&'a str),
Interpolation(Vec<Token<'a>>),
}
fn lexer<'a>() -> impl Parser<'a, &'a str, Vec<Token<'a>>, extra::Err<Rich<'a, char, SimpleSpan>>> {
let mut token = Recursive::declare();
let mut string = Recursive::declare();
let add = just('+').to(Token::Add);
let ident = text::ascii::ident().map(Token::Ident);
let lbrace = just('{').to(Token::LBrace);
let rbrace = just('}').to(Token::RBrace);
// Replacing the above two lines with the following two will cause all the examples in `main`
// to work as expected.
//
// let lbrace = just('`').to(Token::LBrace);
// let rbrace = just('~').to(Token::RBrace);
let string_part = choice((
just("#{")
.ignore_then(token.clone().padded().repeated().collect())
.then_ignore(just('}'))
.map(StringPart::Interpolation),
any()
.and_is(just('"').not())
.and_is(just("#{").not())
.repeated()
.at_least(1)
.to_slice()
.map(StringPart::Literal),
));
string.define(
just('"')
.ignore_then(string_part.repeated().collect())
.then_ignore(just('"'))
.map(Token::String),
);
token.define(choice((add, ident, string, lbrace, rbrace)));
token
.padded()
.recover_with(skip_then_retry_until(any().ignored(), end()))
.repeated()
.collect()
}
fn lex(source: &str) {
let source = source.trim();
match lexer().parse(source).into_result() {
Ok(tokens) => println!("{source}\n{tokens:?}\n"),
Err(errors) => println!("{source}\n{errors:?}\n"),
}
}
fn main() {
lex(r#" "#); // empty input
lex(r#" "" "#); // empty string
lex(r#" "x" "#); // literal x
lex(r#" x "#); // ident x
lex(r##" "#{}" "##); // empty interpolation
lex(r##" "#{}" x "##); // empty interpolation + ident x
lex(r##" "#{x}" "##); // interpolation of ident x
lex(r##" "x + #{x} + x" "##); // ident x + interpolation of ident x + ident x
lex(r##" "x + #{x + x} + x" "##); // ident x + (interpolation of ident x + ident x) + ident x
lex(r##" "#{x + x} + x + #{x + x}" "##); // (interpolation of ident x + ident x) + ident x + (interpolation of ident x + ident x)
}The output of the program as written is: As noted in the code comment, removing the I would appreciate any guidance on what I am missing here. If there's a better way to approach string interpolation in a lexer with Chumsky, that would be helpful to know, too! Assuming this overall approach is solid, I'm also wondering if there's a way my Thanks in advance! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
|
The problem here is with the way you've designed the token parser. Let's take the simplest problematic example:
In theory, this should be parsed as The problem occurs when we hit the This continues: the next character is a Your parser sees the input like the following: Notice how this is incomplete. Your grammars contains an ambiguity, which causes the parser to get the wrong end of the stick. There is a solution: Make sure that a single #![allow(dead_code)]
use chumsky::prelude::*;
#[derive(Clone, Debug)]
enum Token<'a> {
Add,
Ident(&'a str),
String(Vec<StringPart<'a>>),
// { ... }
Block(Vec<Self>),
}
#[derive(Clone, Debug)]
enum StringPart<'a> {
Literal(&'a str),
Interpolation(Vec<Token<'a>>),
}
fn lexer<'a>() -> impl Parser<'a, &'a str, Vec<Token<'a>>, extra::Err<Rich<'a, char, SimpleSpan>>> {
let mut token = Recursive::declare();
let mut string = Recursive::declare();
let add = just('+').to(Token::Add);
let ident = text::ascii::ident().map(Token::Ident);
let block = token.clone().padded()
.repeated().collect()
.delimited_by(just('{'), just('}'))
.map(Token::Block);
let string_part = choice((
just("#{")
.ignore_then(token.clone().padded().repeated().collect())
.then_ignore(just('}'))
.map(StringPart::Interpolation),
any()
.and_is(just('"').not())
.and_is(just("#{").not())
.repeated()
.at_least(1)
.to_slice()
.map(StringPart::Literal),
));
string.define(
just('"')
.ignore_then(string_part.repeated().collect())
.then_ignore(just('"'))
.map(Token::String),
);
token.define(choice((add, ident, string, block)));
token
.padded()
.recover_with(skip_then_retry_until(any().ignored(), end()))
.repeated()
.collect()
}
fn lex(source: &str) {
let source = source.trim();
match lexer().parse(source).into_result() {
Ok(tokens) => println!("{source}\n{tokens:?}\n"),
Err(errors) => println!("{source}\n{errors:?}\n"),
}
}
fn main() {
lex(r#" "#); // empty input
lex(r#" "" "#); // empty string
lex(r#" "x" "#); // literal x
lex(r#" x "#); // ident x
lex(r##""#{}""##); // empty interpolation
lex(r##" "#{}" x "##); // empty interpolation + ident x
lex(r##" "#{x}" "##); // interpolation of ident x
lex(r##" "x + #{x} + x" "##); // literal x + interpolation of ident x + literal x
lex(r##" "x + #{x + x} + x" "##); // literal x + (interpolation of ident x + ident x) + literal x
lex(r##" "#{x + x} + x + #{x + x}" "##); // literal x + (interpolation of ident x + ident x) + literal x
}Notice how we've replaced the I think this solution is spiritually compatible with your existing tree-driven approach. |
Beta Was this translation helpful? Give feedback.
The problem here is with the way you've designed the token parser. Let's take the simplest problematic example:
"#{}"In theory, this should be parsed as
String([Literal(""), Interpolation([]), Literal("")]), if I understand correctly.The problem occurs when we hit the
#{, which starts the interpolated expression. Here, we start looking for more tokens (since an interpolation just contains a list of tokens). What's the next character? Well, it's}. But because you've got yourrbraceparser enabled, this successfully parses as a token, even though we shouldn't be finding any tokens in the interpolation!This continues: the next character is a
", which means the start of a string token. Bu…