Lexing a grammar with string interpolation #891

jimmycuadra · 2025-10-26T19:20:24Z

jimmycuadra
Oct 26, 2025

I'm trying to lex a grammar that includes Ruby-style string interpolation. If you're not familiar, it has this syntax:

"This part of the string is literal, but #{expr} will evaluate `expr` as Ruby code."

I've tried to approach this in a few ways, and the closest I've come is by using two recursive parsers, one for strings, and one for all tokens, that each refer to each other. What I'm finding is that having tokens for curly braces in my grammar (for use outside of string interpolations) is causing lexing of string interpolation to fail. This is the best minimal, reproducible example I could come up with. Hopefully there isn't too much incidental complexity here.

#![allow(dead_code)]

use chumsky::prelude::*;

#[derive(Clone, Debug)]
enum Token<'a> {
    Add,
    Ident(&'a str),
    String(Vec<StringPart<'a>>),

    LBrace,
    RBrace,
}

#[derive(Clone, Debug)]
enum StringPart<'a> {
    Literal(&'a str),
    Interpolation(Vec<Token<'a>>),
}

fn lexer<'a>() -> impl Parser<'a, &'a str, Vec<Token<'a>>, extra::Err<Rich<'a, char, SimpleSpan>>> {
    let mut token = Recursive::declare();
    let mut string = Recursive::declare();

    let add = just('+').to(Token::Add);
    let ident = text::ascii::ident().map(Token::Ident);
    let lbrace = just('{').to(Token::LBrace);
    let rbrace = just('}').to(Token::RBrace);
    // Replacing the above two lines with the following two will cause all the examples in `main`
    // to work as expected.
    //
    // let lbrace = just('`').to(Token::LBrace);
    // let rbrace = just('~').to(Token::RBrace);

    let string_part = choice((
        just("#{")
            .ignore_then(token.clone().padded().repeated().collect())
            .then_ignore(just('}'))
            .map(StringPart::Interpolation),
        any()
            .and_is(just('"').not())
            .and_is(just("#{").not())
            .repeated()
            .at_least(1)
            .to_slice()
            .map(StringPart::Literal),
    ));

    string.define(
        just('"')
            .ignore_then(string_part.repeated().collect())
            .then_ignore(just('"'))
            .map(Token::String),
    );

    token.define(choice((add, ident, string, lbrace, rbrace)));

    token
        .padded()
        .recover_with(skip_then_retry_until(any().ignored(), end()))
        .repeated()
        .collect()
}

fn lex(source: &str) {
    let source = source.trim();

    match lexer().parse(source).into_result() {
        Ok(tokens) => println!("{source}\n{tokens:?}\n"),
        Err(errors) => println!("{source}\n{errors:?}\n"),
    }
}

fn main() {
    lex(r#" "#); // empty input
    lex(r#" "" "#); // empty string
    lex(r#" "x" "#); // literal x
    lex(r#" x "#); // ident x
    lex(r##" "#{}" "##); // empty interpolation
    lex(r##" "#{}" x "##); // empty interpolation + ident x
    lex(r##" "#{x}" "##); // interpolation of ident x
    lex(r##" "x + #{x} + x" "##); // ident x + interpolation of ident x + ident x
    lex(r##" "x + #{x + x} + x" "##); // ident x + (interpolation of ident x + ident x) + ident x
    lex(r##" "#{x + x} + x + #{x + x}" "##); // (interpolation of ident x + ident x) + ident x + (interpolation of ident x + ident x)
}

The output of the program as written is:


[]

""
[String([])]

"x"
[String([Literal("x")])]

x
[Ident("x")]

"#{}"
[found end of input at 5..5 expected ''#'', any, or ''"'', found end of input at 5..5 expected ''#'', any, or ''"'']

"#{}" x
[found end of input at 7..7 expected any, ''#'', or ''"'', found end of input at 7..7 expected any, ''#'', or ''"'']

"#{x}"
[found end of input at 6..6 expected ''#'', any, or ''"'', found end of input at 6..6 expected ''#'', any, or ''"'']

"x + #{x} + x"
[found end of input at 14..14 expected ''#'', any, or ''"'', found ''#'' at 5..6 expected ''/'', ''+'', identifier, ''"'', ''{'', or ''}'', found end of input at 14..14 expected ''#'', any, or ''"'']

"x + #{x + x} + x"
[found end of input at 18..18 expected ''#'', any, or ''"'', found ''#'' at 5..6 expected ''/'', ''+'', identifier, ''"'', ''{'', or ''}'', found end of input at 18..18 expected ''#'', any, or ''"'']

"#{x + x} + x + #{x + x}"
[found ''#'' at 16..17 expected ''+'', identifier, ''"'', ''{'', or ''}'', found ''#'' at 16..17 expected ''/'', ''+'', identifier, ''"'', ''{'', or ''}'', found end of input at 25..25 expected ''#'', any, or ''"'']

As noted in the code comment, removing the LBrace and RBrace tokens from the grammar, or replacing them with characters that won't actually match, will result in this output from the program, which is what is desired:


[]

""
[String([])]

"x"
[String([Literal("x")])]

x
[Ident("x")]

"#{}"
[String([Interpolation([])])]

"#{}" x
[String([Interpolation([])]), Ident("x")]

"#{x}"
[String([Interpolation([Ident("x")])])]

"x + #{x} + x"
[String([Literal("x + "), Interpolation([Ident("x")]), Literal(" + x")])]

"x + #{x + x} + x"
[String([Literal("x + "), Interpolation([Ident("x"), Add, Ident("x")]), Literal(" + x")])]

"#{x + x} + x + #{x + x}"
[String([Interpolation([Ident("x"), Add, Ident("x")]), Literal(" + x + "), Interpolation([Ident("x"), Add, Ident("x")])])]

I would appreciate any guidance on what I am missing here. If there's a better way to approach string interpolation in a lexer with Chumsky, that would be helpful to know, too!

Assuming this overall approach is solid, I'm also wondering if there's a way my StringPart::Interpolation variant can hold &[Token<'a>] instead of Vec<Token<'a>>. (And perhaps the Token::String variant could be a slice too?) I'm guessing not since the collection these slices point to have to live somewhere, and the number of elements aren't known statically. But maybe there's a clever trick I'm not thinking of!

Thanks in advance!

Answered by zesterer

Oct 27, 2025

The problem here is with the way you've designed the token parser. Let's take the simplest problematic example:

"#{}"

In theory, this should be parsed as String([Literal(""), Interpolation([]), Literal("")]), if I understand correctly.

The problem occurs when we hit the #{, which starts the interpolated expression. Here, we start looking for more tokens (since an interpolation just contains a list of tokens). What's the next character? Well, it's }. But because you've got your rbrace parser enabled, this successfully parses as a token, even though we shouldn't be finding any tokens in the interpolation!

This continues: the next character is a ", which means the start of a string token. Bu…

View full answer

zesterer · 2025-10-27T14:48:14Z

zesterer
Oct 27, 2025
Maintainer

The problem here is with the way you've designed the token parser. Let's take the simplest problematic example:

"#{}"

In theory, this should be parsed as String([Literal(""), Interpolation([]), Literal("")]), if I understand correctly.

The problem occurs when we hit the #{, which starts the interpolated expression. Here, we start looking for more tokens (since an interpolation just contains a list of tokens). What's the next character? Well, it's }. But because you've got your rbrace parser enabled, this successfully parses as a token, even though we shouldn't be finding any tokens in the interpolation!

This continues: the next character is a ", which means the start of a string token. But then we hit the end of input, causing the string parser to error out.

Your parser sees the input like the following:

String([Literal(""), Interpolation([RBrace, String( <error!>

Notice how this is incomplete. Your grammars contains an ambiguity, which causes the parser to get the wrong end of the stick.

There is a solution: Make sure that a single } is never legal as a token. You might wonder how this is possible, since braces are valid tokens, but the solution is in the word 'single' - we can just allow braces to parser, but only if they are paired up with one-another! Consider:

#![allow(dead_code)]

use chumsky::prelude::*;

#[derive(Clone, Debug)]
enum Token<'a> {
    Add,
    Ident(&'a str),
    String(Vec<StringPart<'a>>),

    // { ... }
    Block(Vec<Self>),
}

#[derive(Clone, Debug)]
enum StringPart<'a> {
    Literal(&'a str),
    Interpolation(Vec<Token<'a>>),
}

fn lexer<'a>() -> impl Parser<'a, &'a str, Vec<Token<'a>>, extra::Err<Rich<'a, char, SimpleSpan>>> {
    let mut token = Recursive::declare();
    let mut string = Recursive::declare();

    let add = just('+').to(Token::Add);
    let ident = text::ascii::ident().map(Token::Ident);
    let block = token.clone().padded()
        .repeated().collect()
        .delimited_by(just('{'), just('}'))
        .map(Token::Block);

    let string_part = choice((
        just("#{")
            .ignore_then(token.clone().padded().repeated().collect())
            .then_ignore(just('}'))
            .map(StringPart::Interpolation),
        any()
            .and_is(just('"').not())
            .and_is(just("#{").not())
            .repeated()
            .at_least(1)
            .to_slice()
            .map(StringPart::Literal),
    ));

    string.define(
        just('"')
            .ignore_then(string_part.repeated().collect())
            .then_ignore(just('"'))
            .map(Token::String),
    );

    token.define(choice((add, ident, string, block)));

    token
        .padded()
        .recover_with(skip_then_retry_until(any().ignored(), end()))
        .repeated()
        .collect()
}

fn lex(source: &str) {
    let source = source.trim();

    match lexer().parse(source).into_result() {
        Ok(tokens) => println!("{source}\n{tokens:?}\n"),
        Err(errors) => println!("{source}\n{errors:?}\n"),
    }
}

fn main() {
    lex(r#" "#); // empty input
    lex(r#" "" "#); // empty string
    lex(r#" "x" "#); // literal x
    lex(r#" x "#); // ident x
    lex(r##""#{}""##); // empty interpolation
    lex(r##" "#{}" x "##); // empty interpolation + ident x
    lex(r##" "#{x}" "##); // interpolation of ident x
    lex(r##" "x + #{x} + x" "##); // literal x + interpolation of ident x + literal x
    lex(r##" "x + #{x + x} + x" "##); // literal x + (interpolation of ident x + ident x) + literal x
    lex(r##" "#{x + x} + x + #{x + x}" "##); // literal x + (interpolation of ident x + ident x) + literal x
}

Notice how we've replaced the lbrace and rbrace parsers with a single parser, block, which parses a complete braced block, including any tokens within it. This means that a single isolated } can never accidentally be misinterpreted as a token when it's actually intended as the terminator of a string interpolation that sits higher up in the parsing hierarchy.

I think this solution is spiritually compatible with your existing tree-driven approach.

5 replies

jimmycuadra Oct 27, 2025
Author

Thanks so much! That just might be the way to go. Amusingly, I rubber-ducked myself into another possible solution right before I read your reply. I realized that because I was specifying that an interpolation could be a sequence of any token, that of course those tokens included the two braces. I found that excluding them from the acceptable tokens when inside an interpolation would also work:

let string_part = choice((
    just("#{")
        .ignore_then(
            token
                .clone()
                .and_is(just('{').not()) // These two lines make
                .and_is(just('}').not()) // all the example inputs work
                .padded()
                .repeated()
                .collect(),
        )
        .then_ignore(just('}'))
        .map(StringPart::Interpolation),
    any()
        .and_is(just('"').not())
        .and_is(just("#{").not())
        .repeated()
        .at_least(1)
        .to_slice()
        .map(StringPart::Literal),
));

This works, at least for the specific test cases in the main function. I'll give it a shot with more complicated examples and see if it works. I wouldn't be surprised if your approach is necessary, though.

jimmycuadra Oct 27, 2025
Author

One follow-up question. When building an AST from the tokens, what is the right Chumsky construct to surface errors in the interpolations within a string? This is what I'm trying:

use chumsky::input::BorrowInput;
use chumsky::prelude::*;
use chumsky::span::SimpleSpan;

// Top-level AST enum. Every AST node, including this parent wrapper, have an associated
// `parser` function that returns a Chumsky parser for that type.
use super::Expr;
use crate::token::{StringPart, Token};

// AST node for strings
#[derive(Debug, Clone, PartialEq)]
pub struct ExprString<'a> {
    parts: Vec<ExprStringPart<'a>>,
    span: SimpleSpan,
}

#[derive(Debug, Clone, PartialEq)]
pub enum ExprStringPart<'a> {
    Literal(&'a str),
    Expr(Expr<'a>),
}

impl<'a> ExprString<'a> {
    pub fn parser<I>()
    -> impl Parser<'a, I, Self, extra::Err<Rich<'a, Token<'a>, SimpleSpan>>> + Clone
    where
        I: BorrowInput<'a, Token = Token<'a>, Span = SimpleSpan>,
    {
        select_ref! { Token::String(s) => s }
            .try_map(|parts, span| {
                let mut out = Vec::with_capacity(parts.len());

                for part in parts {
                    match part {
                        StringPart::Literal(s) => out.push(ExprStringPart::Literal(s)),
                        StringPart::Interpolation(e) => {
                            let exprs = Expr::parser()
                                .parse(e.as_slice())
                                .into_result()
                                // I realize I'm throwing away all the errors from parsing `Expr` here.
                                // I was just trying to get something working, and wasn't sure how to
                                // lift the inner errors into the outer parsing call, since `try_map`'s
                                // closure returns a single error
                                .map_err(|_| Rich::custom(span, "invalid string interpolation"))?;

                            out.push(ExprStringPart::Expr(exprs));
                        }
                    }
                }

                Ok(Self { parts: out, span })
            })
            .labelled("string")
    }
}

But my custom error message doesn't appear in error output. Instead, it treats the entire string as unexpected input. For example, given this input, which will not parse because square bracket indexing is not yet part of the grammar:

fn main() {
  "List indexing is not yet supported: #{obj.list[0]}"
}

The output is (with irrelevant expected tokens truncated):

Error: found 'List indexing is not yet supported: #{obj.list[0]}' expected operator, <... snip ...> in function at 0..11 in item at 0..11
   ╭─[ fixtures/example.ob:2:3 ]
   │
 1 │ fn main() {
   │ ─────┬─────
   │      ╰─────── while parsing this function
   │      │
   │      ╰─────── while parsing this item
 2 │   "List indexing is not yet supported: #{obj.list[0]}"
   │   ──────────────────────────┬─────────────────────────
   │                             ╰─────────────────────────── found 'List indexing is not yet supported: #{obj.list[0]}' expected operator, <... snip ...>
───╯

zesterer Oct 27, 2025
Maintainer

That approach certainly won't work as you intend. For parsing nested inputs (i.e: token trees and similar, like you're doing) take a look at the nested_in combinator. There are also some examples in the repository that show how to use nested_in.

jimmycuadra Oct 27, 2025
Author

Sorry, I'm not sure I understand how to use nested_in for my case from looking at Chumsky's examples. They all show cases where there's a token that directly contains a list of other tokens and the delimiters for the nested token tree are also tokens. But in my case, the token trees only appear in one of two variants of an enum inside a token. By the time it gets to the parser, I've got a single token containing Vec<StringPart<'_>> and need to transform each StringPart into an ExprStringPart. Am I on the right track here?

impl<'a> ExprString<'a> {
    pub fn parser<I>(
        expr: impl Parser<'a, I, Expr<'a>, extra::Err<Rich<'a, Token<'a>, SimpleSpan>>> + Clone,
    ) -> impl Parser<'a, I, Self, extra::Err<Rich<'a, Token<'a>, SimpleSpan>>> + Clone
    where
        I: BorrowInput<'a, Token = Token<'a>, Span = SimpleSpan>,
    {
        select_ref! {
            Token::String(parts) = extra => ExprString {
                parts: parts.iter().map(|part| {
                    match part {
                        StringPart::Literal(s) => ExprStringPart::Literal(s),
                        StringPart::Interpolation(tokens) => {
                            expr.nested_in(tokens.as_slice()).map(ExprStringPart::Expr)
                        },
                    }
                }).collect() ,
                span: extra.span(),
            }
        }
    }
}

This results in this compilation error:

error[E0277]: the trait bound `[token::Token<'_>]: chumsky::Parser<'_, _, I, _>` is not satisfied
    --> src/expr/expr_string.rs:28:44
     |
  28 | ...                   expr.nested_in(tokens.as_slice()).map(ExprStringPart::Expr)
     |                            --------- ^^^^^^^^^^^^^^^^^ the trait `chumsky::Parser<'_, _, I, _>` is not implemented for `[token::Token<'_>]`
     |                            |
     |                            required by a bound introduced by this call
     |
     = help: the following other types implement trait `chumsky::Parser<'src, I, O, E>`:
               `&T` implements `chumsky::Parser<'src, I, O, E>`
               `AndIs<A, B, OB>` implements `chumsky::Parser<'src, I, OA, E>`
               `AnyRef<I, E>` implements `chumsky::Parser<'src, I, &'src <I as chumsky::input::Input<'src>>::Token, E>`
               `Arc<T>` implements `chumsky::Parser<'src, I, O, E>`
               `Box<T>` implements `chumsky::Parser<'src, I, O, E>`
               `Choice<&[A]>` implements `chumsky::Parser<'src, I, O, E>`
               `Choice<(A_, B_, C_, D_, E_, F_, G_, H_, I_, J_, K_, L_, M_, N_, O_, P_, Q_, R_, S_, T_, U_, V_, W_, X_, Y_, Z_)>` implements `chumsky::Parser<'src, I, O, E>`
               `Choice<(B_, C_, D_, E_, F_, G_, H_, I_, J_, K_, L_, M_, N_, O_, P_, Q_, R_, S_, T_, U_, V_, W_, X_, Y_, Z_)>` implements `chumsky::Parser<'src, I, O, E>`
             and 142 others
     = note: required for `&[token::Token<'_>]` to implement `chumsky::Parser<'a, _, I, _>`
note: required by a bound in `nested_in`
    --> /.cargo/registry/src/index.crates.io-1949cf8c6b5b557f/chumsky-0.11.1/src/lib.rs:1090:21
     |
1090 |     fn nested_in<B: Parser<'src, J, I, F>, J, F>(self, other: B) -> NestedIn<Self, B, J, F, O, E>
     |                     ^^^^^^^^^^^^^^^^^^^^^ required by this bound in `Parser::nested_in`

That does make sense to me, because &[T] is just a slice, not a parser. In examples/nested.rs in the Chumsky repo, this is handled by passing a select_ref! parser to nested_in:

let parens = expr
    .nested_in(select_ref! { Token::Parens(xs) => { let slice = xs.as_slice(); slice }});

But in my case I've already extracted the equivalent of xs so there's no pattern to match with a select_ref! parser.

I do see that I'm trying to do parsing while mapping over a regular iterator, which seems wrong. I think I'm getting confused between creating parsers and mapping the output of parsers, but I'm not sure what the right incantation is here. 😕

zesterer Oct 27, 2025
Maintainer

We strongly discourage creating parsers 'on the fly' because we don't make any guarantees about the performance of parser creation: so I'd recommend not doing this in the body of the select_ref!.

If you check the docs of nested_in, you'll see that it glues two parsers together. The first parser is the 'inner' pattern, the second parser extracts the input that it should be used to parse.

To me, this seems like you'd want something like the following pseudocode:

let string = {
  // Parses a literal part of a string
  let litr = select! { StringPart::Literal(s) => ExprStringPart::Literal(s) };
  
  // Parses an interpolated part of a string
  let interp = expr
    .map(ExprStringPart::Expr)
    // nested_in is used to aim the expression parser at the tokens inside the interpolated part
    .nested_in(select_ref! { StringPart::Interpolation(toks) => toks });
  
  let part = string_litr.or(string_interp);
  part
      .repeated()
      .collect()
      // nested_in is used to aim the part parser at the parts inside the string token
      .nested_in(select_ref! { Token::String(parts) => parts })
};

Notice that there are two uses of nested_in here: one to deal with the tokens nested inside a StringPart::Interpolation, and another to deal with the outer string parts inside Token::String.

Uh oh!

Uh oh!

Lexing a grammar with string interpolation #891

Uh oh!

Uh oh!

jimmycuadra Oct 26, 2025

Replies: 1 comment · 5 replies

Uh oh!

Uh oh!

zesterer Oct 27, 2025 Maintainer

Uh oh!

jimmycuadra Oct 27, 2025 Author

Uh oh!

jimmycuadra Oct 27, 2025 Author

Uh oh!

zesterer Oct 27, 2025 Maintainer

Uh oh!

Uh oh!

jimmycuadra Oct 27, 2025 Author

Uh oh!

Uh oh!

zesterer Oct 27, 2025 Maintainer

jimmycuadra
Oct 26, 2025

Replies: 1 comment 5 replies

zesterer
Oct 27, 2025
Maintainer

jimmycuadra Oct 27, 2025
Author

jimmycuadra Oct 27, 2025
Author

zesterer Oct 27, 2025
Maintainer

jimmycuadra Oct 27, 2025
Author

zesterer Oct 27, 2025
Maintainer