Add `backtrack` parser. #220

mefyl · 2022-03-18T16:13:22Z

Feature

backtrack n is a parser that rewinds the input by n bytes, or fails if not enough client-uncommitted bytes are available. It is sorted with the expert, undocumented parser since it requires a good understanding of the input to be used safely.

I do understand that this feature may seem dangerous, yet I think it is legitimate and useful in some contexts (see rationale below). I would definitely understand if it's deemed too tricky to integrate upstream, I'm fine using our pinned version.

Rationale

We use Angstrom.scan to consume UTF-8 input. Thus, we must sometime read multiple bytes to get an actual unicode character. If we then try to reject that character by returning None, only the last byte is pushed back to the input, losing some information and probably making the input malformed UTF-8. backtrack enables to roll back a few additional bytes to the start of the actual UTF-8 character. It is safe to use in such a case since we know there are at least that many uncommitted bytes to rollback.

thedufer · 2022-05-28T18:27:38Z

I don't really understand the rationale for why you would want this. It sounds like Angstrom.scan simply isn't the function you should be using - wouldn't it make more sense to define a parser for a single codepoint and then use Angstrom.many? That way the backtracking just does the right thing for you.

mefyl · 2022-05-30T11:37:39Z

Indeed a codepoint parser combined with many could parse arbitrary UTF-8 input, but then we'd lose the nice features we use scan for: we have a list of possible entries to complete (eg. emails), and with scan we eat the input while filtering the potentially matching completions in a much more performant and readable manner.

Using your point, we could argue that scan is actually not ever useful, you could always use Angstorm.char and Angstrom.many to achieve the same result. My idea here is that if scan is sometime the right tool to parse latin-1 input, it is probably the right tool to parse multibyte unicode input, although one could definitely do without in both cases with a sequence of characters parser and some state that carries over (which is exactly what scan provides).

In the end this parser is, I think, well defined, and having it together with the other "expert" parsers can't hurt. But I can understand if it's considered too tricky.

Note: In the end the most correct answer might be that Angstrom should be able to interpret the input using different alphabets, be it plain bytes or unicode characters, but that's a project-wide change.

Add backtrack.

4a80ce3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `backtrack` parser. #220

Add `backtrack` parser. #220

Uh oh!

mefyl commented Mar 18, 2022

Uh oh!

thedufer commented May 28, 2022

Uh oh!

mefyl commented May 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add backtrack parser. #220

Are you sure you want to change the base?

Add backtrack parser. #220

Uh oh!

Conversation

mefyl commented Mar 18, 2022

Feature

Rationale

Uh oh!

thedufer commented May 28, 2022

Uh oh!

mefyl commented May 30, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add `backtrack` parser. #220

Add `backtrack` parser. #220