Skip to content

Conversation

@mefyl
Copy link

@mefyl mefyl commented Mar 18, 2022

Feature

backtrack n is a parser that rewinds the input by n bytes, or fails if not enough client-uncommitted bytes are available. It is sorted with the expert, undocumented parser since it requires a good understanding of the input to be used safely.

I do understand that this feature may seem dangerous, yet I think it is legitimate and useful in some contexts (see rationale below). I would definitely understand if it's deemed too tricky to integrate upstream, I'm fine using our pinned version.

Rationale

We use Angstrom.scan to consume UTF-8 input. Thus, we must sometime read multiple bytes to get an actual unicode character. If we then try to reject that character by returning None, only the last byte is pushed back to the input, losing some information and probably making the input malformed UTF-8. backtrack enables to roll back a few additional bytes to the start of the actual UTF-8 character. It is safe to use in such a case since we know there are at least that many uncommitted bytes to rollback.

@thedufer
Copy link
Collaborator

I don't really understand the rationale for why you would want this. It sounds like Angstrom.scan simply isn't the function you should be using - wouldn't it make more sense to define a parser for a single codepoint and then use Angstrom.many? That way the backtracking just does the right thing for you.

@mefyl
Copy link
Author

mefyl commented May 30, 2022

Indeed a codepoint parser combined with many could parse arbitrary UTF-8 input, but then we'd lose the nice features we use scan for: we have a list of possible entries to complete (eg. emails), and with scan we eat the input while filtering the potentially matching completions in a much more performant and readable manner.

Using your point, we could argue that scan is actually not ever useful, you could always use Angstorm.char and Angstrom.many to achieve the same result. My idea here is that if scan is sometime the right tool to parse latin-1 input, it is probably the right tool to parse multibyte unicode input, although one could definitely do without in both cases with a sequence of characters parser and some state that carries over (which is exactly what scan provides).

In the end this parser is, I think, well defined, and having it together with the other "expert" parsers can't hurt. But I can understand if it's considered too tricky.

Note: In the end the most correct answer might be that Angstrom should be able to interpret the input using different alphabets, be it plain bytes or unicode characters, but that's a project-wide change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants