Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add backtrack parser. #220

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

mefyl
Copy link

@mefyl mefyl commented Mar 18, 2022

Feature

backtrack n is a parser that rewinds the input by n bytes, or fails if not enough client-uncommitted bytes are available. It is sorted with the expert, undocumented parser since it requires a good understanding of the input to be used safely.

I do understand that this feature may seem dangerous, yet I think it is legitimate and useful in some contexts (see rationale below). I would definitely understand if it's deemed too tricky to integrate upstream, I'm fine using our pinned version.

Rationale

We use Angstrom.scan to consume UTF-8 input. Thus, we must sometime read multiple bytes to get an actual unicode character. If we then try to reject that character by returning None, only the last byte is pushed back to the input, losing some information and probably making the input malformed UTF-8. backtrack enables to roll back a few additional bytes to the start of the actual UTF-8 character. It is safe to use in such a case since we know there are at least that many uncommitted bytes to rollback.

@thedufer
Copy link
Collaborator

I don't really understand the rationale for why you would want this. It sounds like Angstrom.scan simply isn't the function you should be using - wouldn't it make more sense to define a parser for a single codepoint and then use Angstrom.many? That way the backtracking just does the right thing for you.

@mefyl
Copy link
Author

mefyl commented May 30, 2022

Indeed a codepoint parser combined with many could parse arbitrary UTF-8 input, but then we'd lose the nice features we use scan for: we have a list of possible entries to complete (eg. emails), and with scan we eat the input while filtering the potentially matching completions in a much more performant and readable manner.

Using your point, we could argue that scan is actually not ever useful, you could always use Angstorm.char and Angstrom.many to achieve the same result. My idea here is that if scan is sometime the right tool to parse latin-1 input, it is probably the right tool to parse multibyte unicode input, although one could definitely do without in both cases with a sequence of characters parser and some state that carries over (which is exactly what scan provides).

In the end this parser is, I think, well defined, and having it together with the other "expert" parsers can't hurt. But I can understand if it's considered too tricky.

Note: In the end the most correct answer might be that Angstrom should be able to interpret the input using different alphabets, be it plain bytes or unicode characters, but that's a project-wide change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants