-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The space character is not properly parsed #633
Comments
I am confused about what the correct behavior should be here. the single quote character has no special meaning inside the s-expression parser, and spaces delineate unquoted symbols (unquoted with double quotes) so The question is: what behavior do you want? Specifically what are you trying to achieve? ie. what do you want to do with the space? |
Nil introduced |
I would say we could allow using escaping |
I see.
Currently the escape evaluation only happens as part of string parsing. (between double quotes) IMO, We can go one of two ways to fix this issue in general:
If we go this way,
I have a strong preference for 1 over 2. But I don't have a preference between 1A. and 1B. What do you think? |
Trying to brainstorm: Supporting common ways of quoting is a possible solution. On the other hand program authors can invent unusual quoting. We could also say that we support only double quoting for instance. If author of the token needs using spaces inside its token then token should be wrapped by double quotes. For instance in this case the possible solution is to say that character literal has form One possibility is to use some rare character to introduce universal quoting (anyway we probably should support double quotes for strings as strings are very common). For instance backtick We could allow escaping only for a space characters between lexems like
This is almost equivalent to the full escaping support for the code, but allows using backslashes without duplication (which doesn't have much value I believe). @luketpeterson , I am not quite understand why do you prefer entering another quoting over escaping inside an input stream of the characters for the parser. Could you please elaborate? |
they can... but only if they build their inputs from already decomposed tokens. Which means working within the lexemes provided by the parser.
Yes. Although I think this might get annoying for the program author. My thinking was that we could support two different quotes: double and single, and then the program author could choose which tokens would match each one. So the meaning is imposed by the Tokenizer entries, but there are two choices as to which type of quote, so two different tokenizer entries could convert them to different atom types. say grounded chars and strings. If we want even more general syntax at the parser level, we could support any character followed by a quote. This would give the parser even more flexibility. For example
My feeling is that the single-quote and double-quote character are pretty well understood, so we deviating from them adds burden to the users for a pretty small payoff.
I feel like using non-standard characters in symbols invites a lot of ambiguity and bug surface-area. We don't define rules for legal symbol names, and maybe we should. But right now, the parser effectively makes limits how badly the user can shoot their feet with crazy characters inside their symbols. And parsing escape sequences makes it harder to keep those problems out. Some of the issues at the top of my mind:
I guess my final question is the reverse. What is the use-case for symbols that are allowed to contain all characters? |
Thanks Luke, I didn't realize you are bothering about symbols. I agree with your opinion that there is no need to allowing characters like space or parenthesis inside a symbol name. On the other hand I would allow such symbols inside tokens. Also I think we could allow users to add additional quotation notation if they need it. Taking last thing into account I think we could add processing single quotes as a special case into parser. But I think it is also worth trying to unify single quoting with a double quoting mechanism and allowing adding another kinds of quotes in future. |
The trouble is, the current behavior is that a parsed One possibility is a filter, where we say only a subset of characters are allowed in symbol atoms. So the parser would give the tokenizer a chance to match the token, but if the tokenizer didn't match the token and it contained illegal characters, then it would lead to an error.
Agreed. I think the same code path can be used for multiple types of quotes, giving a lot of flexibility to the tokenizer on what to do with them. Possibly even making a parser-parameter. I will see if this can be easily added without a lot of code change. |
Using That is said, I think I am somewhat agnostic about which is the best way to go. |
The issue is that the parser has no concept of literals except for strings. The parser constructs the AST without any knowledge of the tokenizer, and then certain syntax nodes are converted to atoms using the tokenizer. So matching something more complicated like that requires introducing custom tokens at the parser level. This might be a reasonable design in the abstract but at the very least this would lead to the need for a tokenizer redesign (which should probably happen regardless) but it's a bit of work, especially given the way the tokenizer is effectively the mechanism to access operations and runtime state. These are all reasonable changes and would bring the design of MeTTa more in line with a traditional language compiler/interpreter. But it's more work than I think we should take on right now.
This scares me a bit. Just about every other language is restrictive about what characters are allowed in symbol names, for good reasons. If MeTTa is just a framework to access programmatically then the bug surface area is minimal but as a language with a syntax, the chances for unfortunate interactions are much higher. |
I would say the collection of tokens is a collection of literals. Thus parser has the concept of literals, but the set of literals can be extended by user.
Haskell (and I believe Scheme) allow using |
What I mean is the parser makes no distinction between a literal and something that becomes a symbol atom, except in the special case of string literals. (which currently also become symbols) This conversation has me thinking I should fully fold Tokenizer into the parser, and tackle #409 as part of this issue. To keep some semblance of sanity, I'll define a default tokenizer that more-or-less preserves the current MeTTa syntax. Extending the syntax will always be an "at your own risk" feature. The fact that the syntax is fluid may create integration issues for other tooling that expect a stable syntax. E.g. editors, debuggers, source-control & merge tools, etc. |
What is your problem?
The space character, represented as
' '
in MeTTa, is not properly parsed.How to reproduce your problem?
Run the following
What would you normally expect?
What do you get instead?
What do you have to say?
I think the problem is that
' '
is understood as two separated'
. This is consistent with the fact thatoutputs
[2]
.I had a look at the definition of
type_tokens
instdlib.py
but I cannot see what could be wrong, the problem might be buried inside the Rust parser.The text was updated successfully, but these errors were encountered: