From 3d83120a2c00a1fbf879904f4fb21333cb0e112d Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Tue, 8 Oct 2024 16:44:26 +0100 Subject: [PATCH 01/12] gh-119786: Move parser doc from devguide to InternalDocs --- InternalDocs/README.md | 2 + InternalDocs/parser.md | 993 +++++++++++++++++++++++++++++++++++++++++ 2 files changed, 995 insertions(+) create mode 100644 InternalDocs/parser.md diff --git a/InternalDocs/README.md b/InternalDocs/README.md index 95181a420f1dfb..7c98bf7648c2d6 100644 --- a/InternalDocs/README.md +++ b/InternalDocs/README.md @@ -12,6 +12,8 @@ it is not, please report that through the [issue tracker](https://github.com/python/cpython/issues). +[Python's Parser](parser.md) + [Compiler Design](compiler.md) [Frames](frames.md) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md new file mode 100644 index 00000000000000..c63ebf5bac0476 --- /dev/null +++ b/InternalDocs/parser.md @@ -0,0 +1,993 @@ + +Guide to the parser +=================== + +Abstract +-------- + +Python's Parser is currently a +[``PEG`` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar) +parser. It was introduced in +[PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace +the original [``LL(1)``](https://en.wikipedia.org/wiki/LL_parser) parser. + +The code implementing the parser is generated from a grammar definition by a +[parser generator](https://en.wikipedia.org/wiki/Compiler-compiler). +Therefore, changes to the Python language are made by modifying the +[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram). +Developers rarely need to modify the generator itself. + +See Also: + +- [Changing CPython’s grammar](https://devguide.python.org/developer-workflow/grammar/#grammar) + for a detailed description of the grammar. + +How PEG parsers work +==================== +#how-peg-parsers-work + +A PEG (Parsing Expression Grammar) grammar differs from a context-free grammar +in that the way it is written more closely reflects how the parser will operate +when parsing. The fundamental technical difference is that the choice operator +is ordered. This means that when writing:: + +> rule: A | B | C + +a parser that implements a context-free-grammar (such as an ``LL(1)`` parser) will +generate constructions that, given an input string, *deduce* which alternative +(``A``, ``B`` or ``C``) must be expanded. On the other hand, a PEG parser will +check each alternative, in the order in which they are specified, and select +that first one that succeeds. + +This means that in a PEG grammer, the choice operator is not commutative. +Furthermore, unlike context-free grammars, the derivation according to a +PEG grammar cannot be ambiguous: if a string parses, it has exactly one +valid parse tree. + +PEG parsers are usually constructed as a recursive descent parser in which every +rule in the grammar corresponds to a function in the program implementing the +parser, and the parsing expression (the "expansion" or "definition" of the rule) +represents the "code" in said function. Each parsing function conceptually takes +an input string as its argument, and yields one of the following results: + +* A "success" result. This result indicates that the expression can be parsed by + that rule and the function may optionally move forward or consume one or more + characters of the input string supplied to it. +* A "failure" result, in which case no input is consumed. + +Note that "failure" results do not imply that the program is incorrect, nor do +they necessarily mean that the parsing has failed. Since the choice operator is +ordered, a failure very often merely indicates "try the following option". A +direct implementation of a PEG parser as a recursive descent parser will present +exponential time performance in the worst case, because PEG parsers have +infinite lookahead (this means that they can consider an arbitrary number of +tokens before deciding for a rule). Usually, PEG parsers avoid this exponential +time complexity with a technique called +["packrat parsing"](https://pdos.csail.mit.edu/~baford/packrat/thesis/) +which not only loads the entire program in memory before parsing it but also +allows the parser to backtrack arbitrarily. This is made efficient by memoizing +the rules already matched for each position. The cost of the memoization cache +is that the parser will naturally use more memory than a simple ``LL(1)`` parser, +which normally are table-based. + + +Key ideas +--------- + +--- +**Important** + +Don't try to reason about a PEG grammar in the same way you would to with an EBNF +or context free grammar. PEG is optimized to describe **how** input strings will +be parsed, while context-free grammars are optimized to generate strings of the +language they describe (in EBNF, to know whether a given string is in the +language, you need to do work to find out as it is not immediately obvious from +the grammar). + +--- + +- Alternatives are ordered ( ``A | B`` is not the same as ``B | A`` ). +- If a rule returns a failure, it doesn't mean that the parsing has failed, + it just means "try something else". +- By default PEG parsers run in exponential time, which can be optimized to linear by + using memoization. +- If parsing fails completely (no rule succeeds in parsing all the input text), the + PEG parser doesn't have a concept of "where the + [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is". + + +Consequences of the ordered choice operator +------------------------------------------- +#order-consequences + +Although PEG may look like EBNF, its meaning is quite different. The fact +that the alternatives are ordered in a PEG grammer (which is at the core of +how PEG parsers work) has deep consequences, other than removing ambiguity. + +If a rule has two alternatives and the first of them succeeds, the second one is +**not** attempted even if the caller rule fails to parse the rest of the input. +Thus the parser is said to be "eager". To illustrate this, consider +the following two rules (in these examples, a token is an individual character): + +> first_rule: ( 'a' | 'aa' ) 'a' +> second_rule: ('aa' | 'a' ) 'a' + +In a regular EBNF grammar, both rules specify the language ``{aa, aaa}`` but +in PEG, one of these two rules accepts the string ``aaa`` but not the string +``aa``. The other does the opposite -- it accepts the string ``aa`` +but not the string ``aaa``. The rule ``('a'|'aa')'a'`` does +not accept ``aaa`` because ``'a'|'aa'`` consumes the first ``a``, letting the +final ``a`` in the rule consume the second, and leaving out the third ``a``. +As the rule has succeeded, no attempt is ever made to go back and let +``'a'|'aa'`` try the second alternative. The expression ``('aa'|'a')'a'`` does +not accept ``aa`` because ``'aa'|'a'`` accepts all of ``aa``, leaving nothing +for the final ``a``. Again, the second alternative of ``'aa'|'a'`` is not +tried. + +--- +**Caution** + +The effects of ordered choice, such as the ones illustrated above, may be +hidden by many levels of rules. + +--- + +For this reason, writing rules where an alternative is contained in the next +one is in almost all cases a mistake, for example: + +> my_rule: +> | 'if' expression 'then' block +> | 'if' expression 'then' block 'else' block + +In this example, the second alternative will never be tried because the first one will +succeed first (even if the input string has an ``'else' block`` that follows). To correctly +write this rule you can simply alter the order: + +> my_rule: +> | 'if' expression 'then' block 'else' block +> | 'if' expression 'then' block + +In this case, if the input string doesn't have an ``'else' block``, the first alternative +will fail and the second will be attempted. + +Grammar Syntax +============== + +The grammar consists of a sequence of rules of the form: + +> rule_name: expression + +Optionally, a type can be included right after the rule name, which +specifies the return type of the C or Python function corresponding to +the rule: + +> rule_name[return_type]: expression + +If the return type is omitted, then a ``void *`` is returned in C and an +``Any`` in Python. + +Grammar expressions +------------------- + +``# comment`` +^^^^^^^^^^^^^ + +Python-style comments. + +``e1 e2`` +^^^^^^^^^ + +Match ``e1``, then match ``e2``. + +> rule_name: first_rule second_rule + +``e1 | e2`` +^^^^^^^^^^^ + +Match ``e1`` or ``e2``. + +The first alternative can also appear on the line after the rule name +for formatting purposes. In that case, a \| must be used before the +first alternative, like so: + +> rule_name[return_type]: +> | first_alt +> | second_alt + +``( e )`` +^^^^^^^^^ + +Match ``e``. This is the grouping operator. + +> rule_name: (e) + +A slightly more complex and useful example includes using the grouping +operator together with the repeat operator: + +> rule_name: (e1 e2)* + +``[ e ] or e?`` +^^^^^^^^^^^^^^^ + +Optionally match ``e``. + +> rule_name: [e] + +A more useful example includes defining that a trailing comma is +optional: + +> rule_name: e (',' e)* [','] + +``e*`` +^^^^^^ + +Match zero or more occurrences of ``e``. + +> rule_name: (e1 e2)* + +``e+`` +^^^^^^ + +Match one or more occurrences of ``e``. + +> rule_name: (e1 e2)+ + +``s.e+`` +^^^^^^^^ + +Match one or more occurrences of ``e``, separated by ``s``. The generated parse +tree does not include the separator. This is otherwise identical to +``(e (s e)*)``. + +> rule_name: ','.e+ + +``&e`` +^^^^^^ +#peg-positive-lookahead + +Succeed if ``e`` can be parsed, without consuming any input. + +``!e`` +^^^^^^ +#peg-negative-lookahead + +Fail if ``e`` can be parsed, without consuming any input. + +An example taken from the Python grammar specifies that a primary +consists of an atom, which is not followed by a ``.`` or a ``(`` or a +``[``: + +> primary: atom !'.' !'(' !'[' + +``~`` +^^^^^ + +Commit to the current alternative, even if it fails to parse (this is called +the "cut"). + +> rule_name: '(' ~ some_rule ')' | some_alt + +In this example, if a left parenthesis is parsed, then the other +alternative won’t be considered, even if some_rule or ``)`` fail to be +parsed. + +Left recursion +-------------- + +PEG parsers normally do not support left recursion, but CPython's parser +generator implements a technique similar to the one described in +[Medeiros et al.](https://arxiv.org/pdf/1207.0443) but using the memoization +cache instead of static variables. This approach is closer to the one described +in [Warth et al.](http://web.cs.ucla.edu/~todd/research/pepm08.pdf). This +allows us to write not only simple left-recursive rules but also more +complicated rules that involve indirect left-recursion like: + +> rule1: rule2 | 'a' +> rule2: rule3 | 'b' +> rule3: rule1 | 'c' + +and "hidden left-recursion" like: + +> rule: 'optional'? rule '@' some_other_rule + +Variables in the grammar +------------------------ + +A sub-expression can be named by preceding it with an identifier and an +``=`` sign. The name can then be used in the action (see below), like this: + +> rule_name[return_type]: '(' a=some_other_rule ')' { a } + +Grammar actions +--------------- +#peg-grammar-actions + +To avoid the intermediate steps that obscure the relationship between the +grammar and the AST generation, the PEG parser allows directly generating AST +nodes for a rule via grammar actions. Grammar actions are language-specific +expressions that are evaluated when a grammar rule is successfully parsed. These +expressions can be written in Python or C depending on the desired output of the +parser generator. This means that if one would want to generate a parser in +Python and another in C, two grammar files should be written, each one with a +different set of actions, keeping everything else apart from said actions +identical in both files. As an example of a grammar with Python actions, the +piece of the parser generator that parses grammar files is bootstrapped from a +meta-grammar file with Python actions that generate the grammar tree as a result +of the parsing. + +In the specific case of the PEG grammar for Python, having actions allows +directly describing how the AST is composed in the grammar itself, making it +more clear and maintainable. This AST generation process is supported by the use +of some helper functions that factor out common AST object manipulations and +some other required operations that are not directly related to the grammar. + +To indicate these actions each alternative can be followed by the action code +inside curly-braces, which specifies the return value of the alternative: + +> rule_name[return_type]: +> | first_alt1 first_alt2 { first_alt1 } +> | second_alt1 second_alt2 { second_alt1 } + +If the action is omitted, a default action is generated: + +- If there is a single name in the rule, it gets returned. +- If there multiple names in the rule, a collection with all parsed + expressions gets returned (the type of the collection will be different + in C and Python). + +This default behaviour is primarily made for very simple situations and for +debugging purposes. + +--- +**Warning** + +It's important that the actions don't mutate any AST nodes that are passed +into them via variables referring to other rules. The reason for mutation +being not allowed is that the AST nodes are cached by memoization and could +potentially be reused in a different context, where the mutation would be +invalid. If an action needs to change an AST node, it should instead make a +new copy of the node and change that. + +--- + +The full meta-grammar for the grammars supported by the PEG generator is: + +``` +{ + start[Grammar]: grammar ENDMARKER { grammar } + + grammar[Grammar]: + | metas rules { Grammar(rules, metas) } + | rules { Grammar(rules, []) } + + metas[MetaList]: + | meta metas { [meta] + metas } + | meta { [meta] } + + meta[MetaTuple]: + | "@" NAME NEWLINE { (name.string, None) } + | "@" a=NAME b=NAME NEWLINE { (a.string, b.string) } + | "@" NAME STRING NEWLINE { (name.string, literal_eval(string.string)) } + + rules[RuleList]: + | rule rules { [rule] + rules } + | rule { [rule] } + + rule[Rule]: + | rulename ":" alts NEWLINE INDENT more_alts DEDENT { + Rule(rulename[0], rulename[1], Rhs(alts.alts + more_alts.alts)) } + | rulename ":" NEWLINE INDENT more_alts DEDENT { Rule(rulename[0], rulename[1], more_alts) } + | rulename ":" alts NEWLINE { Rule(rulename[0], rulename[1], alts) } + + rulename[RuleName]: + | NAME '[' type=NAME '*' ']' {(name.string, type.string+"*")} + | NAME '[' type=NAME ']' {(name.string, type.string)} + | NAME {(name.string, None)} + + alts[Rhs]: + | alt "|" alts { Rhs([alt] + alts.alts)} + | alt { Rhs([alt]) } + + more_alts[Rhs]: + | "|" alts NEWLINE more_alts { Rhs(alts.alts + more_alts.alts) } + | "|" alts NEWLINE { Rhs(alts.alts) } + + alt[Alt]: + | items '$' action { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=action) } + | items '$' { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=None) } + | items action { Alt(items, action=action) } + | items { Alt(items, action=None) } + + items[NamedItemList]: + | named_item items { [named_item] + items } + | named_item { [named_item] } + + named_item[NamedItem]: + | NAME '=' ~ item {NamedItem(name.string, item)} + | item {NamedItem(None, item)} + | it=lookahead {NamedItem(None, it)} + + lookahead[LookaheadOrCut]: + | '&' ~ atom {PositiveLookahead(atom)} + | '!' ~ atom {NegativeLookahead(atom)} + | '~' {Cut()} + + item[Item]: + | '[' ~ alts ']' {Opt(alts)} + | atom '?' {Opt(atom)} + | atom '*' {Repeat0(atom)} + | atom '+' {Repeat1(atom)} + | sep=atom '.' node=atom '+' {Gather(sep, node)} + | atom {atom} + + atom[Plain]: + | '(' ~ alts ')' {Group(alts)} + | NAME {NameLeaf(name.string) } + | STRING {StringLeaf(string.string)} + + # Mini-grammar for the actions + + action[str]: "{" ~ target_atoms "}" { target_atoms } + + target_atoms[str]: + | target_atom target_atoms { target_atom + " " + target_atoms } + | target_atom { target_atom } + + target_atom[str]: + | "{" ~ target_atoms "}" { "{" + target_atoms + "}" } + | NAME { name.string } + | NUMBER { number.string } + | STRING { string.string } + | "?" { "?" } + | ":" { ":" } +} +``` + +As an illustrative example this simple grammar file allows directly +generating a full parser that can parse simple arithmetic expressions and that +returns a valid C-based Python AST: + +``` +{ + start[mod_ty]: a=expr_stmt* ENDMARKER { _PyAST_Module(a, NULL, p->arena) } + expr_stmt[stmt_ty]: a=expr NEWLINE { _PyAST_Expr(a, EXTRA) } + + expr[expr_ty]: + | l=expr '+' r=term { _PyAST_BinOp(l, Add, r, EXTRA) } + | l=expr '-' r=term { _PyAST_BinOp(l, Sub, r, EXTRA) } + | term + + term[expr_ty]: + | l=term '*' r=factor { _PyAST_BinOp(l, Mult, r, EXTRA) } + | l=term '/' r=factor { _PyAST_BinOp(l, Div, r, EXTRA) } + | factor + + factor[expr_ty]: + | '(' e=expr ')' { e } + | atom + + atom[expr_ty]: + | NAME + | NUMBER +} +``` + +Here ``EXTRA`` is a macro that expands to ``start_lineno, start_col_offset, +end_lineno, end_col_offset, p->arena``, those being variables automatically +injected by the parser; ``p`` points to an object that holds on to all state +for the parser. + +A similar grammar written to target Python AST objects: + +``` +{ + start[ast.Module]: a=expr_stmt* ENDMARKER { ast.Module(body=a or [] } + expr_stmt: a=expr NEWLINE { ast.Expr(value=a, EXTRA) } + + expr: + | l=expr '+' r=term { ast.BinOp(left=l, op=ast.Add(), right=r, EXTRA) } + | l=expr '-' r=term { ast.BinOp(left=l, op=ast.Sub(), right=r, EXTRA) } + | term + + term: + | l=term '*' r=factor { ast.BinOp(left=l, op=ast.Mult(), right=r, EXTRA) } + | l=term '/' r=factor { ast.BinOp(left=l, op=ast.Div(), right=r, EXTRA) } + | factor + + factor: + | '(' e=expr ')' { e } + | atom + + atom: + | NAME + | NUMBER +} +``` + +Pegen +===== + +Pegen is the parser generator used in CPython to produce the final PEG parser +used by the interpreter. It is the program that can be used to read the python +grammar located in +[`Grammar/python.gram`](https://github.com/python/cpython/blob/main/Grammar/python.gram) +and produce the final C parser. It contains the following pieces: + +- A parser generator that can read a grammar file and produce a PEG parser + written in Python or C that can parse said grammar. The generator is located at + [`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen). +- A PEG meta-grammar that automatically generates a Python parser which is used + for the parser generator itself (this means that there are no manually-written + parsers). The meta-grammar is located at + [`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram). +- A generated parser (using the parser generator) that can directly produce C and Python AST objects. + +The source code for Pegen lives at +[`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen) +but normally all typical commands to interact with the parser generator are executed from +the main makefile. + +How to regenerate the parser +---------------------------- + +Once you have made the changes to the grammar files, to regenerate the ``C`` +parser (the one used by the interpreter) just execute: + +> make regen-pegen + +using the ``Makefile`` in the main directory. If you are on Windows you can +use the Visual Studio project files to regenerate the parser or to execute: + +> ./PCbuild/build.bat --regen + +The generated parser file is located at +[`Parser/parser.c`](https://github.com/python/cpython/blob/main/Parser/parser.c). + +How to regenerate the meta-parser +--------------------------------- + +The meta-grammar (the grammar that describes the grammar for the grammar files +themselves) is located at +[`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram). +Although it is very unlikely that you will ever need to modify it, if you make +any modifications to this file (in order to implement new Pegen features) you will +need to regenerate the meta-parser (the parser that parses the grammar files). +To do so just execute: + +> make regen-pegen-metaparser + +If you are on Windows you can use the Visual Studio project files +to regenerate the parser or to execute: + +> ./PCbuild/build.bat --regen + + +Grammatical elements and rules +------------------------------ + +Pegen has some special grammatical elements and rules: + +- Strings with single quotes (') (for example, ``'class'``) denote KEYWORDS. +- Strings with double quotes (") (for example, ``"match"``) denote SOFT KEYWORDS. +- Uppercase names (for example, ``NAME``) denote tokens in the + [`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) file. +- Rule names starting with ``invalid_`` are used for specialized syntax errors. + + - These rules are NOT used in the first pass of the parser. + - Only if the first pass fails to parse, a second pass including the invalid + rules will be executed. + - If the parser fails in the second phase with a generic syntax error, the + location of the generic failure of the first pass will be used (this avoids + reporting incorrect locations due to the invalid rules). + - The order of the alternatives involving invalid rules matter + (like any rule in PEG). + +Tokenization +------------ + +It is common among PEG parser frameworks that the parser does both the parsing +and the tokenization, but this does not happen in Pegen. The reason is that the +Python language needs a custom tokenizer to handle things like indentation +boundaries, some special keywords like ``ASYNC`` and ``AWAIT`` (for +compatibility purposes), backtracking errors (such as unclosed parenthesis), +dealing with encoding, interactive mode and much more. Some of these reasons +are also there for historical purposes, and some others are useful even today. + +The list of tokens (all uppercase names in the grammar) that you can use can +be found in thei +[`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) +file. If you change this file to add new tokens, make sure to regenerate the +files by executing: + +> make regen-token + +If you are on Windows you can use the Visual Studio project files to regenerate +the tokens or to execute: + +> ./PCbuild/build.bat --regen + +How tokens are generated and the rules governing this are completely up to the tokenizer +([`Parser/lexer`](https://github.com/python/cpython/blob/main/Parser/lexer) +and +[`Parser/tokenizer`](https://github.com/python/cpython/blob/main/Parser/tokenizer)); +the parser just receives tokens from it. + +Memoization +----------- + +As described previously, to avoid exponential time complexity in the parser, +memoization is used. + +The C parser used by Python is highly optimized and memoization can be expensive +both in memory and time. Although the memory cost is obvious (the parser needs +memory for storing previous results in the cache) the execution time cost comes +for continuously checking if the given rule has a cache hit or not. In many +situations, just parsing it again can be faster. Pegen **disables memoization +by default** except for rules with the special marker ``memo`` after the rule +name (and type, if present): + +> rule_name[typr] (memo): +> ... + +By selectively turning on memoization for a handful of rules, the parser becomes +faster and uses less memory. + +--- +**Note** + + Left-recursive rules always use memoization, since the implementation of + left-recursion depends on it. + +--- + +To determine whether a new rule needs memoization or not, benchmarking is required +(comparing execution times and memory usage of some considerably large files with +and without memoization). There is a very simple instrumentation API available +in the generated C parse code that allows to measure how much each rule uses +memoization (check the +[`Parser/pegen.c`](https://github.com/python/cpython/blob/main/Parser/pegen.c) +file for more information) but it needs to be manually activated. + +Automatic variables +------------------- + +To make writing actions easier, Pegen injects some automatic variables in the +namespace available when writing actions. In the C parser, some of these +automatic variable names are: + +- ``p``: The parser structure. +- ``EXTRA``: This is a macro that expands to + ``(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)``, + which is normally used to create AST nodes as almost all constructors need these + attributes to be provided. All of the location variables are taken from the + location information of the current token. + +Hard and soft keywords +---------------------- + +--- +**Note** + +In the grammar files, keywords are defined using **single quotes** (for example, +``'class'``) while soft keywords are defined using **double quotes** (for example, +``"match"``). + +--- + +There are two kinds of keywords allowed in pegen grammars: *hard* and *soft* +keywords. The difference between hard and soft keywords is that hard keywords +are always reserved words, even in positions where they make no sense +(for example, ``x = class + 1``), while soft keywords only get a special +meaning in context. Trying to use a hard keyword as a variable will always +fail: + +``` +{ + >>> class = 3 + File "", line 1 + class = 3 + ^ + SyntaxError: invalid syntax + >>> foo(class=3) + File "", line 1 + foo(class=3) + ^^^^^ + SyntaxError: invalid syntax +} +``` + +While soft keywords don't have this limitation if used in a context other the +one where they are defined as keywords: + +``` +{ + >>> match = 45 + >>> foo(match="Yeah!") +} +``` + +The ``match`` and ``case`` keywords are soft keywords, so that they are +recognized as keywords at the beginning of a match statement or case block +respectively, but are allowed to be used in other places as variable or +argument names. + +You can get a list of all keywords defined in the grammar from Python: + +``` +{ + >>> import keyword + >>> keyword.kwlist + ['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break', + 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', + 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', + 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield'] +} +``` + +as well as soft keywords: + +``` +{ + >>> import keyword + >>> keyword.softkwlist + ['_', 'case', 'match'] +} +``` + +--- +**Caution** + +Soft keywords can be a bit challenging to manage as they can be accepted in +places you don't intend, given how the order alternatives behave in PEG +parsers (see [consequences of ordered choice section](#order-consequences) +for some background on this). In general, try to define them in places where +there are not many alternatives. + +--- + +Error handling +-------------- + +When a pegen-generated parser detects that an exception is raised, it will +**automatically stop parsing**, no matter what the current state of the parser +is, and it will unwind the stack and report the exception. This means that if a +[rule action](#peg-grammar-actions) raises an exception, all parsing will +stop at that exact point. This is done to allow to correctly propagate any +exception set by calling Python's C API functions. This also includes +[``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) +exceptions and it is the main mechanism the parser uses to report custom syntax +error messages. + +--- +**Note** + +Tokenizer errors are normally reported by raising exceptions but some special +tokenizer errors such as unclosed parenthesis will be reported only after the +parser finishes without returning anything. + +--- + +How syntax errors are reported +------------------------------ + +As described previously in the [how PEG parsers work](#how-peg-parsers-work) +section, PEG parsers don't have a defined concept of where errors happened +in the grammar, because a rule failure doesn't imply a parsing failure like +in context free grammars. This means that a heuristic has to be used to report +generic errors unless something is explicitly declared as an error in the +grammar. + +To report generic syntax errors, pegen uses a common heuristic in PEG parsers: +the location of *generic* syntax errors is reported to be the furthest token that +was attempted to be matched but failed. This is only done if parsing has failed +(the parser returns ``NULL`` in C or ``None`` in Python) but no exception has +been raised. + +As the Python grammar was primordially written as an ``LL(1)`` grammar, this heuristic +has an extremely high success rate, but some PEG features can have small effects, +such as [positive lookaheads](#peg-positive-lookahead) and +[negative lookaheads)(#peg-negative-lookahead). + +--- +**Caution** + +Positive and negative lookaheads will try to match a token so they will affect +the location of generic syntax errors. Use them carefully at boundaries +between rules. + +--- + +To generate more precise syntax errors, custom rules are used. This is a common practice +also in context free grammars: the parser will try to accept some construct that is known +to be incorrect just to report a specific syntax error for that construct. In pegen grammars, +these rules start with the ``invalid_`` prefix. This is because trying to match these rules +normally has a performance impact on parsing (and can also affect the 'correct' grammar itself +in some tricky cases, depending on the ordering of the rules) so the generated parser acts in +two phases: + +1. The first phase will try to parse the input stream without taking into account rules that + start with the ``invalid_`` prefix. If the parsing succeeds it will return the generated AST + and the second phase will be skipped. + +2. If the first phase failed, a second parsing attempt is done including the rules that start + with an ``invalid_`` prefix. By design this attempt **cannot succeed** and is only executed + to give to the invalid rules a chance to detect specific situations where custom, more precise, + syntax errors can be raised. This also allows to trade a bit of performance for precision reporting + errors: given that we know that the input text is invalid, there is typically no need to be fast + because execution is going to stop anyway. + +--- +**Important** + +When defining invalid rules: + +- Make sure all custom invalid rules raise + [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) + exceptions (or a subclass of it). +- Make sure **all** invalid rules start with the ``invalid_`` prefix to not + impact performance of parsing correct Python code. +- Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules + (see the [how PEG parsers work](#how-peg-parsers-work) section for more information). + +--- + +You can find a collection of macros to raise specialized syntax errors in the +[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) +header file. These macros allow also to report ranges for +the custom errors, which will be highlighted in the tracebacks that will be +displayed when the error is reported. + + +--- +**Tip** + +A good way to test whether an invalid rule will be triggered when you expect is to test if introducing +a syntax error **after** valid code triggers the rule or not. For example: + +> $ 42 + +should trigger the syntax error in the ``$`` character. If your rule is not correctly defined this +won't happen. As another example, suppose that you try to define a rule to match Python 2 style +``print`` statements in order to create a better error message and you define it as: + +> invalid_print: "print" expression + +This will **seem** to work because the parser will correctly parse ``print(something)`` because it is valid +code and the second phase will never execute but if you try to parse ``print(something) $ 3`` the first pass +of the parser will fail (because of the ``$``) and in the second phase, the rule will match the +``print(something)`` as ``print`` followed by the variable ``something`` between parentheses and the error +will be reported there instead of the ``$`` character. + +Generating AST objects +---------------------- + +The output of the C parser used by CPython, which is generated from the +[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram), +is a Python AST object (using C structures). This means that the actions in the +grammar file generate AST objects when they succeed. Constructing these objects +can be quite cumbersome (see the [AST compiler section](compiler.md#abstract-syntax-trees-ast) +for more information on how these objects are constructed and how they are used +by the compiler), so special helper functions are used. These functions are +declared in the +[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) +header file and defined in the +[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c) +file. The helpers include functions that join AST sequences, get specific elements +from them or to perform extra processing on the generated tree. + + +--- +**Caution** + +Actions must **never** be used to accept or reject rules. It may be tempting +in some situations to write a very generic rule and then check the generated +AST to decide whether it is valid or not, but this will render the +(official grammar)[https://docs.python.org/3/reference/grammar.html] partially +incorrect (because it does not include actions) and will make it more difficult +for other Python implementations to adapt the grammar to their own needs. + +--- + +As a general rule, if an action spawns multiple lines or requires something more +complicated than a single expression of C code, is normally better to create a +custom helper in +[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c) +and expose it in the +[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) +header file so that it can be used from the grammar. + +When parsing succeeds, the parser **must** return a **valid** AST object. + +Testing +======= + +There are three files that contain tests for the grammar and the parser: + +- [test_grammar.py](https://github.com/python/cpython/blob/main/Lib/test/test_grammar.py) +- [test_syntax.py](https://github.com/python/cpython/blob/main/Lib/test/test_syntax.py) +- [test_exceptions.py](https://github.com/python/cpython/blob/main/Lib/test/test_exceptions.py) + +Check the contents of these files to know which is the best place for new tests, depending +on the nature of the new feature you are adding. + +Tests for the parser generator itself can be found in the +[test_peg_generator](https://github.com/python/cpython/blob/main/Lib/test_peg_generator) +directory. + + +Debugging generated parsers +=========================== + +Making experiments +------------------ + +As the generated C parser is the one used by Python, this means that if +something goes wrong when adding some new rules to the grammar, you cannot +correctly compile and execute Python anymore. This makes it a bit challenging +to debug when something goes wrong, especially when experimenting. + +For this reason it is a good idea to experiment first by generating a Python +parser. To do this, you can go to the +[Tools/peg_generator](https://github.com/python/cpython/blob/main/Tools/peg_generator) +directory on the CPython repository and manually call the parser generator by executing: + +> $ python -m pegen python + +This will generate a file called ``parse.py`` in the same directory that you +can use to parse some input: + +> $ python parse.py file_with_source_code_to_test.py + +As the generated ``parse.py`` file is just Python code, you can modify it +and add breakpoints to debug or better understand some complex situations. + + +Verbose mode +------------ + +When Python is compiled in debug mode (by adding ``--with-pydebug`` when +running the configure step in Linux or by adding ``-d`` when calling the +[PCbuild/build.bat](https://github.com/python/cpython/blob/main/PCbuild/build.bat)), +it is possible to activate a **very** verbose mode in the generated parser. This +is very useful to debug the generated parser and to understand how it works, but it +can be a bit hard to understand at first. + +--- +**Note:** + +When activating verbose mode in the Python parser, it is better to not use interactive +mode as it can be much harder to understand, because interactive mode involves some +special steps compared to regular parsing. + +--- + +To activate verbose mode you can add the ``-d`` flag when executing Python: + +> $ python -d file_to_test.py + +This will print **a lot** of output to ``stderr`` so it is probably better to dump +it to a file for further analysis. The output consists of trace lines with the +following structure:: + +> ('>'|'-'|'+'|'!') []: ... + +Every line is indented by a different amount (````) depending on how +deep the call stack is. The next character marks the type of the trace: + +- ``>`` indicates that a rule is going to be attempted to be parsed. +- ``-`` indicates that a rule has failed to be parsed. +- ``+`` indicates that a rule has been parsed correctly. +- ``!`` indicates that an exception or an error has been detected and the parser is unwinding. + +The ```` part indicates the current index in the token array, +the ```` part indicates what rule is being parsed and +the ```` part indicates what alternative within that rule +is being attempted. + + +--- +**Admonition: Document history** + + Pablo Galindo Salgado - Original author + +-- From 6af1ea14154426b7087a09ffa1ead36eae52e5a0 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Tue, 8 Oct 2024 18:13:50 +0100 Subject: [PATCH 02/12] fixups --- InternalDocs/parser.md | 103 ++++++++++++++++++++--------------------- 1 file changed, 49 insertions(+), 54 deletions(-) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index c63ebf5bac0476..d2d9abb5448e80 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -17,19 +17,17 @@ Therefore, changes to the Python language are made by modifying the [grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram). Developers rarely need to modify the generator itself. -See Also: - -- [Changing CPython’s grammar](https://devguide.python.org/developer-workflow/grammar/#grammar) - for a detailed description of the grammar. +See [Changing CPython’s grammar](https://devguide.python.org/developer-workflow/grammar/#grammar) +for a detailed description of the grammar and the process for changing it. How PEG parsers work ==================== -#how-peg-parsers-work -A PEG (Parsing Expression Grammar) grammar differs from a context-free grammar +A PEG (Parsing Expression Grammar) grammar differs from a +[context-free grammar](https://en.wikipedia.org/wiki/Context-free_grammar) in that the way it is written more closely reflects how the parser will operate when parsing. The fundamental technical difference is that the choice operator -is ordered. This means that when writing:: +is ordered. This means that when writing: > rule: A | B | C @@ -74,10 +72,21 @@ which normally are table-based. Key ideas --------- +- Alternatives are ordered ( ``A | B`` is not the same as ``B | A`` ). +- If a rule returns a failure, it doesn't mean that the parsing has failed, + it just means "try something else". +- By default PEG parsers run in exponential time, which can be optimized to linear by + using memoization. +- If parsing fails completely (no rule succeeds in parsing all the input text), the + PEG parser doesn't have a concept of "where the + [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is". + + --- **Important** -Don't try to reason about a PEG grammar in the same way you would to with an EBNF +Don't try to reason about a PEG grammar in the same way you would to with an +[EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) or context free grammar. PEG is optimized to describe **how** input strings will be parsed, while context-free grammars are optimized to generate strings of the language they describe (in EBNF, to know whether a given string is in the @@ -86,19 +95,8 @@ the grammar). --- -- Alternatives are ordered ( ``A | B`` is not the same as ``B | A`` ). -- If a rule returns a failure, it doesn't mean that the parsing has failed, - it just means "try something else". -- By default PEG parsers run in exponential time, which can be optimized to linear by - using memoization. -- If parsing fails completely (no rule succeeds in parsing all the input text), the - PEG parser doesn't have a concept of "where the - [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is". - - Consequences of the ordered choice operator ------------------------------------------- -#order-consequences Although PEG may look like EBNF, its meaning is quite different. The fact that the alternatives are ordered in a PEG grammer (which is at the core of @@ -110,6 +108,7 @@ Thus the parser is said to be "eager". To illustrate this, consider the following two rules (in these examples, a token is an individual character): > first_rule: ( 'a' | 'aa' ) 'a' + > second_rule: ('aa' | 'a' ) 'a' In a regular EBNF grammar, both rules specify the language ``{aa, aaa}`` but @@ -136,7 +135,9 @@ For this reason, writing rules where an alternative is contained in the next one is in almost all cases a mistake, for example: > my_rule: + > | 'if' expression 'then' block + > | 'if' expression 'then' block 'else' block In this example, the second alternative will never be tried because the first one will @@ -144,7 +145,9 @@ succeed first (even if the input string has an ``'else' block`` that follows). T write this rule you can simply alter the order: > my_rule: + > | 'if' expression 'then' block 'else' block + > | 'if' expression 'then' block In this case, if the input string doesn't have an ``'else' block``, the first alternative @@ -169,20 +172,17 @@ If the return type is omitted, then a ``void *`` is returned in C and an Grammar expressions ------------------- -``# comment`` -^^^^^^^^^^^^^ +**``# comment``** Python-style comments. -``e1 e2`` -^^^^^^^^^ +**``e1 e2``** Match ``e1``, then match ``e2``. > rule_name: first_rule second_rule -``e1 | e2`` -^^^^^^^^^^^ +**``e1 | e2``** Match ``e1`` or ``e2``. @@ -191,13 +191,14 @@ for formatting purposes. In that case, a \| must be used before the first alternative, like so: > rule_name[return_type]: + > | first_alt + > | second_alt -``( e )`` -^^^^^^^^^ +**``( e )`` (grouping operator)** -Match ``e``. This is the grouping operator. +Match ``e``. > rule_name: (e) @@ -206,8 +207,7 @@ operator together with the repeat operator: > rule_name: (e1 e2)* -``[ e ] or e?`` -^^^^^^^^^^^^^^^ +**``[ e ] or e?``** Optionally match ``e``. @@ -218,38 +218,31 @@ optional: > rule_name: e (',' e)* [','] -``e*`` -^^^^^^ +**``e*``** Match zero or more occurrences of ``e``. > rule_name: (e1 e2)* -``e+`` -^^^^^^ +**``e+``** Match one or more occurrences of ``e``. > rule_name: (e1 e2)+ -``s.e+`` -^^^^^^^^ +**``s.e+``** -Match one or more occurrences of ``e``, separated by ``s``. The generated parse -tree does not include the separator. This is otherwise identical to +Match one or more occurrences of ``e``, separated by ``s``. The generated +parse tree does not include the separator. This is otherwise identical to ``(e (s e)*)``. > rule_name: ','.e+ -``&e`` -^^^^^^ -#peg-positive-lookahead +**``&e`` (positive lookahead)** Succeed if ``e`` can be parsed, without consuming any input. -``!e`` -^^^^^^ -#peg-negative-lookahead +**``!e`` (negative lookahead)** Fail if ``e`` can be parsed, without consuming any input. @@ -259,8 +252,7 @@ consists of an atom, which is not followed by a ``.`` or a ``(`` or a > primary: atom !'.' !'(' !'[' -``~`` -^^^^^ +**``~``** Commit to the current alternative, even if it fails to parse (this is called the "cut"). @@ -283,7 +275,9 @@ allows us to write not only simple left-recursive rules but also more complicated rules that involve indirect left-recursion like: > rule1: rule2 | 'a' + > rule2: rule3 | 'b' + > rule3: rule1 | 'c' and "hidden left-recursion" like: @@ -300,7 +294,6 @@ A sub-expression can be named by preceding it with an identifier and an Grammar actions --------------- -#peg-grammar-actions To avoid the intermediate steps that obscure the relationship between the grammar and the AST generation, the PEG parser allows directly generating AST @@ -325,7 +318,9 @@ To indicate these actions each alternative can be followed by the action code inside curly-braces, which specifies the return value of the alternative: > rule_name[return_type]: + > | first_alt1 first_alt2 { first_alt1 } + > | second_alt1 second_alt2 { second_alt1 } If the action is omitted, a default action is generated: @@ -739,9 +734,10 @@ as well as soft keywords: Soft keywords can be a bit challenging to manage as they can be accepted in places you don't intend, given how the order alternatives behave in PEG -parsers (see [consequences of ordered choice section](#order-consequences) -for some background on this). In general, try to define them in places where -there are not many alternatives. +parsers (see the +[consequences of ordered choice](#consequences-of-the-ordered-choice-operator) +section for some background on this). In general, try to define them in places +where there are not many alternatives. --- @@ -751,7 +747,7 @@ Error handling When a pegen-generated parser detects that an exception is raised, it will **automatically stop parsing**, no matter what the current state of the parser is, and it will unwind the stack and report the exception. This means that if a -[rule action](#peg-grammar-actions) raises an exception, all parsing will +[rule action](#grammar-actions) raises an exception, all parsing will stop at that exact point. This is done to allow to correctly propagate any exception set by calling Python's C API functions. This also includes [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) @@ -784,9 +780,8 @@ was attempted to be matched but failed. This is only done if parsing has failed been raised. As the Python grammar was primordially written as an ``LL(1)`` grammar, this heuristic -has an extremely high success rate, but some PEG features can have small effects, -such as [positive lookaheads](#peg-positive-lookahead) and -[negative lookaheads)(#peg-negative-lookahead). +has an extremely high success rate, but some PEG features, such as lookaheads, +can impact this. --- **Caution** From 91c1067fe4a6a48468f79729768f87279a09dd2b Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Tue, 8 Oct 2024 19:09:29 +0100 Subject: [PATCH 03/12] formatting --- InternalDocs/parser.md | 279 ++++++++++++++++++++++++++++++----------- 1 file changed, 204 insertions(+), 75 deletions(-) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index d2d9abb5448e80..53e0ba5c6d9676 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -29,7 +29,11 @@ in that the way it is written more closely reflects how the parser will operate when parsing. The fundamental technical difference is that the choice operator is ordered. This means that when writing: -> rule: A | B | C +``` +{ + rule: A | B | C +} +``` a parser that implements a context-free-grammar (such as an ``LL(1)`` parser) will generate constructions that, given an input string, *deduce* which alternative @@ -107,9 +111,12 @@ If a rule has two alternatives and the first of them succeeds, the second one is Thus the parser is said to be "eager". To illustrate this, consider the following two rules (in these examples, a token is an individual character): -> first_rule: ( 'a' | 'aa' ) 'a' - -> second_rule: ('aa' | 'a' ) 'a' +``` +{ + first_rule: ( 'a' | 'aa' ) 'a' + second_rule: ('aa' | 'a' ) 'a' +} +``` In a regular EBNF grammar, both rules specify the language ``{aa, aaa}`` but in PEG, one of these two rules accepts the string ``aaa`` but not the string @@ -134,21 +141,25 @@ hidden by many levels of rules. For this reason, writing rules where an alternative is contained in the next one is in almost all cases a mistake, for example: -> my_rule: - -> | 'if' expression 'then' block - -> | 'if' expression 'then' block 'else' block +``` +{ + my_rule: + | 'if' expression 'then' block + | 'if' expression 'then' block 'else' block +} +``` In this example, the second alternative will never be tried because the first one will succeed first (even if the input string has an ``'else' block`` that follows). To correctly write this rule you can simply alter the order: -> my_rule: - -> | 'if' expression 'then' block 'else' block - -> | 'if' expression 'then' block +``` +{ + my_rule: + | 'if' expression 'then' block 'else' block + | 'if' expression 'then' block +} +``` In this case, if the input string doesn't have an ``'else' block``, the first alternative will fail and the second will be attempted. @@ -158,13 +169,21 @@ Grammar Syntax The grammar consists of a sequence of rules of the form: -> rule_name: expression +``` +{ + rule_name: expression +} +``` Optionally, a type can be included right after the rule name, which specifies the return type of the C or Python function corresponding to the rule: -> rule_name[return_type]: expression +``` +{ + rule_name[return_type]: expression +} +``` If the return type is omitted, then a ``void *`` is returned in C and an ``Any`` in Python. @@ -180,7 +199,11 @@ Python-style comments. Match ``e1``, then match ``e2``. -> rule_name: first_rule second_rule +``` +{ + rule_name: first_rule second_rule +} +``` **``e1 | e2``** @@ -190,45 +213,71 @@ The first alternative can also appear on the line after the rule name for formatting purposes. In that case, a \| must be used before the first alternative, like so: -> rule_name[return_type]: - -> | first_alt - -> | second_alt +``` +{ + rule_name[return_type]: + | first_alt + | second_alt +} +``` **``( e )`` (grouping operator)** Match ``e``. -> rule_name: (e) +``` +{ + rule_name: (e) +} +``` A slightly more complex and useful example includes using the grouping operator together with the repeat operator: -> rule_name: (e1 e2)* +``` +{ + rule_name: (e1 e2)* +} +``` **``[ e ] or e?``** Optionally match ``e``. -> rule_name: [e] +``` +{ + rule_name: [e] +} +``` A more useful example includes defining that a trailing comma is optional: -> rule_name: e (',' e)* [','] +``` +{ + rule_name: e (',' e)* [','] +} +``` **``e*``** Match zero or more occurrences of ``e``. -> rule_name: (e1 e2)* +``` +{ + rule_name: (e1 e2)* +} +``` **``e+``** Match one or more occurrences of ``e``. -> rule_name: (e1 e2)+ +``` +{ + rule_name: (e1 e2)+ +} +``` **``s.e+``** @@ -236,7 +285,11 @@ Match one or more occurrences of ``e``, separated by ``s``. The generated parse tree does not include the separator. This is otherwise identical to ``(e (s e)*)``. -> rule_name: ','.e+ +``` +{ + rule_name: ','.e+ +} +``` **``&e`` (positive lookahead)** @@ -250,14 +303,22 @@ An example taken from the Python grammar specifies that a primary consists of an atom, which is not followed by a ``.`` or a ``(`` or a ``[``: -> primary: atom !'.' !'(' !'[' +``` +{ + primary: atom !'.' !'(' !'[' +} +``` **``~``** Commit to the current alternative, even if it fails to parse (this is called the "cut"). -> rule_name: '(' ~ some_rule ')' | some_alt +``` +{ + rule_name: '(' ~ some_rule ')' | some_alt +} +``` In this example, if a left parenthesis is parsed, then the other alternative won’t be considered, even if some_rule or ``)`` fail to be @@ -274,15 +335,21 @@ in [Warth et al.](http://web.cs.ucla.edu/~todd/research/pepm08.pdf). This allows us to write not only simple left-recursive rules but also more complicated rules that involve indirect left-recursion like: -> rule1: rule2 | 'a' - -> rule2: rule3 | 'b' - -> rule3: rule1 | 'c' +``` +{ + rule1: rule2 | 'a' + rule2: rule3 | 'b' + rule3: rule1 | 'c' +} +``` and "hidden left-recursion" like: -> rule: 'optional'? rule '@' some_other_rule +``` +{ + rule: 'optional'? rule '@' some_other_rule +} +``` Variables in the grammar ------------------------ @@ -290,7 +357,11 @@ Variables in the grammar A sub-expression can be named by preceding it with an identifier and an ``=`` sign. The name can then be used in the action (see below), like this: -> rule_name[return_type]: '(' a=some_other_rule ')' { a } +``` +{ + rule_name[return_type]: '(' a=some_other_rule ')' { a } +} +``` Grammar actions --------------- @@ -317,11 +388,13 @@ some other required operations that are not directly related to the grammar. To indicate these actions each alternative can be followed by the action code inside curly-braces, which specifies the return value of the alternative: -> rule_name[return_type]: - -> | first_alt1 first_alt2 { first_alt1 } - -> | second_alt1 second_alt2 { second_alt1 } +``` +{ + rule_name[return_type]: + | first_alt1 first_alt2 { first_alt1 } + | second_alt1 second_alt2 { second_alt1 } +} +``` If the action is omitted, a default action is generated: @@ -528,12 +601,20 @@ How to regenerate the parser Once you have made the changes to the grammar files, to regenerate the ``C`` parser (the one used by the interpreter) just execute: -> make regen-pegen +``` +{ + make regen-pegen +} +``` using the ``Makefile`` in the main directory. If you are on Windows you can use the Visual Studio project files to regenerate the parser or to execute: -> ./PCbuild/build.bat --regen +``` +{ + ./PCbuild/build.bat --regen +} +``` The generated parser file is located at [`Parser/parser.c`](https://github.com/python/cpython/blob/main/Parser/parser.c). @@ -549,12 +630,20 @@ any modifications to this file (in order to implement new Pegen features) you wi need to regenerate the meta-parser (the parser that parses the grammar files). To do so just execute: -> make regen-pegen-metaparser +``` +{ + make regen-pegen-metaparser +} +``` If you are on Windows you can use the Visual Studio project files to regenerate the parser or to execute: -> ./PCbuild/build.bat --regen +``` +{ + ./PCbuild/build.bat --regen +} +``` Grammatical elements and rules @@ -594,12 +683,20 @@ be found in thei file. If you change this file to add new tokens, make sure to regenerate the files by executing: -> make regen-token +``` +{ + make regen-token +} +``` If you are on Windows you can use the Visual Studio project files to regenerate the tokens or to execute: -> ./PCbuild/build.bat --regen +``` +{ + ./PCbuild/build.bat --regen +} +``` How tokens are generated and the rules governing this are completely up to the tokenizer ([`Parser/lexer`](https://github.com/python/cpython/blob/main/Parser/lexer) @@ -621,8 +718,12 @@ situations, just parsing it again can be faster. Pegen **disables memoization by default** except for rules with the special marker ``memo`` after the rule name (and type, if present): -> rule_name[typr] (memo): -> ... +``` +{ + rule_name[typr] (memo): + ... +} +``` By selectively turning on memoization for a handful of rules, the parser becomes faster and uses less memory. @@ -792,24 +893,28 @@ between rules. --- -To generate more precise syntax errors, custom rules are used. This is a common practice -also in context free grammars: the parser will try to accept some construct that is known -to be incorrect just to report a specific syntax error for that construct. In pegen grammars, -these rules start with the ``invalid_`` prefix. This is because trying to match these rules -normally has a performance impact on parsing (and can also affect the 'correct' grammar itself -in some tricky cases, depending on the ordering of the rules) so the generated parser acts in -two phases: - -1. The first phase will try to parse the input stream without taking into account rules that - start with the ``invalid_`` prefix. If the parsing succeeds it will return the generated AST - and the second phase will be skipped. - -2. If the first phase failed, a second parsing attempt is done including the rules that start - with an ``invalid_`` prefix. By design this attempt **cannot succeed** and is only executed - to give to the invalid rules a chance to detect specific situations where custom, more precise, - syntax errors can be raised. This also allows to trade a bit of performance for precision reporting - errors: given that we know that the input text is invalid, there is typically no need to be fast - because execution is going to stop anyway. +To generate more precise syntax errors, custom rules are used. This is a common +practice also in context free grammars: the parser will try to accept some +construct that is known to be incorrect just to report a specific syntax error +for that construct. In pegen grammars, these rules start with the ``invalid_`` +prefix. This is because trying to match these rules normally has a performance +impact on parsing (and can also affect the 'correct' grammar itself in some +tricky cases, depending on the ordering of the rules) so the generated parser +acts in two phases: + +1. The first phase will try to parse the input stream without taking into + account rules that start with the ``invalid_`` prefix. If the parsing + succeeds it will return the generated AST and the second phase will be + skipped. + +2. If the first phase failed, a second parsing attempt is done including the + rules that start with an ``invalid_`` prefix. By design this attempt + **cannot succeed** and is only executed to give to the invalid rules a + chance to detect specific situations where custom, more precise, syntax + errors can be raised. This also allows to trade a bit of performance for + precision reporting errors: given that we know that the input text is + invalid, there is typically no need to be fast because execution is going + to stop anyway. --- **Important** @@ -839,13 +944,21 @@ displayed when the error is reported. A good way to test whether an invalid rule will be triggered when you expect is to test if introducing a syntax error **after** valid code triggers the rule or not. For example: -> $ 42 +``` +{ + $ 42 +} +``` should trigger the syntax error in the ``$`` character. If your rule is not correctly defined this won't happen. As another example, suppose that you try to define a rule to match Python 2 style ``print`` statements in order to create a better error message and you define it as: -> invalid_print: "print" expression +``` +{ + invalid_print: "print" expression +} +``` This will **seem** to work because the parser will correctly parse ``print(something)`` because it is valid code and the second phase will never execute but if you try to parse ``print(something) $ 3`` the first pass @@ -926,12 +1039,20 @@ parser. To do this, you can go to the [Tools/peg_generator](https://github.com/python/cpython/blob/main/Tools/peg_generator) directory on the CPython repository and manually call the parser generator by executing: -> $ python -m pegen python +``` +{ + $ python -m pegen python +} +``` This will generate a file called ``parse.py`` in the same directory that you can use to parse some input: -> $ python parse.py file_with_source_code_to_test.py +``` +{ + $ python parse.py file_with_source_code_to_test.py +} +``` As the generated ``parse.py`` file is just Python code, you can modify it and add breakpoints to debug or better understand some complex situations. @@ -958,13 +1079,21 @@ special steps compared to regular parsing. To activate verbose mode you can add the ``-d`` flag when executing Python: -> $ python -d file_to_test.py +``` +{ + $ python -d file_to_test.py +} +``` This will print **a lot** of output to ``stderr`` so it is probably better to dump it to a file for further analysis. The output consists of trace lines with the following structure:: -> ('>'|'-'|'+'|'!') []: ... +``` +{ + ('>'|'-'|'+'|'!') []: ... +} +``` Every line is indented by a different amount (````) depending on how deep the call stack is. The next character marks the type of the trace: From 606d30ce8594815504edfd04a115610fb3a514fb Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Tue, 8 Oct 2024 19:34:18 +0100 Subject: [PATCH 04/12] formatting --- InternalDocs/parser.md | 82 ------------------------------------------ 1 file changed, 82 deletions(-) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index 53e0ba5c6d9676..c19c175fa8bfc9 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -30,9 +30,7 @@ when parsing. The fundamental technical difference is that the choice operator is ordered. This means that when writing: ``` -{ rule: A | B | C -} ``` a parser that implements a context-free-grammar (such as an ``LL(1)`` parser) will @@ -112,10 +110,8 @@ Thus the parser is said to be "eager". To illustrate this, consider the following two rules (in these examples, a token is an individual character): ``` -{ first_rule: ( 'a' | 'aa' ) 'a' second_rule: ('aa' | 'a' ) 'a' -} ``` In a regular EBNF grammar, both rules specify the language ``{aa, aaa}`` but @@ -142,11 +138,9 @@ For this reason, writing rules where an alternative is contained in the next one is in almost all cases a mistake, for example: ``` -{ my_rule: | 'if' expression 'then' block | 'if' expression 'then' block 'else' block -} ``` In this example, the second alternative will never be tried because the first one will @@ -154,11 +148,9 @@ succeed first (even if the input string has an ``'else' block`` that follows). T write this rule you can simply alter the order: ``` -{ my_rule: | 'if' expression 'then' block 'else' block | 'if' expression 'then' block -} ``` In this case, if the input string doesn't have an ``'else' block``, the first alternative @@ -170,9 +162,7 @@ Grammar Syntax The grammar consists of a sequence of rules of the form: ``` -{ rule_name: expression -} ``` Optionally, a type can be included right after the rule name, which @@ -180,9 +170,7 @@ specifies the return type of the C or Python function corresponding to the rule: ``` -{ rule_name[return_type]: expression -} ``` If the return type is omitted, then a ``void *`` is returned in C and an @@ -200,9 +188,7 @@ Python-style comments. Match ``e1``, then match ``e2``. ``` -{ rule_name: first_rule second_rule -} ``` **``e1 | e2``** @@ -214,11 +200,9 @@ for formatting purposes. In that case, a \| must be used before the first alternative, like so: ``` -{ rule_name[return_type]: | first_alt | second_alt -} ``` **``( e )`` (grouping operator)** @@ -226,18 +210,14 @@ first alternative, like so: Match ``e``. ``` -{ rule_name: (e) -} ``` A slightly more complex and useful example includes using the grouping operator together with the repeat operator: ``` -{ rule_name: (e1 e2)* -} ``` **``[ e ] or e?``** @@ -245,18 +225,14 @@ operator together with the repeat operator: Optionally match ``e``. ``` -{ rule_name: [e] -} ``` A more useful example includes defining that a trailing comma is optional: ``` -{ rule_name: e (',' e)* [','] -} ``` **``e*``** @@ -264,9 +240,7 @@ optional: Match zero or more occurrences of ``e``. ``` -{ rule_name: (e1 e2)* -} ``` **``e+``** @@ -274,9 +248,7 @@ Match zero or more occurrences of ``e``. Match one or more occurrences of ``e``. ``` -{ rule_name: (e1 e2)+ -} ``` **``s.e+``** @@ -286,9 +258,7 @@ parse tree does not include the separator. This is otherwise identical to ``(e (s e)*)``. ``` -{ rule_name: ','.e+ -} ``` **``&e`` (positive lookahead)** @@ -304,9 +274,7 @@ consists of an atom, which is not followed by a ``.`` or a ``(`` or a ``[``: ``` -{ primary: atom !'.' !'(' !'[' -} ``` **``~``** @@ -315,9 +283,7 @@ Commit to the current alternative, even if it fails to parse (this is called the "cut"). ``` -{ rule_name: '(' ~ some_rule ')' | some_alt -} ``` In this example, if a left parenthesis is parsed, then the other @@ -336,19 +302,15 @@ allows us to write not only simple left-recursive rules but also more complicated rules that involve indirect left-recursion like: ``` -{ rule1: rule2 | 'a' rule2: rule3 | 'b' rule3: rule1 | 'c' -} ``` and "hidden left-recursion" like: ``` -{ rule: 'optional'? rule '@' some_other_rule -} ``` Variables in the grammar @@ -358,9 +320,7 @@ A sub-expression can be named by preceding it with an identifier and an ``=`` sign. The name can then be used in the action (see below), like this: ``` -{ rule_name[return_type]: '(' a=some_other_rule ')' { a } -} ``` Grammar actions @@ -389,11 +349,9 @@ To indicate these actions each alternative can be followed by the action code inside curly-braces, which specifies the return value of the alternative: ``` -{ rule_name[return_type]: | first_alt1 first_alt2 { first_alt1 } | second_alt1 second_alt2 { second_alt1 } -} ``` If the action is omitted, a default action is generated: @@ -421,7 +379,6 @@ new copy of the node and change that. The full meta-grammar for the grammars supported by the PEG generator is: ``` -{ start[Grammar]: grammar ENDMARKER { grammar } grammar[Grammar]: @@ -508,7 +465,6 @@ The full meta-grammar for the grammars supported by the PEG generator is: | STRING { string.string } | "?" { "?" } | ":" { ":" } -} ``` As an illustrative example this simple grammar file allows directly @@ -516,7 +472,6 @@ generating a full parser that can parse simple arithmetic expressions and that returns a valid C-based Python AST: ``` -{ start[mod_ty]: a=expr_stmt* ENDMARKER { _PyAST_Module(a, NULL, p->arena) } expr_stmt[stmt_ty]: a=expr NEWLINE { _PyAST_Expr(a, EXTRA) } @@ -537,7 +492,6 @@ returns a valid C-based Python AST: atom[expr_ty]: | NAME | NUMBER -} ``` Here ``EXTRA`` is a macro that expands to ``start_lineno, start_col_offset, @@ -548,7 +502,6 @@ for the parser. A similar grammar written to target Python AST objects: ``` -{ start[ast.Module]: a=expr_stmt* ENDMARKER { ast.Module(body=a or [] } expr_stmt: a=expr NEWLINE { ast.Expr(value=a, EXTRA) } @@ -569,7 +522,6 @@ A similar grammar written to target Python AST objects: atom: | NAME | NUMBER -} ``` Pegen @@ -602,18 +554,14 @@ Once you have made the changes to the grammar files, to regenerate the ``C`` parser (the one used by the interpreter) just execute: ``` -{ make regen-pegen -} ``` using the ``Makefile`` in the main directory. If you are on Windows you can use the Visual Studio project files to regenerate the parser or to execute: ``` -{ ./PCbuild/build.bat --regen -} ``` The generated parser file is located at @@ -631,18 +579,14 @@ need to regenerate the meta-parser (the parser that parses the grammar files). To do so just execute: ``` -{ make regen-pegen-metaparser -} ``` If you are on Windows you can use the Visual Studio project files to regenerate the parser or to execute: ``` -{ ./PCbuild/build.bat --regen -} ``` @@ -684,18 +628,14 @@ file. If you change this file to add new tokens, make sure to regenerate the files by executing: ``` -{ make regen-token -} ``` If you are on Windows you can use the Visual Studio project files to regenerate the tokens or to execute: ``` -{ ./PCbuild/build.bat --regen -} ``` How tokens are generated and the rules governing this are completely up to the tokenizer @@ -719,10 +659,8 @@ by default** except for rules with the special marker ``memo`` after the rule name (and type, if present): ``` -{ rule_name[typr] (memo): ... -} ``` By selectively turning on memoization for a handful of rules, the parser becomes @@ -778,7 +716,6 @@ meaning in context. Trying to use a hard keyword as a variable will always fail: ``` -{ >>> class = 3 File "", line 1 class = 3 @@ -789,17 +726,14 @@ fail: foo(class=3) ^^^^^ SyntaxError: invalid syntax -} ``` While soft keywords don't have this limitation if used in a context other the one where they are defined as keywords: ``` -{ >>> match = 45 >>> foo(match="Yeah!") -} ``` The ``match`` and ``case`` keywords are soft keywords, so that they are @@ -810,24 +744,20 @@ argument names. You can get a list of all keywords defined in the grammar from Python: ``` -{ >>> import keyword >>> keyword.kwlist ['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break', 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield'] -} ``` as well as soft keywords: ``` -{ >>> import keyword >>> keyword.softkwlist ['_', 'case', 'match'] -} ``` --- @@ -945,9 +875,7 @@ A good way to test whether an invalid rule will be triggered when you expect is a syntax error **after** valid code triggers the rule or not. For example: ``` -{ $ 42 -} ``` should trigger the syntax error in the ``$`` character. If your rule is not correctly defined this @@ -955,9 +883,7 @@ won't happen. As another example, suppose that you try to define a rule to match ``print`` statements in order to create a better error message and you define it as: ``` -{ invalid_print: "print" expression -} ``` This will **seem** to work because the parser will correctly parse ``print(something)`` because it is valid @@ -1040,18 +966,14 @@ parser. To do this, you can go to the directory on the CPython repository and manually call the parser generator by executing: ``` -{ $ python -m pegen python -} ``` This will generate a file called ``parse.py`` in the same directory that you can use to parse some input: ``` -{ $ python parse.py file_with_source_code_to_test.py -} ``` As the generated ``parse.py`` file is just Python code, you can modify it @@ -1080,9 +1002,7 @@ special steps compared to regular parsing. To activate verbose mode you can add the ``-d`` flag when executing Python: ``` -{ $ python -d file_to_test.py -} ``` This will print **a lot** of output to ``stderr`` so it is probably better to dump @@ -1090,9 +1010,7 @@ it to a file for further analysis. The output consists of trace lines with the following structure:: ``` -{ ('>'|'-'|'+'|'!') []: ... -} ``` Every line is indented by a different amount (````) depending on how From 6b9e52b91ff5ce2327e9372d41e69f253cdb309b Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Tue, 8 Oct 2024 19:42:37 +0100 Subject: [PATCH 05/12] formatting --- InternalDocs/parser.md | 183 +++++++++++++++-------------------------- 1 file changed, 68 insertions(+), 115 deletions(-) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index c19c175fa8bfc9..7c64ff41c57d35 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -84,18 +84,15 @@ Key ideas [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is". ---- -**Important** +> **Important** +> Don't try to reason about a PEG grammar in the same way you would to with an +> [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) +> or context free grammar. PEG is optimized to describe **how** input strings will +> be parsed, while context-free grammars are optimized to generate strings of the +> language they describe (in EBNF, to know whether a given string is in the +> language, you need to do work to find out as it is not immediately obvious from +> the grammar). -Don't try to reason about a PEG grammar in the same way you would to with an -[EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) -or context free grammar. PEG is optimized to describe **how** input strings will -be parsed, while context-free grammars are optimized to generate strings of the -language they describe (in EBNF, to know whether a given string is in the -language, you need to do work to find out as it is not immediately obvious from -the grammar). - ---- Consequences of the ordered choice operator ------------------------------------------- @@ -126,13 +123,9 @@ not accept ``aa`` because ``'aa'|'a'`` accepts all of ``aa``, leaving nothing for the final ``a``. Again, the second alternative of ``'aa'|'a'`` is not tried. ---- -**Caution** - -The effects of ordered choice, such as the ones illustrated above, may be -hidden by many levels of rules. - ---- +> **Caution** +> The effects of ordered choice, such as the ones illustrated above, may be +> hidden by many levels of rules. For this reason, writing rules where an alternative is contained in the next one is in almost all cases a mistake, for example: @@ -364,17 +357,13 @@ If the action is omitted, a default action is generated: This default behaviour is primarily made for very simple situations and for debugging purposes. ---- -**Warning** - -It's important that the actions don't mutate any AST nodes that are passed -into them via variables referring to other rules. The reason for mutation -being not allowed is that the AST nodes are cached by memoization and could -potentially be reused in a different context, where the mutation would be -invalid. If an action needs to change an AST node, it should instead make a -new copy of the node and change that. - ---- +> **Warning** +> It's important that the actions don't mutate any AST nodes that are passed +> into them via variables referring to other rules. The reason for mutation +> being not allowed is that the AST nodes are cached by memoization and could +> potentially be reused in a different context, where the mutation would be +> invalid. If an action needs to change an AST node, it should instead make a +> new copy of the node and change that. The full meta-grammar for the grammars supported by the PEG generator is: @@ -666,13 +655,9 @@ name (and type, if present): By selectively turning on memoization for a handful of rules, the parser becomes faster and uses less memory. ---- -**Note** - - Left-recursive rules always use memoization, since the implementation of - left-recursion depends on it. - ---- +> **Note** +> Left-recursive rules always use memoization, since the implementation of +> left-recursion depends on it. To determine whether a new rule needs memoization or not, benchmarking is required (comparing execution times and memory usage of some considerably large files with @@ -699,14 +684,10 @@ automatic variable names are: Hard and soft keywords ---------------------- ---- -**Note** - -In the grammar files, keywords are defined using **single quotes** (for example, -``'class'``) while soft keywords are defined using **double quotes** (for example, -``"match"``). - ---- +> **Note** +> In the grammar files, keywords are defined using **single quotes** (for example, +> ``'class'``) while soft keywords are defined using **double quotes** (for example, +> ``"match"``). There are two kinds of keywords allowed in pegen grammars: *hard* and *soft* keywords. The difference between hard and soft keywords is that hard keywords @@ -760,17 +741,13 @@ as well as soft keywords: ['_', 'case', 'match'] ``` ---- -**Caution** - -Soft keywords can be a bit challenging to manage as they can be accepted in -places you don't intend, given how the order alternatives behave in PEG -parsers (see the -[consequences of ordered choice](#consequences-of-the-ordered-choice-operator) -section for some background on this). In general, try to define them in places -where there are not many alternatives. - ---- +> **Caution** +> Soft keywords can be a bit challenging to manage as they can be accepted in +> places you don't intend, given how the order alternatives behave in PEG +> parsers (see the +> [consequences of ordered choice](#consequences-of-the-ordered-choice-operator) +> section for some background on this). In general, try to define them in places +> where there are not many alternatives. Error handling -------------- @@ -785,14 +762,10 @@ exception set by calling Python's C API functions. This also includes exceptions and it is the main mechanism the parser uses to report custom syntax error messages. ---- -**Note** - -Tokenizer errors are normally reported by raising exceptions but some special -tokenizer errors such as unclosed parenthesis will be reported only after the -parser finishes without returning anything. - ---- +> **Note** +> Tokenizer errors are normally reported by raising exceptions but some special +> tokenizer errors such as unclosed parenthesis will be reported only after the +> parser finishes without returning anything. How syntax errors are reported ------------------------------ @@ -814,14 +787,10 @@ As the Python grammar was primordially written as an ``LL(1)`` grammar, this heu has an extremely high success rate, but some PEG features, such as lookaheads, can impact this. ---- -**Caution** - -Positive and negative lookaheads will try to match a token so they will affect -the location of generic syntax errors. Use them carefully at boundaries -between rules. - ---- +> **Caution** +> Positive and negative lookaheads will try to match a token so they will affect +> the location of generic syntax errors. Use them carefully at boundaries +> between rules. To generate more precise syntax errors, custom rules are used. This is a common practice also in context free grammars: the parser will try to accept some @@ -846,20 +815,16 @@ acts in two phases: invalid, there is typically no need to be fast because execution is going to stop anyway. ---- -**Important** - -When defining invalid rules: - -- Make sure all custom invalid rules raise - [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) - exceptions (or a subclass of it). -- Make sure **all** invalid rules start with the ``invalid_`` prefix to not - impact performance of parsing correct Python code. -- Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules - (see the [how PEG parsers work](#how-peg-parsers-work) section for more information). - ---- +> **Important** +> When defining invalid rules: +> +> - Make sure all custom invalid rules raise +> [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) +> exceptions (or a subclass of it). +> - Make sure **all** invalid rules start with the ``invalid_`` prefix to not +> impact performance of parsing correct Python code. +> - Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules +> (see the [how PEG parsers work](#how-peg-parsers-work) section for more information). You can find a collection of macros to raise specialized syntax errors in the [`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) @@ -868,11 +833,10 @@ the custom errors, which will be highlighted in the tracebacks that will be displayed when the error is reported. ---- -**Tip** - -A good way to test whether an invalid rule will be triggered when you expect is to test if introducing -a syntax error **after** valid code triggers the rule or not. For example: +> **Tip** +> A good way to test whether an invalid rule will be triggered when you expect +> is to test if introducing a syntax error **after** valid code triggers the +> rule or not. For example: ``` $ 42 @@ -910,17 +874,13 @@ file. The helpers include functions that join AST sequences, get specific elemen from them or to perform extra processing on the generated tree. ---- -**Caution** - -Actions must **never** be used to accept or reject rules. It may be tempting -in some situations to write a very generic rule and then check the generated -AST to decide whether it is valid or not, but this will render the -(official grammar)[https://docs.python.org/3/reference/grammar.html] partially -incorrect (because it does not include actions) and will make it more difficult -for other Python implementations to adapt the grammar to their own needs. - ---- +> **Caution** +> Actions must **never** be used to accept or reject rules. It may be tempting +> in some situations to write a very generic rule and then check the generated +> AST to decide whether it is valid or not, but this will render the +> (official grammar)[https://docs.python.org/3/reference/grammar.html] partially +> incorrect (because it does not include actions) and will make it more difficult +> for other Python implementations to adapt the grammar to their own needs. As a general rule, if an action spawns multiple lines or requires something more complicated than a single expression of C code, is normally better to create a @@ -990,14 +950,10 @@ it is possible to activate a **very** verbose mode in the generated parser. This is very useful to debug the generated parser and to understand how it works, but it can be a bit hard to understand at first. ---- -**Note:** - -When activating verbose mode in the Python parser, it is better to not use interactive -mode as it can be much harder to understand, because interactive mode involves some -special steps compared to regular parsing. - ---- +> **Note:** +> When activating verbose mode in the Python parser, it is better to not use +> interactive mode as it can be much harder to understand, because interactive +> mode involves some special steps compared to regular parsing. To activate verbose mode you can add the ``-d`` flag when executing Python: @@ -1027,9 +983,6 @@ the ```` part indicates what alternative within that rule is being attempted. ---- -**Admonition: Document history** - - Pablo Galindo Salgado - Original author - --- +> **Admonition: Document history** +> +> Pablo Galindo Salgado - Original author From 5cc2417ebf8795b279f3bc3dc52364e9c1e7f6b3 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Tue, 8 Oct 2024 23:36:17 +0100 Subject: [PATCH 06/12] use nice admonitions --- InternalDocs/parser.md | 28 ++++++++++++++-------------- 1 file changed, 14 insertions(+), 14 deletions(-) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index 7c64ff41c57d35..25631158989e33 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -84,7 +84,7 @@ Key ideas [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is". -> **Important** +> [!IMPORTANT] > Don't try to reason about a PEG grammar in the same way you would to with an > [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) > or context free grammar. PEG is optimized to describe **how** input strings will @@ -123,7 +123,7 @@ not accept ``aa`` because ``'aa'|'a'`` accepts all of ``aa``, leaving nothing for the final ``a``. Again, the second alternative of ``'aa'|'a'`` is not tried. -> **Caution** +> [!CAUTION] > The effects of ordered choice, such as the ones illustrated above, may be > hidden by many levels of rules. @@ -357,7 +357,7 @@ If the action is omitted, a default action is generated: This default behaviour is primarily made for very simple situations and for debugging purposes. -> **Warning** +> [!WARNING] > It's important that the actions don't mutate any AST nodes that are passed > into them via variables referring to other rules. The reason for mutation > being not allowed is that the AST nodes are cached by memoization and could @@ -655,7 +655,7 @@ name (and type, if present): By selectively turning on memoization for a handful of rules, the parser becomes faster and uses less memory. -> **Note** +> [!NOTE] > Left-recursive rules always use memoization, since the implementation of > left-recursion depends on it. @@ -684,7 +684,7 @@ automatic variable names are: Hard and soft keywords ---------------------- -> **Note** +> [!NOTE] > In the grammar files, keywords are defined using **single quotes** (for example, > ``'class'``) while soft keywords are defined using **double quotes** (for example, > ``"match"``). @@ -741,7 +741,7 @@ as well as soft keywords: ['_', 'case', 'match'] ``` -> **Caution** +> [!CAUTION] > Soft keywords can be a bit challenging to manage as they can be accepted in > places you don't intend, given how the order alternatives behave in PEG > parsers (see the @@ -762,7 +762,7 @@ exception set by calling Python's C API functions. This also includes exceptions and it is the main mechanism the parser uses to report custom syntax error messages. -> **Note** +> [!NOTE] > Tokenizer errors are normally reported by raising exceptions but some special > tokenizer errors such as unclosed parenthesis will be reported only after the > parser finishes without returning anything. @@ -787,7 +787,7 @@ As the Python grammar was primordially written as an ``LL(1)`` grammar, this heu has an extremely high success rate, but some PEG features, such as lookaheads, can impact this. -> **Caution** +> [!CAUTION] > Positive and negative lookaheads will try to match a token so they will affect > the location of generic syntax errors. Use them carefully at boundaries > between rules. @@ -815,9 +815,9 @@ acts in two phases: invalid, there is typically no need to be fast because execution is going to stop anyway. -> **Important** +> [!IMPORTANT] > When defining invalid rules: -> +> > - Make sure all custom invalid rules raise > [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) > exceptions (or a subclass of it). @@ -833,7 +833,7 @@ the custom errors, which will be highlighted in the tracebacks that will be displayed when the error is reported. -> **Tip** +> [!TIP] > A good way to test whether an invalid rule will be triggered when you expect > is to test if introducing a syntax error **after** valid code triggers the > rule or not. For example: @@ -874,7 +874,7 @@ file. The helpers include functions that join AST sequences, get specific elemen from them or to perform extra processing on the generated tree. -> **Caution** +> [!CAUTION] > Actions must **never** be used to accept or reject rules. It may be tempting > in some situations to write a very generic rule and then check the generated > AST to decide whether it is valid or not, but this will render the @@ -950,7 +950,7 @@ it is possible to activate a **very** verbose mode in the generated parser. This is very useful to debug the generated parser and to understand how it works, but it can be a bit hard to understand at first. -> **Note:** +> [!NOTE] > When activating verbose mode in the Python parser, it is better to not use > interactive mode as it can be much harder to understand, because interactive > mode involves some special steps compared to regular parsing. @@ -983,6 +983,6 @@ the ```` part indicates what alternative within that rule is being attempted. -> **Admonition: Document history** +> [!IMPORTANT]: **Document history** > > Pablo Galindo Salgado - Original author From 4acd05f4798dd77be288ea4d1100c23cc5e3f76a Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Tue, 8 Oct 2024 23:43:33 +0100 Subject: [PATCH 07/12] fix history --- InternalDocs/parser.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index 25631158989e33..aaf40dc03ec5c7 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -983,6 +983,8 @@ the ```` part indicates what alternative within that rule is being attempted. -> [!IMPORTANT]: **Document history** +> [!NOTE] +> **Document history** > > Pablo Galindo Salgado - Original author +> Irit Katriel - Convert to Markdown From 6ef9c1180e9a2c8cda5f68790ebd69faa1102f7c Mon Sep 17 00:00:00 2001 From: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> Date: Wed, 9 Oct 2024 00:48:12 +0100 Subject: [PATCH 08/12] Apply Carrol's suggestions from code review Co-authored-by: Carol Willing --- InternalDocs/README.md | 2 +- InternalDocs/parser.md | 26 +++++++++++++------------- 2 files changed, 14 insertions(+), 14 deletions(-) diff --git a/InternalDocs/README.md b/InternalDocs/README.md index 7c98bf7648c2d6..8956ecafed2039 100644 --- a/InternalDocs/README.md +++ b/InternalDocs/README.md @@ -12,7 +12,7 @@ it is not, please report that through the [issue tracker](https://github.com/python/cpython/issues). -[Python's Parser](parser.md) +[Guide to the parser](parser.md) [Compiler Design](compiler.md) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index aaf40dc03ec5c7..e3e1dbac151cb2 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -17,7 +17,7 @@ Therefore, changes to the Python language are made by modifying the [grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram). Developers rarely need to modify the generator itself. -See [Changing CPython’s grammar](https://devguide.python.org/developer-workflow/grammar/#grammar) +See the devguide's [Changing CPython’s grammar](https://devguide.python.org/developer-workflow/grammar/#grammar) for a detailed description of the grammar and the process for changing it. How PEG parsers work @@ -39,7 +39,7 @@ generate constructions that, given an input string, *deduce* which alternative check each alternative, in the order in which they are specified, and select that first one that succeeds. -This means that in a PEG grammer, the choice operator is not commutative. +This means that in a PEG grammar, the choice operator is not commutative. Furthermore, unlike context-free grammars, the derivation according to a PEG grammar cannot be ambiguous: if a string parses, it has exactly one valid parse tree. @@ -172,11 +172,11 @@ If the return type is omitted, then a ``void *`` is returned in C and an Grammar expressions ------------------- -**``# comment``** +### ``# comment`` Python-style comments. -**``e1 e2``** +### ``e1 e2`` Match ``e1``, then match ``e2``. @@ -184,7 +184,7 @@ Match ``e1``, then match ``e2``. rule_name: first_rule second_rule ``` -**``e1 | e2``** +### ``e1 | e2`` Match ``e1`` or ``e2``. @@ -198,7 +198,7 @@ first alternative, like so: | second_alt ``` -**``( e )`` (grouping operator)** +### ``( e )`` (grouping operator) Match ``e``. @@ -213,7 +213,7 @@ operator together with the repeat operator: rule_name: (e1 e2)* ``` -**``[ e ] or e?``** +### ``[ e ] or e?`` Optionally match ``e``. @@ -228,7 +228,7 @@ optional: rule_name: e (',' e)* [','] ``` -**``e*``** +### ``e*`` Match zero or more occurrences of ``e``. @@ -236,7 +236,7 @@ Match zero or more occurrences of ``e``. rule_name: (e1 e2)* ``` -**``e+``** +### ``e+`` Match one or more occurrences of ``e``. @@ -244,7 +244,7 @@ Match one or more occurrences of ``e``. rule_name: (e1 e2)+ ``` -**``s.e+``** +### ``s.e+`` Match one or more occurrences of ``e``, separated by ``s``. The generated parse tree does not include the separator. This is otherwise identical to @@ -254,11 +254,11 @@ parse tree does not include the separator. This is otherwise identical to rule_name: ','.e+ ``` -**``&e`` (positive lookahead)** +### ``&e`` (positive lookahead) Succeed if ``e`` can be parsed, without consuming any input. -**``!e`` (negative lookahead)** +### ``!e`` (negative lookahead) Fail if ``e`` can be parsed, without consuming any input. @@ -270,7 +270,7 @@ consists of an atom, which is not followed by a ``.`` or a ``(`` or a primary: atom !'.' !'(' !'[' ``` -**``~``** +### ``~`` Commit to the current alternative, even if it fails to parse (this is called the "cut"). From 25fd1b20bd7e64d05504d4f8c1210fcbca6be9ff Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Wed, 9 Oct 2024 00:54:35 +0100 Subject: [PATCH 09/12] indent --- InternalDocs/parser.md | 54 +++++++++++++++++++++--------------------- 1 file changed, 27 insertions(+), 27 deletions(-) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index e3e1dbac151cb2..db9857d6c5e5eb 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -174,11 +174,11 @@ Grammar expressions ### ``# comment`` -Python-style comments. + Python-style comments. ### ``e1 e2`` -Match ``e1``, then match ``e2``. + Match ``e1``, then match ``e2``. ``` rule_name: first_rule second_rule @@ -186,11 +186,11 @@ Match ``e1``, then match ``e2``. ### ``e1 | e2`` -Match ``e1`` or ``e2``. + Match ``e1`` or ``e2``. -The first alternative can also appear on the line after the rule name -for formatting purposes. In that case, a \| must be used before the -first alternative, like so: + The first alternative can also appear on the line after the rule name + for formatting purposes. In that case, a \| must be used before the + first alternative, like so: ``` rule_name[return_type]: @@ -200,14 +200,14 @@ first alternative, like so: ### ``( e )`` (grouping operator) -Match ``e``. + Match ``e``. ``` rule_name: (e) ``` -A slightly more complex and useful example includes using the grouping -operator together with the repeat operator: + A slightly more complex and useful example includes using the grouping + operator together with the repeat operator: ``` rule_name: (e1 e2)* @@ -215,14 +215,14 @@ operator together with the repeat operator: ### ``[ e ] or e?`` -Optionally match ``e``. + Optionally match ``e``. ``` rule_name: [e] ``` -A more useful example includes defining that a trailing comma is -optional: + A more useful example includes defining that a trailing comma is + optional: ``` rule_name: e (',' e)* [','] @@ -230,7 +230,7 @@ optional: ### ``e*`` -Match zero or more occurrences of ``e``. + Match zero or more occurrences of ``e``. ``` rule_name: (e1 e2)* @@ -238,7 +238,7 @@ Match zero or more occurrences of ``e``. ### ``e+`` -Match one or more occurrences of ``e``. + Match one or more occurrences of ``e``. ``` rule_name: (e1 e2)+ @@ -246,9 +246,9 @@ Match one or more occurrences of ``e``. ### ``s.e+`` -Match one or more occurrences of ``e``, separated by ``s``. The generated -parse tree does not include the separator. This is otherwise identical to -``(e (s e)*)``. + Match one or more occurrences of ``e``, separated by ``s``. The generated + parse tree does not include the separator. This is otherwise identical to + ``(e (s e)*)``. ``` rule_name: ','.e+ @@ -256,15 +256,15 @@ parse tree does not include the separator. This is otherwise identical to ### ``&e`` (positive lookahead) -Succeed if ``e`` can be parsed, without consuming any input. + Succeed if ``e`` can be parsed, without consuming any input. ### ``!e`` (negative lookahead) -Fail if ``e`` can be parsed, without consuming any input. + Fail if ``e`` can be parsed, without consuming any input. -An example taken from the Python grammar specifies that a primary -consists of an atom, which is not followed by a ``.`` or a ``(`` or a -``[``: + An example taken from the Python grammar specifies that a primary + consists of an atom, which is not followed by a ``.`` or a ``(`` or a + ``[``: ``` primary: atom !'.' !'(' !'[' @@ -272,16 +272,16 @@ consists of an atom, which is not followed by a ``.`` or a ``(`` or a ### ``~`` -Commit to the current alternative, even if it fails to parse (this is called -the "cut"). + Commit to the current alternative, even if it fails to parse (this is called + the "cut"). ``` rule_name: '(' ~ some_rule ')' | some_alt ``` -In this example, if a left parenthesis is parsed, then the other -alternative won’t be considered, even if some_rule or ``)`` fail to be -parsed. + In this example, if a left parenthesis is parsed, then the other + alternative won’t be considered, even if some_rule or ``)`` fail to be + parsed. Left recursion -------------- From 40d0a04a3b1c880a6b4b35a7a86d8da9f4b4770e Mon Sep 17 00:00:00 2001 From: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> Date: Wed, 9 Oct 2024 12:48:38 +0100 Subject: [PATCH 10/12] Apply Ezio's suggestions from code review Co-authored-by: Ezio Melotti --- InternalDocs/parser.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index db9857d6c5e5eb..649457c6e1118a 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -6,7 +6,7 @@ Abstract -------- Python's Parser is currently a -[``PEG`` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar) +[`PEG` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar) parser. It was introduced in [PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace the original [``LL(1)``](https://en.wikipedia.org/wiki/LL_parser) parser. @@ -17,7 +17,7 @@ Therefore, changes to the Python language are made by modifying the [grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram). Developers rarely need to modify the generator itself. -See the devguide's [Changing CPython’s grammar](https://devguide.python.org/developer-workflow/grammar/#grammar) +See the devguide's [Changing CPython's grammar](https://devguide.python.org/developer-workflow/grammar/#grammar) for a detailed description of the grammar and the process for changing it. How PEG parsers work From 59f2c9523ac89071980c0b062a9aacfefdbd7417 Mon Sep 17 00:00:00 2001 From: Irit Katriel <1055913+iritkatriel@users.noreply.github.com> Date: Wed, 9 Oct 2024 17:34:01 +0100 Subject: [PATCH 11/12] convert to table Co-authored-by: Jacob Coffee --- InternalDocs/parser.md | 122 +++++------------------------------------ 1 file changed, 13 insertions(+), 109 deletions(-) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index 649457c6e1118a..9d6b779a81010c 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -172,116 +172,20 @@ If the return type is omitted, then a ``void *`` is returned in C and an Grammar expressions ------------------- -### ``# comment`` +| Expression | Description and Example | +|-----------------|-----------------------------------------------------------------------------------------------------------------------| +| `# comment` | Python-style comments. | +| `e1 e2` | Match `e1`, then match `e2`.
`rule_name: first_rule second_rule` | +| `e1 \| e2` | Match `e1` or `e2`.
`rule_name[return_type]:`
` \| first_alt`
` \| second_alt` | +| `( e )` | Grouping operator: Match `e`.
`rule_name: (e)`
`rule_name: (e1 e2)*` | +| `[ e ]` or `e?` | Optionally match `e`.
`rule_name: [e]`
`rule_name: e (',' e)* [',']` | +| `e*` | Match zero or more occurrences of `e`.
`rule_name: (e1 e2)*` | +| `e+` | Match one or more occurrences of `e`.
`rule_name: (e1 e2)+` | +| `s.e+` | Match one or more occurrences of `e`, separated by `s`.
`rule_name: ','.e+` | +| `&e` | Positive lookahead: Succeed if `e` can be parsed, without consuming input. | +| `!e` | Negative lookahead: Fail if `e` can be parsed, without consuming input.
`primary: atom !'.' !'(' !'['` | +| `~` | Commit to the current alternative, even if it fails to parse (cut).
`rule_name: '(' ~ some_rule ')' \| some_alt` | - Python-style comments. - -### ``e1 e2`` - - Match ``e1``, then match ``e2``. - -``` - rule_name: first_rule second_rule -``` - -### ``e1 | e2`` - - Match ``e1`` or ``e2``. - - The first alternative can also appear on the line after the rule name - for formatting purposes. In that case, a \| must be used before the - first alternative, like so: - -``` - rule_name[return_type]: - | first_alt - | second_alt -``` - -### ``( e )`` (grouping operator) - - Match ``e``. - -``` - rule_name: (e) -``` - - A slightly more complex and useful example includes using the grouping - operator together with the repeat operator: - -``` - rule_name: (e1 e2)* -``` - -### ``[ e ] or e?`` - - Optionally match ``e``. - -``` - rule_name: [e] -``` - - A more useful example includes defining that a trailing comma is - optional: - -``` - rule_name: e (',' e)* [','] -``` - -### ``e*`` - - Match zero or more occurrences of ``e``. - -``` - rule_name: (e1 e2)* -``` - -### ``e+`` - - Match one or more occurrences of ``e``. - -``` - rule_name: (e1 e2)+ -``` - -### ``s.e+`` - - Match one or more occurrences of ``e``, separated by ``s``. The generated - parse tree does not include the separator. This is otherwise identical to - ``(e (s e)*)``. - -``` - rule_name: ','.e+ -``` - -### ``&e`` (positive lookahead) - - Succeed if ``e`` can be parsed, without consuming any input. - -### ``!e`` (negative lookahead) - - Fail if ``e`` can be parsed, without consuming any input. - - An example taken from the Python grammar specifies that a primary - consists of an atom, which is not followed by a ``.`` or a ``(`` or a - ``[``: - -``` - primary: atom !'.' !'(' !'[' -``` - -### ``~`` - - Commit to the current alternative, even if it fails to parse (this is called - the "cut"). - -``` - rule_name: '(' ~ some_rule ')' | some_alt -``` - - In this example, if a left parenthesis is parsed, then the other - alternative won’t be considered, even if some_rule or ``)`` fail to be - parsed. Left recursion -------------- From 954d219ca92f4ff71f767a6de6f156cc8be91dd3 Mon Sep 17 00:00:00 2001 From: Irit Katriel Date: Wed, 9 Oct 2024 18:00:33 +0100 Subject: [PATCH 12/12] add credit --- InternalDocs/parser.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index 9d6b779a81010c..11aaf11253646d 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -891,4 +891,4 @@ is being attempted. > **Document history** > > Pablo Galindo Salgado - Original author -> Irit Katriel - Convert to Markdown +> Irit Katriel and Jacob Coffee - Convert to Markdown