diff options
Diffstat (limited to 'InternalDocs/parser.md')
-rw-r--r-- | InternalDocs/parser.md | 894 |
1 files changed, 894 insertions, 0 deletions
diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md new file mode 100644 index 0000000..11aaf11 --- /dev/null +++ b/InternalDocs/parser.md @@ -0,0 +1,894 @@ + +Guide to the parser +=================== + +Abstract +-------- + +Python's Parser is currently a +[`PEG` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar) +parser. It was introduced in +[PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace +the original [``LL(1)``](https://en.wikipedia.org/wiki/LL_parser) parser. + +The code implementing the parser is generated from a grammar definition by a +[parser generator](https://en.wikipedia.org/wiki/Compiler-compiler). +Therefore, changes to the Python language are made by modifying the +[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram). +Developers rarely need to modify the generator itself. + +See the devguide's [Changing CPython's grammar](https://devguide.python.org/developer-workflow/grammar/#grammar) +for a detailed description of the grammar and the process for changing it. + +How PEG parsers work +==================== + +A PEG (Parsing Expression Grammar) grammar differs from a +[context-free grammar](https://en.wikipedia.org/wiki/Context-free_grammar) +in that the way it is written more closely reflects how the parser will operate +when parsing. The fundamental technical difference is that the choice operator +is ordered. This means that when writing: + +``` + rule: A | B | C +``` + +a parser that implements a context-free-grammar (such as an ``LL(1)`` parser) will +generate constructions that, given an input string, *deduce* which alternative +(``A``, ``B`` or ``C``) must be expanded. On the other hand, a PEG parser will +check each alternative, in the order in which they are specified, and select +that first one that succeeds. + +This means that in a PEG grammar, the choice operator is not commutative. +Furthermore, unlike context-free grammars, the derivation according to a +PEG grammar cannot be ambiguous: if a string parses, it has exactly one +valid parse tree. + +PEG parsers are usually constructed as a recursive descent parser in which every +rule in the grammar corresponds to a function in the program implementing the +parser, and the parsing expression (the "expansion" or "definition" of the rule) +represents the "code" in said function. Each parsing function conceptually takes +an input string as its argument, and yields one of the following results: + +* A "success" result. This result indicates that the expression can be parsed by + that rule and the function may optionally move forward or consume one or more + characters of the input string supplied to it. +* A "failure" result, in which case no input is consumed. + +Note that "failure" results do not imply that the program is incorrect, nor do +they necessarily mean that the parsing has failed. Since the choice operator is +ordered, a failure very often merely indicates "try the following option". A +direct implementation of a PEG parser as a recursive descent parser will present +exponential time performance in the worst case, because PEG parsers have +infinite lookahead (this means that they can consider an arbitrary number of +tokens before deciding for a rule). Usually, PEG parsers avoid this exponential +time complexity with a technique called +["packrat parsing"](https://pdos.csail.mit.edu/~baford/packrat/thesis/) +which not only loads the entire program in memory before parsing it but also +allows the parser to backtrack arbitrarily. This is made efficient by memoizing +the rules already matched for each position. The cost of the memoization cache +is that the parser will naturally use more memory than a simple ``LL(1)`` parser, +which normally are table-based. + + +Key ideas +--------- + +- Alternatives are ordered ( ``A | B`` is not the same as ``B | A`` ). +- If a rule returns a failure, it doesn't mean that the parsing has failed, + it just means "try something else". +- By default PEG parsers run in exponential time, which can be optimized to linear by + using memoization. +- If parsing fails completely (no rule succeeds in parsing all the input text), the + PEG parser doesn't have a concept of "where the + [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is". + + +> [!IMPORTANT] +> Don't try to reason about a PEG grammar in the same way you would to with an +> [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form) +> or context free grammar. PEG is optimized to describe **how** input strings will +> be parsed, while context-free grammars are optimized to generate strings of the +> language they describe (in EBNF, to know whether a given string is in the +> language, you need to do work to find out as it is not immediately obvious from +> the grammar). + + +Consequences of the ordered choice operator +------------------------------------------- + +Although PEG may look like EBNF, its meaning is quite different. The fact +that the alternatives are ordered in a PEG grammer (which is at the core of +how PEG parsers work) has deep consequences, other than removing ambiguity. + +If a rule has two alternatives and the first of them succeeds, the second one is +**not** attempted even if the caller rule fails to parse the rest of the input. +Thus the parser is said to be "eager". To illustrate this, consider +the following two rules (in these examples, a token is an individual character): + +``` + first_rule: ( 'a' | 'aa' ) 'a' + second_rule: ('aa' | 'a' ) 'a' +``` + +In a regular EBNF grammar, both rules specify the language ``{aa, aaa}`` but +in PEG, one of these two rules accepts the string ``aaa`` but not the string +``aa``. The other does the opposite -- it accepts the string ``aa`` +but not the string ``aaa``. The rule ``('a'|'aa')'a'`` does +not accept ``aaa`` because ``'a'|'aa'`` consumes the first ``a``, letting the +final ``a`` in the rule consume the second, and leaving out the third ``a``. +As the rule has succeeded, no attempt is ever made to go back and let +``'a'|'aa'`` try the second alternative. The expression ``('aa'|'a')'a'`` does +not accept ``aa`` because ``'aa'|'a'`` accepts all of ``aa``, leaving nothing +for the final ``a``. Again, the second alternative of ``'aa'|'a'`` is not +tried. + +> [!CAUTION] +> The effects of ordered choice, such as the ones illustrated above, may be +> hidden by many levels of rules. + +For this reason, writing rules where an alternative is contained in the next +one is in almost all cases a mistake, for example: + +``` + my_rule: + | 'if' expression 'then' block + | 'if' expression 'then' block 'else' block +``` + +In this example, the second alternative will never be tried because the first one will +succeed first (even if the input string has an ``'else' block`` that follows). To correctly +write this rule you can simply alter the order: + +``` + my_rule: + | 'if' expression 'then' block 'else' block + | 'if' expression 'then' block +``` + +In this case, if the input string doesn't have an ``'else' block``, the first alternative +will fail and the second will be attempted. + +Grammar Syntax +============== + +The grammar consists of a sequence of rules of the form: + +``` + rule_name: expression +``` + +Optionally, a type can be included right after the rule name, which +specifies the return type of the C or Python function corresponding to +the rule: + +``` + rule_name[return_type]: expression +``` + +If the return type is omitted, then a ``void *`` is returned in C and an +``Any`` in Python. + +Grammar expressions +------------------- + +| Expression | Description and Example | +|-----------------|-----------------------------------------------------------------------------------------------------------------------| +| `# comment` | Python-style comments. | +| `e1 e2` | Match `e1`, then match `e2`. <br> `rule_name: first_rule second_rule` | +| `e1 \| e2` | Match `e1` or `e2`. <br> `rule_name[return_type]:`<br>` \| first_alt`<br>` \| second_alt` | +| `( e )` | Grouping operator: Match `e`. <br> `rule_name: (e)`<br>`rule_name: (e1 e2)*` | +| `[ e ]` or `e?` | Optionally match `e`. <br> `rule_name: [e]`<br>`rule_name: e (',' e)* [',']` | +| `e*` | Match zero or more occurrences of `e`. <br> `rule_name: (e1 e2)*` | +| `e+` | Match one or more occurrences of `e`. <br> `rule_name: (e1 e2)+` | +| `s.e+` | Match one or more occurrences of `e`, separated by `s`. <br> `rule_name: ','.e+` | +| `&e` | Positive lookahead: Succeed if `e` can be parsed, without consuming input. | +| `!e` | Negative lookahead: Fail if `e` can be parsed, without consuming input. <br> `primary: atom !'.' !'(' !'['` | +| `~` | Commit to the current alternative, even if it fails to parse (cut). <br> `rule_name: '(' ~ some_rule ')' \| some_alt` | + + +Left recursion +-------------- + +PEG parsers normally do not support left recursion, but CPython's parser +generator implements a technique similar to the one described in +[Medeiros et al.](https://arxiv.org/pdf/1207.0443) but using the memoization +cache instead of static variables. This approach is closer to the one described +in [Warth et al.](http://web.cs.ucla.edu/~todd/research/pepm08.pdf). This +allows us to write not only simple left-recursive rules but also more +complicated rules that involve indirect left-recursion like: + +``` + rule1: rule2 | 'a' + rule2: rule3 | 'b' + rule3: rule1 | 'c' +``` + +and "hidden left-recursion" like: + +``` + rule: 'optional'? rule '@' some_other_rule +``` + +Variables in the grammar +------------------------ + +A sub-expression can be named by preceding it with an identifier and an +``=`` sign. The name can then be used in the action (see below), like this: + +``` + rule_name[return_type]: '(' a=some_other_rule ')' { a } +``` + +Grammar actions +--------------- + +To avoid the intermediate steps that obscure the relationship between the +grammar and the AST generation, the PEG parser allows directly generating AST +nodes for a rule via grammar actions. Grammar actions are language-specific +expressions that are evaluated when a grammar rule is successfully parsed. These +expressions can be written in Python or C depending on the desired output of the +parser generator. This means that if one would want to generate a parser in +Python and another in C, two grammar files should be written, each one with a +different set of actions, keeping everything else apart from said actions +identical in both files. As an example of a grammar with Python actions, the +piece of the parser generator that parses grammar files is bootstrapped from a +meta-grammar file with Python actions that generate the grammar tree as a result +of the parsing. + +In the specific case of the PEG grammar for Python, having actions allows +directly describing how the AST is composed in the grammar itself, making it +more clear and maintainable. This AST generation process is supported by the use +of some helper functions that factor out common AST object manipulations and +some other required operations that are not directly related to the grammar. + +To indicate these actions each alternative can be followed by the action code +inside curly-braces, which specifies the return value of the alternative: + +``` + rule_name[return_type]: + | first_alt1 first_alt2 { first_alt1 } + | second_alt1 second_alt2 { second_alt1 } +``` + +If the action is omitted, a default action is generated: + +- If there is a single name in the rule, it gets returned. +- If there multiple names in the rule, a collection with all parsed + expressions gets returned (the type of the collection will be different + in C and Python). + +This default behaviour is primarily made for very simple situations and for +debugging purposes. + +> [!WARNING] +> It's important that the actions don't mutate any AST nodes that are passed +> into them via variables referring to other rules. The reason for mutation +> being not allowed is that the AST nodes are cached by memoization and could +> potentially be reused in a different context, where the mutation would be +> invalid. If an action needs to change an AST node, it should instead make a +> new copy of the node and change that. + +The full meta-grammar for the grammars supported by the PEG generator is: + +``` + start[Grammar]: grammar ENDMARKER { grammar } + + grammar[Grammar]: + | metas rules { Grammar(rules, metas) } + | rules { Grammar(rules, []) } + + metas[MetaList]: + | meta metas { [meta] + metas } + | meta { [meta] } + + meta[MetaTuple]: + | "@" NAME NEWLINE { (name.string, None) } + | "@" a=NAME b=NAME NEWLINE { (a.string, b.string) } + | "@" NAME STRING NEWLINE { (name.string, literal_eval(string.string)) } + + rules[RuleList]: + | rule rules { [rule] + rules } + | rule { [rule] } + + rule[Rule]: + | rulename ":" alts NEWLINE INDENT more_alts DEDENT { + Rule(rulename[0], rulename[1], Rhs(alts.alts + more_alts.alts)) } + | rulename ":" NEWLINE INDENT more_alts DEDENT { Rule(rulename[0], rulename[1], more_alts) } + | rulename ":" alts NEWLINE { Rule(rulename[0], rulename[1], alts) } + + rulename[RuleName]: + | NAME '[' type=NAME '*' ']' {(name.string, type.string+"*")} + | NAME '[' type=NAME ']' {(name.string, type.string)} + | NAME {(name.string, None)} + + alts[Rhs]: + | alt "|" alts { Rhs([alt] + alts.alts)} + | alt { Rhs([alt]) } + + more_alts[Rhs]: + | "|" alts NEWLINE more_alts { Rhs(alts.alts + more_alts.alts) } + | "|" alts NEWLINE { Rhs(alts.alts) } + + alt[Alt]: + | items '$' action { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=action) } + | items '$' { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=None) } + | items action { Alt(items, action=action) } + | items { Alt(items, action=None) } + + items[NamedItemList]: + | named_item items { [named_item] + items } + | named_item { [named_item] } + + named_item[NamedItem]: + | NAME '=' ~ item {NamedItem(name.string, item)} + | item {NamedItem(None, item)} + | it=lookahead {NamedItem(None, it)} + + lookahead[LookaheadOrCut]: + | '&' ~ atom {PositiveLookahead(atom)} + | '!' ~ atom {NegativeLookahead(atom)} + | '~' {Cut()} + + item[Item]: + | '[' ~ alts ']' {Opt(alts)} + | atom '?' {Opt(atom)} + | atom '*' {Repeat0(atom)} + | atom '+' {Repeat1(atom)} + | sep=atom '.' node=atom '+' {Gather(sep, node)} + | atom {atom} + + atom[Plain]: + | '(' ~ alts ')' {Group(alts)} + | NAME {NameLeaf(name.string) } + | STRING {StringLeaf(string.string)} + + # Mini-grammar for the actions + + action[str]: "{" ~ target_atoms "}" { target_atoms } + + target_atoms[str]: + | target_atom target_atoms { target_atom + " " + target_atoms } + | target_atom { target_atom } + + target_atom[str]: + | "{" ~ target_atoms "}" { "{" + target_atoms + "}" } + | NAME { name.string } + | NUMBER { number.string } + | STRING { string.string } + | "?" { "?" } + | ":" { ":" } +``` + +As an illustrative example this simple grammar file allows directly +generating a full parser that can parse simple arithmetic expressions and that +returns a valid C-based Python AST: + +``` + start[mod_ty]: a=expr_stmt* ENDMARKER { _PyAST_Module(a, NULL, p->arena) } + expr_stmt[stmt_ty]: a=expr NEWLINE { _PyAST_Expr(a, EXTRA) } + + expr[expr_ty]: + | l=expr '+' r=term { _PyAST_BinOp(l, Add, r, EXTRA) } + | l=expr '-' r=term { _PyAST_BinOp(l, Sub, r, EXTRA) } + | term + + term[expr_ty]: + | l=term '*' r=factor { _PyAST_BinOp(l, Mult, r, EXTRA) } + | l=term '/' r=factor { _PyAST_BinOp(l, Div, r, EXTRA) } + | factor + + factor[expr_ty]: + | '(' e=expr ')' { e } + | atom + + atom[expr_ty]: + | NAME + | NUMBER +``` + +Here ``EXTRA`` is a macro that expands to ``start_lineno, start_col_offset, +end_lineno, end_col_offset, p->arena``, those being variables automatically +injected by the parser; ``p`` points to an object that holds on to all state +for the parser. + +A similar grammar written to target Python AST objects: + +``` + start[ast.Module]: a=expr_stmt* ENDMARKER { ast.Module(body=a or [] } + expr_stmt: a=expr NEWLINE { ast.Expr(value=a, EXTRA) } + + expr: + | l=expr '+' r=term { ast.BinOp(left=l, op=ast.Add(), right=r, EXTRA) } + | l=expr '-' r=term { ast.BinOp(left=l, op=ast.Sub(), right=r, EXTRA) } + | term + + term: + | l=term '*' r=factor { ast.BinOp(left=l, op=ast.Mult(), right=r, EXTRA) } + | l=term '/' r=factor { ast.BinOp(left=l, op=ast.Div(), right=r, EXTRA) } + | factor + + factor: + | '(' e=expr ')' { e } + | atom + + atom: + | NAME + | NUMBER +``` + +Pegen +===== + +Pegen is the parser generator used in CPython to produce the final PEG parser +used by the interpreter. It is the program that can be used to read the python +grammar located in +[`Grammar/python.gram`](https://github.com/python/cpython/blob/main/Grammar/python.gram) +and produce the final C parser. It contains the following pieces: + +- A parser generator that can read a grammar file and produce a PEG parser + written in Python or C that can parse said grammar. The generator is located at + [`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen). +- A PEG meta-grammar that automatically generates a Python parser which is used + for the parser generator itself (this means that there are no manually-written + parsers). The meta-grammar is located at + [`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram). +- A generated parser (using the parser generator) that can directly produce C and Python AST objects. + +The source code for Pegen lives at +[`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen) +but normally all typical commands to interact with the parser generator are executed from +the main makefile. + +How to regenerate the parser +---------------------------- + +Once you have made the changes to the grammar files, to regenerate the ``C`` +parser (the one used by the interpreter) just execute: + +``` + make regen-pegen +``` + +using the ``Makefile`` in the main directory. If you are on Windows you can +use the Visual Studio project files to regenerate the parser or to execute: + +``` + ./PCbuild/build.bat --regen +``` + +The generated parser file is located at +[`Parser/parser.c`](https://github.com/python/cpython/blob/main/Parser/parser.c). + +How to regenerate the meta-parser +--------------------------------- + +The meta-grammar (the grammar that describes the grammar for the grammar files +themselves) is located at +[`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram). +Although it is very unlikely that you will ever need to modify it, if you make +any modifications to this file (in order to implement new Pegen features) you will +need to regenerate the meta-parser (the parser that parses the grammar files). +To do so just execute: + +``` + make regen-pegen-metaparser +``` + +If you are on Windows you can use the Visual Studio project files +to regenerate the parser or to execute: + +``` + ./PCbuild/build.bat --regen +``` + + +Grammatical elements and rules +------------------------------ + +Pegen has some special grammatical elements and rules: + +- Strings with single quotes (') (for example, ``'class'``) denote KEYWORDS. +- Strings with double quotes (") (for example, ``"match"``) denote SOFT KEYWORDS. +- Uppercase names (for example, ``NAME``) denote tokens in the + [`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) file. +- Rule names starting with ``invalid_`` are used for specialized syntax errors. + + - These rules are NOT used in the first pass of the parser. + - Only if the first pass fails to parse, a second pass including the invalid + rules will be executed. + - If the parser fails in the second phase with a generic syntax error, the + location of the generic failure of the first pass will be used (this avoids + reporting incorrect locations due to the invalid rules). + - The order of the alternatives involving invalid rules matter + (like any rule in PEG). + +Tokenization +------------ + +It is common among PEG parser frameworks that the parser does both the parsing +and the tokenization, but this does not happen in Pegen. The reason is that the +Python language needs a custom tokenizer to handle things like indentation +boundaries, some special keywords like ``ASYNC`` and ``AWAIT`` (for +compatibility purposes), backtracking errors (such as unclosed parenthesis), +dealing with encoding, interactive mode and much more. Some of these reasons +are also there for historical purposes, and some others are useful even today. + +The list of tokens (all uppercase names in the grammar) that you can use can +be found in thei +[`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) +file. If you change this file to add new tokens, make sure to regenerate the +files by executing: + +``` + make regen-token +``` + +If you are on Windows you can use the Visual Studio project files to regenerate +the tokens or to execute: + +``` + ./PCbuild/build.bat --regen +``` + +How tokens are generated and the rules governing this are completely up to the tokenizer +([`Parser/lexer`](https://github.com/python/cpython/blob/main/Parser/lexer) +and +[`Parser/tokenizer`](https://github.com/python/cpython/blob/main/Parser/tokenizer)); +the parser just receives tokens from it. + +Memoization +----------- + +As described previously, to avoid exponential time complexity in the parser, +memoization is used. + +The C parser used by Python is highly optimized and memoization can be expensive +both in memory and time. Although the memory cost is obvious (the parser needs +memory for storing previous results in the cache) the execution time cost comes +for continuously checking if the given rule has a cache hit or not. In many +situations, just parsing it again can be faster. Pegen **disables memoization +by default** except for rules with the special marker ``memo`` after the rule +name (and type, if present): + +``` + rule_name[typr] (memo): + ... +``` + +By selectively turning on memoization for a handful of rules, the parser becomes +faster and uses less memory. + +> [!NOTE] +> Left-recursive rules always use memoization, since the implementation of +> left-recursion depends on it. + +To determine whether a new rule needs memoization or not, benchmarking is required +(comparing execution times and memory usage of some considerably large files with +and without memoization). There is a very simple instrumentation API available +in the generated C parse code that allows to measure how much each rule uses +memoization (check the +[`Parser/pegen.c`](https://github.com/python/cpython/blob/main/Parser/pegen.c) +file for more information) but it needs to be manually activated. + +Automatic variables +------------------- + +To make writing actions easier, Pegen injects some automatic variables in the +namespace available when writing actions. In the C parser, some of these +automatic variable names are: + +- ``p``: The parser structure. +- ``EXTRA``: This is a macro that expands to + ``(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)``, + which is normally used to create AST nodes as almost all constructors need these + attributes to be provided. All of the location variables are taken from the + location information of the current token. + +Hard and soft keywords +---------------------- + +> [!NOTE] +> In the grammar files, keywords are defined using **single quotes** (for example, +> ``'class'``) while soft keywords are defined using **double quotes** (for example, +> ``"match"``). + +There are two kinds of keywords allowed in pegen grammars: *hard* and *soft* +keywords. The difference between hard and soft keywords is that hard keywords +are always reserved words, even in positions where they make no sense +(for example, ``x = class + 1``), while soft keywords only get a special +meaning in context. Trying to use a hard keyword as a variable will always +fail: + +``` + >>> class = 3 + File "<stdin>", line 1 + class = 3 + ^ + SyntaxError: invalid syntax + >>> foo(class=3) + File "<stdin>", line 1 + foo(class=3) + ^^^^^ + SyntaxError: invalid syntax +``` + +While soft keywords don't have this limitation if used in a context other the +one where they are defined as keywords: + +``` + >>> match = 45 + >>> foo(match="Yeah!") +``` + +The ``match`` and ``case`` keywords are soft keywords, so that they are +recognized as keywords at the beginning of a match statement or case block +respectively, but are allowed to be used in other places as variable or +argument names. + +You can get a list of all keywords defined in the grammar from Python: + +``` + >>> import keyword + >>> keyword.kwlist + ['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break', + 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for', + 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or', + 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield'] +``` + +as well as soft keywords: + +``` + >>> import keyword + >>> keyword.softkwlist + ['_', 'case', 'match'] +``` + +> [!CAUTION] +> Soft keywords can be a bit challenging to manage as they can be accepted in +> places you don't intend, given how the order alternatives behave in PEG +> parsers (see the +> [consequences of ordered choice](#consequences-of-the-ordered-choice-operator) +> section for some background on this). In general, try to define them in places +> where there are not many alternatives. + +Error handling +-------------- + +When a pegen-generated parser detects that an exception is raised, it will +**automatically stop parsing**, no matter what the current state of the parser +is, and it will unwind the stack and report the exception. This means that if a +[rule action](#grammar-actions) raises an exception, all parsing will +stop at that exact point. This is done to allow to correctly propagate any +exception set by calling Python's C API functions. This also includes +[``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) +exceptions and it is the main mechanism the parser uses to report custom syntax +error messages. + +> [!NOTE] +> Tokenizer errors are normally reported by raising exceptions but some special +> tokenizer errors such as unclosed parenthesis will be reported only after the +> parser finishes without returning anything. + +How syntax errors are reported +------------------------------ + +As described previously in the [how PEG parsers work](#how-peg-parsers-work) +section, PEG parsers don't have a defined concept of where errors happened +in the grammar, because a rule failure doesn't imply a parsing failure like +in context free grammars. This means that a heuristic has to be used to report +generic errors unless something is explicitly declared as an error in the +grammar. + +To report generic syntax errors, pegen uses a common heuristic in PEG parsers: +the location of *generic* syntax errors is reported to be the furthest token that +was attempted to be matched but failed. This is only done if parsing has failed +(the parser returns ``NULL`` in C or ``None`` in Python) but no exception has +been raised. + +As the Python grammar was primordially written as an ``LL(1)`` grammar, this heuristic +has an extremely high success rate, but some PEG features, such as lookaheads, +can impact this. + +> [!CAUTION] +> Positive and negative lookaheads will try to match a token so they will affect +> the location of generic syntax errors. Use them carefully at boundaries +> between rules. + +To generate more precise syntax errors, custom rules are used. This is a common +practice also in context free grammars: the parser will try to accept some +construct that is known to be incorrect just to report a specific syntax error +for that construct. In pegen grammars, these rules start with the ``invalid_`` +prefix. This is because trying to match these rules normally has a performance +impact on parsing (and can also affect the 'correct' grammar itself in some +tricky cases, depending on the ordering of the rules) so the generated parser +acts in two phases: + +1. The first phase will try to parse the input stream without taking into + account rules that start with the ``invalid_`` prefix. If the parsing + succeeds it will return the generated AST and the second phase will be + skipped. + +2. If the first phase failed, a second parsing attempt is done including the + rules that start with an ``invalid_`` prefix. By design this attempt + **cannot succeed** and is only executed to give to the invalid rules a + chance to detect specific situations where custom, more precise, syntax + errors can be raised. This also allows to trade a bit of performance for + precision reporting errors: given that we know that the input text is + invalid, there is typically no need to be fast because execution is going + to stop anyway. + +> [!IMPORTANT] +> When defining invalid rules: +> +> - Make sure all custom invalid rules raise +> [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) +> exceptions (or a subclass of it). +> - Make sure **all** invalid rules start with the ``invalid_`` prefix to not +> impact performance of parsing correct Python code. +> - Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules +> (see the [how PEG parsers work](#how-peg-parsers-work) section for more information). + +You can find a collection of macros to raise specialized syntax errors in the +[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) +header file. These macros allow also to report ranges for +the custom errors, which will be highlighted in the tracebacks that will be +displayed when the error is reported. + + +> [!TIP] +> A good way to test whether an invalid rule will be triggered when you expect +> is to test if introducing a syntax error **after** valid code triggers the +> rule or not. For example: + +``` + <valid python code> $ 42 +``` + +should trigger the syntax error in the ``$`` character. If your rule is not correctly defined this +won't happen. As another example, suppose that you try to define a rule to match Python 2 style +``print`` statements in order to create a better error message and you define it as: + +``` + invalid_print: "print" expression +``` + +This will **seem** to work because the parser will correctly parse ``print(something)`` because it is valid +code and the second phase will never execute but if you try to parse ``print(something) $ 3`` the first pass +of the parser will fail (because of the ``$``) and in the second phase, the rule will match the +``print(something)`` as ``print`` followed by the variable ``something`` between parentheses and the error +will be reported there instead of the ``$`` character. + +Generating AST objects +---------------------- + +The output of the C parser used by CPython, which is generated from the +[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram), +is a Python AST object (using C structures). This means that the actions in the +grammar file generate AST objects when they succeed. Constructing these objects +can be quite cumbersome (see the [AST compiler section](compiler.md#abstract-syntax-trees-ast) +for more information on how these objects are constructed and how they are used +by the compiler), so special helper functions are used. These functions are +declared in the +[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) +header file and defined in the +[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c) +file. The helpers include functions that join AST sequences, get specific elements +from them or to perform extra processing on the generated tree. + + +> [!CAUTION] +> Actions must **never** be used to accept or reject rules. It may be tempting +> in some situations to write a very generic rule and then check the generated +> AST to decide whether it is valid or not, but this will render the +> (official grammar)[https://docs.python.org/3/reference/grammar.html] partially +> incorrect (because it does not include actions) and will make it more difficult +> for other Python implementations to adapt the grammar to their own needs. + +As a general rule, if an action spawns multiple lines or requires something more +complicated than a single expression of C code, is normally better to create a +custom helper in +[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c) +and expose it in the +[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) +header file so that it can be used from the grammar. + +When parsing succeeds, the parser **must** return a **valid** AST object. + +Testing +======= + +There are three files that contain tests for the grammar and the parser: + +- [test_grammar.py](https://github.com/python/cpython/blob/main/Lib/test/test_grammar.py) +- [test_syntax.py](https://github.com/python/cpython/blob/main/Lib/test/test_syntax.py) +- [test_exceptions.py](https://github.com/python/cpython/blob/main/Lib/test/test_exceptions.py) + +Check the contents of these files to know which is the best place for new tests, depending +on the nature of the new feature you are adding. + +Tests for the parser generator itself can be found in the +[test_peg_generator](https://github.com/python/cpython/blob/main/Lib/test_peg_generator) +directory. + + +Debugging generated parsers +=========================== + +Making experiments +------------------ + +As the generated C parser is the one used by Python, this means that if +something goes wrong when adding some new rules to the grammar, you cannot +correctly compile and execute Python anymore. This makes it a bit challenging +to debug when something goes wrong, especially when experimenting. + +For this reason it is a good idea to experiment first by generating a Python +parser. To do this, you can go to the +[Tools/peg_generator](https://github.com/python/cpython/blob/main/Tools/peg_generator) +directory on the CPython repository and manually call the parser generator by executing: + +``` + $ python -m pegen python <PATH TO YOUR GRAMMAR FILE> +``` + +This will generate a file called ``parse.py`` in the same directory that you +can use to parse some input: + +``` + $ python parse.py file_with_source_code_to_test.py +``` + +As the generated ``parse.py`` file is just Python code, you can modify it +and add breakpoints to debug or better understand some complex situations. + + +Verbose mode +------------ + +When Python is compiled in debug mode (by adding ``--with-pydebug`` when +running the configure step in Linux or by adding ``-d`` when calling the +[PCbuild/build.bat](https://github.com/python/cpython/blob/main/PCbuild/build.bat)), +it is possible to activate a **very** verbose mode in the generated parser. This +is very useful to debug the generated parser and to understand how it works, but it +can be a bit hard to understand at first. + +> [!NOTE] +> When activating verbose mode in the Python parser, it is better to not use +> interactive mode as it can be much harder to understand, because interactive +> mode involves some special steps compared to regular parsing. + +To activate verbose mode you can add the ``-d`` flag when executing Python: + +``` + $ python -d file_to_test.py +``` + +This will print **a lot** of output to ``stderr`` so it is probably better to dump +it to a file for further analysis. The output consists of trace lines with the +following structure:: + +``` + <indentation> ('>'|'-'|'+'|'!') <rule_name>[<token_location>]: <alternative> ... +``` + +Every line is indented by a different amount (``<indentation>``) depending on how +deep the call stack is. The next character marks the type of the trace: + +- ``>`` indicates that a rule is going to be attempted to be parsed. +- ``-`` indicates that a rule has failed to be parsed. +- ``+`` indicates that a rule has been parsed correctly. +- ``!`` indicates that an exception or an error has been detected and the parser is unwinding. + +The ``<token_location>`` part indicates the current index in the token array, +the ``<rule_name>`` part indicates what rule is being parsed and +the ``<alternative>`` part indicates what alternative within that rule +is being attempted. + + +> [!NOTE] +> **Document history** +> +> Pablo Galindo Salgado - Original author +> Irit Katriel and Jacob Coffee - Convert to Markdown |