summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorIrit Katriel <1055913+iritkatriel@users.noreply.github.com>2024-10-09 17:21:35 (GMT)
committerGitHub <noreply@github.com>2024-10-09 17:21:35 (GMT)
commitd501153aed6cd9c03b77836821ed8d47f0655a0b (patch)
tree186de5007e86f3661acc079b3fd1787777f47243
parent9bda7750c2af779d3431f5ea120db91c6c83ec49 (diff)
downloadcpython-d501153aed6cd9c03b77836821ed8d47f0655a0b.zip
cpython-d501153aed6cd9c03b77836821ed8d47f0655a0b.tar.gz
cpython-d501153aed6cd9c03b77836821ed8d47f0655a0b.tar.bz2
gh-119786: Move parser doc from devguide to InternalDocs (#125119)
Co-authored-by: Jacob Coffee <jacob@z7x.org> Co-authored-by: Carol Willing <carolcode@willingconsulting.com> Co-Authored-By: Adam Turner <9087854+aa-turner@users.noreply.github.com> Co-Authored-By: Carl Friedrich Bolz-Tereick <cfbolz@gmx.de> Co-Authored-By: Carol Willing <carolcode@willingconsulting.com> Co-Authored-By: Erlend E. Aasland <erlend@python.org> Co-Authored-By: Ezio Melotti <ezio.melotti@gmail.com> Co-Authored-By: Hugo van Kemenade <hugovk@users.noreply.github.com> Co-Authored-By: Irit Katriel <iritkatriel@yahoo.com> Co-Authored-By: Itamar Ostricher <itamarost@gmail.com> Co-Authored-By: Julien Palard <julien@palard.fr> Co-Authored-By: Mana <potpath@users.noreply.github.com> Co-Authored-By: Muhammad Mahad <mahadpy@gmail.com> Co-Authored-By: Ned Batchelder <ned@nedbatchelder.com> Co-Authored-By: Pablo Galindo Salgado <Pablogsal@gmail.com> Co-Authored-By: slateny <46876382+slateny@users.noreply.github.com> Co-Authored-By: wookie184 <wookie1840@gmail.com>
-rw-r--r--InternalDocs/README.md2
-rw-r--r--InternalDocs/parser.md894
2 files changed, 896 insertions, 0 deletions
diff --git a/InternalDocs/README.md b/InternalDocs/README.md
index 95181a4..8956eca 100644
--- a/InternalDocs/README.md
+++ b/InternalDocs/README.md
@@ -12,6 +12,8 @@ it is not, please report that through the
[issue tracker](https://github.com/python/cpython/issues).
+[Guide to the parser](parser.md)
+
[Compiler Design](compiler.md)
[Frames](frames.md)
diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md
new file mode 100644
index 0000000..11aaf11
--- /dev/null
+++ b/InternalDocs/parser.md
@@ -0,0 +1,894 @@
+
+Guide to the parser
+===================
+
+Abstract
+--------
+
+Python's Parser is currently a
+[`PEG` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar)
+parser. It was introduced in
+[PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace
+the original [``LL(1)``](https://en.wikipedia.org/wiki/LL_parser) parser.
+
+The code implementing the parser is generated from a grammar definition by a
+[parser generator](https://en.wikipedia.org/wiki/Compiler-compiler).
+Therefore, changes to the Python language are made by modifying the
+[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram).
+Developers rarely need to modify the generator itself.
+
+See the devguide's [Changing CPython's grammar](https://devguide.python.org/developer-workflow/grammar/#grammar)
+for a detailed description of the grammar and the process for changing it.
+
+How PEG parsers work
+====================
+
+A PEG (Parsing Expression Grammar) grammar differs from a
+[context-free grammar](https://en.wikipedia.org/wiki/Context-free_grammar)
+in that the way it is written more closely reflects how the parser will operate
+when parsing. The fundamental technical difference is that the choice operator
+is ordered. This means that when writing:
+
+```
+ rule: A | B | C
+```
+
+a parser that implements a context-free-grammar (such as an ``LL(1)`` parser) will
+generate constructions that, given an input string, *deduce* which alternative
+(``A``, ``B`` or ``C``) must be expanded. On the other hand, a PEG parser will
+check each alternative, in the order in which they are specified, and select
+that first one that succeeds.
+
+This means that in a PEG grammar, the choice operator is not commutative.
+Furthermore, unlike context-free grammars, the derivation according to a
+PEG grammar cannot be ambiguous: if a string parses, it has exactly one
+valid parse tree.
+
+PEG parsers are usually constructed as a recursive descent parser in which every
+rule in the grammar corresponds to a function in the program implementing the
+parser, and the parsing expression (the "expansion" or "definition" of the rule)
+represents the "code" in said function. Each parsing function conceptually takes
+an input string as its argument, and yields one of the following results:
+
+* A "success" result. This result indicates that the expression can be parsed by
+ that rule and the function may optionally move forward or consume one or more
+ characters of the input string supplied to it.
+* A "failure" result, in which case no input is consumed.
+
+Note that "failure" results do not imply that the program is incorrect, nor do
+they necessarily mean that the parsing has failed. Since the choice operator is
+ordered, a failure very often merely indicates "try the following option". A
+direct implementation of a PEG parser as a recursive descent parser will present
+exponential time performance in the worst case, because PEG parsers have
+infinite lookahead (this means that they can consider an arbitrary number of
+tokens before deciding for a rule). Usually, PEG parsers avoid this exponential
+time complexity with a technique called
+["packrat parsing"](https://pdos.csail.mit.edu/~baford/packrat/thesis/)
+which not only loads the entire program in memory before parsing it but also
+allows the parser to backtrack arbitrarily. This is made efficient by memoizing
+the rules already matched for each position. The cost of the memoization cache
+is that the parser will naturally use more memory than a simple ``LL(1)`` parser,
+which normally are table-based.
+
+
+Key ideas
+---------
+
+- Alternatives are ordered ( ``A | B`` is not the same as ``B | A`` ).
+- If a rule returns a failure, it doesn't mean that the parsing has failed,
+ it just means "try something else".
+- By default PEG parsers run in exponential time, which can be optimized to linear by
+ using memoization.
+- If parsing fails completely (no rule succeeds in parsing all the input text), the
+ PEG parser doesn't have a concept of "where the
+ [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is".
+
+
+> [!IMPORTANT]
+> Don't try to reason about a PEG grammar in the same way you would to with an
+> [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form)
+> or context free grammar. PEG is optimized to describe **how** input strings will
+> be parsed, while context-free grammars are optimized to generate strings of the
+> language they describe (in EBNF, to know whether a given string is in the
+> language, you need to do work to find out as it is not immediately obvious from
+> the grammar).
+
+
+Consequences of the ordered choice operator
+-------------------------------------------
+
+Although PEG may look like EBNF, its meaning is quite different. The fact
+that the alternatives are ordered in a PEG grammer (which is at the core of
+how PEG parsers work) has deep consequences, other than removing ambiguity.
+
+If a rule has two alternatives and the first of them succeeds, the second one is
+**not** attempted even if the caller rule fails to parse the rest of the input.
+Thus the parser is said to be "eager". To illustrate this, consider
+the following two rules (in these examples, a token is an individual character):
+
+```
+ first_rule: ( 'a' | 'aa' ) 'a'
+ second_rule: ('aa' | 'a' ) 'a'
+```
+
+In a regular EBNF grammar, both rules specify the language ``{aa, aaa}`` but
+in PEG, one of these two rules accepts the string ``aaa`` but not the string
+``aa``. The other does the opposite -- it accepts the string ``aa``
+but not the string ``aaa``. The rule ``('a'|'aa')'a'`` does
+not accept ``aaa`` because ``'a'|'aa'`` consumes the first ``a``, letting the
+final ``a`` in the rule consume the second, and leaving out the third ``a``.
+As the rule has succeeded, no attempt is ever made to go back and let
+``'a'|'aa'`` try the second alternative. The expression ``('aa'|'a')'a'`` does
+not accept ``aa`` because ``'aa'|'a'`` accepts all of ``aa``, leaving nothing
+for the final ``a``. Again, the second alternative of ``'aa'|'a'`` is not
+tried.
+
+> [!CAUTION]
+> The effects of ordered choice, such as the ones illustrated above, may be
+> hidden by many levels of rules.
+
+For this reason, writing rules where an alternative is contained in the next
+one is in almost all cases a mistake, for example:
+
+```
+ my_rule:
+ | 'if' expression 'then' block
+ | 'if' expression 'then' block 'else' block
+```
+
+In this example, the second alternative will never be tried because the first one will
+succeed first (even if the input string has an ``'else' block`` that follows). To correctly
+write this rule you can simply alter the order:
+
+```
+ my_rule:
+ | 'if' expression 'then' block 'else' block
+ | 'if' expression 'then' block
+```
+
+In this case, if the input string doesn't have an ``'else' block``, the first alternative
+will fail and the second will be attempted.
+
+Grammar Syntax
+==============
+
+The grammar consists of a sequence of rules of the form:
+
+```
+ rule_name: expression
+```
+
+Optionally, a type can be included right after the rule name, which
+specifies the return type of the C or Python function corresponding to
+the rule:
+
+```
+ rule_name[return_type]: expression
+```
+
+If the return type is omitted, then a ``void *`` is returned in C and an
+``Any`` in Python.
+
+Grammar expressions
+-------------------
+
+| Expression | Description and Example |
+|-----------------|-----------------------------------------------------------------------------------------------------------------------|
+| `# comment` | Python-style comments. |
+| `e1 e2` | Match `e1`, then match `e2`. <br> `rule_name: first_rule second_rule` |
+| `e1 \| e2` | Match `e1` or `e2`. <br> `rule_name[return_type]:`<br>` \| first_alt`<br>` \| second_alt` |
+| `( e )` | Grouping operator: Match `e`. <br> `rule_name: (e)`<br>`rule_name: (e1 e2)*` |
+| `[ e ]` or `e?` | Optionally match `e`. <br> `rule_name: [e]`<br>`rule_name: e (',' e)* [',']` |
+| `e*` | Match zero or more occurrences of `e`. <br> `rule_name: (e1 e2)*` |
+| `e+` | Match one or more occurrences of `e`. <br> `rule_name: (e1 e2)+` |
+| `s.e+` | Match one or more occurrences of `e`, separated by `s`. <br> `rule_name: ','.e+` |
+| `&e` | Positive lookahead: Succeed if `e` can be parsed, without consuming input. |
+| `!e` | Negative lookahead: Fail if `e` can be parsed, without consuming input. <br> `primary: atom !'.' !'(' !'['` |
+| `~` | Commit to the current alternative, even if it fails to parse (cut). <br> `rule_name: '(' ~ some_rule ')' \| some_alt` |
+
+
+Left recursion
+--------------
+
+PEG parsers normally do not support left recursion, but CPython's parser
+generator implements a technique similar to the one described in
+[Medeiros et al.](https://arxiv.org/pdf/1207.0443) but using the memoization
+cache instead of static variables. This approach is closer to the one described
+in [Warth et al.](http://web.cs.ucla.edu/~todd/research/pepm08.pdf). This
+allows us to write not only simple left-recursive rules but also more
+complicated rules that involve indirect left-recursion like:
+
+```
+ rule1: rule2 | 'a'
+ rule2: rule3 | 'b'
+ rule3: rule1 | 'c'
+```
+
+and "hidden left-recursion" like:
+
+```
+ rule: 'optional'? rule '@' some_other_rule
+```
+
+Variables in the grammar
+------------------------
+
+A sub-expression can be named by preceding it with an identifier and an
+``=`` sign. The name can then be used in the action (see below), like this:
+
+```
+ rule_name[return_type]: '(' a=some_other_rule ')' { a }
+```
+
+Grammar actions
+---------------
+
+To avoid the intermediate steps that obscure the relationship between the
+grammar and the AST generation, the PEG parser allows directly generating AST
+nodes for a rule via grammar actions. Grammar actions are language-specific
+expressions that are evaluated when a grammar rule is successfully parsed. These
+expressions can be written in Python or C depending on the desired output of the
+parser generator. This means that if one would want to generate a parser in
+Python and another in C, two grammar files should be written, each one with a
+different set of actions, keeping everything else apart from said actions
+identical in both files. As an example of a grammar with Python actions, the
+piece of the parser generator that parses grammar files is bootstrapped from a
+meta-grammar file with Python actions that generate the grammar tree as a result
+of the parsing.
+
+In the specific case of the PEG grammar for Python, having actions allows
+directly describing how the AST is composed in the grammar itself, making it
+more clear and maintainable. This AST generation process is supported by the use
+of some helper functions that factor out common AST object manipulations and
+some other required operations that are not directly related to the grammar.
+
+To indicate these actions each alternative can be followed by the action code
+inside curly-braces, which specifies the return value of the alternative:
+
+```
+ rule_name[return_type]:
+ | first_alt1 first_alt2 { first_alt1 }
+ | second_alt1 second_alt2 { second_alt1 }
+```
+
+If the action is omitted, a default action is generated:
+
+- If there is a single name in the rule, it gets returned.
+- If there multiple names in the rule, a collection with all parsed
+ expressions gets returned (the type of the collection will be different
+ in C and Python).
+
+This default behaviour is primarily made for very simple situations and for
+debugging purposes.
+
+> [!WARNING]
+> It's important that the actions don't mutate any AST nodes that are passed
+> into them via variables referring to other rules. The reason for mutation
+> being not allowed is that the AST nodes are cached by memoization and could
+> potentially be reused in a different context, where the mutation would be
+> invalid. If an action needs to change an AST node, it should instead make a
+> new copy of the node and change that.
+
+The full meta-grammar for the grammars supported by the PEG generator is:
+
+```
+ start[Grammar]: grammar ENDMARKER { grammar }
+
+ grammar[Grammar]:
+ | metas rules { Grammar(rules, metas) }
+ | rules { Grammar(rules, []) }
+
+ metas[MetaList]:
+ | meta metas { [meta] + metas }
+ | meta { [meta] }
+
+ meta[MetaTuple]:
+ | "@" NAME NEWLINE { (name.string, None) }
+ | "@" a=NAME b=NAME NEWLINE { (a.string, b.string) }
+ | "@" NAME STRING NEWLINE { (name.string, literal_eval(string.string)) }
+
+ rules[RuleList]:
+ | rule rules { [rule] + rules }
+ | rule { [rule] }
+
+ rule[Rule]:
+ | rulename ":" alts NEWLINE INDENT more_alts DEDENT {
+ Rule(rulename[0], rulename[1], Rhs(alts.alts + more_alts.alts)) }
+ | rulename ":" NEWLINE INDENT more_alts DEDENT { Rule(rulename[0], rulename[1], more_alts) }
+ | rulename ":" alts NEWLINE { Rule(rulename[0], rulename[1], alts) }
+
+ rulename[RuleName]:
+ | NAME '[' type=NAME '*' ']' {(name.string, type.string+"*")}
+ | NAME '[' type=NAME ']' {(name.string, type.string)}
+ | NAME {(name.string, None)}
+
+ alts[Rhs]:
+ | alt "|" alts { Rhs([alt] + alts.alts)}
+ | alt { Rhs([alt]) }
+
+ more_alts[Rhs]:
+ | "|" alts NEWLINE more_alts { Rhs(alts.alts + more_alts.alts) }
+ | "|" alts NEWLINE { Rhs(alts.alts) }
+
+ alt[Alt]:
+ | items '$' action { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=action) }
+ | items '$' { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=None) }
+ | items action { Alt(items, action=action) }
+ | items { Alt(items, action=None) }
+
+ items[NamedItemList]:
+ | named_item items { [named_item] + items }
+ | named_item { [named_item] }
+
+ named_item[NamedItem]:
+ | NAME '=' ~ item {NamedItem(name.string, item)}
+ | item {NamedItem(None, item)}
+ | it=lookahead {NamedItem(None, it)}
+
+ lookahead[LookaheadOrCut]:
+ | '&' ~ atom {PositiveLookahead(atom)}
+ | '!' ~ atom {NegativeLookahead(atom)}
+ | '~' {Cut()}
+
+ item[Item]:
+ | '[' ~ alts ']' {Opt(alts)}
+ | atom '?' {Opt(atom)}
+ | atom '*' {Repeat0(atom)}
+ | atom '+' {Repeat1(atom)}
+ | sep=atom '.' node=atom '+' {Gather(sep, node)}
+ | atom {atom}
+
+ atom[Plain]:
+ | '(' ~ alts ')' {Group(alts)}
+ | NAME {NameLeaf(name.string) }
+ | STRING {StringLeaf(string.string)}
+
+ # Mini-grammar for the actions
+
+ action[str]: "{" ~ target_atoms "}" { target_atoms }
+
+ target_atoms[str]:
+ | target_atom target_atoms { target_atom + " " + target_atoms }
+ | target_atom { target_atom }
+
+ target_atom[str]:
+ | "{" ~ target_atoms "}" { "{" + target_atoms + "}" }
+ | NAME { name.string }
+ | NUMBER { number.string }
+ | STRING { string.string }
+ | "?" { "?" }
+ | ":" { ":" }
+```
+
+As an illustrative example this simple grammar file allows directly
+generating a full parser that can parse simple arithmetic expressions and that
+returns a valid C-based Python AST:
+
+```
+ start[mod_ty]: a=expr_stmt* ENDMARKER { _PyAST_Module(a, NULL, p->arena) }
+ expr_stmt[stmt_ty]: a=expr NEWLINE { _PyAST_Expr(a, EXTRA) }
+
+ expr[expr_ty]:
+ | l=expr '+' r=term { _PyAST_BinOp(l, Add, r, EXTRA) }
+ | l=expr '-' r=term { _PyAST_BinOp(l, Sub, r, EXTRA) }
+ | term
+
+ term[expr_ty]:
+ | l=term '*' r=factor { _PyAST_BinOp(l, Mult, r, EXTRA) }
+ | l=term '/' r=factor { _PyAST_BinOp(l, Div, r, EXTRA) }
+ | factor
+
+ factor[expr_ty]:
+ | '(' e=expr ')' { e }
+ | atom
+
+ atom[expr_ty]:
+ | NAME
+ | NUMBER
+```
+
+Here ``EXTRA`` is a macro that expands to ``start_lineno, start_col_offset,
+end_lineno, end_col_offset, p->arena``, those being variables automatically
+injected by the parser; ``p`` points to an object that holds on to all state
+for the parser.
+
+A similar grammar written to target Python AST objects:
+
+```
+ start[ast.Module]: a=expr_stmt* ENDMARKER { ast.Module(body=a or [] }
+ expr_stmt: a=expr NEWLINE { ast.Expr(value=a, EXTRA) }
+
+ expr:
+ | l=expr '+' r=term { ast.BinOp(left=l, op=ast.Add(), right=r, EXTRA) }
+ | l=expr '-' r=term { ast.BinOp(left=l, op=ast.Sub(), right=r, EXTRA) }
+ | term
+
+ term:
+ | l=term '*' r=factor { ast.BinOp(left=l, op=ast.Mult(), right=r, EXTRA) }
+ | l=term '/' r=factor { ast.BinOp(left=l, op=ast.Div(), right=r, EXTRA) }
+ | factor
+
+ factor:
+ | '(' e=expr ')' { e }
+ | atom
+
+ atom:
+ | NAME
+ | NUMBER
+```
+
+Pegen
+=====
+
+Pegen is the parser generator used in CPython to produce the final PEG parser
+used by the interpreter. It is the program that can be used to read the python
+grammar located in
+[`Grammar/python.gram`](https://github.com/python/cpython/blob/main/Grammar/python.gram)
+and produce the final C parser. It contains the following pieces:
+
+- A parser generator that can read a grammar file and produce a PEG parser
+ written in Python or C that can parse said grammar. The generator is located at
+ [`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen).
+- A PEG meta-grammar that automatically generates a Python parser which is used
+ for the parser generator itself (this means that there are no manually-written
+ parsers). The meta-grammar is located at
+ [`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram).
+- A generated parser (using the parser generator) that can directly produce C and Python AST objects.
+
+The source code for Pegen lives at
+[`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen)
+but normally all typical commands to interact with the parser generator are executed from
+the main makefile.
+
+How to regenerate the parser
+----------------------------
+
+Once you have made the changes to the grammar files, to regenerate the ``C``
+parser (the one used by the interpreter) just execute:
+
+```
+ make regen-pegen
+```
+
+using the ``Makefile`` in the main directory. If you are on Windows you can
+use the Visual Studio project files to regenerate the parser or to execute:
+
+```
+ ./PCbuild/build.bat --regen
+```
+
+The generated parser file is located at
+[`Parser/parser.c`](https://github.com/python/cpython/blob/main/Parser/parser.c).
+
+How to regenerate the meta-parser
+---------------------------------
+
+The meta-grammar (the grammar that describes the grammar for the grammar files
+themselves) is located at
+[`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram).
+Although it is very unlikely that you will ever need to modify it, if you make
+any modifications to this file (in order to implement new Pegen features) you will
+need to regenerate the meta-parser (the parser that parses the grammar files).
+To do so just execute:
+
+```
+ make regen-pegen-metaparser
+```
+
+If you are on Windows you can use the Visual Studio project files
+to regenerate the parser or to execute:
+
+```
+ ./PCbuild/build.bat --regen
+```
+
+
+Grammatical elements and rules
+------------------------------
+
+Pegen has some special grammatical elements and rules:
+
+- Strings with single quotes (') (for example, ``'class'``) denote KEYWORDS.
+- Strings with double quotes (") (for example, ``"match"``) denote SOFT KEYWORDS.
+- Uppercase names (for example, ``NAME``) denote tokens in the
+ [`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) file.
+- Rule names starting with ``invalid_`` are used for specialized syntax errors.
+
+ - These rules are NOT used in the first pass of the parser.
+ - Only if the first pass fails to parse, a second pass including the invalid
+ rules will be executed.
+ - If the parser fails in the second phase with a generic syntax error, the
+ location of the generic failure of the first pass will be used (this avoids
+ reporting incorrect locations due to the invalid rules).
+ - The order of the alternatives involving invalid rules matter
+ (like any rule in PEG).
+
+Tokenization
+------------
+
+It is common among PEG parser frameworks that the parser does both the parsing
+and the tokenization, but this does not happen in Pegen. The reason is that the
+Python language needs a custom tokenizer to handle things like indentation
+boundaries, some special keywords like ``ASYNC`` and ``AWAIT`` (for
+compatibility purposes), backtracking errors (such as unclosed parenthesis),
+dealing with encoding, interactive mode and much more. Some of these reasons
+are also there for historical purposes, and some others are useful even today.
+
+The list of tokens (all uppercase names in the grammar) that you can use can
+be found in thei
+[`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens)
+file. If you change this file to add new tokens, make sure to regenerate the
+files by executing:
+
+```
+ make regen-token
+```
+
+If you are on Windows you can use the Visual Studio project files to regenerate
+the tokens or to execute:
+
+```
+ ./PCbuild/build.bat --regen
+```
+
+How tokens are generated and the rules governing this are completely up to the tokenizer
+([`Parser/lexer`](https://github.com/python/cpython/blob/main/Parser/lexer)
+and
+[`Parser/tokenizer`](https://github.com/python/cpython/blob/main/Parser/tokenizer));
+the parser just receives tokens from it.
+
+Memoization
+-----------
+
+As described previously, to avoid exponential time complexity in the parser,
+memoization is used.
+
+The C parser used by Python is highly optimized and memoization can be expensive
+both in memory and time. Although the memory cost is obvious (the parser needs
+memory for storing previous results in the cache) the execution time cost comes
+for continuously checking if the given rule has a cache hit or not. In many
+situations, just parsing it again can be faster. Pegen **disables memoization
+by default** except for rules with the special marker ``memo`` after the rule
+name (and type, if present):
+
+```
+ rule_name[typr] (memo):
+ ...
+```
+
+By selectively turning on memoization for a handful of rules, the parser becomes
+faster and uses less memory.
+
+> [!NOTE]
+> Left-recursive rules always use memoization, since the implementation of
+> left-recursion depends on it.
+
+To determine whether a new rule needs memoization or not, benchmarking is required
+(comparing execution times and memory usage of some considerably large files with
+and without memoization). There is a very simple instrumentation API available
+in the generated C parse code that allows to measure how much each rule uses
+memoization (check the
+[`Parser/pegen.c`](https://github.com/python/cpython/blob/main/Parser/pegen.c)
+file for more information) but it needs to be manually activated.
+
+Automatic variables
+-------------------
+
+To make writing actions easier, Pegen injects some automatic variables in the
+namespace available when writing actions. In the C parser, some of these
+automatic variable names are:
+
+- ``p``: The parser structure.
+- ``EXTRA``: This is a macro that expands to
+ ``(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)``,
+ which is normally used to create AST nodes as almost all constructors need these
+ attributes to be provided. All of the location variables are taken from the
+ location information of the current token.
+
+Hard and soft keywords
+----------------------
+
+> [!NOTE]
+> In the grammar files, keywords are defined using **single quotes** (for example,
+> ``'class'``) while soft keywords are defined using **double quotes** (for example,
+> ``"match"``).
+
+There are two kinds of keywords allowed in pegen grammars: *hard* and *soft*
+keywords. The difference between hard and soft keywords is that hard keywords
+are always reserved words, even in positions where they make no sense
+(for example, ``x = class + 1``), while soft keywords only get a special
+meaning in context. Trying to use a hard keyword as a variable will always
+fail:
+
+```
+ >>> class = 3
+ File "<stdin>", line 1
+ class = 3
+ ^
+ SyntaxError: invalid syntax
+ >>> foo(class=3)
+ File "<stdin>", line 1
+ foo(class=3)
+ ^^^^^
+ SyntaxError: invalid syntax
+```
+
+While soft keywords don't have this limitation if used in a context other the
+one where they are defined as keywords:
+
+```
+ >>> match = 45
+ >>> foo(match="Yeah!")
+```
+
+The ``match`` and ``case`` keywords are soft keywords, so that they are
+recognized as keywords at the beginning of a match statement or case block
+respectively, but are allowed to be used in other places as variable or
+argument names.
+
+You can get a list of all keywords defined in the grammar from Python:
+
+```
+ >>> import keyword
+ >>> keyword.kwlist
+ ['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break',
+ 'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for',
+ 'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or',
+ 'pass', 'raise', 'return', 'try', 'while', 'with', 'yield']
+```
+
+as well as soft keywords:
+
+```
+ >>> import keyword
+ >>> keyword.softkwlist
+ ['_', 'case', 'match']
+```
+
+> [!CAUTION]
+> Soft keywords can be a bit challenging to manage as they can be accepted in
+> places you don't intend, given how the order alternatives behave in PEG
+> parsers (see the
+> [consequences of ordered choice](#consequences-of-the-ordered-choice-operator)
+> section for some background on this). In general, try to define them in places
+> where there are not many alternatives.
+
+Error handling
+--------------
+
+When a pegen-generated parser detects that an exception is raised, it will
+**automatically stop parsing**, no matter what the current state of the parser
+is, and it will unwind the stack and report the exception. This means that if a
+[rule action](#grammar-actions) raises an exception, all parsing will
+stop at that exact point. This is done to allow to correctly propagate any
+exception set by calling Python's C API functions. This also includes
+[``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError)
+exceptions and it is the main mechanism the parser uses to report custom syntax
+error messages.
+
+> [!NOTE]
+> Tokenizer errors are normally reported by raising exceptions but some special
+> tokenizer errors such as unclosed parenthesis will be reported only after the
+> parser finishes without returning anything.
+
+How syntax errors are reported
+------------------------------
+
+As described previously in the [how PEG parsers work](#how-peg-parsers-work)
+section, PEG parsers don't have a defined concept of where errors happened
+in the grammar, because a rule failure doesn't imply a parsing failure like
+in context free grammars. This means that a heuristic has to be used to report
+generic errors unless something is explicitly declared as an error in the
+grammar.
+
+To report generic syntax errors, pegen uses a common heuristic in PEG parsers:
+the location of *generic* syntax errors is reported to be the furthest token that
+was attempted to be matched but failed. This is only done if parsing has failed
+(the parser returns ``NULL`` in C or ``None`` in Python) but no exception has
+been raised.
+
+As the Python grammar was primordially written as an ``LL(1)`` grammar, this heuristic
+has an extremely high success rate, but some PEG features, such as lookaheads,
+can impact this.
+
+> [!CAUTION]
+> Positive and negative lookaheads will try to match a token so they will affect
+> the location of generic syntax errors. Use them carefully at boundaries
+> between rules.
+
+To generate more precise syntax errors, custom rules are used. This is a common
+practice also in context free grammars: the parser will try to accept some
+construct that is known to be incorrect just to report a specific syntax error
+for that construct. In pegen grammars, these rules start with the ``invalid_``
+prefix. This is because trying to match these rules normally has a performance
+impact on parsing (and can also affect the 'correct' grammar itself in some
+tricky cases, depending on the ordering of the rules) so the generated parser
+acts in two phases:
+
+1. The first phase will try to parse the input stream without taking into
+ account rules that start with the ``invalid_`` prefix. If the parsing
+ succeeds it will return the generated AST and the second phase will be
+ skipped.
+
+2. If the first phase failed, a second parsing attempt is done including the
+ rules that start with an ``invalid_`` prefix. By design this attempt
+ **cannot succeed** and is only executed to give to the invalid rules a
+ chance to detect specific situations where custom, more precise, syntax
+ errors can be raised. This also allows to trade a bit of performance for
+ precision reporting errors: given that we know that the input text is
+ invalid, there is typically no need to be fast because execution is going
+ to stop anyway.
+
+> [!IMPORTANT]
+> When defining invalid rules:
+>
+> - Make sure all custom invalid rules raise
+> [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError)
+> exceptions (or a subclass of it).
+> - Make sure **all** invalid rules start with the ``invalid_`` prefix to not
+> impact performance of parsing correct Python code.
+> - Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules
+> (see the [how PEG parsers work](#how-peg-parsers-work) section for more information).
+
+You can find a collection of macros to raise specialized syntax errors in the
+[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h)
+header file. These macros allow also to report ranges for
+the custom errors, which will be highlighted in the tracebacks that will be
+displayed when the error is reported.
+
+
+> [!TIP]
+> A good way to test whether an invalid rule will be triggered when you expect
+> is to test if introducing a syntax error **after** valid code triggers the
+> rule or not. For example:
+
+```
+ <valid python code> $ 42
+```
+
+should trigger the syntax error in the ``$`` character. If your rule is not correctly defined this
+won't happen. As another example, suppose that you try to define a rule to match Python 2 style
+``print`` statements in order to create a better error message and you define it as:
+
+```
+ invalid_print: "print" expression
+```
+
+This will **seem** to work because the parser will correctly parse ``print(something)`` because it is valid
+code and the second phase will never execute but if you try to parse ``print(something) $ 3`` the first pass
+of the parser will fail (because of the ``$``) and in the second phase, the rule will match the
+``print(something)`` as ``print`` followed by the variable ``something`` between parentheses and the error
+will be reported there instead of the ``$`` character.
+
+Generating AST objects
+----------------------
+
+The output of the C parser used by CPython, which is generated from the
+[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram),
+is a Python AST object (using C structures). This means that the actions in the
+grammar file generate AST objects when they succeed. Constructing these objects
+can be quite cumbersome (see the [AST compiler section](compiler.md#abstract-syntax-trees-ast)
+for more information on how these objects are constructed and how they are used
+by the compiler), so special helper functions are used. These functions are
+declared in the
+[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h)
+header file and defined in the
+[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c)
+file. The helpers include functions that join AST sequences, get specific elements
+from them or to perform extra processing on the generated tree.
+
+
+> [!CAUTION]
+> Actions must **never** be used to accept or reject rules. It may be tempting
+> in some situations to write a very generic rule and then check the generated
+> AST to decide whether it is valid or not, but this will render the
+> (official grammar)[https://docs.python.org/3/reference/grammar.html] partially
+> incorrect (because it does not include actions) and will make it more difficult
+> for other Python implementations to adapt the grammar to their own needs.
+
+As a general rule, if an action spawns multiple lines or requires something more
+complicated than a single expression of C code, is normally better to create a
+custom helper in
+[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c)
+and expose it in the
+[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h)
+header file so that it can be used from the grammar.
+
+When parsing succeeds, the parser **must** return a **valid** AST object.
+
+Testing
+=======
+
+There are three files that contain tests for the grammar and the parser:
+
+- [test_grammar.py](https://github.com/python/cpython/blob/main/Lib/test/test_grammar.py)
+- [test_syntax.py](https://github.com/python/cpython/blob/main/Lib/test/test_syntax.py)
+- [test_exceptions.py](https://github.com/python/cpython/blob/main/Lib/test/test_exceptions.py)
+
+Check the contents of these files to know which is the best place for new tests, depending
+on the nature of the new feature you are adding.
+
+Tests for the parser generator itself can be found in the
+[test_peg_generator](https://github.com/python/cpython/blob/main/Lib/test_peg_generator)
+directory.
+
+
+Debugging generated parsers
+===========================
+
+Making experiments
+------------------
+
+As the generated C parser is the one used by Python, this means that if
+something goes wrong when adding some new rules to the grammar, you cannot
+correctly compile and execute Python anymore. This makes it a bit challenging
+to debug when something goes wrong, especially when experimenting.
+
+For this reason it is a good idea to experiment first by generating a Python
+parser. To do this, you can go to the
+[Tools/peg_generator](https://github.com/python/cpython/blob/main/Tools/peg_generator)
+directory on the CPython repository and manually call the parser generator by executing:
+
+```
+ $ python -m pegen python <PATH TO YOUR GRAMMAR FILE>
+```
+
+This will generate a file called ``parse.py`` in the same directory that you
+can use to parse some input:
+
+```
+ $ python parse.py file_with_source_code_to_test.py
+```
+
+As the generated ``parse.py`` file is just Python code, you can modify it
+and add breakpoints to debug or better understand some complex situations.
+
+
+Verbose mode
+------------
+
+When Python is compiled in debug mode (by adding ``--with-pydebug`` when
+running the configure step in Linux or by adding ``-d`` when calling the
+[PCbuild/build.bat](https://github.com/python/cpython/blob/main/PCbuild/build.bat)),
+it is possible to activate a **very** verbose mode in the generated parser. This
+is very useful to debug the generated parser and to understand how it works, but it
+can be a bit hard to understand at first.
+
+> [!NOTE]
+> When activating verbose mode in the Python parser, it is better to not use
+> interactive mode as it can be much harder to understand, because interactive
+> mode involves some special steps compared to regular parsing.
+
+To activate verbose mode you can add the ``-d`` flag when executing Python:
+
+```
+ $ python -d file_to_test.py
+```
+
+This will print **a lot** of output to ``stderr`` so it is probably better to dump
+it to a file for further analysis. The output consists of trace lines with the
+following structure::
+
+```
+ <indentation> ('>'|'-'|'+'|'!') <rule_name>[<token_location>]: <alternative> ...
+```
+
+Every line is indented by a different amount (``<indentation>``) depending on how
+deep the call stack is. The next character marks the type of the trace:
+
+- ``>`` indicates that a rule is going to be attempted to be parsed.
+- ``-`` indicates that a rule has failed to be parsed.
+- ``+`` indicates that a rule has been parsed correctly.
+- ``!`` indicates that an exception or an error has been detected and the parser is unwinding.
+
+The ``<token_location>`` part indicates the current index in the token array,
+the ``<rule_name>`` part indicates what rule is being parsed and
+the ``<alternative>`` part indicates what alternative within that rule
+is being attempted.
+
+
+> [!NOTE]
+> **Document history**
+>
+> Pablo Galindo Salgado - Original author
+> Irit Katriel and Jacob Coffee - Convert to Markdown