diff options
Diffstat (limited to 'InternalDocs/parser.md')
-rw-r--r-- | InternalDocs/parser.md | 214 |
1 files changed, 101 insertions, 113 deletions
diff --git a/InternalDocs/parser.md b/InternalDocs/parser.md index 11aaf11..6398ba6 100644 --- a/InternalDocs/parser.md +++ b/InternalDocs/parser.md @@ -9,12 +9,12 @@ Python's Parser is currently a [`PEG` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar) parser. It was introduced in [PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace -the original [``LL(1)``](https://en.wikipedia.org/wiki/LL_parser) parser. +the original [`LL(1)`](https://en.wikipedia.org/wiki/LL_parser) parser. The code implementing the parser is generated from a grammar definition by a [parser generator](https://en.wikipedia.org/wiki/Compiler-compiler). Therefore, changes to the Python language are made by modifying the -[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram). +[grammar file](../Grammar/python.gram). Developers rarely need to modify the generator itself. See the devguide's [Changing CPython's grammar](https://devguide.python.org/developer-workflow/grammar/#grammar) @@ -33,9 +33,9 @@ is ordered. This means that when writing: rule: A | B | C ``` -a parser that implements a context-free-grammar (such as an ``LL(1)`` parser) will +a parser that implements a context-free-grammar (such as an `LL(1)` parser) will generate constructions that, given an input string, *deduce* which alternative -(``A``, ``B`` or ``C``) must be expanded. On the other hand, a PEG parser will +(`A`, `B` or `C`) must be expanded. On the other hand, a PEG parser will check each alternative, in the order in which they are specified, and select that first one that succeeds. @@ -67,21 +67,21 @@ time complexity with a technique called which not only loads the entire program in memory before parsing it but also allows the parser to backtrack arbitrarily. This is made efficient by memoizing the rules already matched for each position. The cost of the memoization cache -is that the parser will naturally use more memory than a simple ``LL(1)`` parser, +is that the parser will naturally use more memory than a simple `LL(1)` parser, which normally are table-based. Key ideas --------- -- Alternatives are ordered ( ``A | B`` is not the same as ``B | A`` ). +- Alternatives are ordered ( `A | B` is not the same as `B | A` ). - If a rule returns a failure, it doesn't mean that the parsing has failed, it just means "try something else". - By default PEG parsers run in exponential time, which can be optimized to linear by using memoization. - If parsing fails completely (no rule succeeds in parsing all the input text), the PEG parser doesn't have a concept of "where the - [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) is". + [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError) is". > [!IMPORTANT] @@ -111,16 +111,16 @@ the following two rules (in these examples, a token is an individual character): second_rule: ('aa' | 'a' ) 'a' ``` -In a regular EBNF grammar, both rules specify the language ``{aa, aaa}`` but -in PEG, one of these two rules accepts the string ``aaa`` but not the string -``aa``. The other does the opposite -- it accepts the string ``aa`` -but not the string ``aaa``. The rule ``('a'|'aa')'a'`` does -not accept ``aaa`` because ``'a'|'aa'`` consumes the first ``a``, letting the -final ``a`` in the rule consume the second, and leaving out the third ``a``. +In a regular EBNF grammar, both rules specify the language `{aa, aaa}` but +in PEG, one of these two rules accepts the string `aaa` but not the string +`aa`. The other does the opposite -- it accepts the string `aa` +but not the string `aaa`. The rule `('a'|'aa')'a'` does +not accept `aaa` because `'a'|'aa'` consumes the first `a`, letting the +final `a` in the rule consume the second, and leaving out the third `a`. As the rule has succeeded, no attempt is ever made to go back and let -``'a'|'aa'`` try the second alternative. The expression ``('aa'|'a')'a'`` does -not accept ``aa`` because ``'aa'|'a'`` accepts all of ``aa``, leaving nothing -for the final ``a``. Again, the second alternative of ``'aa'|'a'`` is not +`'a'|'aa'` try the second alternative. The expression `('aa'|'a')'a'` does +not accept `aa` because `'aa'|'a'` accepts all of `aa`, leaving nothing +for the final `a`. Again, the second alternative of `'aa'|'a'` is not tried. > [!CAUTION] @@ -137,7 +137,7 @@ one is in almost all cases a mistake, for example: ``` In this example, the second alternative will never be tried because the first one will -succeed first (even if the input string has an ``'else' block`` that follows). To correctly +succeed first (even if the input string has an `'else' block` that follows). To correctly write this rule you can simply alter the order: ``` @@ -146,7 +146,7 @@ write this rule you can simply alter the order: | 'if' expression 'then' block ``` -In this case, if the input string doesn't have an ``'else' block``, the first alternative +In this case, if the input string doesn't have an `'else' block`, the first alternative will fail and the second will be attempted. Grammar Syntax @@ -166,8 +166,8 @@ the rule: rule_name[return_type]: expression ``` -If the return type is omitted, then a ``void *`` is returned in C and an -``Any`` in Python. +If the return type is omitted, then a `void *` is returned in C and an +`Any` in Python. Grammar expressions ------------------- @@ -214,7 +214,7 @@ Variables in the grammar ------------------------ A sub-expression can be named by preceding it with an identifier and an -``=`` sign. The name can then be used in the action (see below), like this: +`=` sign. The name can then be used in the action (see below), like this: ``` rule_name[return_type]: '(' a=some_other_rule ')' { a } @@ -387,9 +387,9 @@ returns a valid C-based Python AST: | NUMBER ``` -Here ``EXTRA`` is a macro that expands to ``start_lineno, start_col_offset, -end_lineno, end_col_offset, p->arena``, those being variables automatically -injected by the parser; ``p`` points to an object that holds on to all state +Here `EXTRA` is a macro that expands to `start_lineno, start_col_offset, +end_lineno, end_col_offset, p->arena`, those being variables automatically +injected by the parser; `p` points to an object that holds on to all state for the parser. A similar grammar written to target Python AST objects: @@ -422,50 +422,47 @@ Pegen Pegen is the parser generator used in CPython to produce the final PEG parser used by the interpreter. It is the program that can be used to read the python -grammar located in -[`Grammar/python.gram`](https://github.com/python/cpython/blob/main/Grammar/python.gram) -and produce the final C parser. It contains the following pieces: +grammar located in [`Grammar/python.gram`](../Grammar/python.gram) and produce +the final C parser. It contains the following pieces: - A parser generator that can read a grammar file and produce a PEG parser written in Python or C that can parse said grammar. The generator is located at - [`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen). + [`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen). - A PEG meta-grammar that automatically generates a Python parser which is used for the parser generator itself (this means that there are no manually-written parsers). The meta-grammar is located at - [`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram). + [`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram). - A generated parser (using the parser generator) that can directly produce C and Python AST objects. -The source code for Pegen lives at -[`Tools/peg_generator/pegen`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen) +The source code for Pegen lives at [`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen) but normally all typical commands to interact with the parser generator are executed from the main makefile. How to regenerate the parser ---------------------------- -Once you have made the changes to the grammar files, to regenerate the ``C`` +Once you have made the changes to the grammar files, to regenerate the `C` parser (the one used by the interpreter) just execute: ``` make regen-pegen ``` -using the ``Makefile`` in the main directory. If you are on Windows you can +using the `Makefile` in the main directory. If you are on Windows you can use the Visual Studio project files to regenerate the parser or to execute: ``` ./PCbuild/build.bat --regen ``` -The generated parser file is located at -[`Parser/parser.c`](https://github.com/python/cpython/blob/main/Parser/parser.c). +The generated parser file is located at [`Parser/parser.c`](../Parser/parser.c). How to regenerate the meta-parser --------------------------------- The meta-grammar (the grammar that describes the grammar for the grammar files themselves) is located at -[`Tools/peg_generator/pegen/metagrammar.gram`](https://github.com/python/cpython/blob/main/Tools/peg_generator/pegen/metagrammar.gram). +[`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram). Although it is very unlikely that you will ever need to modify it, if you make any modifications to this file (in order to implement new Pegen features) you will need to regenerate the meta-parser (the parser that parses the grammar files). @@ -488,11 +485,11 @@ Grammatical elements and rules Pegen has some special grammatical elements and rules: -- Strings with single quotes (') (for example, ``'class'``) denote KEYWORDS. -- Strings with double quotes (") (for example, ``"match"``) denote SOFT KEYWORDS. -- Uppercase names (for example, ``NAME``) denote tokens in the - [`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) file. -- Rule names starting with ``invalid_`` are used for specialized syntax errors. +- Strings with single quotes (') (for example, `'class'`) denote KEYWORDS. +- Strings with double quotes (") (for example, `"match"`) denote SOFT KEYWORDS. +- Uppercase names (for example, `NAME`) denote tokens in the + [`Grammar/Tokens`](../Grammar/Tokens) file. +- Rule names starting with `invalid_` are used for specialized syntax errors. - These rules are NOT used in the first pass of the parser. - Only if the first pass fails to parse, a second pass including the invalid @@ -509,14 +506,13 @@ Tokenization It is common among PEG parser frameworks that the parser does both the parsing and the tokenization, but this does not happen in Pegen. The reason is that the Python language needs a custom tokenizer to handle things like indentation -boundaries, some special keywords like ``ASYNC`` and ``AWAIT`` (for +boundaries, some special keywords like `ASYNC` and `AWAIT` (for compatibility purposes), backtracking errors (such as unclosed parenthesis), dealing with encoding, interactive mode and much more. Some of these reasons are also there for historical purposes, and some others are useful even today. The list of tokens (all uppercase names in the grammar) that you can use can -be found in thei -[`Grammar/Tokens`](https://github.com/python/cpython/blob/main/Grammar/Tokens) +be found in the [`Grammar/Tokens`](../Grammar/Tokens) file. If you change this file to add new tokens, make sure to regenerate the files by executing: @@ -532,9 +528,7 @@ the tokens or to execute: ``` How tokens are generated and the rules governing this are completely up to the tokenizer -([`Parser/lexer`](https://github.com/python/cpython/blob/main/Parser/lexer) -and -[`Parser/tokenizer`](https://github.com/python/cpython/blob/main/Parser/tokenizer)); +([`Parser/lexer`](../Parser/lexer) and [`Parser/tokenizer`](../Parser/tokenizer)); the parser just receives tokens from it. Memoization @@ -548,7 +542,7 @@ both in memory and time. Although the memory cost is obvious (the parser needs memory for storing previous results in the cache) the execution time cost comes for continuously checking if the given rule has a cache hit or not. In many situations, just parsing it again can be faster. Pegen **disables memoization -by default** except for rules with the special marker ``memo`` after the rule +by default** except for rules with the special marker `memo` after the rule name (and type, if present): ``` @@ -567,8 +561,7 @@ To determine whether a new rule needs memoization or not, benchmarking is requir (comparing execution times and memory usage of some considerably large files with and without memoization). There is a very simple instrumentation API available in the generated C parse code that allows to measure how much each rule uses -memoization (check the -[`Parser/pegen.c`](https://github.com/python/cpython/blob/main/Parser/pegen.c) +memoization (check the [`Parser/pegen.c`](../Parser/pegen.c) file for more information) but it needs to be manually activated. Automatic variables @@ -578,9 +571,9 @@ To make writing actions easier, Pegen injects some automatic variables in the namespace available when writing actions. In the C parser, some of these automatic variable names are: -- ``p``: The parser structure. -- ``EXTRA``: This is a macro that expands to - ``(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)``, +- `p`: The parser structure. +- `EXTRA`: This is a macro that expands to + `(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)`, which is normally used to create AST nodes as almost all constructors need these attributes to be provided. All of the location variables are taken from the location information of the current token. @@ -590,13 +583,13 @@ Hard and soft keywords > [!NOTE] > In the grammar files, keywords are defined using **single quotes** (for example, -> ``'class'``) while soft keywords are defined using **double quotes** (for example, -> ``"match"``). +> `'class'`) while soft keywords are defined using **double quotes** (for example, +> `"match"`). There are two kinds of keywords allowed in pegen grammars: *hard* and *soft* keywords. The difference between hard and soft keywords is that hard keywords are always reserved words, even in positions where they make no sense -(for example, ``x = class + 1``), while soft keywords only get a special +(for example, `x = class + 1`), while soft keywords only get a special meaning in context. Trying to use a hard keyword as a variable will always fail: @@ -621,7 +614,7 @@ one where they are defined as keywords: >>> foo(match="Yeah!") ``` -The ``match`` and ``case`` keywords are soft keywords, so that they are +The `match` and `case` keywords are soft keywords, so that they are recognized as keywords at the beginning of a match statement or case block respectively, but are allowed to be used in other places as variable or argument names. @@ -662,7 +655,7 @@ is, and it will unwind the stack and report the exception. This means that if a [rule action](#grammar-actions) raises an exception, all parsing will stop at that exact point. This is done to allow to correctly propagate any exception set by calling Python's C API functions. This also includes -[``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) +[`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError) exceptions and it is the main mechanism the parser uses to report custom syntax error messages. @@ -684,10 +677,10 @@ grammar. To report generic syntax errors, pegen uses a common heuristic in PEG parsers: the location of *generic* syntax errors is reported to be the furthest token that was attempted to be matched but failed. This is only done if parsing has failed -(the parser returns ``NULL`` in C or ``None`` in Python) but no exception has +(the parser returns `NULL` in C or `None` in Python) but no exception has been raised. -As the Python grammar was primordially written as an ``LL(1)`` grammar, this heuristic +As the Python grammar was primordially written as an `LL(1)` grammar, this heuristic has an extremely high success rate, but some PEG features, such as lookaheads, can impact this. @@ -699,19 +692,19 @@ can impact this. To generate more precise syntax errors, custom rules are used. This is a common practice also in context free grammars: the parser will try to accept some construct that is known to be incorrect just to report a specific syntax error -for that construct. In pegen grammars, these rules start with the ``invalid_`` +for that construct. In pegen grammars, these rules start with the `invalid_` prefix. This is because trying to match these rules normally has a performance impact on parsing (and can also affect the 'correct' grammar itself in some tricky cases, depending on the ordering of the rules) so the generated parser acts in two phases: 1. The first phase will try to parse the input stream without taking into - account rules that start with the ``invalid_`` prefix. If the parsing + account rules that start with the `invalid_` prefix. If the parsing succeeds it will return the generated AST and the second phase will be skipped. 2. If the first phase failed, a second parsing attempt is done including the - rules that start with an ``invalid_`` prefix. By design this attempt + rules that start with an `invalid_` prefix. By design this attempt **cannot succeed** and is only executed to give to the invalid rules a chance to detect specific situations where custom, more precise, syntax errors can be raised. This also allows to trade a bit of performance for @@ -723,15 +716,15 @@ acts in two phases: > When defining invalid rules: > > - Make sure all custom invalid rules raise -> [``SyntaxError``](https://docs.python.org/3/library/exceptions.html#SyntaxError) +> [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError) > exceptions (or a subclass of it). -> - Make sure **all** invalid rules start with the ``invalid_`` prefix to not +> - Make sure **all** invalid rules start with the `invalid_` prefix to not > impact performance of parsing correct Python code. > - Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules > (see the [how PEG parsers work](#how-peg-parsers-work) section for more information). You can find a collection of macros to raise specialized syntax errors in the -[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) +[`Parser/pegen.h`](../Parser/pegen.h) header file. These macros allow also to report ranges for the custom errors, which will be highlighted in the tracebacks that will be displayed when the error is reported. @@ -746,35 +739,33 @@ displayed when the error is reported. <valid python code> $ 42 ``` -should trigger the syntax error in the ``$`` character. If your rule is not correctly defined this +should trigger the syntax error in the `$` character. If your rule is not correctly defined this won't happen. As another example, suppose that you try to define a rule to match Python 2 style -``print`` statements in order to create a better error message and you define it as: +`print` statements in order to create a better error message and you define it as: ``` invalid_print: "print" expression ``` -This will **seem** to work because the parser will correctly parse ``print(something)`` because it is valid -code and the second phase will never execute but if you try to parse ``print(something) $ 3`` the first pass -of the parser will fail (because of the ``$``) and in the second phase, the rule will match the -``print(something)`` as ``print`` followed by the variable ``something`` between parentheses and the error -will be reported there instead of the ``$`` character. +This will **seem** to work because the parser will correctly parse `print(something)` because it is valid +code and the second phase will never execute but if you try to parse `print(something) $ 3` the first pass +of the parser will fail (because of the `$`) and in the second phase, the rule will match the +`print(something)` as `print` followed by the variable `something` between parentheses and the error +will be reported there instead of the `$` character. Generating AST objects ---------------------- The output of the C parser used by CPython, which is generated from the -[grammar file](https://github.com/python/cpython/blob/main/Grammar/python.gram), -is a Python AST object (using C structures). This means that the actions in the -grammar file generate AST objects when they succeed. Constructing these objects -can be quite cumbersome (see the [AST compiler section](compiler.md#abstract-syntax-trees-ast) +[grammar file](../Grammar/python.gram), is a Python AST object (using C +structures). This means that the actions in the grammar file generate AST +objects when they succeed. Constructing these objects can be quite cumbersome +(see the [AST compiler section](compiler.md#abstract-syntax-trees-ast) for more information on how these objects are constructed and how they are used by the compiler), so special helper functions are used. These functions are -declared in the -[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) -header file and defined in the -[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c) -file. The helpers include functions that join AST sequences, get specific elements +declared in the [`Parser/pegen.h`](../Parser/pegen.h) header file and defined +in the [`Parser/action_helpers.c`](../Parser/action_helpers.c) file. The +helpers include functions that join AST sequences, get specific elements from them or to perform extra processing on the generated tree. @@ -788,11 +779,9 @@ from them or to perform extra processing on the generated tree. As a general rule, if an action spawns multiple lines or requires something more complicated than a single expression of C code, is normally better to create a -custom helper in -[`Parser/action_helpers.c`](https://github.com/python/cpython/blob/main/Parser/action_helpers.c) -and expose it in the -[`Parser/pegen.h`](https://github.com/python/cpython/blob/main/Parser/pegen.h) -header file so that it can be used from the grammar. +custom helper in [`Parser/action_helpers.c`](../Parser/action_helpers.c) +and expose it in the [`Parser/pegen.h`](../Parser/pegen.h) header file so that +it can be used from the grammar. When parsing succeeds, the parser **must** return a **valid** AST object. @@ -801,16 +790,15 @@ Testing There are three files that contain tests for the grammar and the parser: -- [test_grammar.py](https://github.com/python/cpython/blob/main/Lib/test/test_grammar.py) -- [test_syntax.py](https://github.com/python/cpython/blob/main/Lib/test/test_syntax.py) -- [test_exceptions.py](https://github.com/python/cpython/blob/main/Lib/test/test_exceptions.py) +- [test_grammar.py](../Lib/test/test_grammar.py) +- [test_syntax.py](../Lib/test/test_syntax.py) +- [test_exceptions.py](../Lib/test/test_exceptions.py) -Check the contents of these files to know which is the best place for new tests, depending -on the nature of the new feature you are adding. +Check the contents of these files to know which is the best place for new +tests, depending on the nature of the new feature you are adding. Tests for the parser generator itself can be found in the -[test_peg_generator](https://github.com/python/cpython/blob/main/Lib/test_peg_generator) -directory. +[test_peg_generator](../Lib/test_peg_generator) directory. Debugging generated parsers @@ -825,33 +813,32 @@ correctly compile and execute Python anymore. This makes it a bit challenging to debug when something goes wrong, especially when experimenting. For this reason it is a good idea to experiment first by generating a Python -parser. To do this, you can go to the -[Tools/peg_generator](https://github.com/python/cpython/blob/main/Tools/peg_generator) +parser. To do this, you can go to the [Tools/peg_generator](../Tools/peg_generator) directory on the CPython repository and manually call the parser generator by executing: ``` $ python -m pegen python <PATH TO YOUR GRAMMAR FILE> ``` -This will generate a file called ``parse.py`` in the same directory that you +This will generate a file called `parse.py` in the same directory that you can use to parse some input: ``` $ python parse.py file_with_source_code_to_test.py ``` -As the generated ``parse.py`` file is just Python code, you can modify it +As the generated `parse.py` file is just Python code, you can modify it and add breakpoints to debug or better understand some complex situations. Verbose mode ------------ -When Python is compiled in debug mode (by adding ``--with-pydebug`` when -running the configure step in Linux or by adding ``-d`` when calling the -[PCbuild/build.bat](https://github.com/python/cpython/blob/main/PCbuild/build.bat)), -it is possible to activate a **very** verbose mode in the generated parser. This -is very useful to debug the generated parser and to understand how it works, but it +When Python is compiled in debug mode (by adding `--with-pydebug` when +running the configure step in Linux or by adding `-d` when calling the +[PCbuild/build.bat](../PCbuild/build.bat)), it is possible to activate a +**very** verbose mode in the generated parser. This is very useful to +debug the generated parser and to understand how it works, but it can be a bit hard to understand at first. > [!NOTE] @@ -859,13 +846,13 @@ can be a bit hard to understand at first. > interactive mode as it can be much harder to understand, because interactive > mode involves some special steps compared to regular parsing. -To activate verbose mode you can add the ``-d`` flag when executing Python: +To activate verbose mode you can add the `-d` flag when executing Python: ``` $ python -d file_to_test.py ``` -This will print **a lot** of output to ``stderr`` so it is probably better to dump +This will print **a lot** of output to `stderr` so it is probably better to dump it to a file for further analysis. The output consists of trace lines with the following structure:: @@ -873,17 +860,17 @@ following structure:: <indentation> ('>'|'-'|'+'|'!') <rule_name>[<token_location>]: <alternative> ... ``` -Every line is indented by a different amount (``<indentation>``) depending on how +Every line is indented by a different amount (`<indentation>`) depending on how deep the call stack is. The next character marks the type of the trace: -- ``>`` indicates that a rule is going to be attempted to be parsed. -- ``-`` indicates that a rule has failed to be parsed. -- ``+`` indicates that a rule has been parsed correctly. -- ``!`` indicates that an exception or an error has been detected and the parser is unwinding. +- `>` indicates that a rule is going to be attempted to be parsed. +- `-` indicates that a rule has failed to be parsed. +- `+` indicates that a rule has been parsed correctly. +- `!` indicates that an exception or an error has been detected and the parser is unwinding. -The ``<token_location>`` part indicates the current index in the token array, -the ``<rule_name>`` part indicates what rule is being parsed and -the ``<alternative>`` part indicates what alternative within that rule +The `<token_location>` part indicates the current index in the token array, +the `<rule_name>` part indicates what rule is being parsed and +the `<alternative>` part indicates what alternative within that rule is being attempted. @@ -891,4 +878,5 @@ is being attempted. > **Document history** > > Pablo Galindo Salgado - Original author +> > Irit Katriel and Jacob Coffee - Convert to Markdown |