summaryrefslogtreecommitdiffstats
path: root/InternalDocs/parser.md
blob: 348988b7c2f00391381850a6cf9c27ee1f7d2be0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882

Guide to the parser
===================

Abstract
--------

Python's Parser is currently a
[`PEG` (Parser Expression Grammar)](https://en.wikipedia.org/wiki/Parsing_expression_grammar)
parser. It was introduced in
[PEP 617: New PEG parser for CPython](https://peps.python.org/pep-0617/) to replace
the original [`LL(1)`](https://en.wikipedia.org/wiki/LL_parser) parser.

The code implementing the parser is generated from a grammar definition by a
[parser generator](https://en.wikipedia.org/wiki/Compiler-compiler).
Therefore, changes to the Python language are made by modifying the
[grammar file](../Grammar/python.gram).
Developers rarely need to modify the generator itself.

See [Changing CPython's grammar](changing_grammar.md)
for a detailed description of the grammar and the process for changing it.

How PEG parsers work
====================

A PEG (Parsing Expression Grammar) grammar differs from a
[context-free grammar](https://en.wikipedia.org/wiki/Context-free_grammar)
in that the way it is written more closely reflects how the parser will operate
when parsing. The fundamental technical difference is that the choice operator
is ordered. This means that when writing:

```
    rule: A | B | C
```

a parser that implements a context-free-grammar (such as an `LL(1)` parser) will
generate constructions that, given an input string, *deduce* which alternative
(`A`, `B` or `C`) must be expanded. On the other hand, a PEG parser will
check each alternative, in the order in which they are specified, and select
that first one that succeeds.

This means that in a PEG grammar, the choice operator is not commutative.
Furthermore, unlike context-free grammars, the derivation according to a
PEG grammar cannot be ambiguous: if a string parses, it has exactly one
valid parse tree.

PEG parsers are usually constructed as a recursive descent parser in which every
rule in the grammar corresponds to a function in the program implementing the
parser, and the parsing expression (the "expansion" or "definition" of the rule)
represents the "code" in said function. Each parsing function conceptually takes
an input string as its argument, and yields one of the following results:

* A "success" result. This result indicates that the expression can be parsed by
  that rule and the function may optionally move forward or consume one or more
  characters of the input string supplied to it.
* A "failure" result, in which case no input is consumed.

Note that "failure" results do not imply that the program is incorrect, nor do
they necessarily mean that the parsing has failed. Since the choice operator is
ordered, a failure very often merely indicates "try the following option".  A
direct implementation of a PEG parser as a recursive descent parser will present
exponential time performance in the worst case, because PEG parsers have
infinite lookahead (this means that they can consider an arbitrary number of
tokens before deciding for a rule).  Usually, PEG parsers avoid this exponential
time complexity with a technique called
["packrat parsing"](https://pdos.csail.mit.edu/~baford/packrat/thesis/)
which not only loads the entire program in memory before parsing it but also
allows the parser to backtrack arbitrarily. This is made efficient by memoizing
the rules already matched for each position. The cost of the memoization cache
is that the parser will naturally use more memory than a simple `LL(1)` parser,
which normally are table-based.


Key ideas
---------

- Alternatives are ordered ( `A | B` is not the same as `B | A` ).
- If a rule returns a failure, it doesn't mean that the parsing has failed,
  it just means "try something else".
- By default PEG parsers run in exponential time, which can be optimized to linear by
  using memoization.
- If parsing fails completely (no rule succeeds in parsing all the input text), the
  PEG parser doesn't have a concept of "where the
  [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError) is".


> [!IMPORTANT]
> Don't try to reason about a PEG grammar in the same way you would to with an
> [EBNF](https://en.wikipedia.org/wiki/Extended_Backus–Naur_form)
> or context free grammar. PEG is optimized to describe **how** input strings will
> be parsed, while context-free grammars are optimized to generate strings of the
> language they describe (in EBNF, to know whether a given string is in the
> language, you need to do work to find out as it is not immediately obvious from
> the grammar).


Consequences of the ordered choice operator
-------------------------------------------

Although PEG may look like EBNF, its meaning is quite different. The fact
that the alternatives are ordered in a PEG grammer (which is at the core of
how PEG parsers work) has deep consequences, other than removing ambiguity.

If a rule has two alternatives and the first of them succeeds, the second one is
**not** attempted even if the caller rule fails to parse the rest of the input.
Thus the parser is said to be "eager". To illustrate this, consider
the following two rules (in these examples, a token is an individual character):

```
    first_rule:  ( 'a' | 'aa' ) 'a'
    second_rule: ('aa' | 'a'  ) 'a'
```

In a regular EBNF grammar, both rules specify the language `{aa, aaa}` but
in PEG, one of these two rules accepts the string `aaa` but not the string
`aa`. The other does the opposite -- it accepts the string `aa`
but not the string `aaa`. The rule `('a'|'aa')'a'` does
not accept `aaa` because `'a'|'aa'` consumes the first `a`, letting the
final `a` in the rule consume the second, and leaving out the third `a`.
As the rule has succeeded, no attempt is ever made to go back and let
`'a'|'aa'` try the second alternative. The expression `('aa'|'a')'a'` does
not accept `aa` because `'aa'|'a'` accepts all of `aa`, leaving nothing
for the final `a`. Again, the second alternative of `'aa'|'a'` is not
tried.

> [!CAUTION]
> The effects of ordered choice, such as the ones illustrated above, may be
> hidden by many levels of rules.

For this reason, writing rules where an alternative is contained in the next
one is in almost all cases a mistake, for example:

```
    my_rule:
      | 'if' expression 'then' block
      | 'if' expression 'then' block 'else' block
```

In this example, the second alternative will never be tried because the first one will
succeed first (even if the input string has an `'else' block` that follows). To correctly
write this rule you can simply alter the order:

```
    my_rule:
      | 'if' expression 'then' block 'else' block
      | 'if' expression 'then' block
```

In this case, if the input string doesn't have an `'else' block`, the first alternative
will fail and the second will be attempted.

Grammar Syntax
==============

The grammar consists of a sequence of rules of the form:

```
    rule_name: expression
```

Optionally, a type can be included right after the rule name, which
specifies the return type of the C or Python function corresponding to
the rule:

```
    rule_name[return_type]: expression
```

If the return type is omitted, then a `void *` is returned in C and an
`Any` in Python.

Grammar expressions
-------------------

| Expression      | Description and Example                                                                                               |
|-----------------|-----------------------------------------------------------------------------------------------------------------------|
| `# comment`     | Python-style comments.                                                                                                |
| `e1 e2`         | Match `e1`, then match `e2`. <br> `rule_name: first_rule second_rule`                                                 |
| `e1 \| e2`      | Match `e1` or `e2`. <br> `rule_name[return_type]:`<br>`    \| first_alt`<br>`    \| second_alt`                       |
| `( e )`         | Grouping operator: Match `e`. <br> `rule_name: (e)`<br>`rule_name: (e1 e2)*`                                          |
| `[ e ]` or `e?` | Optionally match `e`. <br> `rule_name: [e]`<br>`rule_name: e (',' e)* [',']`                                          |
| `e*`            | Match zero or more occurrences of `e`. <br> `rule_name: (e1 e2)*`                                                     |
| `e+`            | Match one or more occurrences of `e`. <br> `rule_name: (e1 e2)+`                                                      |
| `s.e+`          | Match one or more occurrences of `e`, separated by `s`. <br> `rule_name: ','.e+`                                      |
| `&e`            | Positive lookahead: Succeed if `e` can be parsed, without consuming input.                                            |
| `!e`            | Negative lookahead: Fail if `e` can be parsed, without consuming input. <br> `primary: atom !'.' !'(' !'['`           |
| `~`             | Commit to the current alternative, even if it fails to parse (cut). <br> `rule_name: '(' ~ some_rule ')' \| some_alt` |


Left recursion
--------------

PEG parsers normally do not support left recursion, but CPython's parser
generator implements a technique similar to the one described in
[Medeiros et al.](https://arxiv.org/pdf/1207.0443) but using the memoization
cache instead of static variables. This approach is closer to the one described
in [Warth et al.](http://web.cs.ucla.edu/~todd/research/pepm08.pdf). This
allows us to write not only simple left-recursive rules but also more
complicated rules that involve indirect left-recursion like:

```
    rule1: rule2 | 'a'
    rule2: rule3 | 'b'
    rule3: rule1 | 'c'
```

and "hidden left-recursion" like:

```
    rule: 'optional'? rule '@' some_other_rule
```

Variables in the grammar
------------------------

A sub-expression can be named by preceding it with an identifier and an
`=` sign. The name can then be used in the action (see below), like this:

```
    rule_name[return_type]: '(' a=some_other_rule ')' { a }
```

Grammar actions
---------------

To avoid the intermediate steps that obscure the relationship between the
grammar and the AST generation, the PEG parser allows directly generating AST
nodes for a rule via grammar actions. Grammar actions are language-specific
expressions that are evaluated when a grammar rule is successfully parsed. These
expressions can be written in Python or C depending on the desired output of the
parser generator. This means that if one would want to generate a parser in
Python and another in C, two grammar files should be written, each one with a
different set of actions, keeping everything else apart from said actions
identical in both files. As an example of a grammar with Python actions, the
piece of the parser generator that parses grammar files is bootstrapped from a
meta-grammar file with Python actions that generate the grammar tree as a result
of the parsing.

In the specific case of the PEG grammar for Python, having actions allows
directly describing how the AST is composed in the grammar itself, making it
more clear and maintainable. This AST generation process is supported by the use
of some helper functions that factor out common AST object manipulations and
some other required operations that are not directly related to the grammar.

To indicate these actions each alternative can be followed by the action code
inside curly-braces, which specifies the return value of the alternative:

```
    rule_name[return_type]:
       | first_alt1 first_alt2 { first_alt1 }
       | second_alt1 second_alt2 { second_alt1 }
```

If the action is omitted, a default action is generated:

- If there is a single name in the rule, it gets returned.
- If there multiple names in the rule, a collection with all parsed
  expressions gets returned (the type of the collection will be different
  in C and Python).

This default behaviour is primarily made for very simple situations and for
debugging purposes.

> [!WARNING]
> It's important that the actions don't mutate any AST nodes that are passed
> into them via variables referring to other rules. The reason for mutation
> being not allowed is that the AST nodes are cached by memoization and could
> potentially be reused in a different context, where the mutation would be
> invalid. If an action needs to change an AST node, it should instead make a
> new copy of the node and change that.

The full meta-grammar for the grammars supported by the PEG generator is:

```
    start[Grammar]: grammar ENDMARKER { grammar }

    grammar[Grammar]:
        | metas rules { Grammar(rules, metas) }
        | rules { Grammar(rules, []) }

    metas[MetaList]:
        | meta metas { [meta] + metas }
        | meta { [meta] }

    meta[MetaTuple]:
        | "@" NAME NEWLINE { (name.string, None) }
        | "@" a=NAME b=NAME NEWLINE { (a.string, b.string) }
        | "@" NAME STRING NEWLINE { (name.string, literal_eval(string.string)) }

    rules[RuleList]:
        | rule rules { [rule] + rules }
        | rule { [rule] }

    rule[Rule]:
        | rulename ":" alts NEWLINE INDENT more_alts DEDENT {
                Rule(rulename[0], rulename[1], Rhs(alts.alts + more_alts.alts)) }
        | rulename ":" NEWLINE INDENT more_alts DEDENT { Rule(rulename[0], rulename[1], more_alts) }
        | rulename ":" alts NEWLINE { Rule(rulename[0], rulename[1], alts) }

    rulename[RuleName]:
        | NAME '[' type=NAME '*' ']' {(name.string, type.string+"*")}
        | NAME '[' type=NAME ']' {(name.string, type.string)}
        | NAME {(name.string, None)}

    alts[Rhs]:
        | alt "|" alts { Rhs([alt] + alts.alts)}
        | alt { Rhs([alt]) }

    more_alts[Rhs]:
        | "|" alts NEWLINE more_alts { Rhs(alts.alts + more_alts.alts) }
        | "|" alts NEWLINE { Rhs(alts.alts) }

    alt[Alt]:
        | items '$' action { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=action) }
        | items '$' { Alt(items + [NamedItem(None, NameLeaf('ENDMARKER'))], action=None) }
        | items action { Alt(items, action=action) }
        | items { Alt(items, action=None) }

    items[NamedItemList]:
        | named_item items { [named_item] + items }
        | named_item { [named_item] }

    named_item[NamedItem]:
        | NAME '=' ~ item {NamedItem(name.string, item)}
        | item {NamedItem(None, item)}
        | it=lookahead {NamedItem(None, it)}

    lookahead[LookaheadOrCut]:
        | '&' ~ atom {PositiveLookahead(atom)}
        | '!' ~ atom {NegativeLookahead(atom)}
        | '~' {Cut()}

    item[Item]:
        | '[' ~ alts ']' {Opt(alts)}
        |  atom '?' {Opt(atom)}
        |  atom '*' {Repeat0(atom)}
        |  atom '+' {Repeat1(atom)}
        |  sep=atom '.' node=atom '+' {Gather(sep, node)}
        |  atom {atom}

    atom[Plain]:
        | '(' ~ alts ')' {Group(alts)}
        | NAME {NameLeaf(name.string) }
        | STRING {StringLeaf(string.string)}

    # Mini-grammar for the actions

    action[str]: "{" ~ target_atoms "}" { target_atoms }

    target_atoms[str]:
        | target_atom target_atoms { target_atom + " " + target_atoms }
        | target_atom { target_atom }

    target_atom[str]:
        | "{" ~ target_atoms "}" { "{" + target_atoms + "}" }
        | NAME { name.string }
        | NUMBER { number.string }
        | STRING { string.string }
        | "?" { "?" }
        | ":" { ":" }
```

As an illustrative example this simple grammar file allows directly
generating a full parser that can parse simple arithmetic expressions and that
returns a valid C-based Python AST:

```
    start[mod_ty]: a=expr_stmt* ENDMARKER { _PyAST_Module(a, NULL, p->arena) }
    expr_stmt[stmt_ty]: a=expr NEWLINE { _PyAST_Expr(a, EXTRA) }

    expr[expr_ty]:
        | l=expr '+' r=term { _PyAST_BinOp(l, Add, r, EXTRA) }
        | l=expr '-' r=term { _PyAST_BinOp(l, Sub, r, EXTRA) }
        | term

    term[expr_ty]:
        | l=term '*' r=factor { _PyAST_BinOp(l, Mult, r, EXTRA) }
        | l=term '/' r=factor { _PyAST_BinOp(l, Div, r, EXTRA) }
        | factor

    factor[expr_ty]:
        | '(' e=expr ')' { e }
        | atom

    atom[expr_ty]:
        | NAME
        | NUMBER
```

Here `EXTRA` is a macro that expands to `start_lineno, start_col_offset,
end_lineno, end_col_offset, p->arena`, those being variables automatically
injected by the parser; `p` points to an object that holds on to all state
for the parser.

A similar grammar written to target Python AST objects:

```
    start[ast.Module]: a=expr_stmt* ENDMARKER { ast.Module(body=a or [] }
    expr_stmt: a=expr NEWLINE { ast.Expr(value=a, EXTRA) }

    expr:
        | l=expr '+' r=term { ast.BinOp(left=l, op=ast.Add(), right=r, EXTRA) }
        | l=expr '-' r=term { ast.BinOp(left=l, op=ast.Sub(), right=r, EXTRA) }
        | term

    term:
        | l=term '*' r=factor { ast.BinOp(left=l, op=ast.Mult(), right=r, EXTRA) }
        | l=term '/' r=factor { ast.BinOp(left=l, op=ast.Div(), right=r, EXTRA) }
        | factor

    factor:
        | '(' e=expr ')' { e }
        | atom

    atom:
        | NAME
        | NUMBER
```

Pegen
=====

Pegen is the parser generator used in CPython to produce the final PEG parser
used by the interpreter. It is the program that can be used to read the python
grammar located in [`Grammar/python.gram`](../Grammar/python.gram) and produce
the final C parser. It contains the following pieces:

- A parser generator that can read a grammar file and produce a PEG parser
  written in Python or C that can parse said grammar. The generator is located at
  [`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen).
- A PEG meta-grammar that automatically generates a Python parser which is used
  for the parser generator itself (this means that there are no manually-written
  parsers). The meta-grammar is located at
  [`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram).
- A generated parser (using the parser generator) that can directly produce C and Python AST objects.

The source code for Pegen lives at [`Tools/peg_generator/pegen`](../Tools/peg_generator/pegen)
but normally all typical commands to interact with the parser generator are executed from
the main makefile.

How to regenerate the parser
----------------------------

Once you have made the changes to the grammar files, to regenerate the `C`
parser (the one used by the interpreter) just execute:

```
    make regen-pegen
```

using the `Makefile` in the main directory.  If you are on Windows you can
use the Visual Studio project files to regenerate the parser or to execute:

```
    ./PCbuild/build.bat --regen
```

The generated parser file is located at [`Parser/parser.c`](../Parser/parser.c).

How to regenerate the meta-parser
---------------------------------

The meta-grammar (the grammar that describes the grammar for the grammar files
themselves) is located at
[`Tools/peg_generator/pegen/metagrammar.gram`](../Tools/peg_generator/pegen/metagrammar.gram).
Although it is very unlikely that you will ever need to modify it, if you make
any modifications to this file (in order to implement new Pegen features) you will
need to regenerate the meta-parser (the parser that parses the grammar files).
To do so just execute:

```
    make regen-pegen-metaparser
```

If you are on Windows you can use the Visual Studio project files
to regenerate the parser or to execute:

```
    ./PCbuild/build.bat --regen
```


Grammatical elements and rules
------------------------------

Pegen has some special grammatical elements and rules:

- Strings with single quotes (') (for example, `'class'`) denote KEYWORDS.
- Strings with double quotes (") (for example, `"match"`) denote SOFT KEYWORDS.
- Uppercase names (for example, `NAME`) denote tokens in the
  [`Grammar/Tokens`](../Grammar/Tokens) file.
- Rule names starting with `invalid_` are used for specialized syntax errors.

  - These rules are NOT used in the first pass of the parser.
  - Only if the first pass fails to parse, a second pass including the invalid
    rules will be executed.
  - If the parser fails in the second phase with a generic syntax error, the
    location of the generic failure of the first pass will be used (this avoids
    reporting incorrect locations due to the invalid rules).
  - The order of the alternatives involving invalid rules matter
    (like any rule in PEG).

Tokenization
------------

It is common among PEG parser frameworks that the parser does both the parsing
and the tokenization, but this does not happen in Pegen. The reason is that the
Python language needs a custom tokenizer to handle things like indentation
boundaries, some special keywords like `ASYNC` and `AWAIT` (for
compatibility purposes), backtracking errors (such as unclosed parenthesis),
dealing with encoding, interactive mode and much more. Some of these reasons
are also there for historical purposes, and some others are useful even today.

The list of tokens (all uppercase names in the grammar) that you can use can
be found in the [`Grammar/Tokens`](../Grammar/Tokens)
file. If you change this file to add new tokens, make sure to regenerate the
files by executing:

```
    make regen-token
```

If you are on Windows you can use the Visual Studio project files to regenerate
the tokens or to execute:

```
    ./PCbuild/build.bat --regen
```

How tokens are generated and the rules governing this are completely up to the tokenizer
([`Parser/lexer`](../Parser/lexer) and [`Parser/tokenizer`](../Parser/tokenizer));
the parser just receives tokens from it.

Memoization
-----------

As described previously, to avoid exponential time complexity in the parser,
memoization is used.

The C parser used by Python is highly optimized and memoization can be expensive
both in memory and time. Although the memory cost is obvious (the parser needs
memory for storing previous results in the cache) the execution time cost comes
for continuously checking if the given rule has a cache hit or not. In many
situations, just parsing it again can be faster. Pegen **disables memoization
by default** except for rules with the special marker `memo` after the rule
name (and type, if present):

```
    rule_name[typr] (memo):
      ...
```

By selectively turning on memoization for a handful of rules, the parser becomes
faster and uses less memory.

> [!NOTE]
> Left-recursive rules always use memoization, since the implementation of
> left-recursion depends on it.

To determine whether a new rule needs memoization or not, benchmarking is required
(comparing execution times and memory usage of some considerably large files with
and without memoization). There is a very simple instrumentation API available
in the generated C parse code that allows to measure how much each rule uses
memoization (check the [`Parser/pegen.c`](../Parser/pegen.c)
file for more information) but it needs to be manually activated.

Automatic variables
-------------------

To make writing actions easier, Pegen injects some automatic variables in the
namespace available when writing actions. In the C parser, some of these
automatic variable names are:

- `p`: The parser structure.
- `EXTRA`: This is a macro that expands to
  `(_start_lineno, _start_col_offset, _end_lineno, _end_col_offset, p->arena)`,
  which is normally used to create AST nodes as almost all constructors need these
  attributes to be provided. All of the location variables are taken from the
  location information of the current token.

Hard and soft keywords
----------------------

> [!NOTE]
> In the grammar files, keywords are defined using **single quotes** (for example,
> `'class'`) while soft keywords are defined using **double quotes** (for example,
> `"match"`).

There are two kinds of keywords allowed in pegen grammars: *hard* and *soft*
keywords. The difference between hard and soft keywords is that hard keywords
are always reserved words, even in positions where they make no sense
(for example, `x = class + 1`), while soft keywords only get a special
meaning in context. Trying to use a hard keyword as a variable will always
fail:

```
    >>> class = 3
    File "<stdin>", line 1
        class = 3
            ^
    SyntaxError: invalid syntax
    >>> foo(class=3)
    File "<stdin>", line 1
        foo(class=3)
            ^^^^^
    SyntaxError: invalid syntax
```

While soft keywords don't have this limitation if used in a context other the
one where they are defined as keywords:

```
    >>> match = 45
    >>> foo(match="Yeah!")
```

The `match` and `case` keywords are soft keywords, so that they are
recognized as keywords at the beginning of a match statement or case block
respectively, but are allowed to be used in other places as variable or
argument names.

You can get a list of all keywords defined in the grammar from Python:

```
    >>> import keyword
    >>> keyword.kwlist
    ['False', 'None', 'True', 'and', 'as', 'assert', 'async', 'await', 'break',
    'class', 'continue', 'def', 'del', 'elif', 'else', 'except', 'finally', 'for',
    'from', 'global', 'if', 'import', 'in', 'is', 'lambda', 'nonlocal', 'not', 'or',
    'pass', 'raise', 'return', 'try', 'while', 'with', 'yield']
```

as well as soft keywords:

```
    >>> import keyword
    >>> keyword.softkwlist
    ['_', 'case', 'match']
```

> [!CAUTION]
> Soft keywords can be a bit challenging to manage as they can be accepted in
> places you don't intend, given how the order alternatives behave in PEG
> parsers (see the
> [consequences of ordered choice](#consequences-of-the-ordered-choice-operator)
> section for some background on this). In general, try to define them in places
> where there are not many alternatives.

Error handling
--------------

When a pegen-generated parser detects that an exception is raised, it will
**automatically stop parsing**, no matter what the current state of the parser
is, and it will unwind the stack and report the exception. This means that if a
[rule action](#grammar-actions) raises an exception, all parsing will
stop at that exact point. This is done to allow to correctly propagate any
exception set by calling Python's C API functions. This also includes
[`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError)
exceptions and it is the main mechanism the parser uses to report custom syntax
error messages.

> [!NOTE]
> Tokenizer errors are normally reported by raising exceptions but some special
> tokenizer errors such as unclosed parenthesis will be reported only after the
> parser finishes without returning anything.

How syntax errors are reported
------------------------------

As described previously in the [how PEG parsers work](#how-peg-parsers-work)
section, PEG parsers don't have a defined concept of where errors happened
in the grammar, because a rule failure doesn't imply a parsing failure like
in context free grammars. This means that a heuristic has to be used to report
generic errors unless something is explicitly declared as an error in the
grammar.

To report generic syntax errors, pegen uses a common heuristic in PEG parsers:
the location of *generic* syntax errors is reported to be the furthest token that
was attempted to be matched but failed. This is only done if parsing has failed
(the parser returns `NULL` in C or `None` in Python) but no exception has
been raised.

As the Python grammar was primordially written as an `LL(1)` grammar, this heuristic
has an extremely high success rate, but some PEG features, such as lookaheads,
can impact this.

> [!CAUTION]
> Positive and negative lookaheads will try to match a token so they will affect
> the location of generic syntax errors. Use them carefully at boundaries
> between rules.

To generate more precise syntax errors, custom rules are used. This is a common
practice also in context free grammars: the parser will try to accept some
construct that is known to be incorrect just to report a specific syntax error
for that construct. In pegen grammars, these rules start with the `invalid_`
prefix. This is because trying to match these rules normally has a performance
impact on parsing (and can also affect the 'correct' grammar itself in some
tricky cases, depending on the ordering of the rules) so the generated parser
acts in two phases:

1. The first phase will try to parse the input stream without taking into
   account rules that start with the `invalid_` prefix. If the parsing
   succeeds it will return the generated AST and the second phase will be
   skipped.

2. If the first phase failed, a second parsing attempt is done including the
   rules that start with an `invalid_` prefix. By design this attempt
   **cannot succeed** and is only executed to give to the invalid rules a
   chance to detect specific situations where custom, more precise, syntax
   errors can be raised. This also allows to trade a bit of performance for
   precision reporting errors: given that we know that the input text is
   invalid, there is typically no need to be fast because execution is going
   to stop anyway.

> [!IMPORTANT]
> When defining invalid rules:
>
> - Make sure all custom invalid rules raise
>   [`SyntaxError`](https://docs.python.org/3/library/exceptions.html#SyntaxError)
>   exceptions (or a subclass of it).
> - Make sure **all** invalid rules start with the `invalid_` prefix to not
>   impact performance of parsing correct Python code.
> - Make sure the parser doesn't behave differently for regular rules when you introduce invalid rules
>   (see the [how PEG parsers work](#how-peg-parsers-work) section for more information).

You can find a collection of macros to raise specialized syntax errors in the
[`Parser/pegen.h`](../Parser/pegen.h)
header file. These macros allow also to report ranges for
the custom errors, which will be highlighted in the tracebacks that will be
displayed when the error is reported.


> [!TIP]
> A good way to test whether an invalid rule will be triggered when you expect
> is to test if introducing a syntax error **after** valid code triggers the
> rule or not. For example:

```
   <valid python code> $ 42
```

should trigger the syntax error in the `$` character. If your rule is not correctly defined this
won't happen. As another example, suppose that you try to define a rule to match Python 2 style
`print` statements in order to create a better error message and you define it as:

```
    invalid_print: "print" expression
```

This will **seem** to work because the parser will correctly parse `print(something)` because it is valid
code and the second phase will never execute but if you try to parse `print(something) $ 3` the first pass
of the parser will fail (because of the `$`) and in the second phase, the rule will match the
`print(something)` as `print` followed by the variable `something` between parentheses and the error
will be reported there instead of the `$` character.

Generating AST objects
----------------------

The output of the C parser used by CPython,  which is generated from the
[grammar file](../Grammar/python.gram), is a Python AST object (using C
structures). This means that the actions in the grammar file generate AST
objects when they succeed. Constructing these objects can be quite cumbersome
(see the [AST compiler section](compiler.md#abstract-syntax-trees-ast)
for more information on how these objects are constructed and how they are used
by the compiler), so special helper functions are used. These functions are
declared in the [`Parser/pegen.h`](../Parser/pegen.h) header file and defined
in the [`Parser/action_helpers.c`](../Parser/action_helpers.c) file. The
helpers include functions that join AST sequences, get specific elements
from them or to perform extra processing on the generated tree.


> [!CAUTION]
> Actions must **never** be used to accept or reject rules. It may be tempting
> in some situations to write a very generic rule and then check the generated
> AST to decide whether it is valid or not, but this will render the
> (official grammar)[https://docs.python.org/3/reference/grammar.html] partially
> incorrect (because it does not include actions) and will make it more difficult
> for other Python implementations to adapt the grammar to their own needs.

As a general rule, if an action spawns multiple lines or requires something more
complicated than a single expression of C code, is normally better to create a
custom helper in [`Parser/action_helpers.c`](../Parser/action_helpers.c)
and expose it in the [`Parser/pegen.h`](../Parser/pegen.h) header file so that
it can be used from the grammar.

When parsing succeeds, the parser **must** return a **valid** AST object.

Testing
=======

There are three files that contain tests for the grammar and the parser:

- [test_grammar.py](../Lib/test/test_grammar.py)
- [test_syntax.py](../Lib/test/test_syntax.py)
- [test_exceptions.py](../Lib/test/test_exceptions.py)

Check the contents of these files to know which is the best place for new
tests, depending on the nature of the new feature you are adding.

Tests for the parser generator itself can be found in the
[test_peg_generator](../Lib/test_peg_generator) directory.


Debugging generated parsers
===========================

Making experiments
------------------

As the generated C parser is the one used by Python, this means that if
something goes wrong when adding some new rules to the grammar, you cannot
correctly compile and execute Python anymore. This makes it a bit challenging
to debug when something goes wrong, especially when experimenting.

For this reason it is a good idea to experiment first by generating a Python
parser. To do this, you can go to the [Tools/peg_generator](../Tools/peg_generator)
directory on the CPython repository and manually call the parser generator by executing:

```
    $ python -m pegen python <PATH TO YOUR GRAMMAR FILE>
```

This will generate a file called `parse.py` in the same directory that you
can use to parse some input:

```
    $ python parse.py file_with_source_code_to_test.py
```

As the generated `parse.py` file is just Python code, you can modify it
and add breakpoints to debug or better understand some complex situations.


Verbose mode
------------

When Python is compiled in debug mode (by adding `--with-pydebug` when
running the configure step in Linux or by adding `-d` when calling the
[PCbuild/build.bat](../PCbuild/build.bat)), it is possible to activate a
**very** verbose mode in the generated parser. This is very useful to
debug the generated parser and to understand how it works, but it
can be a bit hard to understand at first.

> [!NOTE]
> When activating verbose mode in the Python parser, it is better to not use
> interactive mode as it can be much harder to understand, because interactive
> mode involves some special steps compared to regular parsing.

To activate verbose mode you can add the `-d` flag when executing Python:

```
    $ python -d file_to_test.py
```

This will print **a lot** of output to `stderr` so it is probably better to dump
it to a file for further analysis. The output consists of trace lines with the
following structure::

```
    <indentation> ('>'|'-'|'+'|'!') <rule_name>[<token_location>]: <alternative> ...
```

Every line is indented by a different amount (`<indentation>`) depending on how
deep the call stack is. The next character marks the type of the trace:

- `>` indicates that a rule is going to be attempted to be parsed.
- `-` indicates that a rule has failed to be parsed.
- `+` indicates that a rule has been parsed correctly.
- `!` indicates that an exception or an error has been detected and the parser is unwinding.

The `<token_location>` part indicates the current index in the token array,
the `<rule_name>` part indicates what rule is being parsed and
the `<alternative>` part indicates what alternative within that rule
is being attempted.


> [!NOTE]
> **Document history**
>
>   Pablo Galindo Salgado - Original author
>
>   Irit Katriel and Jacob Coffee - Convert to Markdown