diff options
Diffstat (limited to 'tcllib/modules/pt/include/serial')
-rw-r--r-- | tcllib/modules/pt/include/serial/ast.inc | 104 | ||||
-rw-r--r-- | tcllib/modules/pt/include/serial/pegrammar.inc | 114 | ||||
-rw-r--r-- | tcllib/modules/pt/include/serial/pexpression.inc | 245 |
3 files changed, 463 insertions, 0 deletions
diff --git a/tcllib/modules/pt/include/serial/ast.inc b/tcllib/modules/pt/include/serial/ast.inc new file mode 100644 index 0000000..090a8c2 --- /dev/null +++ b/tcllib/modules/pt/include/serial/ast.inc @@ -0,0 +1,104 @@ +[comment {-*- text -*-}] +[section {AST serialization format}] + +Here we specify the format used by the Parser Tools to serialize +Abstract Syntax Trees (ASTs) as immutable values for transport, +comparison, etc. + +[para] + +Each node in an AST represents a nonterminal symbol of a grammar, and +the range of tokens/characters in the input covered by it. ASTs do not +contain terminal symbols, i.e. tokens/characters. These can be +recovered from the input given a symbol's location. + +[para] + +We distinguish between [term regular] and [term canonical] +serializations. + +While a tree may have more than one regular serialization only exactly +one of them will be [term canonical]. + + +[list_begin definitions][comment {-- serializations --}] +[def {Regular serialization}] + +[list_begin enumerated][comment {-- regular points --}] + +[enum] +The serialization of any AST is the serialization of its root node. + +[enum] +The serialization of any node is a Tcl list containing at least three +elements. + +[list_begin enumerated][comment {-- node elements --}] +[enum] +The first element is the name of the nonterminal symbol stored in the +node. + +[enum] +The second and third element are the locations of the first and last +token in the token stream the node represents (covers). + +[list_begin enumerated][comment {--- location constraints}] +[enum] +Locations are provided as non-negative integer offsets from the +beginning of the token stream, with the first token found in the +stream located at offset 0 (zero). + +[enum] +The end location has to be equal to or larger than the start location. + +[list_end][comment {--- location constraints}] + +[enum] +All elements after the first three represent the children of the node, +which are themselves nodes. This means that the serializations of +nodes without children, i.e. leaf nodes, have exactly three elements. + +The children are stored in the list with the leftmost child first, and +the rightmost child last. + +[list_end][comment {-- node elements --}] +[list_end][comment {-- regular points --}] + +[def {Canonical serialization}] + +The canonical serialization of an abstract syntax tree has the format +as specified in the previous item, and then additionally satisfies the +constraints below, which make it unique among all the possible +serializations of this tree. + +[list_begin enumerated][comment {-- canonical points --}] +[enum] + +The string representation of the value is the canonical representation +of a pure Tcl list. I.e. it does not contain superfluous whitespace. + +[list_end][comment {-- canonical points --}] +[list_end][comment {-- serializations --}] +[para] + +[subsection Example] + +Assuming the parsing expression grammar below + +[para] +[include ../example/expr_peg.inc] +[para] + +and the input string + +[example { 120+5 }] + +then a parser should deliver the abstract syntax tree below (except for whitespace) + +[para] +[include ../example/expr_ast.inc] +[para] + +Or, more graphical + +[para][image expr_ast][para] diff --git a/tcllib/modules/pt/include/serial/pegrammar.inc b/tcllib/modules/pt/include/serial/pegrammar.inc new file mode 100644 index 0000000..4dbdb56 --- /dev/null +++ b/tcllib/modules/pt/include/serial/pegrammar.inc @@ -0,0 +1,114 @@ +[section {PEG serialization format}] + +Here we specify the format used by the Parser Tools to serialize +Parsing Expression Grammars as immutable values for transport, +comparison, etc. + +[para] + +We distinguish between [term regular] and [term canonical] +serializations. + +While a PEG may have more than one regular serialization only exactly +one of them will be [term canonical]. + + +[list_begin definitions][comment {-- serializations --}] +[def {regular serialization}] + +[list_begin enumerated][comment {-- regular points --}] +[enum] +The serialization of any PEG is a nested Tcl dictionary. + +[enum] +This dictionary holds a single key, [const pt::grammar::peg], and its +value. This value holds the contents of the grammar. + +[enum] +The contents of the grammar are a Tcl dictionary holding the set of +nonterminal symbols and the starting expression. The relevant keys and +their values are + +[list_begin definitions][comment {-- grammar keywords --}] +[def [const rules]] + +The value is a Tcl dictionary whose keys are the names of the +nonterminal symbols known to the grammar. + +[list_begin enumerated][comment {-- nonterminals --}] +[enum] +Each nonterminal symbol may occur only once. + +[enum] +The empty string is not a legal nonterminal symbol. + +[enum] +The value for each symbol is a Tcl dictionary itself. The relevant +keys and their values in this dictionary are + +[list_begin definitions][comment {-- nonterminal keywords --}] +[def [const is]] + +The value is the serialization of the parsing expression describing +the symbols sentennial structure, as specified in the section +[sectref {PE serialization format}]. + +[def [const mode]] + +The value can be one of three values specifying how a parser should +handle the semantic value produced by the symbol. + +[include ../modes.inc] +[list_end][comment {-- nonterminal keywords --}] +[list_end][comment {-- nonterminals --}] + +[def [const start]] + +The value is the serialization of the start parsing expression of the +grammar, as specified in the section [sectref {PE serialization format}]. + +[list_end][comment {-- grammar keywords --}] + +[enum] +The terminal symbols of the grammar are specified implicitly as the +set of all terminal symbols used in the start expression and on the +RHS of the grammar rules. + + +[list_end][comment {-- regular points --}] + +[def {canonical serialization}] + +The canonical serialization of a grammar has the format as specified +in the previous item, and then additionally satisfies the constraints +below, which make it unique among all the possible serializations of +this grammar. + +[list_begin enumerated][comment {-- canonical points --}] +[enum] + +The keys found in all the nested Tcl dictionaries are sorted in +ascending dictionary order, as generated by Tcl's builtin command +[cmd {lsort -increasing -dict}]. + +[enum] + +The string representation of the value is the canonical representation +of a Tcl dictionary. I.e. it does not contain superfluous whitespace. + +[list_end][comment {-- canonical points --}] +[list_end][comment {-- serializations --}] + +[subsection Example] + +Assuming the following PEG for simple mathematical expressions + +[para] +[include ../example/expr_peg.inc] +[para] + +then its canonical serialization (except for whitespace) is + +[para] +[include ../example/expr_serial.inc] +[para] diff --git a/tcllib/modules/pt/include/serial/pexpression.inc b/tcllib/modules/pt/include/serial/pexpression.inc new file mode 100644 index 0000000..c0b2255 --- /dev/null +++ b/tcllib/modules/pt/include/serial/pexpression.inc @@ -0,0 +1,245 @@ +[comment {-*- text -*-}] +[section {PE serialization format}] + +Here we specify the format used by the Parser Tools to serialize +Parsing Expressions as immutable values for transport, comparison, +etc. + +[para] + +We distinguish between [term regular] and [term canonical] +serializations. + +While a parsing expression may have more than one regular +serialization only exactly one of them will be [term canonical]. + +[list_begin definitions][comment {-- serializations --}] +[def {Regular serialization}] + +[list_begin definitions][comment {-- regular points --}] + +[def [const {Atomic Parsing Expressions}]] +[list_begin enumerated][comment {-- atomic points --}] + +[enum] +The string [const epsilon] is an atomic parsing expression. It matches +the empty string. + +[enum] +The string [const dot] is an atomic parsing expression. It matches +any character. + +[enum] +The string [const alnum] is an atomic parsing expression. It matches +any Unicode alphabet or digit character. This is a custom extension of +PEs based on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const alpha] is an atomic parsing expression. It matches +any Unicode alphabet character. This is a custom extension of PEs +based on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const ascii] is an atomic parsing expression. It matches +any Unicode character below U0080. This is a custom extension of PEs +based on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const control] is an atomic parsing expression. It matches +any Unicode control character. This is a custom extension of PEs based +on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const digit] is an atomic parsing expression. It matches +any Unicode digit character. Note that this includes characters +outside of the [lb]0..9[rb] range. This is a custom extension of PEs +based on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const graph] is an atomic parsing expression. It matches +any Unicode printing character, except for space. This is a custom +extension of PEs based on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const lower] is an atomic parsing expression. It matches +any Unicode lower-case alphabet character. This is a custom extension +of PEs based on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const print] is an atomic parsing expression. It matches +any Unicode printing character, including space. This is a custom +extension of PEs based on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const punct] is an atomic parsing expression. It matches +any Unicode punctuation character. This is a custom extension of PEs +based on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const space] is an atomic parsing expression. It matches +any Unicode space character. This is a custom extension of PEs based +on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const upper] is an atomic parsing expression. It matches +any Unicode upper-case alphabet character. This is a custom extension +of PEs based on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const wordchar] is an atomic parsing expression. It +matches any Unicode word character. This is any alphanumeric character +(see alnum), and any connector punctuation characters (e.g. +underscore). This is a custom extension of PEs based on Tcl's builtin +command [cmd {string is}]. + +[enum] +The string [const xdigit] is an atomic parsing expression. It matches +any hexadecimal digit character. This is a custom extension of PEs +based on Tcl's builtin command [cmd {string is}]. + +[enum] +The string [const ddigit] is an atomic parsing expression. It matches +any decimal digit character. This is a custom extension of PEs based +on Tcl's builtin command [cmd regexp]. + +[enum] +The expression + [lb]list t [var x][rb] +is an atomic parsing expression. It matches the terminal string [var x]. + +[enum] +The expression + [lb]list n [var A][rb] +is an atomic parsing expression. It matches the nonterminal [var A]. + +[list_end][comment {-- atomic points --}] + +[def [const {Combined Parsing Expressions}]] +[list_begin enumerated][comment {-- combined points --}] + +[enum] +For parsing expressions [var e1], [var e2], ... the result of + + [lb]list / [var e1] [var e2] ... [rb] + +is a parsing expression as well. + +This is the [term {ordered choice}], aka [term {prioritized choice}]. + +[enum] +For parsing expressions [var e1], [var e2], ... the result of + + [lb]list x [var e1] [var e2] ... [rb] + +is a parsing expression as well. + +This is the [term {sequence}]. + +[enum] +For a parsing expression [var e] the result of + + [lb]list * [var e][rb] + +is a parsing expression as well. + +This is the [term {kleene closure}], describing zero or more +repetitions. + +[enum] +For a parsing expression [var e] the result of + + [lb]list + [var e][rb] + +is a parsing expression as well. + +This is the [term {positive kleene closure}], describing one or more +repetitions. + +[enum] +For a parsing expression [var e] the result of + + [lb]list & [var e][rb] + +is a parsing expression as well. + +This is the [term {and lookahead predicate}]. + +[enum] +For a parsing expression [var e] the result of + + [lb]list ! [var e][rb] + +is a parsing expression as well. + +This is the [term {not lookahead predicate}]. + + +[enum] +For a parsing expression [var e] the result of + + [lb]list ? [var e][rb] + +is a parsing expression as well. + +This is the [term {optional input}]. + + +[list_end][comment {-- combined points --}] +[list_end][comment {-- regular points --}] + +[def {Canonical serialization}] + +The canonical serialization of a parsing expression has the format as +specified in the previous item, and then additionally satisfies the +constraints below, which make it unique among all the possible +serializations of this parsing expression. + +[list_begin enumerated][comment {-- canonical points --}] +[enum] + +The string representation of the value is the canonical representation +of a pure Tcl list. I.e. it does not contain superfluous whitespace. + +[enum] + +Terminals are [emph not] encoded as ranges (where start and end of the +range are identical). + +[comment { + Thinking about this I am not sure if that was a good move. + There are a lot more equivalent encodings around that just + the one I used above. Examples + + {x {t a} {t b} {tc } {t d}} + {x {x {t a} {t b}} {x {tc } {t d}}} + {x {x {t a} {t b} {tc } {t d}}} + + etc. Having the t/.. equivalence added it can now be argued + that we should handle these as well. Which essentially + amounts to a whole-sale system to simplify parsing + expressions. This moves expression equality from intensional + to extensional, or as near as is possible. + + The only counter-argument I have is that the t/.. equivalence + is restricted to leaves of the tree, or alternatively, to + terminal symbol operators. +}] + +[list_end][comment {-- canonical points --}] +[list_end][comment {-- serializations --}] +[para] + +[subsection Example] + +Assuming the parsing expression shown on the right-hand side of the +rule + +[para] +[include ../example/expr_pe.inc] +[para] + +then its canonical serialization (except for whitespace) is + +[para] +[include ../example/expr_pe_serial.inc] +[para] |