summaryrefslogtreecommitdiffstats
path: root/tcllib/modules/pt/include/serial
diff options
context:
space:
mode:
Diffstat (limited to 'tcllib/modules/pt/include/serial')
-rw-r--r--tcllib/modules/pt/include/serial/ast.inc104
-rw-r--r--tcllib/modules/pt/include/serial/pegrammar.inc114
-rw-r--r--tcllib/modules/pt/include/serial/pexpression.inc245
3 files changed, 463 insertions, 0 deletions
diff --git a/tcllib/modules/pt/include/serial/ast.inc b/tcllib/modules/pt/include/serial/ast.inc
new file mode 100644
index 0000000..090a8c2
--- /dev/null
+++ b/tcllib/modules/pt/include/serial/ast.inc
@@ -0,0 +1,104 @@
+[comment {-*- text -*-}]
+[section {AST serialization format}]
+
+Here we specify the format used by the Parser Tools to serialize
+Abstract Syntax Trees (ASTs) as immutable values for transport,
+comparison, etc.
+
+[para]
+
+Each node in an AST represents a nonterminal symbol of a grammar, and
+the range of tokens/characters in the input covered by it. ASTs do not
+contain terminal symbols, i.e. tokens/characters. These can be
+recovered from the input given a symbol's location.
+
+[para]
+
+We distinguish between [term regular] and [term canonical]
+serializations.
+
+While a tree may have more than one regular serialization only exactly
+one of them will be [term canonical].
+
+
+[list_begin definitions][comment {-- serializations --}]
+[def {Regular serialization}]
+
+[list_begin enumerated][comment {-- regular points --}]
+
+[enum]
+The serialization of any AST is the serialization of its root node.
+
+[enum]
+The serialization of any node is a Tcl list containing at least three
+elements.
+
+[list_begin enumerated][comment {-- node elements --}]
+[enum]
+The first element is the name of the nonterminal symbol stored in the
+node.
+
+[enum]
+The second and third element are the locations of the first and last
+token in the token stream the node represents (covers).
+
+[list_begin enumerated][comment {--- location constraints}]
+[enum]
+Locations are provided as non-negative integer offsets from the
+beginning of the token stream, with the first token found in the
+stream located at offset 0 (zero).
+
+[enum]
+The end location has to be equal to or larger than the start location.
+
+[list_end][comment {--- location constraints}]
+
+[enum]
+All elements after the first three represent the children of the node,
+which are themselves nodes. This means that the serializations of
+nodes without children, i.e. leaf nodes, have exactly three elements.
+
+The children are stored in the list with the leftmost child first, and
+the rightmost child last.
+
+[list_end][comment {-- node elements --}]
+[list_end][comment {-- regular points --}]
+
+[def {Canonical serialization}]
+
+The canonical serialization of an abstract syntax tree has the format
+as specified in the previous item, and then additionally satisfies the
+constraints below, which make it unique among all the possible
+serializations of this tree.
+
+[list_begin enumerated][comment {-- canonical points --}]
+[enum]
+
+The string representation of the value is the canonical representation
+of a pure Tcl list. I.e. it does not contain superfluous whitespace.
+
+[list_end][comment {-- canonical points --}]
+[list_end][comment {-- serializations --}]
+[para]
+
+[subsection Example]
+
+Assuming the parsing expression grammar below
+
+[para]
+[include ../example/expr_peg.inc]
+[para]
+
+and the input string
+
+[example { 120+5 }]
+
+then a parser should deliver the abstract syntax tree below (except for whitespace)
+
+[para]
+[include ../example/expr_ast.inc]
+[para]
+
+Or, more graphical
+
+[para][image expr_ast][para]
diff --git a/tcllib/modules/pt/include/serial/pegrammar.inc b/tcllib/modules/pt/include/serial/pegrammar.inc
new file mode 100644
index 0000000..4dbdb56
--- /dev/null
+++ b/tcllib/modules/pt/include/serial/pegrammar.inc
@@ -0,0 +1,114 @@
+[section {PEG serialization format}]
+
+Here we specify the format used by the Parser Tools to serialize
+Parsing Expression Grammars as immutable values for transport,
+comparison, etc.
+
+[para]
+
+We distinguish between [term regular] and [term canonical]
+serializations.
+
+While a PEG may have more than one regular serialization only exactly
+one of them will be [term canonical].
+
+
+[list_begin definitions][comment {-- serializations --}]
+[def {regular serialization}]
+
+[list_begin enumerated][comment {-- regular points --}]
+[enum]
+The serialization of any PEG is a nested Tcl dictionary.
+
+[enum]
+This dictionary holds a single key, [const pt::grammar::peg], and its
+value. This value holds the contents of the grammar.
+
+[enum]
+The contents of the grammar are a Tcl dictionary holding the set of
+nonterminal symbols and the starting expression. The relevant keys and
+their values are
+
+[list_begin definitions][comment {-- grammar keywords --}]
+[def [const rules]]
+
+The value is a Tcl dictionary whose keys are the names of the
+nonterminal symbols known to the grammar.
+
+[list_begin enumerated][comment {-- nonterminals --}]
+[enum]
+Each nonterminal symbol may occur only once.
+
+[enum]
+The empty string is not a legal nonterminal symbol.
+
+[enum]
+The value for each symbol is a Tcl dictionary itself. The relevant
+keys and their values in this dictionary are
+
+[list_begin definitions][comment {-- nonterminal keywords --}]
+[def [const is]]
+
+The value is the serialization of the parsing expression describing
+the symbols sentennial structure, as specified in the section
+[sectref {PE serialization format}].
+
+[def [const mode]]
+
+The value can be one of three values specifying how a parser should
+handle the semantic value produced by the symbol.
+
+[include ../modes.inc]
+[list_end][comment {-- nonterminal keywords --}]
+[list_end][comment {-- nonterminals --}]
+
+[def [const start]]
+
+The value is the serialization of the start parsing expression of the
+grammar, as specified in the section [sectref {PE serialization format}].
+
+[list_end][comment {-- grammar keywords --}]
+
+[enum]
+The terminal symbols of the grammar are specified implicitly as the
+set of all terminal symbols used in the start expression and on the
+RHS of the grammar rules.
+
+
+[list_end][comment {-- regular points --}]
+
+[def {canonical serialization}]
+
+The canonical serialization of a grammar has the format as specified
+in the previous item, and then additionally satisfies the constraints
+below, which make it unique among all the possible serializations of
+this grammar.
+
+[list_begin enumerated][comment {-- canonical points --}]
+[enum]
+
+The keys found in all the nested Tcl dictionaries are sorted in
+ascending dictionary order, as generated by Tcl's builtin command
+[cmd {lsort -increasing -dict}].
+
+[enum]
+
+The string representation of the value is the canonical representation
+of a Tcl dictionary. I.e. it does not contain superfluous whitespace.
+
+[list_end][comment {-- canonical points --}]
+[list_end][comment {-- serializations --}]
+
+[subsection Example]
+
+Assuming the following PEG for simple mathematical expressions
+
+[para]
+[include ../example/expr_peg.inc]
+[para]
+
+then its canonical serialization (except for whitespace) is
+
+[para]
+[include ../example/expr_serial.inc]
+[para]
diff --git a/tcllib/modules/pt/include/serial/pexpression.inc b/tcllib/modules/pt/include/serial/pexpression.inc
new file mode 100644
index 0000000..c0b2255
--- /dev/null
+++ b/tcllib/modules/pt/include/serial/pexpression.inc
@@ -0,0 +1,245 @@
+[comment {-*- text -*-}]
+[section {PE serialization format}]
+
+Here we specify the format used by the Parser Tools to serialize
+Parsing Expressions as immutable values for transport, comparison,
+etc.
+
+[para]
+
+We distinguish between [term regular] and [term canonical]
+serializations.
+
+While a parsing expression may have more than one regular
+serialization only exactly one of them will be [term canonical].
+
+[list_begin definitions][comment {-- serializations --}]
+[def {Regular serialization}]
+
+[list_begin definitions][comment {-- regular points --}]
+
+[def [const {Atomic Parsing Expressions}]]
+[list_begin enumerated][comment {-- atomic points --}]
+
+[enum]
+The string [const epsilon] is an atomic parsing expression. It matches
+the empty string.
+
+[enum]
+The string [const dot] is an atomic parsing expression. It matches
+any character.
+
+[enum]
+The string [const alnum] is an atomic parsing expression. It matches
+any Unicode alphabet or digit character. This is a custom extension of
+PEs based on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const alpha] is an atomic parsing expression. It matches
+any Unicode alphabet character. This is a custom extension of PEs
+based on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const ascii] is an atomic parsing expression. It matches
+any Unicode character below U0080. This is a custom extension of PEs
+based on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const control] is an atomic parsing expression. It matches
+any Unicode control character. This is a custom extension of PEs based
+on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const digit] is an atomic parsing expression. It matches
+any Unicode digit character. Note that this includes characters
+outside of the [lb]0..9[rb] range. This is a custom extension of PEs
+based on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const graph] is an atomic parsing expression. It matches
+any Unicode printing character, except for space. This is a custom
+extension of PEs based on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const lower] is an atomic parsing expression. It matches
+any Unicode lower-case alphabet character. This is a custom extension
+of PEs based on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const print] is an atomic parsing expression. It matches
+any Unicode printing character, including space. This is a custom
+extension of PEs based on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const punct] is an atomic parsing expression. It matches
+any Unicode punctuation character. This is a custom extension of PEs
+based on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const space] is an atomic parsing expression. It matches
+any Unicode space character. This is a custom extension of PEs based
+on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const upper] is an atomic parsing expression. It matches
+any Unicode upper-case alphabet character. This is a custom extension
+of PEs based on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const wordchar] is an atomic parsing expression. It
+matches any Unicode word character. This is any alphanumeric character
+(see alnum), and any connector punctuation characters (e.g.
+underscore). This is a custom extension of PEs based on Tcl's builtin
+command [cmd {string is}].
+
+[enum]
+The string [const xdigit] is an atomic parsing expression. It matches
+any hexadecimal digit character. This is a custom extension of PEs
+based on Tcl's builtin command [cmd {string is}].
+
+[enum]
+The string [const ddigit] is an atomic parsing expression. It matches
+any decimal digit character. This is a custom extension of PEs based
+on Tcl's builtin command [cmd regexp].
+
+[enum]
+The expression
+ [lb]list t [var x][rb]
+is an atomic parsing expression. It matches the terminal string [var x].
+
+[enum]
+The expression
+ [lb]list n [var A][rb]
+is an atomic parsing expression. It matches the nonterminal [var A].
+
+[list_end][comment {-- atomic points --}]
+
+[def [const {Combined Parsing Expressions}]]
+[list_begin enumerated][comment {-- combined points --}]
+
+[enum]
+For parsing expressions [var e1], [var e2], ... the result of
+
+ [lb]list / [var e1] [var e2] ... [rb]
+
+is a parsing expression as well.
+
+This is the [term {ordered choice}], aka [term {prioritized choice}].
+
+[enum]
+For parsing expressions [var e1], [var e2], ... the result of
+
+ [lb]list x [var e1] [var e2] ... [rb]
+
+is a parsing expression as well.
+
+This is the [term {sequence}].
+
+[enum]
+For a parsing expression [var e] the result of
+
+ [lb]list * [var e][rb]
+
+is a parsing expression as well.
+
+This is the [term {kleene closure}], describing zero or more
+repetitions.
+
+[enum]
+For a parsing expression [var e] the result of
+
+ [lb]list + [var e][rb]
+
+is a parsing expression as well.
+
+This is the [term {positive kleene closure}], describing one or more
+repetitions.
+
+[enum]
+For a parsing expression [var e] the result of
+
+ [lb]list & [var e][rb]
+
+is a parsing expression as well.
+
+This is the [term {and lookahead predicate}].
+
+[enum]
+For a parsing expression [var e] the result of
+
+ [lb]list ! [var e][rb]
+
+is a parsing expression as well.
+
+This is the [term {not lookahead predicate}].
+
+
+[enum]
+For a parsing expression [var e] the result of
+
+ [lb]list ? [var e][rb]
+
+is a parsing expression as well.
+
+This is the [term {optional input}].
+
+
+[list_end][comment {-- combined points --}]
+[list_end][comment {-- regular points --}]
+
+[def {Canonical serialization}]
+
+The canonical serialization of a parsing expression has the format as
+specified in the previous item, and then additionally satisfies the
+constraints below, which make it unique among all the possible
+serializations of this parsing expression.
+
+[list_begin enumerated][comment {-- canonical points --}]
+[enum]
+
+The string representation of the value is the canonical representation
+of a pure Tcl list. I.e. it does not contain superfluous whitespace.
+
+[enum]
+
+Terminals are [emph not] encoded as ranges (where start and end of the
+range are identical).
+
+[comment {
+ Thinking about this I am not sure if that was a good move.
+ There are a lot more equivalent encodings around that just
+ the one I used above. Examples
+
+ {x {t a} {t b} {tc } {t d}}
+ {x {x {t a} {t b}} {x {tc } {t d}}}
+ {x {x {t a} {t b} {tc } {t d}}}
+
+ etc. Having the t/.. equivalence added it can now be argued
+ that we should handle these as well. Which essentially
+ amounts to a whole-sale system to simplify parsing
+ expressions. This moves expression equality from intensional
+ to extensional, or as near as is possible.
+
+ The only counter-argument I have is that the t/.. equivalence
+ is restricted to leaves of the tree, or alternatively, to
+ terminal symbol operators.
+}]
+
+[list_end][comment {-- canonical points --}]
+[list_end][comment {-- serializations --}]
+[para]
+
+[subsection Example]
+
+Assuming the parsing expression shown on the right-hand side of the
+rule
+
+[para]
+[include ../example/expr_pe.inc]
+[para]
+
+then its canonical serialization (except for whitespace) is
+
+[para]
+[include ../example/expr_pe_serial.inc]
+[para]