diff options
Diffstat (limited to 'tcllib/modules/grammar_peg/peg.man')
-rw-r--r-- | tcllib/modules/grammar_peg/peg.man | 721 |
1 files changed, 721 insertions, 0 deletions
diff --git a/tcllib/modules/grammar_peg/peg.man b/tcllib/modules/grammar_peg/peg.man new file mode 100644 index 0000000..c5d7de8 --- /dev/null +++ b/tcllib/modules/grammar_peg/peg.man @@ -0,0 +1,721 @@ +[comment {-*- tcl -*- doctools manpage}] +[manpage_begin grammar::peg n 0.1] +[keywords {context-free languages}] +[keywords expression] +[keywords grammar] +[keywords LL(k)] +[keywords parsing] +[keywords {parsing expression}] +[keywords {parsing expression grammar}] +[keywords {push down automaton}] +[keywords {recursive descent}] +[keywords state] +[keywords TDPL] +[keywords {top-down parsing languages}] +[keywords transducer] +[copyright {2005 Andreas Kupries <andreas_kupries@users.sourceforge.net>}] +[moddesc {Grammar operations and usage}] +[titledesc {Create and manipulate parsing expression grammars}] +[category {Grammars and finite automata}] +[require Tcl 8.4] +[require snit] +[require grammar::peg [opt 0.1]] +[description] +[para] + +This package provides a container class for +[term {parsing expression grammars}] (Short: PEG). + +It allows the incremental definition of the grammar, its manipulation +and querying of the definition. + +The package neither provides complex operations on the grammar, nor has +it the ability to execute a grammar definition for a stream of symbols. + +Two packages related to this one are [package grammar::mengine] and +[package grammar::peg::interpreter]. The first of them defines a +general virtual machine for the matching of a character stream, and +the second implements an interpreter for parsing expression grammars +on top of that virtual machine. + +[subsection {TERMS & CONCEPTS}] + +PEGs are similar to context-free grammars, but not equivalent; in some +cases PEGs are strictly more powerful than context-free grammars (there +exist PEGs for some non-context-free languages). + +The formal mathematical definition of parsing expressions and parsing +expression grammars can be found in section +[sectref {PARSING EXPRESSION GRAMMARS}]. + +[para] + +In short, we have [term {terminal symbols}], which are the most basic +building blocks for [term sentences], and [term {nonterminal symbols}] +with associated [term {parsing expressions}], defining the grammatical +structure of the sentences. The two sets of symbols are distinctive, +and do not overlap. When speaking about symbols the word "symbol" is +often left out. The union of the sets of terminal and nonterminal +symbols is called the set of [term symbols]. + +[para] + +Here the set of [term {terminal symbols}] is not explicitly managed, +but implicitly defined as the set of all characters. Note that this +means that we inherit from Tcl the ability to handle all of Unicode. + +[para] + +A pair of [term nonterminal] and [term {parsing expression}] is also +called a [term {grammatical rule}], or [term rule] for short. In the +context of a rule the nonterminal is often called the left-hand-side +(LHS), and the parsing expression the right-hand-side (RHS). + +[para] + +The [term {start expression}] of a grammar is a parsing expression +from which all the sentences contained in the language specified by +the grammar are [term derived]. + +To make the understanding of this term easier let us assume for a +moment that the RHS of each rule, and the start expression, is either +a sequence of symbols, or a series of alternate parsing expressions. +In the latter case the rule can be seen as a set of rules, each +providing one alternative for the nonterminal. + +A parsing expression A' is now a derivation of a parsing expression A +if we pick one of the nonterminals N in the expression, and one of the +alternative rules R for N, and then replace the nonterminal in A with +the RHS of the chosen rule. Here we can see why the terminal symbols +are called such. They cannot be expanded any further, thus terminate +the process of deriving new expressions. + +An example + +[para][example { + Rules + (1) A <- a B c + (2a) B <- d B + (2b) B <- e + + Some derivations, using starting expression A. + + A -/1/-> a B c -/2a/-> a d B c -/2b/-> a d e c +}][para] + +A derived expression containing only terminal symbols is a +[term sentence]. The set of all sentences which can be derived from +the start expression is the [term language] of the grammar. + +[para] + +Some definitions for nonterminals and expressions: + +[list_begin enumerated] +[enum] +A nonterminal A is called [term reachable] if it is possible to derive +a parsing expression from the start expression which contains A. + +[enum] +A nonterminal A is called [term useful] if it is possible to derive a +sentence from it. + +[enum] +A nonterminal A is called [term recursive] if it is possible to derive +a parsing expression from it which contains A, again. + +[enum] +The [term {FIRST set}] of a nonterminal A contains all the symbols which +can occur of as the leftmost symbol in a parsing expression derived from +A. If the FIRST set contains A itself then that nonterminal is called +[term left-recursive]. + +[enum] +The [term {LAST set}] of a nonterminal A contains all the symbols which +can occur of as the rightmost symbol in a parsing expression derived from +A. If the LAST set contains A itself then that nonterminal is called +[term right-recursive]. + +[enum] +The [term {FOLLOW set}] of a nonterminal A contains all the symbols which +can occur after A in a parsing expression derived from the start +expression. + +[enum] +A nonterminal (or parsing expression) is called [term nullable] if the +empty sentence can be derived from it. + +[list_end] +[para] + +And based on the above definitions for grammars: + +[list_begin enumerated] +[enum] +A grammar G is [term recursive] if and only if it contains a nonterminal +A which is recursive. The terms [term left-] and [term right-recursive], +and [term useful] are analogously defined. + +[enum] +A grammar is [term minimal] if it contains only [term reachable] and +[term useful] nonterminals. + +[enum] +A grammar is [term wellformed] if it is not left-recursive. Such +grammars are also [term complete], which means that they always succeed +or fail on all input sentences. For an incomplete grammar on the +other hand input sentences exist for which an attempt to match them +against the grammar will not terminate. + +[enum] +As we wish to allow ourselves to build a grammar incrementally in a +container object we will encounter stages where the RHS of one or more +rules reference symbols which are not yet known to the container. Such +a grammar we call [term invalid]. + +We cannot use the term [term incomplete] as this term is already +taken, see the last item. + +[list_end] +[para] + +[subsection {CONTAINER CLASS API}] + +The package exports the API described here. + +[list_begin definitions] + +[call [cmd ::grammar::peg] [arg pegName] \ + [opt "[const =]|[const :=]|[const <--]|[const as]|[const deserialize] [arg src]"]] + +The command creates a new container object for a parsing expression +grammar and returns the fully qualified name of the object command as +its result. The API the returned command is following is described in +the section [sectref {CONTAINER OBJECT API}]. It may be used to invoke +various operations on the container and the grammar within. + +[para] + +The new container, i.e. grammar will be empty if no [arg src] is +specified. Otherwise it will contain a copy of the grammar contained +in the [arg src]. + +The [arg src] has to be a container object reference for all operators +except [const deserialize]. + +The [const deserialize] operator requires [arg src] to be the +serialization of a parsing expression grammar instead. + +[para] + +An empty grammar has no nonterminal symbols, and the start expression +is the empty expression, i.e. epsilon. It is [term valid], but not +[term useful]. + +[list_end] + +[subsection {CONTAINER OBJECT API}] +[para] + +All grammar container objects provide the following methods for the +manipulation of their contents: + +[list_begin definitions] + +[call [arg pegName] [method destroy]] + +Destroys the grammar, including its storage space and associated +command. + +[call [arg pegName] [method clear]] + +Clears out the definition of the grammar contained in [arg pegName], +but does [emph not] destroy the object. + +[call [arg pegName] [method =] [arg srcPEG]] + +Assigns the contents of the grammar contained in [arg srcPEG] to +[arg pegName], overwriting any existing definition. + +This is the assignment operator for grammars. It copies the grammar +contained in the grammar object [arg srcPEG] over the grammar +definition in [arg pegName]. The old contents of [arg pegName] are +deleted by this operation. + +[para] + +This operation is in effect equivalent to +[para] +[example_begin] + [arg pegName] [method deserialize] [lb][arg srcPEG] [method serialize][rb] +[example_end] + +[call [arg pegName] [method -->] [arg dstPEG]] + +This is the reverse assignment operator for grammars. It copies the +automation contained in the object [arg pegName] over the grammar +definition in the object [arg dstPEG]. + +The old contents of [arg dstPEG] are deleted by this operation. + +[para] + +This operation is in effect equivalent to +[para] +[example_begin] + [arg dstPEG] [method deserialize] [lb][arg pegName] [method serialize][rb] +[example_end] + +[call [arg pegName] [method serialize]] + +This method serializes the grammar stored in [arg pegName]. In other +words it returns a tcl [emph value] completely describing that +grammar. + +This allows, for example, the transfer of grammars over arbitrary +channels, persistence, etc. + +This method is also the basis for both the copy constructor and the +assignment operator. + +[para] + +The result of this method has to be semantically identical over all +implementations of the [package grammar::peg] interface. This is what +will enable us to copy grammars between different implementations of +the same interface. + +[para] + +The result is a list of four elements with the following structure: + +[list_begin enumerated] +[enum] +The constant string [const grammar::peg]. + +[enum] +A dictionary. Its keys are the names of all known nonterminal symbols, +and their associated values are the parsing expressions describing +their sentennial structure. + +[enum] +A dictionary. Its keys are the names of all known nonterminal symbols, +and their associated values hints to a matcher regarding the semantic +values produced by the symbol. + +[enum] +The last item is a parsing expression, the [term {start expression}] +of the grammar. + +[list_end] +[para] + +Assuming the following PEG for simple mathematical expressions + +[para] +[example { + Digit <- '0'/'1'/'2'/'3'/'4'/'5'/'6'/'7'/'8'/'9' + Sign <- '+' / '-' + Number <- Sign? Digit+ + Expression <- '(' Expression ')' / (Factor (MulOp Factor)*) + MulOp <- '*' / '/' + Factor <- Term (AddOp Term)* + AddOp <- '+'/'-' + Term <- Number +}] +[para] + +a possible serialization is + +[para] +[example { + grammar::peg \\ + {Expression {/ {x ( Expression )} {x Factor {* {x MulOp Factor}}}} \\ + Factor {x Term {* {x AddOp Term}}} \\ + Term Number \\ + MulOp {/ * /} \\ + AddOp {/ + -} \\ + Number {x {? Sign} {+ Digit}} \\ + Sign {/ + -} \\ + Digit {/ 0 1 2 3 4 5 6 7 8 9} \\ + } \\ + {Expression value Factor value \\ + Term value MulOp value \\ + AddOp value Number value \\ + Sign value Digit value \\ + } + Expression +}] +[para] + +A possible one, because the order of the nonterminals in the +dictionary is not relevant. + +[call [arg pegName] [method deserialize] [arg serialization]] + +This is the complement to [method serialize]. It replaces the grammar +definition in [arg pegName] with the grammar described by the +[arg serialization] value. The old contents of [arg pegName] are +deleted by this operation. + +[call [arg pegName] [method {is valid}]] + +A predicate. It tests whether the PEG in [arg pegName] is [term valid]. +See section [sectref {TERMS & CONCEPTS}] for the definition of this +grammar property. + +The result is a boolean value. It will be set to [const true] if +the PEG has the tested property, and [const false] otherwise. + +[call [arg pegName] [method start] [opt [arg pe]]] + +This method defines the [term {start expression}] of the grammar. It +replaces the previously defined start expression with the parsing +expression [arg pe]. + +The method fails and throws an error if [arg pe] does not contain a +valid parsing expression as specified in the section +[sectref {PARSING EXPRESSIONS}]. In that case the existing start +expression is not changed. + +The method returns the empty string as its result. + +[para] + +If the method is called without an argument it will return the currently +defined start expression. + +[call [arg pegName] [method nonterminals]] + +Returns the set of all nonterminal symbols known to the grammar. + +[call [arg pegName] [method {nonterminal add}] [arg nt] [arg pe]] + +This method adds the nonterminal [arg nt] and its associated parsing +expression [arg pe] to the set of nonterminal symbols and rules of the +PEG contained in the object [arg pegName]. + +The method fails and throws an error if either the string [arg nt] is +already known as a symbol of the grammar, or if [arg pe] does not +contain a valid parsing expression as specified in the section +[sectref {PARSING EXPRESSIONS}]. In that case the current set of +nonterminal symbols and rules is not changed. + +The method returns the empty string as its result. + +[call [arg pegName] [method {nonterminal delete}] [arg nt1] [opt "[arg nt2] ..."]] + +This method removes the named symbols [arg nt1], [arg nt2] from the +set of nonterminal symbols of the PEG contained in the object +[arg pegName]. + +The method fails and throws an error if any of the strings is not +known as a nonterminal symbol. In that case the current set of +nonterminal symbols is not changed. + +The method returns the empty string as its result. + +[para] + +The stored grammar becomes invalid if the deleted nonterminals are +referenced by the RHS of still-known rules. + +[call [arg pegName] [method {nonterminal exists}] [arg nt]] + +A predicate. It tests whether the nonterminal symbol [arg nt] is known +to the PEG in [arg pegName]. + +The result is a boolean value. It will be set to [const true] if the +symbol [arg nt] is known, and [const false] otherwise. + +[call [arg pegName] [method {nonterminal rename}] [arg nt] [arg ntnew]] + +This method renames the nonterminal symbol [arg nt] to [arg ntnew]. + +The method fails and throws an error if either [arg nt] is not known +as a nonterminal, or if [arg ntnew] is a known symbol. + +The method returns the empty string as its result. + +[call [arg pegName] [method {nonterminal mode}] [arg nt] [opt [arg mode]]] + +This mode returns or sets the semantic mode associated with the +nonterminal symbol [arg nt]. If no [arg mode] is specified the +current mode of the nonterminal is returned. Otherwise the current +mode is set to [arg mode]. + +The method fails and throws an error if [arg nt] is not known as a +nonterminal. + +The grammar interpreter implemented by the package +[package grammar::peg::interpreter] recognizes the +following modes: + +[list_begin definitions] +[def value] + +The semantic value of the nonterminal is the abstract syntax tree +created from the AST's of the RHS and a node for the nonterminal +itself. + +[def match] + +The semantic value of the nonterminal is an the abstract syntax tree +consisting of single a node for the string matched by the RHS. The ASTs +generated by the RHS are discarded. + +[def leaf] + +The semantic value of the nonterminal is an the abstract syntax tree +consisting of single a node for the nonterminal itself. The ASTs +generated by the RHS are discarded. + +[def discard] + +The nonterminal has no semantic value. The ASTs generated by the RHS +are discarded (as well). + +[list_end] + +[call [arg pegName] [method {nonterminal rule}] [arg nt]] + +This method returns the parsing expression associated with the +nonterminal [arg nt]. + +The method fails and throws an error if [arg nt] is not known as a +nonterminal. + +[call [arg pegName] [method {unknown nonterminals}]] + +This method returns a list containing the names of all nonterminal +symbols which are referenced on the RHS of a grammatical rule, but +have no rule definining their structure. In other words, a list of +the nonterminal symbols which make the grammar invalid. The grammar +is valid if this list is empty. + +[list_end] + +[para] + +[subsection {PARSING EXPRESSIONS}] +[para] + +Various methods of PEG container objects expect a parsing expression +as their argument, or will return such. This section specifies the +format such parsing expressions are in. + +[para] + +[list_begin enumerated] +[enum] +The string [const epsilon] is an atomic parsing expression. It matches +the empty string. + +[enum] +The string [const alnum] is an atomic parsing expression. It matches +any alphanumeric character. + +[enum] +The string [const alpha] is an atomic parsing expression. It matches +any alphabetical character. + +[enum] +The string [const dot] is an atomic parsing expression. It matches +any character. + +[enum] +The expression + [lb]list t [var x][rb] +is an atomic parsing expression. It matches the terminal string [var x]. + +[enum] +The expression + [lb]list n [var A][rb] +is an atomic parsing expression. It matches the nonterminal [var A]. + +[enum] +For parsing expressions [var e1], [var e2], ... the result of + + [lb]list / [var e1] [var e2] ... [rb] + +is a parsing expression as well. + +This is the [term {ordered choice}], aka [term {prioritized choice}]. + +[enum] +For parsing expressions [var e1], [var e2], ... the result of + + [lb]list x [var e1] [var e2] ... [rb] + +is a parsing expression as well. + +This is the [term {sequence}]. + +[enum] +For a parsing expression [var e] the result of + + [lb]list * [var e][rb] + +is a parsing expression as well. + +This is the [term {kleene closure}], describing zero or more +repetitions. + +[enum] +For a parsing expression [var e] the result of + + [lb]list + [var e][rb] + +is a parsing expression as well. + +This is the [term {positive kleene closure}], describing one or more +repetitions. + +[enum] +For a parsing expression [var e] the result of + + [lb]list & [var e][rb] + +is a parsing expression as well. + +This is the [term {and lookahead predicate}]. + +[enum] +For a parsing expression [var e] the result of + + [lb]list ! [var e][rb] + +is a parsing expression as well. + +This is the [term {not lookahead predicate}]. + +[enum] +For a parsing expression [var e] the result of + + [lb]list ? [var e][rb] + +is a parsing expression as well. + +This is the [term {optional input}]. + +[list_end] +[para] + +Examples of parsing expressions where already shown, in the +description of the method [method serialize]. + +[section {PARSING EXPRESSION GRAMMARS}] +[para] + +For the mathematically inclined, a PEG is a 4-tuple (VN,VT,R,eS) where + +[list_begin itemized] +[item] +VN is a set of [term {nonterminal symbols}], + +[item] +VT is a set of [term {terminal symbols}], + +[item] +R is a finite set of rules, where each rule is a pair (A,e), A in VN, +and [term e] a [term {parsing expression}]. + +[item] +eS is a parsing expression, the [term {start expression}]. + +[list_end] +[para] + +Further constraints are + +[list_begin itemized] +[item] +The intersection of VN and VT is empty. + +[item] +For all A in VT exists exactly one pair (A,e) in R. In other words, R +is a function from nonterminal symbols to parsing expressions. + +[list_end] +[para] + +Parsing expression are inductively defined via + +[list_begin itemized] +[item] +The empty string (epsilon) is a parsing expression. + +[item] +A terminal symbol [term a] is a parsing expression. + +[item] +A nonterminal symbol [term A] is a parsing expression. + +[item] +[term e1][term e2] is a parsing expression for parsing expressions +[term e1] and [term 2]. This is called [term sequence]. + +[item] +[term e1]/[term e2] is a parsing expression for parsing expressions +[term e1] and [term 2]. This is called [term {ordered choice}]. + +[item] +[term e]* is a parsing expression for parsing expression +[term e]. This is called [term {zero-or-more repetitions}], also known +as [term {kleene closure}]. + +[item] +[term e]+ is a parsing expression for parsing expression +[term e]. This is called [term {one-or-more repetitions}], also known +as [term {positive kleene closure}]. + +[item] +![term e] is a parsing expression for parsing expression +[term e1]. This is called a [term {not lookahead predicate}]. + +[item] +&[term e] is a parsing expression for parsing expression +[term e1]. This is called an [term {and lookahead predicate}]. + +[list_end] +[para] + +[para] + +PEGs are used to define a grammatical structure for streams of symbols +over VT. They are a modern phrasing of older formalisms invented by +Alexander Birham. These formalisms were called TS (TMG recognition +scheme), and gTS (generalized TS). Later they were renamed to TPDL +(Top-Down Parsing Languages) and gTPDL (generalized TPDL). + +[para] + +They can be easily implemented by recursive descent parsers with +backtracking. This makes them relatives of LL(k) Context-Free +Grammars. + +[section REFERENCES] + +[list_begin enumerated] +[enum] +[uri {http://www.pdos.lcs.mit.edu/~baford/packrat/} \ + {The Packrat Parsing and Parsing Expression Grammars Page}], +by Bryan Ford, Massachusetts Institute of Technology. This is the main +entry page to PEGs, and their realization through Packrat Parsers. + +[enum] +[uri {http://www.cs.vu.nl/~dick/PTAPG.html} \ + {Parsing Techniques - A Practical Guide }], an online book +offering a clear, accessible, and thorough discussion of many +different parsing techniques with their interrelations and +applicabilities, including error recovery techniques. + +[enum] +[uri {http://scifac.ru.ac.za/compilers/} \ + {Compilers and Compiler Generators}], an online book using +CoCo/R, a generator for recursive descent parsers. +[list_end] + +[vset CATEGORY grammar_peg] +[include ../doctools2base/include/feedback.inc] +[manpage_end] |