Merge commit '29ccecd87709feda60d191f6aaba324ccad91f55' as 'tcl8.6'

author: William Joye <wjoye@cfa.harvard.edu> 2017-09-22 18:57:19 (GMT)
committer: William Joye <wjoye@cfa.harvard.edu> 2017-09-22 18:57:19 (GMT)
commit: 2aff4a96fa0286d875bddec0019648e2c6431cbc (patch)
tree: f7a9a4800a3f3ad4b77470b8383529176d8b7181 /tcl8.6/doc/re_syntax.n
parent: 3fa8e6dc88e8041b6cb88d1b1e9c05676d3346b7 (diff)
parent: 29ccecd87709feda60d191f6aaba324ccad91f55 (diff)
download: blt-2aff4a96fa0286d875bddec0019648e2c6431cbc.zip
blt-2aff4a96fa0286d875bddec0019648e2c6431cbc.tar.gz
blt-2aff4a96fa0286d875bddec0019648e2c6431cbc.tar.bz2
1 files changed, 858 insertions, 0 deletions
diff --git a/tcl8.6/doc/re_syntax.n b/tcl8.6/doc/re_syntax.n
new file mode 100644
index 0000000..7988071
--- /dev/null
+++ b/tcl8.6/doc/re_syntax.n
@@ -0,0 +1,858 @@
+'\"
+'\" Copyright (c) 1998 Sun Microsystems, Inc.
+'\" Copyright (c) 1999 Scriptics Corporation
+'\"
+'\" See the file "license.terms" for information on usage and redistribution
+'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
+'\"
+.so man.macros
+.ie '\w'o''\w'\C'^o''' .ds qo \C'^o'
+.el .ds qo u
+.TH re_syntax n "8.1" Tcl "Tcl Built-In Commands"
+.BS
+.SH NAME
+re_syntax \- Syntax of Tcl regular expressions
+.BE
+.SH DESCRIPTION
+.PP
+A \fIregular expression\fR describes strings of characters.
+It's a pattern that matches certain strings and does not match others.
+.SH "DIFFERENT FLAVORS OF REs"
+Regular expressions
+.PQ RE s ,
+as defined by POSIX, come in two flavors: \fIextended\fR REs
+.PQ ERE s
+and \fIbasic\fR REs
+.PQ BRE s .
+EREs are roughly those of the traditional \fIegrep\fR, while BREs are
+roughly those of the traditional \fIed\fR. This implementation adds
+a third flavor, \fIadvanced\fR REs
+.PQ ARE s ,
+basically EREs with some significant extensions.
+.PP
+This manual page primarily describes AREs. BREs mostly exist for
+backward compatibility in some old programs; they will be discussed at
+the end. POSIX EREs are almost an exact subset of AREs. Features of
+AREs that are not present in EREs will be indicated.
+.SH "REGULAR EXPRESSION SYNTAX"
+.PP
+Tcl regular expressions are implemented using the package written by
+Henry Spencer, based on the 1003.2 spec and some (not quite all) of
+the Perl5 extensions (thanks, Henry!). Much of the description of
+regular expressions below is copied verbatim from his manual entry.
+.PP
+An ARE is one or more \fIbranches\fR,
+separated by
+.QW \fB|\fR ,
+matching anything that matches any of the branches.
+.PP
+A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR,
+concatenated.
+It matches a match for the first, followed by a match for the second, etc;
+an empty branch matches the empty string.
+.SS QUANTIFIERS
+A quantified atom is an \fIatom\fR possibly followed
+by a single \fIquantifier\fR.
+Without a quantifier, it matches a single match for the atom.
+The quantifiers,
+and what a so-quantified atom matches, are:
+.RS 2
+.TP 6
+\fB*\fR
+.
+a sequence of 0 or more matches of the atom
+.TP
+\fB+\fR
+.
+a sequence of 1 or more matches of the atom
+.TP
+\fB?\fR
+.
+a sequence of 0 or 1 matches of the atom
+.TP
+\fB{\fIm\fB}\fR
+.
+a sequence of exactly \fIm\fR matches of the atom
+.TP
+\fB{\fIm\fB,}\fR
+.
+a sequence of \fIm\fR or more matches of the atom
+.TP
+\fB{\fIm\fB,\fIn\fB}\fR
+.
+a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom;
+\fIm\fR may not exceed \fIn\fR
+.TP
+\fB*?  +?  ??  {\fIm\fB}?  {\fIm\fB,}?  {\fIm\fB,\fIn\fB}?\fR
+.
+\fInon-greedy\fR quantifiers, which match the same possibilities,
+but prefer the smallest number rather than the largest number
+of matches (see \fBMATCHING\fR)
+.RE
+.PP
+The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs. The
+numbers \fIm\fR and \fIn\fR are unsigned decimal integers with
+permissible values from 0 to 255 inclusive.
+.SS ATOMS
+An atom is one of:
+.RS 2
+.IP \fB(\fIre\fB)\fR 6
+matches a match for \fIre\fR (\fIre\fR is any regular expression) with
+the match noted for possible reporting
+.IP \fB(?:\fIre\fB)\fR
+as previous, but does no reporting (a
+.QW non-capturing
+set of parentheses)
+.IP \fB()\fR
+matches an empty string, noted for possible reporting
+.IP \fB(?:)\fR
+matches an empty string, without reporting
+.IP \fB[\fIchars\fB]\fR
+a \fIbracket expression\fR, matching any one of the \fIchars\fR (see
+\fBBRACKET EXPRESSIONS\fR for more detail)
+.IP \fB.\fR
+matches any single character
+.IP \fB\e\fIk\fR
+matches the non-alphanumeric character \fIk\fR
+taken as an ordinary character, e.g. \fB\e\e\fR matches a backslash
+character
+.IP \fB\e\fIc\fR
+where \fIc\fR is alphanumeric (possibly followed by other characters),
+an \fIescape\fR (AREs only), see \fBESCAPES\fR below
+.IP \fB{\fR
+when followed by a character other than a digit, matches the
+left-brace character
+.QW \fB{\fR ;
+when followed by a digit, it is the beginning of a \fIbound\fR (see above)
+.IP \fIx\fR
+where \fIx\fR is a single character with no other significance,
+matches that character.
+.RE
+.SS CONSTRAINTS
+A \fIconstraint\fR matches an empty string when specific conditions
+are met. A constraint may not be followed by a quantifier. The
+simple constraints are as follows; some more constraints are described
+later, under \fBESCAPES\fR.
+.RS 2
+.TP 8
+\fB^\fR
+.
+matches at the beginning of a line
+.TP
+\fB$\fR
+.
+matches at the end of a line
+.TP
+\fB(?=\fIre\fB)\fR
+.
+\fIpositive lookahead\fR (AREs only), matches at any point where a
+substring matching \fIre\fR begins
+.TP
+\fB(?!\fIre\fB)\fR
+.
+\fInegative lookahead\fR (AREs only), matches at any point where no
+substring matching \fIre\fR begins
+.RE
+.PP
+The lookahead constraints may not contain back references (see later),
+and all parentheses within them are considered non-capturing.
+.PP
+An RE may not end with
+.QW \fB\e\fR .
+.SH "BRACKET EXPRESSIONS"
+A \fIbracket expression\fR is a list of characters enclosed in
+.QW \fB[\|]\fR .
+It normally matches any single character from the list
+(but see below). If the list begins with
+.QW \fB^\fR ,
+it matches any single character (but see below) \fInot\fR from the
+rest of the list.
+.PP
+If two characters in the list are separated by
+.QW \fB\-\fR ,
+this is shorthand for the full \fIrange\fR of characters between those two
+(inclusive) in the collating sequence, e.g.
+.QW \fB[0\-9]\fR
+in Unicode matches any conventional decimal digit. Two ranges may not share an
+endpoint, so e.g.
+.QW \fBa\-c\-e\fR
+is illegal. Ranges in Tcl always use the
+Unicode collating sequence, but other programs may use other collating
+sequences and this can be a source of incompatibility between programs.
+.PP
+To include a literal \fB]\fR or \fB\-\fR in the list, the simplest
+method is to enclose it in \fB[.\fR and \fB.]\fR to make it a
+collating element (see below). Alternatively, make it the first
+character (following a possible
+.QW \fB^\fR ),
+or (AREs only) precede it with
+.QW \fB\e\fR .
+Alternatively, for
+.QW \fB\-\fR ,
+make it the last character, or the second endpoint of a range. To use
+a literal \fB\-\fR as the first endpoint of a range, make it a
+collating element or (AREs only) precede it with
+.QW \fB\e\fR .
+With the exception of
+these, some combinations using \fB[\fR (see next paragraphs), and
+escapes, all other special characters lose their special significance
+within a bracket expression.
+.SS "CHARACTER CLASSES"
+Within a bracket expression, the name of a \fIcharacter class\fR
+enclosed in \fB[:\fR and \fB:]\fR stands for the list of all
+characters (not all collating elements!) belonging to that class.
+Standard character classes are:
+.IP \fBalpha\fR 8
+A letter.
+.IP \fBupper\fR 8
+An upper-case letter.
+.IP \fBlower\fR 8
+A lower-case letter.
+.IP \fBdigit\fR 8
+A decimal digit.
+.IP \fBxdigit\fR 8
+A hexadecimal digit.
+.IP \fBalnum\fR 8
+An alphanumeric (letter or digit).
+.IP \fBprint\fR 8
+A "printable" (same as graph, except also including space).
+.IP \fBblank\fR 8
+A space or tab character.
+.IP \fBspace\fR 8
+A character producing white space in displayed text.
+.IP \fBpunct\fR 8
+A punctuation character.
+.IP \fBgraph\fR 8
+A character with a visible representation (includes both \fBalnum\fR
+and \fBpunct\fR).
+.IP \fBcntrl\fR 8
+A control character.
+.PP
+A locale may provide others. A character class may not be used as an endpoint
+of a range.
+.RS
+.PP
+(\fINote:\fR the current Tcl implementation has only one locale, the Unicode
+locale, which supports exactly the above classes.)
+.RE
+.SS "BRACKETED CONSTRAINTS"
+There are two special cases of bracket expressions: the bracket
+expressions
+.QW \fB[[:<:]]\fR
+and
+.QW \fB[[:>:]]\fR
+are constraints, matching empty strings at the beginning and end of a word
+respectively.
+.\" note, discussion of escapes below references this definition of word
+A word is defined as a sequence of word characters that is neither preceded
+nor followed by word characters. A word character is an \fIalnum\fR character
+or an underscore
+.PQ \fB_\fR "" .
+These special bracket expressions are deprecated; users of AREs should use
+constraint escapes instead (see below).
+.SS "COLLATING ELEMENTS"
+Within a bracket expression, a collating element (a character, a
+multi-character sequence that collates as if it were a single
+character, or a collating-sequence name for either) enclosed in
+\fB[.\fR and \fB.]\fR stands for the sequence of characters of that
+collating element. The sequence is a single element of the bracket
+expression's list. A bracket expression in a locale that has
+multi-character collating elements can thus match more than one
+character. So (insidiously), a bracket expression that starts with
+\fB^\fR can match multi-character collating elements even if none of
+them appear in the bracket expression!
+.RS
+.PP
+(\fINote:\fR Tcl has no multi-character collating elements. This information
+is only for illustration.)
+.RE
+.PP
+For example, assume the collating sequence includes a \fBch\fR multi-character
+collating element. Then the RE
+.QW \fB[[.ch.]]*c\fR
+(zero or more
+.QW \fBch\fRs
+followed by
+.QW \fBc\fR )
+matches the first five characters of
+.QW \fBchchcc\fR .
+Also, the RE
+.QW \fB[^c]b\fR
+matches all of
+.QW \fBchb\fR
+(because
+.QW \fB[^c]\fR
+matches the multi-character
+.QW \fBch\fR ).
+.SS "EQUIVALENCE CLASSES"
+Within a bracket expression, a collating element enclosed in \fB[=\fR
+and \fB=]\fR is an equivalence class, standing for the sequences of
+characters of all collating elements equivalent to that one, including
+itself. (If there are no other equivalent collating elements, the
+treatment is as if the enclosing delimiters were
+.QW \fB[.\fR \&
+and
+.QW \fB.]\fR .)
+For example, if \fBo\fR and \fB\*(qo\fR are the members of an
+equivalence class, then
+.QW \fB[[=o=]]\fR ,
+.QW \fB[[=\*(qo=]]\fR ,
+and
+.QW \fB[o\*(qo]\fR \&
+are all synonymous. An equivalence class may not be an endpoint of a range.
+.RS
+.PP
+(\fINote:\fR Tcl implements only the Unicode locale. It does not define any
+equivalence classes. The examples above are just illustrations.)
+.RE
+.SH ESCAPES
+Escapes (AREs only), which begin with a \fB\e\fR followed by an
+alphanumeric character, come in several varieties: character entry,
+class shorthands, constraint escapes, and back references. A \fB\e\fR
+followed by an alphanumeric character but not constituting a valid
+escape is illegal in AREs. In EREs, there are no escapes: outside a
+bracket expression, a \fB\e\fR followed by an alphanumeric character
+merely stands for that character as an ordinary character, and inside
+a bracket expression, \fB\e\fR is an ordinary character. (The latter
+is the one actual incompatibility between EREs and AREs.)
+.SS "CHARACTER-ENTRY ESCAPES"
+Character-entry escapes (AREs only) exist to make it easier to specify
+non-printing and otherwise inconvenient characters in REs:
+.RS 2
+.TP 5
+\fB\ea\fR
+.
+alert (bell) character, as in C
+.TP
+\fB\eb\fR
+.
+backspace, as in C
+.TP
+\fB\eB\fR
+.
+synonym for \fB\e\fR to help reduce backslash doubling in some
+applications where there are multiple levels of backslash processing
+.TP
+\fB\ec\fIX\fR
+.
+(where \fIX\fR is any character) the character whose low-order 5 bits
+are the same as those of \fIX\fR, and whose other bits are all zero
+.TP
+\fB\ee\fR
+.
+the character whose collating-sequence name is
+.QW \fBESC\fR ,
+or failing that, the character with octal value 033
+.TP
+\fB\ef\fR
+.
+formfeed, as in C
+.TP
+\fB\en\fR
+.
+newline, as in C
+.TP
+\fB\er\fR
+.
+carriage return, as in C
+.TP
+\fB\et\fR
+.
+horizontal tab, as in C
+.TP
+\fB\eu\fIwxyz\fR
+.
+(where \fIwxyz\fR is one up to four hexadecimal digits) the Unicode
+character \fBU+\fIwxyz\fR in the local byte ordering
+.TP
+\fB\eU\fIstuvwxyz\fR
+.
+(where \fIstuvwxyz\fR is one up to eight hexadecimal digits) reserved
+for a Unicode extension up to 21 bits. The digits are parsed until the
+first non-hexadecimal character is encountered, the maximun of eight
+hexadecimal digits are reached, or an overflow would occur in the maximum
+value of \fBU+\fI10ffff\fR.
+.TP
+\fB\ev\fR
+.
+vertical tab, as in C are all available.
+.TP
+\fB\ex\fIhh\fR
+.
+(where \fIhh\fR is one or two hexadecimal digits) the character
+whose hexadecimal value is \fB0x\fIhh\fR.
+.TP
+\fB\e0\fR
+.
+the character whose value is \fB0\fR
+.TP
+\fB\e\fIxyz\fR
+.
+(where \fIxyz\fR is exactly three octal digits, and is not a \fIback
+reference\fR (see below)) the character whose octal value is
+\fB0\fIxyz\fR. The first digit must be in the range 0-3, otherwise
+the two-digit form is assumed.
+.TP
+\fB\e\fIxy\fR
+.
+(where \fIxy\fR is exactly two octal digits, and is not a \fIback
+reference\fR (see below)) the character whose octal value is
+\fB0\fIxy\fR
+.RE
+.PP
+Hexadecimal digits are
+.QR \fB0\fR \fB9\fR ,
+.QR \fBa\fR \fBf\fR ,
+and
+.QR \fBA\fR \fBF\fR .
+Octal digits are
+.QR \fB0\fR \fB7\fR .
+.PP
+The character-entry escapes are always taken as ordinary characters.
+For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does
+not terminate a bracket expression. Beware, however, that some
+applications (e.g., C compilers and the Tcl interpreter if the regular
+expression is not quoted with braces) interpret such sequences
+themselves before the regular-expression package gets to see them,
+which may require doubling (quadrupling, etc.) the
+.QW \fB\e\fR .
+.SS "CLASS-SHORTHAND ESCAPES"
+Class-shorthand escapes (AREs only) provide shorthands for certain
+commonly-used character classes:
+.RS 2
+.TP 10
+\fB\ed\fR
+.
+\fB[[:digit:]]\fR
+.TP
+\fB\es\fR
+.
+\fB[[:space:]]\fR
+.TP
+\fB\ew\fR
+.
+\fB[[:alnum:]_]\fR (note underscore)
+.TP
+\fB\eD\fR
+.
+\fB[^[:digit:]]\fR
+.TP
+\fB\eS\fR
+.
+\fB[^[:space:]]\fR
+.TP
+\fB\eW\fR
+.
+\fB[^[:alnum:]_]\fR (note underscore)
+.RE
+.PP
+Within bracket expressions,
+.QW \fB\ed\fR ,
+.QW \fB\es\fR ,
+and
+.QW \fB\ew\fR \&
+lose their outer brackets, and
+.QW \fB\eD\fR ,
+.QW \fB\eS\fR ,
+and
+.QW \fB\eW\fR \&
+are illegal. (So, for example,
+.QW \fB[a-c\ed]\fR
+is equivalent to
+.QW \fB[a-c[:digit:]]\fR .
+Also,
+.QW \fB[a-c\eD]\fR ,
+which is equivalent to
+.QW \fB[a-c^[:digit:]]\fR ,
+is illegal.)
+.SS "CONSTRAINT ESCAPES"
+A constraint escape (AREs only) is a constraint, matching the empty
+string if specific conditions are met, written as an escape:
+.RS 2
+.TP 6
+\fB\eA\fR
+.
+matches only at the beginning of the string (see \fBMATCHING\fR,
+below, for how this differs from
+.QW \fB^\fR )
+.TP
+\fB\em\fR
+.
+matches only at the beginning of a word
+.TP
+\fB\eM\fR
+.
+matches only at the end of a word
+.TP
+\fB\ey\fR
+.
+matches only at the beginning or end of a word
+.TP
+\fB\eY\fR
+.
+matches only at a point that is not the beginning or end of a word
+.TP
+\fB\eZ\fR
+.
+matches only at the end of the string (see \fBMATCHING\fR, below, for
+how this differs from
+.QW \fB$\fR )
+.TP
+\fB\e\fIm\fR
+.
+(where \fIm\fR is a nonzero digit) a \fIback reference\fR, see below
+.TP
+\fB\e\fImnn\fR
+.
+(where \fIm\fR is a nonzero digit, and \fInn\fR is some more digits,
+and the decimal value \fImnn\fR is not greater than the number of
+closing capturing parentheses seen so far) a \fIback reference\fR, see
+below
+.RE
+.PP
+A word is defined as in the specification of
+.QW \fB[[:<:]]\fR
+and
+.QW \fB[[:>:]]\fR
+above. Constraint escapes are illegal within bracket expressions.
+.SS "BACK REFERENCES"
+A back reference (AREs only) matches the same string matched by the
+parenthesized subexpression specified by the number, so that (e.g.)
+.QW \fB([bc])\e1\fR
+matches
+.QW \fBbb\fR
+or
+.QW \fBcc\fR
+but not
+.QW \fBbc\fR .
+The subexpression must entirely precede the back reference in the RE.
+Subexpressions are numbered in the order of their leading parentheses.
+Non-capturing parentheses do not define subexpressions.
+.PP
+There is an inherent historical ambiguity between octal
+character-entry escapes and back references, which is resolved by
+heuristics, as hinted at above. A leading zero always indicates an
+octal escape. A single non-zero digit, not followed by another digit,
+is always taken as a back reference. A multi-digit sequence not
+starting with a zero is taken as a back reference if it comes after a
+suitable subexpression (i.e. the number is in the legal range for a
+back reference), and otherwise is taken as octal.
+.SH "METASYNTAX"
+In addition to the main syntax described above, there are some special
+forms and miscellaneous syntactic facilities available.
+.PP
+Normally the flavor of RE being used is specified by
+application-dependent means. However, this can be overridden by a
+\fIdirector\fR. If an RE of any flavor begins with
+.QW \fB***:\fR ,
+the rest of the RE is an ARE. If an RE of any flavor begins with
+.QW \fB***=\fR ,
+the rest of the RE is taken to be a literal string, with
+all characters considered ordinary characters.
+.PP
+An ARE may begin with \fIembedded options\fR: a sequence
+\fB(?\fIxyz\fB)\fR (where \fIxyz\fR is one or more alphabetic
+characters) specifies options affecting the rest of the RE. These
+supplement, and can override, any options specified by the
+application. The available option letters are:
+.RS 2
+.TP 3
+\fBb\fR
+.
+rest of RE is a BRE
+.TP 3
+\fBc\fR
+.
+case-sensitive matching (usual default)
+.TP 3
+\fBe\fR
+.
+rest of RE is an ERE
+.TP 3
+\fBi\fR
+.
+case-insensitive matching (see \fBMATCHING\fR, below)
+.TP 3
+\fBm\fR
+.
+historical synonym for \fBn\fR
+.TP 3
+\fBn\fR
+.
+newline-sensitive matching (see \fBMATCHING\fR, below)
+.TP 3
+\fBp\fR
+.
+partial newline-sensitive matching (see \fBMATCHING\fR, below)
+.TP 3
+\fBq\fR
+.
+rest of RE is a literal
+.PQ quoted
+string, all ordinary characters
+.TP 3
+\fBs\fR
+.
+non-newline-sensitive matching (usual default)
+.TP 3
+\fBt\fR
+.
+tight syntax (usual default; see below)
+.TP 3
+\fBw\fR
+.
+inverse partial newline-sensitive
+.PQ weird
+matching (see \fBMATCHING\fR, below)
+.TP 3
+\fBx\fR
+.
+expanded syntax (see below)
+.RE
+.PP
+Embedded options take effect at the \fB)\fR terminating the sequence.
+They are available only at the start of an ARE, and may not be used
+later within it.
+.PP
+In addition to the usual (\fItight\fR) RE syntax, in which all
+characters are significant, there is an \fIexpanded\fR syntax,
+available in all flavors of RE with the \fB\-expanded\fR switch, or in
+AREs with the embedded x option. In the expanded syntax, white-space
+characters are ignored and all characters between a \fB#\fR and the
+following newline (or the end of the RE) are ignored, permitting
+paragraphing and commenting a complex RE. There are three exceptions
+to that basic rule:
+.IP \(bu 3
+a white-space character or
+.QW \fB#\fR
+preceded by
+.QW \fB\e\fR
+is retained
+.IP \(bu 3
+white space or
+.QW \fB#\fR
+within a bracket expression is retained
+.IP \(bu 3
+white space and comments are illegal within multi-character symbols
+like the ARE
+.QW \fB(?:\fR
+or the BRE
+.QW \fB\e(\fR
+.PP
+Expanded-syntax white-space characters are blank, tab, newline, and
+any character that belongs to the \fIspace\fR character class.
+.PP
+Finally, in an ARE, outside bracket expressions, the sequence
+.QW \fB(?#\fIttt\fB)\fR
+(where \fIttt\fR is any text not containing a
+.QW \fB)\fR )
+is a comment, completely ignored. Again, this is not
+allowed between the characters of multi-character symbols like
+.QW \fB(?:\fR .
+Such comments are more a historical artifact than a useful facility,
+and their use is deprecated; use the expanded syntax instead.
+.PP
+\fINone\fR of these metasyntax extensions is available if the
+application (or an initial
+.QW \fB***=\fR
+director) has specified that the
+user's input be treated as a literal string rather than as an RE.
+.SH MATCHING
+In the event that an RE could match more than one substring of a given
+string, the RE matches the one starting earliest in the string. If
+the RE could match more than one substring starting at that point, its
+choice is determined by its \fIpreference\fR: either the longest
+substring, or the shortest.
+.PP
+Most atoms, and all constraints, have no preference. A parenthesized
+RE has the same preference (possibly none) as the RE. A quantified
+atom with quantifier \fB{\fIm\fB}\fR or \fB{\fIm\fB}?\fR has the same
+preference (possibly none) as the atom itself. A quantified atom with
+other normal quantifiers (including \fB{\fIm\fB,\fIn\fB}\fR with
+\fIm\fR equal to \fIn\fR) prefers longest match. A quantified atom
+with other non-greedy quantifiers (including \fB{\fIm\fB,\fIn\fB}?\fR
+with \fIm\fR equal to \fIn\fR) prefers shortest match. A branch has
+the same preference as the first quantified atom in it which has a
+preference. An RE consisting of two or more branches connected by the
+\fB|\fR operator prefers longest match.
+.PP
+Subject to the constraints imposed by the rules for matching the whole
+RE, subexpressions also match the longest or shortest possible
+substrings, based on their preferences, with subexpressions starting
+earlier in the RE taking priority over ones starting later. Note that
+outer subexpressions thus take priority over their component
+subexpressions.
+.PP
+The quantifiers \fB{1,1}\fR and \fB{1,1}?\fR can be used to
+force longest and shortest preference, respectively, on a
+subexpression or a whole RE.
+.RS
+.PP
+\fBNOTE:\fR This means that you can usually make a RE be non-greedy overall by
+putting \fB{1,1}?\fR after one of the first non-constraint atoms or
+parenthesized sub-expressions in it. \fIIt pays to experiment\fR with the
+placing of this non-greediness override on a suitable range of input texts
+when you are writing a RE if you are using this level of complexity.
+.PP
+For example, this regular expression is non-greedy, and will match the
+shortest substring possible given that
+.QW \fBabc\fR
+will be matched as early as possible (the quantifier does not change that):
+.PP
+.CS
+ab{1,1}?c.*x.*cba
+.CE
+.PP
+The atom
+.QW \fBa\fR
+has no greediness preference, we explicitly give one for
+.QW \fBb\fR ,
+and the remaining quantifiers are overridden to be non-greedy by the preceding
+non-greedy quantifier.
+.RE
+.PP
+Match lengths are measured in characters, not collating elements. An
+empty string is considered longer than no match at all. For example,
+.QW \fBbb*\fR
+matches the three middle characters of
+.QW \fBabbbc\fR ,
+.QW \fB(week|wee)(night|knights)\fR
+matches all ten characters of
+.QW \fBweeknights\fR ,
+when
+.QW \fB(.*).*\fR
+is matched against
+.QW \fBabc\fR
+the parenthesized subexpression matches all three characters, and when
+.QW \fB(a*)*\fR
+is matched against
+.QW \fBbc\fR
+both the whole RE and the parenthesized subexpression match an empty string.
+.PP
+If case-independent matching is specified, the effect is much as if
+all case distinctions had vanished from the alphabet. When an
+alphabetic that exists in multiple cases appears as an ordinary
+character outside a bracket expression, it is effectively transformed
+into a bracket expression containing both cases, so that \fBx\fR
+becomes
+.QW \fB[xX]\fR .
+When it appears inside a bracket expression,
+all case counterparts of it are added to the bracket expression, so
+that
+.QW \fB[x]\fR
+becomes
+.QW \fB[xX]\fR
+and
+.QW \fB[^x]\fR
+becomes
+.QW \fB[^xX]\fR .
+.PP
+If newline-sensitive matching is specified, \fB.\fR and bracket
+expressions using \fB^\fR will never match the newline character (so
+that matches will never cross newlines unless the RE explicitly
+arranges it) and \fB^\fR and \fB$\fR will match the empty string after
+and before a newline respectively, in addition to matching at
+beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR
+continue to match beginning or end of string \fIonly\fR.
+.PP
+If partial newline-sensitive matching is specified, this affects
+\fB.\fR and bracket expressions as with newline-sensitive matching,
+but not \fB^\fR and \fB$\fR.
+.PP
+If inverse partial newline-sensitive matching is specified, this
+affects \fB^\fR and \fB$\fR as with newline-sensitive matching, but
+not \fB.\fR and bracket expressions. This is not very useful but is
+provided for symmetry.
+.SH "LIMITS AND COMPATIBILITY"
+No particular limit is imposed on the length of REs. Programs
+intended to be highly portable should not employ REs longer than 256
+bytes, as a POSIX-compliant implementation can refuse to accept such
+REs.
+.PP
+The only feature of AREs that is actually incompatible with POSIX EREs
+is that \fB\e\fR does not lose its special significance inside bracket
+expressions. All other ARE features use syntax which is illegal or
+has undefined or unspecified effects in POSIX EREs; the \fB***\fR
+syntax of directors likewise is outside the POSIX syntax for both BREs
+and EREs.
+.PP
+Many of the ARE extensions are borrowed from Perl, but some have been
+changed to clean them up, and a few Perl extensions are not present.
+Incompatibilities of note include
+.QW \fB\eb\fR ,
+.QW \fB\eB\fR ,
+the lack of special treatment for a trailing newline, the addition of
+complemented bracket expressions to the things affected by
+newline-sensitive matching, the restrictions on parentheses and back
+references in lookahead constraints, and the longest/shortest-match
+(rather than first-match) matching semantics.
+.PP
+The matching rules for REs containing both normal and non-greedy
+quantifiers have changed since early beta-test versions of this
+package. (The new rules are much simpler and cleaner, but do not work
+as hard at guessing the user's real intentions.)
+.PP
+Henry Spencer's original 1986 \fIregexp\fR package, still in
+widespread use (e.g., in pre-8.1 releases of Tcl), implemented an
+early version of today's EREs. There are four incompatibilities
+between \fIregexp\fR's near-EREs
+.PQ RREs " for short"
+and AREs. In roughly increasing order of significance:
+.IP \(bu 3
+In AREs, \fB\e\fR followed by an alphanumeric character is either an
+escape or an error, while in RREs, it was just another way of writing
+the alphanumeric. This should not be a problem because there was no
+reason to write such a sequence in RREs.
+.IP \(bu 3
+\fB{\fR followed by a digit in an ARE is the beginning of a bound,
+while in RREs, \fB{\fR was always an ordinary character. Such
+sequences should be rare, and will often result in an error because
+following characters will not look like a valid bound.
+.IP \(bu 3
+In AREs, \fB\e\fR remains a special character within
+.QW \fB[\|]\fR ,
+so a literal \fB\e\fR within \fB[\|]\fR must be written
+.QW \fB\e\e\fR .
+\fB\e\e\fR also gives a literal \fB\e\fR within \fB[\|]\fR in RREs,
+but only truly paranoid programmers routinely doubled the backslash.
+.IP \(bu 3
+AREs report the longest/shortest match for the RE, rather than the
+first found in a specified search order. This may affect some RREs
+which were written in the expectation that the first match would be
+reported. (The careful crafting of RREs to optimize the search order
+for fast matching is obsolete (AREs examine all possible matches in
+parallel, and their performance is largely insensitive to their
+complexity) but cases where the search order was exploited to
+deliberately find a match which was \fInot\fR the longest/shortest
+will need rewriting.)
+.SH "BASIC REGULAR EXPRESSIONS"
+BREs differ from EREs in several respects.
+.QW \fB|\fR ,
+.QW \fB+\fR ,
+and \fB?\fR are ordinary characters and there is no equivalent for their
+functionality. The delimiters for bounds are \fB\e{\fR and
+.QW \fB\e}\fR ,
+with \fB{\fR and \fB}\fR by themselves ordinary characters. The
+parentheses for nested subexpressions are \fB\e(\fR and
+.QW \fB\e)\fR ,
+with \fB(\fR and \fB)\fR by themselves ordinary
+characters. \fB^\fR is an ordinary character except at the beginning
+of the RE or the beginning of a parenthesized subexpression, \fB$\fR
+is an ordinary character except at the end of the RE or the end of a
+parenthesized subexpression, and \fB*\fR is an ordinary character if
+it appears at the beginning of the RE or the beginning of a
+parenthesized subexpression (after a possible leading
+.QW \fB^\fR ).
+Finally, single-digit back references are available, and \fB\e<\fR and
+\fB\e>\fR are synonyms for
+.QW \fB[[:<:]]\fR
+and
+.QW \fB[[:>:]]\fR
+respectively; no other escapes are available.
+.SH "SEE ALSO"
+RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n)
+.SH KEYWORDS
+match, regular expression, string
+.\" Local Variables:
+.\" mode: nroff
+.\" End:
author	William Joye <wjoye@cfa.harvard.edu>	2017-09-22 18:57:19 (GMT)
committer	William Joye <wjoye@cfa.harvard.edu>	2017-09-22 18:57:19 (GMT)
commit	2aff4a96fa0286d875bddec0019648e2c6431cbc (patch)
tree	f7a9a4800a3f3ad4b77470b8383529176d8b7181 /tcl8.6/doc/re_syntax.n
parent	3fa8e6dc88e8041b6cb88d1b1e9c05676d3346b7 (diff)
parent	29ccecd87709feda60d191f6aaba324ccad91f55 (diff)
download	blt-2aff4a96fa0286d875bddec0019648e2c6431cbc.zip blt-2aff4a96fa0286d875bddec0019648e2c6431cbc.tar.gz blt-2aff4a96fa0286d875bddec0019648e2c6431cbc.tar.bz2