diff options
author | William Joye <wjoye@cfa.harvard.edu> | 2017-09-22 18:57:19 (GMT) |
---|---|---|
committer | William Joye <wjoye@cfa.harvard.edu> | 2017-09-22 18:57:19 (GMT) |
commit | 2aff4a96fa0286d875bddec0019648e2c6431cbc (patch) | |
tree | f7a9a4800a3f3ad4b77470b8383529176d8b7181 /tcl8.6/doc/re_syntax.n | |
parent | 3fa8e6dc88e8041b6cb88d1b1e9c05676d3346b7 (diff) | |
parent | 29ccecd87709feda60d191f6aaba324ccad91f55 (diff) | |
download | blt-2aff4a96fa0286d875bddec0019648e2c6431cbc.zip blt-2aff4a96fa0286d875bddec0019648e2c6431cbc.tar.gz blt-2aff4a96fa0286d875bddec0019648e2c6431cbc.tar.bz2 |
Merge commit '29ccecd87709feda60d191f6aaba324ccad91f55' as 'tcl8.6'
Diffstat (limited to 'tcl8.6/doc/re_syntax.n')
-rw-r--r-- | tcl8.6/doc/re_syntax.n | 858 |
1 files changed, 858 insertions, 0 deletions
diff --git a/tcl8.6/doc/re_syntax.n b/tcl8.6/doc/re_syntax.n new file mode 100644 index 0000000..7988071 --- /dev/null +++ b/tcl8.6/doc/re_syntax.n @@ -0,0 +1,858 @@ +'\" +'\" Copyright (c) 1998 Sun Microsystems, Inc. +'\" Copyright (c) 1999 Scriptics Corporation +'\" +'\" See the file "license.terms" for information on usage and redistribution +'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES. +'\" +.so man.macros +.ie '\w'o''\w'\C'^o''' .ds qo \C'^o' +.el .ds qo u +.TH re_syntax n "8.1" Tcl "Tcl Built-In Commands" +.BS +.SH NAME +re_syntax \- Syntax of Tcl regular expressions +.BE +.SH DESCRIPTION +.PP +A \fIregular expression\fR describes strings of characters. +It's a pattern that matches certain strings and does not match others. +.SH "DIFFERENT FLAVORS OF REs" +Regular expressions +.PQ RE s , +as defined by POSIX, come in two flavors: \fIextended\fR REs +.PQ ERE s +and \fIbasic\fR REs +.PQ BRE s . +EREs are roughly those of the traditional \fIegrep\fR, while BREs are +roughly those of the traditional \fIed\fR. This implementation adds +a third flavor, \fIadvanced\fR REs +.PQ ARE s , +basically EREs with some significant extensions. +.PP +This manual page primarily describes AREs. BREs mostly exist for +backward compatibility in some old programs; they will be discussed at +the end. POSIX EREs are almost an exact subset of AREs. Features of +AREs that are not present in EREs will be indicated. +.SH "REGULAR EXPRESSION SYNTAX" +.PP +Tcl regular expressions are implemented using the package written by +Henry Spencer, based on the 1003.2 spec and some (not quite all) of +the Perl5 extensions (thanks, Henry!). Much of the description of +regular expressions below is copied verbatim from his manual entry. +.PP +An ARE is one or more \fIbranches\fR, +separated by +.QW \fB|\fR , +matching anything that matches any of the branches. +.PP +A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR, +concatenated. +It matches a match for the first, followed by a match for the second, etc; +an empty branch matches the empty string. +.SS QUANTIFIERS +A quantified atom is an \fIatom\fR possibly followed +by a single \fIquantifier\fR. +Without a quantifier, it matches a single match for the atom. +The quantifiers, +and what a so-quantified atom matches, are: +.RS 2 +.TP 6 +\fB*\fR +. +a sequence of 0 or more matches of the atom +.TP +\fB+\fR +. +a sequence of 1 or more matches of the atom +.TP +\fB?\fR +. +a sequence of 0 or 1 matches of the atom +.TP +\fB{\fIm\fB}\fR +. +a sequence of exactly \fIm\fR matches of the atom +.TP +\fB{\fIm\fB,}\fR +. +a sequence of \fIm\fR or more matches of the atom +.TP +\fB{\fIm\fB,\fIn\fB}\fR +. +a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom; +\fIm\fR may not exceed \fIn\fR +.TP +\fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR +. +\fInon-greedy\fR quantifiers, which match the same possibilities, +but prefer the smallest number rather than the largest number +of matches (see \fBMATCHING\fR) +.RE +.PP +The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs. The +numbers \fIm\fR and \fIn\fR are unsigned decimal integers with +permissible values from 0 to 255 inclusive. +.SS ATOMS +An atom is one of: +.RS 2 +.IP \fB(\fIre\fB)\fR 6 +matches a match for \fIre\fR (\fIre\fR is any regular expression) with +the match noted for possible reporting +.IP \fB(?:\fIre\fB)\fR +as previous, but does no reporting (a +.QW non-capturing +set of parentheses) +.IP \fB()\fR +matches an empty string, noted for possible reporting +.IP \fB(?:)\fR +matches an empty string, without reporting +.IP \fB[\fIchars\fB]\fR +a \fIbracket expression\fR, matching any one of the \fIchars\fR (see +\fBBRACKET EXPRESSIONS\fR for more detail) +.IP \fB.\fR +matches any single character +.IP \fB\e\fIk\fR +matches the non-alphanumeric character \fIk\fR +taken as an ordinary character, e.g. \fB\e\e\fR matches a backslash +character +.IP \fB\e\fIc\fR +where \fIc\fR is alphanumeric (possibly followed by other characters), +an \fIescape\fR (AREs only), see \fBESCAPES\fR below +.IP \fB{\fR +when followed by a character other than a digit, matches the +left-brace character +.QW \fB{\fR ; +when followed by a digit, it is the beginning of a \fIbound\fR (see above) +.IP \fIx\fR +where \fIx\fR is a single character with no other significance, +matches that character. +.RE +.SS CONSTRAINTS +A \fIconstraint\fR matches an empty string when specific conditions +are met. A constraint may not be followed by a quantifier. The +simple constraints are as follows; some more constraints are described +later, under \fBESCAPES\fR. +.RS 2 +.TP 8 +\fB^\fR +. +matches at the beginning of a line +.TP +\fB$\fR +. +matches at the end of a line +.TP +\fB(?=\fIre\fB)\fR +. +\fIpositive lookahead\fR (AREs only), matches at any point where a +substring matching \fIre\fR begins +.TP +\fB(?!\fIre\fB)\fR +. +\fInegative lookahead\fR (AREs only), matches at any point where no +substring matching \fIre\fR begins +.RE +.PP +The lookahead constraints may not contain back references (see later), +and all parentheses within them are considered non-capturing. +.PP +An RE may not end with +.QW \fB\e\fR . +.SH "BRACKET EXPRESSIONS" +A \fIbracket expression\fR is a list of characters enclosed in +.QW \fB[\|]\fR . +It normally matches any single character from the list +(but see below). If the list begins with +.QW \fB^\fR , +it matches any single character (but see below) \fInot\fR from the +rest of the list. +.PP +If two characters in the list are separated by +.QW \fB\-\fR , +this is shorthand for the full \fIrange\fR of characters between those two +(inclusive) in the collating sequence, e.g. +.QW \fB[0\-9]\fR +in Unicode matches any conventional decimal digit. Two ranges may not share an +endpoint, so e.g. +.QW \fBa\-c\-e\fR +is illegal. Ranges in Tcl always use the +Unicode collating sequence, but other programs may use other collating +sequences and this can be a source of incompatibility between programs. +.PP +To include a literal \fB]\fR or \fB\-\fR in the list, the simplest +method is to enclose it in \fB[.\fR and \fB.]\fR to make it a +collating element (see below). Alternatively, make it the first +character (following a possible +.QW \fB^\fR ), +or (AREs only) precede it with +.QW \fB\e\fR . +Alternatively, for +.QW \fB\-\fR , +make it the last character, or the second endpoint of a range. To use +a literal \fB\-\fR as the first endpoint of a range, make it a +collating element or (AREs only) precede it with +.QW \fB\e\fR . +With the exception of +these, some combinations using \fB[\fR (see next paragraphs), and +escapes, all other special characters lose their special significance +within a bracket expression. +.SS "CHARACTER CLASSES" +Within a bracket expression, the name of a \fIcharacter class\fR +enclosed in \fB[:\fR and \fB:]\fR stands for the list of all +characters (not all collating elements!) belonging to that class. +Standard character classes are: +.IP \fBalpha\fR 8 +A letter. +.IP \fBupper\fR 8 +An upper-case letter. +.IP \fBlower\fR 8 +A lower-case letter. +.IP \fBdigit\fR 8 +A decimal digit. +.IP \fBxdigit\fR 8 +A hexadecimal digit. +.IP \fBalnum\fR 8 +An alphanumeric (letter or digit). +.IP \fBprint\fR 8 +A "printable" (same as graph, except also including space). +.IP \fBblank\fR 8 +A space or tab character. +.IP \fBspace\fR 8 +A character producing white space in displayed text. +.IP \fBpunct\fR 8 +A punctuation character. +.IP \fBgraph\fR 8 +A character with a visible representation (includes both \fBalnum\fR +and \fBpunct\fR). +.IP \fBcntrl\fR 8 +A control character. +.PP +A locale may provide others. A character class may not be used as an endpoint +of a range. +.RS +.PP +(\fINote:\fR the current Tcl implementation has only one locale, the Unicode +locale, which supports exactly the above classes.) +.RE +.SS "BRACKETED CONSTRAINTS" +There are two special cases of bracket expressions: the bracket +expressions +.QW \fB[[:<:]]\fR +and +.QW \fB[[:>:]]\fR +are constraints, matching empty strings at the beginning and end of a word +respectively. +.\" note, discussion of escapes below references this definition of word +A word is defined as a sequence of word characters that is neither preceded +nor followed by word characters. A word character is an \fIalnum\fR character +or an underscore +.PQ \fB_\fR "" . +These special bracket expressions are deprecated; users of AREs should use +constraint escapes instead (see below). +.SS "COLLATING ELEMENTS" +Within a bracket expression, a collating element (a character, a +multi-character sequence that collates as if it were a single +character, or a collating-sequence name for either) enclosed in +\fB[.\fR and \fB.]\fR stands for the sequence of characters of that +collating element. The sequence is a single element of the bracket +expression's list. A bracket expression in a locale that has +multi-character collating elements can thus match more than one +character. So (insidiously), a bracket expression that starts with +\fB^\fR can match multi-character collating elements even if none of +them appear in the bracket expression! +.RS +.PP +(\fINote:\fR Tcl has no multi-character collating elements. This information +is only for illustration.) +.RE +.PP +For example, assume the collating sequence includes a \fBch\fR multi-character +collating element. Then the RE +.QW \fB[[.ch.]]*c\fR +(zero or more +.QW \fBch\fRs +followed by +.QW \fBc\fR ) +matches the first five characters of +.QW \fBchchcc\fR . +Also, the RE +.QW \fB[^c]b\fR +matches all of +.QW \fBchb\fR +(because +.QW \fB[^c]\fR +matches the multi-character +.QW \fBch\fR ). +.SS "EQUIVALENCE CLASSES" +Within a bracket expression, a collating element enclosed in \fB[=\fR +and \fB=]\fR is an equivalence class, standing for the sequences of +characters of all collating elements equivalent to that one, including +itself. (If there are no other equivalent collating elements, the +treatment is as if the enclosing delimiters were +.QW \fB[.\fR \& +and +.QW \fB.]\fR .) +For example, if \fBo\fR and \fB\*(qo\fR are the members of an +equivalence class, then +.QW \fB[[=o=]]\fR , +.QW \fB[[=\*(qo=]]\fR , +and +.QW \fB[o\*(qo]\fR \& +are all synonymous. An equivalence class may not be an endpoint of a range. +.RS +.PP +(\fINote:\fR Tcl implements only the Unicode locale. It does not define any +equivalence classes. The examples above are just illustrations.) +.RE +.SH ESCAPES +Escapes (AREs only), which begin with a \fB\e\fR followed by an +alphanumeric character, come in several varieties: character entry, +class shorthands, constraint escapes, and back references. A \fB\e\fR +followed by an alphanumeric character but not constituting a valid +escape is illegal in AREs. In EREs, there are no escapes: outside a +bracket expression, a \fB\e\fR followed by an alphanumeric character +merely stands for that character as an ordinary character, and inside +a bracket expression, \fB\e\fR is an ordinary character. (The latter +is the one actual incompatibility between EREs and AREs.) +.SS "CHARACTER-ENTRY ESCAPES" +Character-entry escapes (AREs only) exist to make it easier to specify +non-printing and otherwise inconvenient characters in REs: +.RS 2 +.TP 5 +\fB\ea\fR +. +alert (bell) character, as in C +.TP +\fB\eb\fR +. +backspace, as in C +.TP +\fB\eB\fR +. +synonym for \fB\e\fR to help reduce backslash doubling in some +applications where there are multiple levels of backslash processing +.TP +\fB\ec\fIX\fR +. +(where \fIX\fR is any character) the character whose low-order 5 bits +are the same as those of \fIX\fR, and whose other bits are all zero +.TP +\fB\ee\fR +. +the character whose collating-sequence name is +.QW \fBESC\fR , +or failing that, the character with octal value 033 +.TP +\fB\ef\fR +. +formfeed, as in C +.TP +\fB\en\fR +. +newline, as in C +.TP +\fB\er\fR +. +carriage return, as in C +.TP +\fB\et\fR +. +horizontal tab, as in C +.TP +\fB\eu\fIwxyz\fR +. +(where \fIwxyz\fR is one up to four hexadecimal digits) the Unicode +character \fBU+\fIwxyz\fR in the local byte ordering +.TP +\fB\eU\fIstuvwxyz\fR +. +(where \fIstuvwxyz\fR is one up to eight hexadecimal digits) reserved +for a Unicode extension up to 21 bits. The digits are parsed until the +first non-hexadecimal character is encountered, the maximun of eight +hexadecimal digits are reached, or an overflow would occur in the maximum +value of \fBU+\fI10ffff\fR. +.TP +\fB\ev\fR +. +vertical tab, as in C are all available. +.TP +\fB\ex\fIhh\fR +. +(where \fIhh\fR is one or two hexadecimal digits) the character +whose hexadecimal value is \fB0x\fIhh\fR. +.TP +\fB\e0\fR +. +the character whose value is \fB0\fR +.TP +\fB\e\fIxyz\fR +. +(where \fIxyz\fR is exactly three octal digits, and is not a \fIback +reference\fR (see below)) the character whose octal value is +\fB0\fIxyz\fR. The first digit must be in the range 0-3, otherwise +the two-digit form is assumed. +.TP +\fB\e\fIxy\fR +. +(where \fIxy\fR is exactly two octal digits, and is not a \fIback +reference\fR (see below)) the character whose octal value is +\fB0\fIxy\fR +.RE +.PP +Hexadecimal digits are +.QR \fB0\fR \fB9\fR , +.QR \fBa\fR \fBf\fR , +and +.QR \fBA\fR \fBF\fR . +Octal digits are +.QR \fB0\fR \fB7\fR . +.PP +The character-entry escapes are always taken as ordinary characters. +For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does +not terminate a bracket expression. Beware, however, that some +applications (e.g., C compilers and the Tcl interpreter if the regular +expression is not quoted with braces) interpret such sequences +themselves before the regular-expression package gets to see them, +which may require doubling (quadrupling, etc.) the +.QW \fB\e\fR . +.SS "CLASS-SHORTHAND ESCAPES" +Class-shorthand escapes (AREs only) provide shorthands for certain +commonly-used character classes: +.RS 2 +.TP 10 +\fB\ed\fR +. +\fB[[:digit:]]\fR +.TP +\fB\es\fR +. +\fB[[:space:]]\fR +.TP +\fB\ew\fR +. +\fB[[:alnum:]_]\fR (note underscore) +.TP +\fB\eD\fR +. +\fB[^[:digit:]]\fR +.TP +\fB\eS\fR +. +\fB[^[:space:]]\fR +.TP +\fB\eW\fR +. +\fB[^[:alnum:]_]\fR (note underscore) +.RE +.PP +Within bracket expressions, +.QW \fB\ed\fR , +.QW \fB\es\fR , +and +.QW \fB\ew\fR \& +lose their outer brackets, and +.QW \fB\eD\fR , +.QW \fB\eS\fR , +and +.QW \fB\eW\fR \& +are illegal. (So, for example, +.QW \fB[a-c\ed]\fR +is equivalent to +.QW \fB[a-c[:digit:]]\fR . +Also, +.QW \fB[a-c\eD]\fR , +which is equivalent to +.QW \fB[a-c^[:digit:]]\fR , +is illegal.) +.SS "CONSTRAINT ESCAPES" +A constraint escape (AREs only) is a constraint, matching the empty +string if specific conditions are met, written as an escape: +.RS 2 +.TP 6 +\fB\eA\fR +. +matches only at the beginning of the string (see \fBMATCHING\fR, +below, for how this differs from +.QW \fB^\fR ) +.TP +\fB\em\fR +. +matches only at the beginning of a word +.TP +\fB\eM\fR +. +matches only at the end of a word +.TP +\fB\ey\fR +. +matches only at the beginning or end of a word +.TP +\fB\eY\fR +. +matches only at a point that is not the beginning or end of a word +.TP +\fB\eZ\fR +. +matches only at the end of the string (see \fBMATCHING\fR, below, for +how this differs from +.QW \fB$\fR ) +.TP +\fB\e\fIm\fR +. +(where \fIm\fR is a nonzero digit) a \fIback reference\fR, see below +.TP +\fB\e\fImnn\fR +. +(where \fIm\fR is a nonzero digit, and \fInn\fR is some more digits, +and the decimal value \fImnn\fR is not greater than the number of +closing capturing parentheses seen so far) a \fIback reference\fR, see +below +.RE +.PP +A word is defined as in the specification of +.QW \fB[[:<:]]\fR +and +.QW \fB[[:>:]]\fR +above. Constraint escapes are illegal within bracket expressions. +.SS "BACK REFERENCES" +A back reference (AREs only) matches the same string matched by the +parenthesized subexpression specified by the number, so that (e.g.) +.QW \fB([bc])\e1\fR +matches +.QW \fBbb\fR +or +.QW \fBcc\fR +but not +.QW \fBbc\fR . +The subexpression must entirely precede the back reference in the RE. +Subexpressions are numbered in the order of their leading parentheses. +Non-capturing parentheses do not define subexpressions. +.PP +There is an inherent historical ambiguity between octal +character-entry escapes and back references, which is resolved by +heuristics, as hinted at above. A leading zero always indicates an +octal escape. A single non-zero digit, not followed by another digit, +is always taken as a back reference. A multi-digit sequence not +starting with a zero is taken as a back reference if it comes after a +suitable subexpression (i.e. the number is in the legal range for a +back reference), and otherwise is taken as octal. +.SH "METASYNTAX" +In addition to the main syntax described above, there are some special +forms and miscellaneous syntactic facilities available. +.PP +Normally the flavor of RE being used is specified by +application-dependent means. However, this can be overridden by a +\fIdirector\fR. If an RE of any flavor begins with +.QW \fB***:\fR , +the rest of the RE is an ARE. If an RE of any flavor begins with +.QW \fB***=\fR , +the rest of the RE is taken to be a literal string, with +all characters considered ordinary characters. +.PP +An ARE may begin with \fIembedded options\fR: a sequence +\fB(?\fIxyz\fB)\fR (where \fIxyz\fR is one or more alphabetic +characters) specifies options affecting the rest of the RE. These +supplement, and can override, any options specified by the +application. The available option letters are: +.RS 2 +.TP 3 +\fBb\fR +. +rest of RE is a BRE +.TP 3 +\fBc\fR +. +case-sensitive matching (usual default) +.TP 3 +\fBe\fR +. +rest of RE is an ERE +.TP 3 +\fBi\fR +. +case-insensitive matching (see \fBMATCHING\fR, below) +.TP 3 +\fBm\fR +. +historical synonym for \fBn\fR +.TP 3 +\fBn\fR +. +newline-sensitive matching (see \fBMATCHING\fR, below) +.TP 3 +\fBp\fR +. +partial newline-sensitive matching (see \fBMATCHING\fR, below) +.TP 3 +\fBq\fR +. +rest of RE is a literal +.PQ quoted +string, all ordinary characters +.TP 3 +\fBs\fR +. +non-newline-sensitive matching (usual default) +.TP 3 +\fBt\fR +. +tight syntax (usual default; see below) +.TP 3 +\fBw\fR +. +inverse partial newline-sensitive +.PQ weird +matching (see \fBMATCHING\fR, below) +.TP 3 +\fBx\fR +. +expanded syntax (see below) +.RE +.PP +Embedded options take effect at the \fB)\fR terminating the sequence. +They are available only at the start of an ARE, and may not be used +later within it. +.PP +In addition to the usual (\fItight\fR) RE syntax, in which all +characters are significant, there is an \fIexpanded\fR syntax, +available in all flavors of RE with the \fB\-expanded\fR switch, or in +AREs with the embedded x option. In the expanded syntax, white-space +characters are ignored and all characters between a \fB#\fR and the +following newline (or the end of the RE) are ignored, permitting +paragraphing and commenting a complex RE. There are three exceptions +to that basic rule: +.IP \(bu 3 +a white-space character or +.QW \fB#\fR +preceded by +.QW \fB\e\fR +is retained +.IP \(bu 3 +white space or +.QW \fB#\fR +within a bracket expression is retained +.IP \(bu 3 +white space and comments are illegal within multi-character symbols +like the ARE +.QW \fB(?:\fR +or the BRE +.QW \fB\e(\fR +.PP +Expanded-syntax white-space characters are blank, tab, newline, and +any character that belongs to the \fIspace\fR character class. +.PP +Finally, in an ARE, outside bracket expressions, the sequence +.QW \fB(?#\fIttt\fB)\fR +(where \fIttt\fR is any text not containing a +.QW \fB)\fR ) +is a comment, completely ignored. Again, this is not +allowed between the characters of multi-character symbols like +.QW \fB(?:\fR . +Such comments are more a historical artifact than a useful facility, +and their use is deprecated; use the expanded syntax instead. +.PP +\fINone\fR of these metasyntax extensions is available if the +application (or an initial +.QW \fB***=\fR +director) has specified that the +user's input be treated as a literal string rather than as an RE. +.SH MATCHING +In the event that an RE could match more than one substring of a given +string, the RE matches the one starting earliest in the string. If +the RE could match more than one substring starting at that point, its +choice is determined by its \fIpreference\fR: either the longest +substring, or the shortest. +.PP +Most atoms, and all constraints, have no preference. A parenthesized +RE has the same preference (possibly none) as the RE. A quantified +atom with quantifier \fB{\fIm\fB}\fR or \fB{\fIm\fB}?\fR has the same +preference (possibly none) as the atom itself. A quantified atom with +other normal quantifiers (including \fB{\fIm\fB,\fIn\fB}\fR with +\fIm\fR equal to \fIn\fR) prefers longest match. A quantified atom +with other non-greedy quantifiers (including \fB{\fIm\fB,\fIn\fB}?\fR +with \fIm\fR equal to \fIn\fR) prefers shortest match. A branch has +the same preference as the first quantified atom in it which has a +preference. An RE consisting of two or more branches connected by the +\fB|\fR operator prefers longest match. +.PP +Subject to the constraints imposed by the rules for matching the whole +RE, subexpressions also match the longest or shortest possible +substrings, based on their preferences, with subexpressions starting +earlier in the RE taking priority over ones starting later. Note that +outer subexpressions thus take priority over their component +subexpressions. +.PP +The quantifiers \fB{1,1}\fR and \fB{1,1}?\fR can be used to +force longest and shortest preference, respectively, on a +subexpression or a whole RE. +.RS +.PP +\fBNOTE:\fR This means that you can usually make a RE be non-greedy overall by +putting \fB{1,1}?\fR after one of the first non-constraint atoms or +parenthesized sub-expressions in it. \fIIt pays to experiment\fR with the +placing of this non-greediness override on a suitable range of input texts +when you are writing a RE if you are using this level of complexity. +.PP +For example, this regular expression is non-greedy, and will match the +shortest substring possible given that +.QW \fBabc\fR +will be matched as early as possible (the quantifier does not change that): +.PP +.CS +ab{1,1}?c.*x.*cba +.CE +.PP +The atom +.QW \fBa\fR +has no greediness preference, we explicitly give one for +.QW \fBb\fR , +and the remaining quantifiers are overridden to be non-greedy by the preceding +non-greedy quantifier. +.RE +.PP +Match lengths are measured in characters, not collating elements. An +empty string is considered longer than no match at all. For example, +.QW \fBbb*\fR +matches the three middle characters of +.QW \fBabbbc\fR , +.QW \fB(week|wee)(night|knights)\fR +matches all ten characters of +.QW \fBweeknights\fR , +when +.QW \fB(.*).*\fR +is matched against +.QW \fBabc\fR +the parenthesized subexpression matches all three characters, and when +.QW \fB(a*)*\fR +is matched against +.QW \fBbc\fR +both the whole RE and the parenthesized subexpression match an empty string. +.PP +If case-independent matching is specified, the effect is much as if +all case distinctions had vanished from the alphabet. When an +alphabetic that exists in multiple cases appears as an ordinary +character outside a bracket expression, it is effectively transformed +into a bracket expression containing both cases, so that \fBx\fR +becomes +.QW \fB[xX]\fR . +When it appears inside a bracket expression, +all case counterparts of it are added to the bracket expression, so +that +.QW \fB[x]\fR +becomes +.QW \fB[xX]\fR +and +.QW \fB[^x]\fR +becomes +.QW \fB[^xX]\fR . +.PP +If newline-sensitive matching is specified, \fB.\fR and bracket +expressions using \fB^\fR will never match the newline character (so +that matches will never cross newlines unless the RE explicitly +arranges it) and \fB^\fR and \fB$\fR will match the empty string after +and before a newline respectively, in addition to matching at +beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR +continue to match beginning or end of string \fIonly\fR. +.PP +If partial newline-sensitive matching is specified, this affects +\fB.\fR and bracket expressions as with newline-sensitive matching, +but not \fB^\fR and \fB$\fR. +.PP +If inverse partial newline-sensitive matching is specified, this +affects \fB^\fR and \fB$\fR as with newline-sensitive matching, but +not \fB.\fR and bracket expressions. This is not very useful but is +provided for symmetry. +.SH "LIMITS AND COMPATIBILITY" +No particular limit is imposed on the length of REs. Programs +intended to be highly portable should not employ REs longer than 256 +bytes, as a POSIX-compliant implementation can refuse to accept such +REs. +.PP +The only feature of AREs that is actually incompatible with POSIX EREs +is that \fB\e\fR does not lose its special significance inside bracket +expressions. All other ARE features use syntax which is illegal or +has undefined or unspecified effects in POSIX EREs; the \fB***\fR +syntax of directors likewise is outside the POSIX syntax for both BREs +and EREs. +.PP +Many of the ARE extensions are borrowed from Perl, but some have been +changed to clean them up, and a few Perl extensions are not present. +Incompatibilities of note include +.QW \fB\eb\fR , +.QW \fB\eB\fR , +the lack of special treatment for a trailing newline, the addition of +complemented bracket expressions to the things affected by +newline-sensitive matching, the restrictions on parentheses and back +references in lookahead constraints, and the longest/shortest-match +(rather than first-match) matching semantics. +.PP +The matching rules for REs containing both normal and non-greedy +quantifiers have changed since early beta-test versions of this +package. (The new rules are much simpler and cleaner, but do not work +as hard at guessing the user's real intentions.) +.PP +Henry Spencer's original 1986 \fIregexp\fR package, still in +widespread use (e.g., in pre-8.1 releases of Tcl), implemented an +early version of today's EREs. There are four incompatibilities +between \fIregexp\fR's near-EREs +.PQ RREs " for short" +and AREs. In roughly increasing order of significance: +.IP \(bu 3 +In AREs, \fB\e\fR followed by an alphanumeric character is either an +escape or an error, while in RREs, it was just another way of writing +the alphanumeric. This should not be a problem because there was no +reason to write such a sequence in RREs. +.IP \(bu 3 +\fB{\fR followed by a digit in an ARE is the beginning of a bound, +while in RREs, \fB{\fR was always an ordinary character. Such +sequences should be rare, and will often result in an error because +following characters will not look like a valid bound. +.IP \(bu 3 +In AREs, \fB\e\fR remains a special character within +.QW \fB[\|]\fR , +so a literal \fB\e\fR within \fB[\|]\fR must be written +.QW \fB\e\e\fR . +\fB\e\e\fR also gives a literal \fB\e\fR within \fB[\|]\fR in RREs, +but only truly paranoid programmers routinely doubled the backslash. +.IP \(bu 3 +AREs report the longest/shortest match for the RE, rather than the +first found in a specified search order. This may affect some RREs +which were written in the expectation that the first match would be +reported. (The careful crafting of RREs to optimize the search order +for fast matching is obsolete (AREs examine all possible matches in +parallel, and their performance is largely insensitive to their +complexity) but cases where the search order was exploited to +deliberately find a match which was \fInot\fR the longest/shortest +will need rewriting.) +.SH "BASIC REGULAR EXPRESSIONS" +BREs differ from EREs in several respects. +.QW \fB|\fR , +.QW \fB+\fR , +and \fB?\fR are ordinary characters and there is no equivalent for their +functionality. The delimiters for bounds are \fB\e{\fR and +.QW \fB\e}\fR , +with \fB{\fR and \fB}\fR by themselves ordinary characters. The +parentheses for nested subexpressions are \fB\e(\fR and +.QW \fB\e)\fR , +with \fB(\fR and \fB)\fR by themselves ordinary +characters. \fB^\fR is an ordinary character except at the beginning +of the RE or the beginning of a parenthesized subexpression, \fB$\fR +is an ordinary character except at the end of the RE or the end of a +parenthesized subexpression, and \fB*\fR is an ordinary character if +it appears at the beginning of the RE or the beginning of a +parenthesized subexpression (after a possible leading +.QW \fB^\fR ). +Finally, single-digit back references are available, and \fB\e<\fR and +\fB\e>\fR are synonyms for +.QW \fB[[:<:]]\fR +and +.QW \fB[[:>:]]\fR +respectively; no other escapes are available. +.SH "SEE ALSO" +RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n) +.SH KEYWORDS +match, regular expression, string +.\" Local Variables: +.\" mode: nroff +.\" End: |