diff options
author | William Joye <wjoye@cfa.harvard.edu> | 2017-09-22 18:51:12 (GMT) |
---|---|---|
committer | William Joye <wjoye@cfa.harvard.edu> | 2017-09-22 18:51:12 (GMT) |
commit | 3fa8e6dc88e8041b6cb88d1b1e9c05676d3346b7 (patch) | |
tree | 69afbb41089c8358615879f7cd3c4cf7997f4c7e /tcl8.6/doc/re_syntax.n | |
parent | a0e17db23c0fd7c771c0afce8cce350c98f90b02 (diff) | |
download | blt-3fa8e6dc88e8041b6cb88d1b1e9c05676d3346b7.zip blt-3fa8e6dc88e8041b6cb88d1b1e9c05676d3346b7.tar.gz blt-3fa8e6dc88e8041b6cb88d1b1e9c05676d3346b7.tar.bz2 |
update to tcl/tk 8.6.7
Diffstat (limited to 'tcl8.6/doc/re_syntax.n')
-rw-r--r-- | tcl8.6/doc/re_syntax.n | 858 |
1 files changed, 0 insertions, 858 deletions
diff --git a/tcl8.6/doc/re_syntax.n b/tcl8.6/doc/re_syntax.n deleted file mode 100644 index 7988071..0000000 --- a/tcl8.6/doc/re_syntax.n +++ /dev/null @@ -1,858 +0,0 @@ -'\" -'\" Copyright (c) 1998 Sun Microsystems, Inc. -'\" Copyright (c) 1999 Scriptics Corporation -'\" -'\" See the file "license.terms" for information on usage and redistribution -'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES. -'\" -.so man.macros -.ie '\w'o''\w'\C'^o''' .ds qo \C'^o' -.el .ds qo u -.TH re_syntax n "8.1" Tcl "Tcl Built-In Commands" -.BS -.SH NAME -re_syntax \- Syntax of Tcl regular expressions -.BE -.SH DESCRIPTION -.PP -A \fIregular expression\fR describes strings of characters. -It's a pattern that matches certain strings and does not match others. -.SH "DIFFERENT FLAVORS OF REs" -Regular expressions -.PQ RE s , -as defined by POSIX, come in two flavors: \fIextended\fR REs -.PQ ERE s -and \fIbasic\fR REs -.PQ BRE s . -EREs are roughly those of the traditional \fIegrep\fR, while BREs are -roughly those of the traditional \fIed\fR. This implementation adds -a third flavor, \fIadvanced\fR REs -.PQ ARE s , -basically EREs with some significant extensions. -.PP -This manual page primarily describes AREs. BREs mostly exist for -backward compatibility in some old programs; they will be discussed at -the end. POSIX EREs are almost an exact subset of AREs. Features of -AREs that are not present in EREs will be indicated. -.SH "REGULAR EXPRESSION SYNTAX" -.PP -Tcl regular expressions are implemented using the package written by -Henry Spencer, based on the 1003.2 spec and some (not quite all) of -the Perl5 extensions (thanks, Henry!). Much of the description of -regular expressions below is copied verbatim from his manual entry. -.PP -An ARE is one or more \fIbranches\fR, -separated by -.QW \fB|\fR , -matching anything that matches any of the branches. -.PP -A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR, -concatenated. -It matches a match for the first, followed by a match for the second, etc; -an empty branch matches the empty string. -.SS QUANTIFIERS -A quantified atom is an \fIatom\fR possibly followed -by a single \fIquantifier\fR. -Without a quantifier, it matches a single match for the atom. -The quantifiers, -and what a so-quantified atom matches, are: -.RS 2 -.TP 6 -\fB*\fR -. -a sequence of 0 or more matches of the atom -.TP -\fB+\fR -. -a sequence of 1 or more matches of the atom -.TP -\fB?\fR -. -a sequence of 0 or 1 matches of the atom -.TP -\fB{\fIm\fB}\fR -. -a sequence of exactly \fIm\fR matches of the atom -.TP -\fB{\fIm\fB,}\fR -. -a sequence of \fIm\fR or more matches of the atom -.TP -\fB{\fIm\fB,\fIn\fB}\fR -. -a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom; -\fIm\fR may not exceed \fIn\fR -.TP -\fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR -. -\fInon-greedy\fR quantifiers, which match the same possibilities, -but prefer the smallest number rather than the largest number -of matches (see \fBMATCHING\fR) -.RE -.PP -The forms using \fB{\fR and \fB}\fR are known as \fIbound\fRs. The -numbers \fIm\fR and \fIn\fR are unsigned decimal integers with -permissible values from 0 to 255 inclusive. -.SS ATOMS -An atom is one of: -.RS 2 -.IP \fB(\fIre\fB)\fR 6 -matches a match for \fIre\fR (\fIre\fR is any regular expression) with -the match noted for possible reporting -.IP \fB(?:\fIre\fB)\fR -as previous, but does no reporting (a -.QW non-capturing -set of parentheses) -.IP \fB()\fR -matches an empty string, noted for possible reporting -.IP \fB(?:)\fR -matches an empty string, without reporting -.IP \fB[\fIchars\fB]\fR -a \fIbracket expression\fR, matching any one of the \fIchars\fR (see -\fBBRACKET EXPRESSIONS\fR for more detail) -.IP \fB.\fR -matches any single character -.IP \fB\e\fIk\fR -matches the non-alphanumeric character \fIk\fR -taken as an ordinary character, e.g. \fB\e\e\fR matches a backslash -character -.IP \fB\e\fIc\fR -where \fIc\fR is alphanumeric (possibly followed by other characters), -an \fIescape\fR (AREs only), see \fBESCAPES\fR below -.IP \fB{\fR -when followed by a character other than a digit, matches the -left-brace character -.QW \fB{\fR ; -when followed by a digit, it is the beginning of a \fIbound\fR (see above) -.IP \fIx\fR -where \fIx\fR is a single character with no other significance, -matches that character. -.RE -.SS CONSTRAINTS -A \fIconstraint\fR matches an empty string when specific conditions -are met. A constraint may not be followed by a quantifier. The -simple constraints are as follows; some more constraints are described -later, under \fBESCAPES\fR. -.RS 2 -.TP 8 -\fB^\fR -. -matches at the beginning of a line -.TP -\fB$\fR -. -matches at the end of a line -.TP -\fB(?=\fIre\fB)\fR -. -\fIpositive lookahead\fR (AREs only), matches at any point where a -substring matching \fIre\fR begins -.TP -\fB(?!\fIre\fB)\fR -. -\fInegative lookahead\fR (AREs only), matches at any point where no -substring matching \fIre\fR begins -.RE -.PP -The lookahead constraints may not contain back references (see later), -and all parentheses within them are considered non-capturing. -.PP -An RE may not end with -.QW \fB\e\fR . -.SH "BRACKET EXPRESSIONS" -A \fIbracket expression\fR is a list of characters enclosed in -.QW \fB[\|]\fR . -It normally matches any single character from the list -(but see below). If the list begins with -.QW \fB^\fR , -it matches any single character (but see below) \fInot\fR from the -rest of the list. -.PP -If two characters in the list are separated by -.QW \fB\-\fR , -this is shorthand for the full \fIrange\fR of characters between those two -(inclusive) in the collating sequence, e.g. -.QW \fB[0\-9]\fR -in Unicode matches any conventional decimal digit. Two ranges may not share an -endpoint, so e.g. -.QW \fBa\-c\-e\fR -is illegal. Ranges in Tcl always use the -Unicode collating sequence, but other programs may use other collating -sequences and this can be a source of incompatibility between programs. -.PP -To include a literal \fB]\fR or \fB\-\fR in the list, the simplest -method is to enclose it in \fB[.\fR and \fB.]\fR to make it a -collating element (see below). Alternatively, make it the first -character (following a possible -.QW \fB^\fR ), -or (AREs only) precede it with -.QW \fB\e\fR . -Alternatively, for -.QW \fB\-\fR , -make it the last character, or the second endpoint of a range. To use -a literal \fB\-\fR as the first endpoint of a range, make it a -collating element or (AREs only) precede it with -.QW \fB\e\fR . -With the exception of -these, some combinations using \fB[\fR (see next paragraphs), and -escapes, all other special characters lose their special significance -within a bracket expression. -.SS "CHARACTER CLASSES" -Within a bracket expression, the name of a \fIcharacter class\fR -enclosed in \fB[:\fR and \fB:]\fR stands for the list of all -characters (not all collating elements!) belonging to that class. -Standard character classes are: -.IP \fBalpha\fR 8 -A letter. -.IP \fBupper\fR 8 -An upper-case letter. -.IP \fBlower\fR 8 -A lower-case letter. -.IP \fBdigit\fR 8 -A decimal digit. -.IP \fBxdigit\fR 8 -A hexadecimal digit. -.IP \fBalnum\fR 8 -An alphanumeric (letter or digit). -.IP \fBprint\fR 8 -A "printable" (same as graph, except also including space). -.IP \fBblank\fR 8 -A space or tab character. -.IP \fBspace\fR 8 -A character producing white space in displayed text. -.IP \fBpunct\fR 8 -A punctuation character. -.IP \fBgraph\fR 8 -A character with a visible representation (includes both \fBalnum\fR -and \fBpunct\fR). -.IP \fBcntrl\fR 8 -A control character. -.PP -A locale may provide others. A character class may not be used as an endpoint -of a range. -.RS -.PP -(\fINote:\fR the current Tcl implementation has only one locale, the Unicode -locale, which supports exactly the above classes.) -.RE -.SS "BRACKETED CONSTRAINTS" -There are two special cases of bracket expressions: the bracket -expressions -.QW \fB[[:<:]]\fR -and -.QW \fB[[:>:]]\fR -are constraints, matching empty strings at the beginning and end of a word -respectively. -.\" note, discussion of escapes below references this definition of word -A word is defined as a sequence of word characters that is neither preceded -nor followed by word characters. A word character is an \fIalnum\fR character -or an underscore -.PQ \fB_\fR "" . -These special bracket expressions are deprecated; users of AREs should use -constraint escapes instead (see below). -.SS "COLLATING ELEMENTS" -Within a bracket expression, a collating element (a character, a -multi-character sequence that collates as if it were a single -character, or a collating-sequence name for either) enclosed in -\fB[.\fR and \fB.]\fR stands for the sequence of characters of that -collating element. The sequence is a single element of the bracket -expression's list. A bracket expression in a locale that has -multi-character collating elements can thus match more than one -character. So (insidiously), a bracket expression that starts with -\fB^\fR can match multi-character collating elements even if none of -them appear in the bracket expression! -.RS -.PP -(\fINote:\fR Tcl has no multi-character collating elements. This information -is only for illustration.) -.RE -.PP -For example, assume the collating sequence includes a \fBch\fR multi-character -collating element. Then the RE -.QW \fB[[.ch.]]*c\fR -(zero or more -.QW \fBch\fRs -followed by -.QW \fBc\fR ) -matches the first five characters of -.QW \fBchchcc\fR . -Also, the RE -.QW \fB[^c]b\fR -matches all of -.QW \fBchb\fR -(because -.QW \fB[^c]\fR -matches the multi-character -.QW \fBch\fR ). -.SS "EQUIVALENCE CLASSES" -Within a bracket expression, a collating element enclosed in \fB[=\fR -and \fB=]\fR is an equivalence class, standing for the sequences of -characters of all collating elements equivalent to that one, including -itself. (If there are no other equivalent collating elements, the -treatment is as if the enclosing delimiters were -.QW \fB[.\fR \& -and -.QW \fB.]\fR .) -For example, if \fBo\fR and \fB\*(qo\fR are the members of an -equivalence class, then -.QW \fB[[=o=]]\fR , -.QW \fB[[=\*(qo=]]\fR , -and -.QW \fB[o\*(qo]\fR \& -are all synonymous. An equivalence class may not be an endpoint of a range. -.RS -.PP -(\fINote:\fR Tcl implements only the Unicode locale. It does not define any -equivalence classes. The examples above are just illustrations.) -.RE -.SH ESCAPES -Escapes (AREs only), which begin with a \fB\e\fR followed by an -alphanumeric character, come in several varieties: character entry, -class shorthands, constraint escapes, and back references. A \fB\e\fR -followed by an alphanumeric character but not constituting a valid -escape is illegal in AREs. In EREs, there are no escapes: outside a -bracket expression, a \fB\e\fR followed by an alphanumeric character -merely stands for that character as an ordinary character, and inside -a bracket expression, \fB\e\fR is an ordinary character. (The latter -is the one actual incompatibility between EREs and AREs.) -.SS "CHARACTER-ENTRY ESCAPES" -Character-entry escapes (AREs only) exist to make it easier to specify -non-printing and otherwise inconvenient characters in REs: -.RS 2 -.TP 5 -\fB\ea\fR -. -alert (bell) character, as in C -.TP -\fB\eb\fR -. -backspace, as in C -.TP -\fB\eB\fR -. -synonym for \fB\e\fR to help reduce backslash doubling in some -applications where there are multiple levels of backslash processing -.TP -\fB\ec\fIX\fR -. -(where \fIX\fR is any character) the character whose low-order 5 bits -are the same as those of \fIX\fR, and whose other bits are all zero -.TP -\fB\ee\fR -. -the character whose collating-sequence name is -.QW \fBESC\fR , -or failing that, the character with octal value 033 -.TP -\fB\ef\fR -. -formfeed, as in C -.TP -\fB\en\fR -. -newline, as in C -.TP -\fB\er\fR -. -carriage return, as in C -.TP -\fB\et\fR -. -horizontal tab, as in C -.TP -\fB\eu\fIwxyz\fR -. -(where \fIwxyz\fR is one up to four hexadecimal digits) the Unicode -character \fBU+\fIwxyz\fR in the local byte ordering -.TP -\fB\eU\fIstuvwxyz\fR -. -(where \fIstuvwxyz\fR is one up to eight hexadecimal digits) reserved -for a Unicode extension up to 21 bits. The digits are parsed until the -first non-hexadecimal character is encountered, the maximun of eight -hexadecimal digits are reached, or an overflow would occur in the maximum -value of \fBU+\fI10ffff\fR. -.TP -\fB\ev\fR -. -vertical tab, as in C are all available. -.TP -\fB\ex\fIhh\fR -. -(where \fIhh\fR is one or two hexadecimal digits) the character -whose hexadecimal value is \fB0x\fIhh\fR. -.TP -\fB\e0\fR -. -the character whose value is \fB0\fR -.TP -\fB\e\fIxyz\fR -. -(where \fIxyz\fR is exactly three octal digits, and is not a \fIback -reference\fR (see below)) the character whose octal value is -\fB0\fIxyz\fR. The first digit must be in the range 0-3, otherwise -the two-digit form is assumed. -.TP -\fB\e\fIxy\fR -. -(where \fIxy\fR is exactly two octal digits, and is not a \fIback -reference\fR (see below)) the character whose octal value is -\fB0\fIxy\fR -.RE -.PP -Hexadecimal digits are -.QR \fB0\fR \fB9\fR , -.QR \fBa\fR \fBf\fR , -and -.QR \fBA\fR \fBF\fR . -Octal digits are -.QR \fB0\fR \fB7\fR . -.PP -The character-entry escapes are always taken as ordinary characters. -For example, \fB\e135\fR is \fB]\fR in Unicode, but \fB\e135\fR does -not terminate a bracket expression. Beware, however, that some -applications (e.g., C compilers and the Tcl interpreter if the regular -expression is not quoted with braces) interpret such sequences -themselves before the regular-expression package gets to see them, -which may require doubling (quadrupling, etc.) the -.QW \fB\e\fR . -.SS "CLASS-SHORTHAND ESCAPES" -Class-shorthand escapes (AREs only) provide shorthands for certain -commonly-used character classes: -.RS 2 -.TP 10 -\fB\ed\fR -. -\fB[[:digit:]]\fR -.TP -\fB\es\fR -. -\fB[[:space:]]\fR -.TP -\fB\ew\fR -. -\fB[[:alnum:]_]\fR (note underscore) -.TP -\fB\eD\fR -. -\fB[^[:digit:]]\fR -.TP -\fB\eS\fR -. -\fB[^[:space:]]\fR -.TP -\fB\eW\fR -. -\fB[^[:alnum:]_]\fR (note underscore) -.RE -.PP -Within bracket expressions, -.QW \fB\ed\fR , -.QW \fB\es\fR , -and -.QW \fB\ew\fR \& -lose their outer brackets, and -.QW \fB\eD\fR , -.QW \fB\eS\fR , -and -.QW \fB\eW\fR \& -are illegal. (So, for example, -.QW \fB[a-c\ed]\fR -is equivalent to -.QW \fB[a-c[:digit:]]\fR . -Also, -.QW \fB[a-c\eD]\fR , -which is equivalent to -.QW \fB[a-c^[:digit:]]\fR , -is illegal.) -.SS "CONSTRAINT ESCAPES" -A constraint escape (AREs only) is a constraint, matching the empty -string if specific conditions are met, written as an escape: -.RS 2 -.TP 6 -\fB\eA\fR -. -matches only at the beginning of the string (see \fBMATCHING\fR, -below, for how this differs from -.QW \fB^\fR ) -.TP -\fB\em\fR -. -matches only at the beginning of a word -.TP -\fB\eM\fR -. -matches only at the end of a word -.TP -\fB\ey\fR -. -matches only at the beginning or end of a word -.TP -\fB\eY\fR -. -matches only at a point that is not the beginning or end of a word -.TP -\fB\eZ\fR -. -matches only at the end of the string (see \fBMATCHING\fR, below, for -how this differs from -.QW \fB$\fR ) -.TP -\fB\e\fIm\fR -. -(where \fIm\fR is a nonzero digit) a \fIback reference\fR, see below -.TP -\fB\e\fImnn\fR -. -(where \fIm\fR is a nonzero digit, and \fInn\fR is some more digits, -and the decimal value \fImnn\fR is not greater than the number of -closing capturing parentheses seen so far) a \fIback reference\fR, see -below -.RE -.PP -A word is defined as in the specification of -.QW \fB[[:<:]]\fR -and -.QW \fB[[:>:]]\fR -above. Constraint escapes are illegal within bracket expressions. -.SS "BACK REFERENCES" -A back reference (AREs only) matches the same string matched by the -parenthesized subexpression specified by the number, so that (e.g.) -.QW \fB([bc])\e1\fR -matches -.QW \fBbb\fR -or -.QW \fBcc\fR -but not -.QW \fBbc\fR . -The subexpression must entirely precede the back reference in the RE. -Subexpressions are numbered in the order of their leading parentheses. -Non-capturing parentheses do not define subexpressions. -.PP -There is an inherent historical ambiguity between octal -character-entry escapes and back references, which is resolved by -heuristics, as hinted at above. A leading zero always indicates an -octal escape. A single non-zero digit, not followed by another digit, -is always taken as a back reference. A multi-digit sequence not -starting with a zero is taken as a back reference if it comes after a -suitable subexpression (i.e. the number is in the legal range for a -back reference), and otherwise is taken as octal. -.SH "METASYNTAX" -In addition to the main syntax described above, there are some special -forms and miscellaneous syntactic facilities available. -.PP -Normally the flavor of RE being used is specified by -application-dependent means. However, this can be overridden by a -\fIdirector\fR. If an RE of any flavor begins with -.QW \fB***:\fR , -the rest of the RE is an ARE. If an RE of any flavor begins with -.QW \fB***=\fR , -the rest of the RE is taken to be a literal string, with -all characters considered ordinary characters. -.PP -An ARE may begin with \fIembedded options\fR: a sequence -\fB(?\fIxyz\fB)\fR (where \fIxyz\fR is one or more alphabetic -characters) specifies options affecting the rest of the RE. These -supplement, and can override, any options specified by the -application. The available option letters are: -.RS 2 -.TP 3 -\fBb\fR -. -rest of RE is a BRE -.TP 3 -\fBc\fR -. -case-sensitive matching (usual default) -.TP 3 -\fBe\fR -. -rest of RE is an ERE -.TP 3 -\fBi\fR -. -case-insensitive matching (see \fBMATCHING\fR, below) -.TP 3 -\fBm\fR -. -historical synonym for \fBn\fR -.TP 3 -\fBn\fR -. -newline-sensitive matching (see \fBMATCHING\fR, below) -.TP 3 -\fBp\fR -. -partial newline-sensitive matching (see \fBMATCHING\fR, below) -.TP 3 -\fBq\fR -. -rest of RE is a literal -.PQ quoted -string, all ordinary characters -.TP 3 -\fBs\fR -. -non-newline-sensitive matching (usual default) -.TP 3 -\fBt\fR -. -tight syntax (usual default; see below) -.TP 3 -\fBw\fR -. -inverse partial newline-sensitive -.PQ weird -matching (see \fBMATCHING\fR, below) -.TP 3 -\fBx\fR -. -expanded syntax (see below) -.RE -.PP -Embedded options take effect at the \fB)\fR terminating the sequence. -They are available only at the start of an ARE, and may not be used -later within it. -.PP -In addition to the usual (\fItight\fR) RE syntax, in which all -characters are significant, there is an \fIexpanded\fR syntax, -available in all flavors of RE with the \fB\-expanded\fR switch, or in -AREs with the embedded x option. In the expanded syntax, white-space -characters are ignored and all characters between a \fB#\fR and the -following newline (or the end of the RE) are ignored, permitting -paragraphing and commenting a complex RE. There are three exceptions -to that basic rule: -.IP \(bu 3 -a white-space character or -.QW \fB#\fR -preceded by -.QW \fB\e\fR -is retained -.IP \(bu 3 -white space or -.QW \fB#\fR -within a bracket expression is retained -.IP \(bu 3 -white space and comments are illegal within multi-character symbols -like the ARE -.QW \fB(?:\fR -or the BRE -.QW \fB\e(\fR -.PP -Expanded-syntax white-space characters are blank, tab, newline, and -any character that belongs to the \fIspace\fR character class. -.PP -Finally, in an ARE, outside bracket expressions, the sequence -.QW \fB(?#\fIttt\fB)\fR -(where \fIttt\fR is any text not containing a -.QW \fB)\fR ) -is a comment, completely ignored. Again, this is not -allowed between the characters of multi-character symbols like -.QW \fB(?:\fR . -Such comments are more a historical artifact than a useful facility, -and their use is deprecated; use the expanded syntax instead. -.PP -\fINone\fR of these metasyntax extensions is available if the -application (or an initial -.QW \fB***=\fR -director) has specified that the -user's input be treated as a literal string rather than as an RE. -.SH MATCHING -In the event that an RE could match more than one substring of a given -string, the RE matches the one starting earliest in the string. If -the RE could match more than one substring starting at that point, its -choice is determined by its \fIpreference\fR: either the longest -substring, or the shortest. -.PP -Most atoms, and all constraints, have no preference. A parenthesized -RE has the same preference (possibly none) as the RE. A quantified -atom with quantifier \fB{\fIm\fB}\fR or \fB{\fIm\fB}?\fR has the same -preference (possibly none) as the atom itself. A quantified atom with -other normal quantifiers (including \fB{\fIm\fB,\fIn\fB}\fR with -\fIm\fR equal to \fIn\fR) prefers longest match. A quantified atom -with other non-greedy quantifiers (including \fB{\fIm\fB,\fIn\fB}?\fR -with \fIm\fR equal to \fIn\fR) prefers shortest match. A branch has -the same preference as the first quantified atom in it which has a -preference. An RE consisting of two or more branches connected by the -\fB|\fR operator prefers longest match. -.PP -Subject to the constraints imposed by the rules for matching the whole -RE, subexpressions also match the longest or shortest possible -substrings, based on their preferences, with subexpressions starting -earlier in the RE taking priority over ones starting later. Note that -outer subexpressions thus take priority over their component -subexpressions. -.PP -The quantifiers \fB{1,1}\fR and \fB{1,1}?\fR can be used to -force longest and shortest preference, respectively, on a -subexpression or a whole RE. -.RS -.PP -\fBNOTE:\fR This means that you can usually make a RE be non-greedy overall by -putting \fB{1,1}?\fR after one of the first non-constraint atoms or -parenthesized sub-expressions in it. \fIIt pays to experiment\fR with the -placing of this non-greediness override on a suitable range of input texts -when you are writing a RE if you are using this level of complexity. -.PP -For example, this regular expression is non-greedy, and will match the -shortest substring possible given that -.QW \fBabc\fR -will be matched as early as possible (the quantifier does not change that): -.PP -.CS -ab{1,1}?c.*x.*cba -.CE -.PP -The atom -.QW \fBa\fR -has no greediness preference, we explicitly give one for -.QW \fBb\fR , -and the remaining quantifiers are overridden to be non-greedy by the preceding -non-greedy quantifier. -.RE -.PP -Match lengths are measured in characters, not collating elements. An -empty string is considered longer than no match at all. For example, -.QW \fBbb*\fR -matches the three middle characters of -.QW \fBabbbc\fR , -.QW \fB(week|wee)(night|knights)\fR -matches all ten characters of -.QW \fBweeknights\fR , -when -.QW \fB(.*).*\fR -is matched against -.QW \fBabc\fR -the parenthesized subexpression matches all three characters, and when -.QW \fB(a*)*\fR -is matched against -.QW \fBbc\fR -both the whole RE and the parenthesized subexpression match an empty string. -.PP -If case-independent matching is specified, the effect is much as if -all case distinctions had vanished from the alphabet. When an -alphabetic that exists in multiple cases appears as an ordinary -character outside a bracket expression, it is effectively transformed -into a bracket expression containing both cases, so that \fBx\fR -becomes -.QW \fB[xX]\fR . -When it appears inside a bracket expression, -all case counterparts of it are added to the bracket expression, so -that -.QW \fB[x]\fR -becomes -.QW \fB[xX]\fR -and -.QW \fB[^x]\fR -becomes -.QW \fB[^xX]\fR . -.PP -If newline-sensitive matching is specified, \fB.\fR and bracket -expressions using \fB^\fR will never match the newline character (so -that matches will never cross newlines unless the RE explicitly -arranges it) and \fB^\fR and \fB$\fR will match the empty string after -and before a newline respectively, in addition to matching at -beginning and end of string respectively. ARE \fB\eA\fR and \fB\eZ\fR -continue to match beginning or end of string \fIonly\fR. -.PP -If partial newline-sensitive matching is specified, this affects -\fB.\fR and bracket expressions as with newline-sensitive matching, -but not \fB^\fR and \fB$\fR. -.PP -If inverse partial newline-sensitive matching is specified, this -affects \fB^\fR and \fB$\fR as with newline-sensitive matching, but -not \fB.\fR and bracket expressions. This is not very useful but is -provided for symmetry. -.SH "LIMITS AND COMPATIBILITY" -No particular limit is imposed on the length of REs. Programs -intended to be highly portable should not employ REs longer than 256 -bytes, as a POSIX-compliant implementation can refuse to accept such -REs. -.PP -The only feature of AREs that is actually incompatible with POSIX EREs -is that \fB\e\fR does not lose its special significance inside bracket -expressions. All other ARE features use syntax which is illegal or -has undefined or unspecified effects in POSIX EREs; the \fB***\fR -syntax of directors likewise is outside the POSIX syntax for both BREs -and EREs. -.PP -Many of the ARE extensions are borrowed from Perl, but some have been -changed to clean them up, and a few Perl extensions are not present. -Incompatibilities of note include -.QW \fB\eb\fR , -.QW \fB\eB\fR , -the lack of special treatment for a trailing newline, the addition of -complemented bracket expressions to the things affected by -newline-sensitive matching, the restrictions on parentheses and back -references in lookahead constraints, and the longest/shortest-match -(rather than first-match) matching semantics. -.PP -The matching rules for REs containing both normal and non-greedy -quantifiers have changed since early beta-test versions of this -package. (The new rules are much simpler and cleaner, but do not work -as hard at guessing the user's real intentions.) -.PP -Henry Spencer's original 1986 \fIregexp\fR package, still in -widespread use (e.g., in pre-8.1 releases of Tcl), implemented an -early version of today's EREs. There are four incompatibilities -between \fIregexp\fR's near-EREs -.PQ RREs " for short" -and AREs. In roughly increasing order of significance: -.IP \(bu 3 -In AREs, \fB\e\fR followed by an alphanumeric character is either an -escape or an error, while in RREs, it was just another way of writing -the alphanumeric. This should not be a problem because there was no -reason to write such a sequence in RREs. -.IP \(bu 3 -\fB{\fR followed by a digit in an ARE is the beginning of a bound, -while in RREs, \fB{\fR was always an ordinary character. Such -sequences should be rare, and will often result in an error because -following characters will not look like a valid bound. -.IP \(bu 3 -In AREs, \fB\e\fR remains a special character within -.QW \fB[\|]\fR , -so a literal \fB\e\fR within \fB[\|]\fR must be written -.QW \fB\e\e\fR . -\fB\e\e\fR also gives a literal \fB\e\fR within \fB[\|]\fR in RREs, -but only truly paranoid programmers routinely doubled the backslash. -.IP \(bu 3 -AREs report the longest/shortest match for the RE, rather than the -first found in a specified search order. This may affect some RREs -which were written in the expectation that the first match would be -reported. (The careful crafting of RREs to optimize the search order -for fast matching is obsolete (AREs examine all possible matches in -parallel, and their performance is largely insensitive to their -complexity) but cases where the search order was exploited to -deliberately find a match which was \fInot\fR the longest/shortest -will need rewriting.) -.SH "BASIC REGULAR EXPRESSIONS" -BREs differ from EREs in several respects. -.QW \fB|\fR , -.QW \fB+\fR , -and \fB?\fR are ordinary characters and there is no equivalent for their -functionality. The delimiters for bounds are \fB\e{\fR and -.QW \fB\e}\fR , -with \fB{\fR and \fB}\fR by themselves ordinary characters. The -parentheses for nested subexpressions are \fB\e(\fR and -.QW \fB\e)\fR , -with \fB(\fR and \fB)\fR by themselves ordinary -characters. \fB^\fR is an ordinary character except at the beginning -of the RE or the beginning of a parenthesized subexpression, \fB$\fR -is an ordinary character except at the end of the RE or the end of a -parenthesized subexpression, and \fB*\fR is an ordinary character if -it appears at the beginning of the RE or the beginning of a -parenthesized subexpression (after a possible leading -.QW \fB^\fR ). -Finally, single-digit back references are available, and \fB\e<\fR and -\fB\e>\fR are synonyms for -.QW \fB[[:<:]]\fR -and -.QW \fB[[:>:]]\fR -respectively; no other escapes are available. -.SH "SEE ALSO" -RegExp(3), regexp(n), regsub(n), lsearch(n), switch(n), text(n) -.SH KEYWORDS -match, regular expression, string -.\" Local Variables: -.\" mode: nroff -.\" End: |