diff options
author | jpeek <jpeek> | 1999-06-24 21:15:13 (GMT) |
---|---|---|
committer | jpeek <jpeek> | 1999-06-24 21:15:13 (GMT) |
commit | 9f6f5c519abd3e4640e5fed701205a348b0bf006 (patch) | |
tree | eefaa1eee86447d2b23facd48fa29800b023a37d /doc/regexp.n | |
parent | 79d36b8166d773a6f2740a59820c30748c102226 (diff) | |
download | tcl-9f6f5c519abd3e4640e5fed701205a348b0bf006.zip tcl-9f6f5c519abd3e4640e5fed701205a348b0bf006.tar.gz tcl-9f6f5c519abd3e4640e5fed701205a348b0bf006.tar.bz2 |
Moved description of regular expression syntax from regexp.n into a
new re_syntax.n page. Modified other pages' references to regexp(n).
Added a few new "see also" entries pointing to re_syntax(n).
Diffstat (limited to 'doc/regexp.n')
-rw-r--r-- | doc/regexp.n | 906 |
1 files changed, 5 insertions, 901 deletions
diff --git a/doc/regexp.n b/doc/regexp.n index 2f6c4b6..03010e3 100644 --- a/doc/regexp.n +++ b/doc/regexp.n @@ -4,7 +4,7 @@ '\" See the file "license.terms" for information on usage and redistribution '\" of this file, and for a DISCLAIMER OF ALL WARRANTIES. '\" -'\" RCS: @(#) $Id: regexp.n,v 1.5 1999/05/06 19:14:42 stanton Exp $ +'\" RCS: @(#) $Id: regexp.n,v 1.6 1999/06/24 21:15:13 jpeek Exp $ '\" .so man.macros .TH regexp n 8.1 Tcl "Tcl Built-In Commands" @@ -21,6 +21,8 @@ regexp \- Match a regular expression against a string .PP Determines whether the regular expression \fIexp\fR matches part or all of \fIstring\fR and returns 1 if it does, 0 if it doesn't. +(Regular expression matching is described in the \fBre_syntax\fR +reference page.) .LP If additional arguments are specified after \fIstring\fR then they are treated as the names of variables in which to return @@ -93,906 +95,8 @@ portion of the expression that wasn't matched), then the corresponding \fIsubMatchVar\fR will be set to ``\fB\-1 \-1\fR'' if \fB\-indices\fR has been specified or to an empty string otherwise. -.SH "DIFFERENT FLAVORS OF REs" -.VS 8.1 -Regular expressions (``RE''s), as defined by POSIX, come in two -flavors: \fIextended\fR REs (``EREs'') and \fIbasic\fR REs (``BREs''). -EREs are roughly those of the traditional \fIegrep\fR, while BREs are -roughly those of the traditional \fIed\fR . This implementation adds -a third flavor, \fIadvanced\fR REs (``AREs''), basically EREs with -some significant extensions. -.PP -This manual page primarily describes AREs. BREs mostly exist for -backward compatibility in some old programs; they will be discussed at -the end. POSIX EREs are almost an exact subset of AREs. Features of -AREs that are not present in EREs will be indicated. - -.SH "REGULAR EXPRESSION SYNTAX" -.PP -Tcl regular expressions are implemented using the package written by -Henry Spencer, based on the 1003.2 spec and some (not quite all) of -the Perl5 extensions (thanks, Henry!). Much of the description of -regular expressions below is copied verbatim from his manual entry. -.PP -An ARE is one or more \fIbranches\fR, -separated by `\fB|\fR', -matching anything that matches any of the branches. -.PP -A branch is zero or more \fIconstraints\fR or \fIquantified atoms\fR, -concatenated. -It matches a match for the first, followed by a match for the second, etc; -an empty branch matches the empty string. -.PP -A quantified atom is an \fIatom\fR possibly followed -by a single \fIquantifier\fR. -Without a quantifier, it matches a match for the atom. -The quantifiers, -and what a so-quantified atom matches, are: -.RS 2 -.TP 6 -\fB*\fR -a sequence of 0 or more matches of the atom -.TP -\fB+\fR -a sequence of 1 or more matches of the atom -.TP -\fB?\fR -a sequence of 0 or 1 matches of the atom -.TP -\fB{\fIm\fB}\fR -a sequence of exactly \fIm\fR matches of the atom -.TP -\fB{\fIm\fB,}\fR -a sequence of \fIm\fR or more matches of the atom -.TP -\fB{\fIm\fB,\fIn\fB}\fR -a sequence of \fIm\fR through \fIn\fR (inclusive) matches of the atom; -\fIm\fR may not exceed \fIn\fR -.TP -\fB*? +? ?? {\fIm\fB}? {\fIm\fB,}? {\fIm\fB,\fIn\fB}?\fR -\fInon-greedy\fR quantifiers, -which match the same possibilities, -but prefer the smallest number rather than the largest number -of matches (see MATCHING) -.RE -.PP -The forms using -\fB{\fR and \fB}\fR -are known as \fIbound\fRs. -The numbers -\fIm\fR and \fIn\fR are unsigned decimal integers -with permissible values from 0 to 255 inclusive. -.PP -An atom is one of: -.RS 2 -.TP 6 -\fB(\fIre\fB)\fR -(where \fIre\fR is any regular expression) -matches a match for -\fIre\fR, with the match noted for possible reporting -.TP -\fB(?:\fIre\fB)\fR -as previous, -but does no reporting -(a ``non-capturing'' set of parentheses) -.TP -\fB()\fR -matches an empty string, -noted for possible reporting -.TP -\fB(?:)\fR -matches an empty string, -without reporting -.TP -\fB[\fIchars\fB]\fR -a \fIbracket expression\fR, -matching any one of the \fIchars\fR (see BRACKET EXPRESSIONS for more detail) -.TP - \fB.\fR -matches any single character -.TP -\fB\e\fIk\fR -(where \fIk\fR is a non-alphanumeric character) -matches that character taken as an ordinary character, -e.g. \e\e matches a backslash character -.TP -\fB\e\fIc\fR -where \fIc\fR is alphanumeric -(possibly followed by other characters), -an \fIescape\fR (AREs only), -see ESCAPES below -.TP -\fB{\fR -when followed by a character other than a digit, -matches the left-brace character `\fB{\fR'; -when followed by a digit, it is the beginning of a -\fIbound\fR (see above) -.TP -\fIx\fR -where \fIx\fR is -a single character with no other significance, matches that character. -.RE -.PP -A \fIconstraint\fR matches an empty string when specific conditions -are met. -A constraint may not be followed by a quantifier. -The simple constraints are as follows; some more constraints are -described later, under ESCAPES. -.RS 2 -.TP 8 -\fB^\fR -matches at the beginning of a line -.TP -\fB$\fR -matches at the end of a line -.TP -\fB(?=\fIre\fB)\fR -\fIpositive lookahead\fR (AREs only), matches at any point -where a substring matching \fIre\fR begins -.TP -\fB(?!\fIre\fB)\fR -\fInegative lookahead\fR (AREs only), matches at any point -where no substring matching \fIre\fR begins -.RE -.PP -The lookahead constraints may not contain back references (see later), -and all parentheses within them are considered non-capturing. -.PP -An RE may not end with `\fB\e\fR'. - -.SH "BRACKET EXPRESSIONS" -A \fIbracket expression\fR is a list of characters enclosed in `\fB[\|]\fR'. -It normally matches any single character from the list (but see below). -If the list begins with `\fB^\fR', -it matches any single character -(but see below) \fInot\fR from the rest of the list. -.PP -If two characters in the list are separated by `\fB\-\fR', -this is shorthand -for the full \fIrange\fR of characters between those two (inclusive) in the -collating sequence, -e.g. -\fB[0\-9]\fR -in ASCII matches any decimal digit. -Two ranges may not share an -endpoint, so e.g. -\fBa\-c\-e\fR -is illegal. -Ranges are very collating-sequence-dependent, -and portable programs should avoid relying on them. -.PP -To include a literal -\fB]\fR -or -\fB\-\fR -in the list, -the simplest method is to -enclose it in -\fB[.\fR -and -\fB.]\fR -to make it a collating element (see below). -Alternatively, -make it the first character -(following a possible `\fB^\fR'), -or (AREs only) precede it with `\fB\e\fR'. -Alternatively, for `\fB\-\fR', -make it the last character, -or the second endpoint of a range. -To use a literal -\fB\-\fR -as the first endpoint of a range, -make it a collating element -or (AREs only) precede it with `\fB\e\fR'. -With the exception of these, some combinations using -\fB[\fR -(see next -paragraphs), and escapes, -all other special characters lose their -special significance within a bracket expression. -.PP -Within a bracket expression, a collating element (a character, -a multi-character sequence that collates as if it were a single character, -or a collating-sequence name for either) -enclosed in -\fB[.\fR -and -\fB.]\fR -stands for the -sequence of characters of that collating element. -The sequence is a single element of the bracket expression's list. -A bracket expression in a locale which has -multi-character collating elements -can thus match more than one character. -Most insidiously, if -\fB^\fR -is used, -this can happen even if no multi-character collating -elements appear in the bracket expression! -If the collating sequence includes a -\fBch\fR -multi-character collating element, -then the RE -\fB[[.ch.]]*c\fR -matches the first five characters -of `\fBchchcc\fR', -and the RE -\fB[^c]b\fR -matches all of `\fBchb\fR'. -.PP -Within a bracket expression, a collating element enclosed in -\fB[=\fR -and -\fB=]\fR -is an equivalence class, standing for the sequences of characters -of all collating elements equivalent to that one, including itself. -(If there are no other equivalent collating elements, -the treatment is as if the enclosing delimiters were `\fB[.\fR'\& -and `\fB.]\fR'.) -For example, if -\fBo\fR -and -\fB\o'o^'\fR -are the members of an equivalence class, -then `\fB[[=o=]]\fR', `\fB[[=\o'o^'=]]\fR', -and `\fB[o\o'o^']\fR'\& -are all synonymous. -An equivalence class may not be an endpoint -of a range. -.PP -Within a bracket expression, the name of a \fIcharacter class\fR enclosed -in -\fB[:\fR -and -\fB:]\fR -stands for the list of all characters -(not all collating elements!) -belonging to that -class. -Standard character class names are: -.PP -.RS -.ne 5 -.nf -.ta 3c 6c 9c -\fBalnum digit punct -alpha graph space -blank lower upper -cntrl print xdigit\fR -.fi -.RE -.PP -These stand for the character classes defined in -\fIctype\fR(3). -A locale may provide others. -A character class may not be used as an endpoint of a range. -.PP -There are two special cases of bracket expressions: -the bracket expressions -\fB[[:<:]]\fR -and -\fB[[:>:]]\fR -are constraints, matching empty strings at -the beginning and end of a word respectively. -'\" note, discussion of escapes below references this definition of word -A word is defined as a sequence of -word characters -which is neither preceded nor followed by -word characters. -A word character is an -\fIalnum\fR -character (as defined by -\fIctype\fR(3)) -or an underscore -(\fB_\fR). -These special bracket expressions are deprecated; -users of AREs should use constraint escapes instead (see below). -.SH ESCAPES -Escapes (AREs only), which begin with a -\fB\e\fR -followed by an alphanumeric character, -come in several varieties: -character entry, class shorthands, constraint escapes, and back references. -A -\fB\e\fR -followed by an alphanumeric character but not constituting -a valid escape is illegal in AREs. -In EREs, there are no escapes: -outside a bracket expression, -a -\fB\e\fR -followed by an alphanumeric character merely stands for that -character as an ordinary character, -and inside a bracket expression, -\fB\e\fR -is an ordinary character. -(The latter is the one actual incompatibility between EREs and AREs.) -.PP -Character-entry escapes (AREs only) exist to make it easier to specify -non-printing and otherwise inconvenient characters in REs: -.RS 2 -.TP 5 -\fB\ea\fR -alert, aka bell, character, as in C -.TP -\fB\eb\fR -backspace, as in C -.TP -\fB\eB\fR -synonym for -\fB\e\fR -to help reduce backslash doubling in some -applications where there are multiple levels of backslash processing -.TP -\fB\ec\fIX\fR -(where X is any character) the character whose -low-order 5 bits are the same as those of -\fIX\fR, -and whose other bits are all zero -.TP -\fB\ee\fR -the character whose collating-sequence name -is `\fBESC\fR', -or failing that, the character with octal value 033 -.TP -\fB\ef\fR -formfeed, as in C -.TP -\fB\en\fR -newline, as in C -.TP -\fB\er\fR -carriage return, as in C -.TP -\fB\et\fR -horizontal tab, as in C -.TP -\fB\eu\fIwxyz\fR -(where -\fIwxyz\fR -is exactly four hexadecimal digits) -the Unicode character -\fBU+\fIwxyz\fR -in the local byte ordering -.TP -\fB\eU\fIstuvwxyz\fR -(where -\fIstuvwxyz\fR -is exactly eight hexadecimal digits) -reserved for a somewhat-hypothetical Unicode extension to 32 bits -.TP -\fB\ev\fR -vertical tab, as in C -are all available. -.TP -\fB\ex\fIhhh\fR -(where -\fIhhh\fR -is any sequence of hexadecimal digits) -the character whose hexadecimal value is -\fB0x\fIhhh\fR -(a single character no matter how many hexadecimal digits are used). -.TP -\fB\e0\fR -the character whose value is -\fB0\fR -.TP -\fB\e\fIxy\fR -(where -\fIxy\fR -is exactly two octal digits, -and is not a -\fIback reference\fR (see below)) -the character whose octal value is -\fB0\fIxy\fR -.TP -\fB\e\fIxyz\fR -(where -\fIxyz\fR -is exactly three octal digits, -and is not a -back reference (see below)) -the character whose octal value is -\fB0\fIxyz\fR -.RE -.PP -Hexadecimal digits are `\fB0\fR'-`\fB9\fR', `\fBa\fR'-`\fBf\fR', -and `\fBA\fR'-`\fBF\fR'. -Octal digits are `\fB0\fR'-`\fB7\fR'. -.PP -The character-entry escapes are always taken as ordinary characters. -For example, -\fB\e135\fR -is -\fB]\fR -in ASCII, -but -\fB\e135\fR -does not terminate a bracket expression. -Beware, however, that some applications (e.g., C compilers) interpret -such sequences themselves before the regular-expression package -gets to see them, which may require doubling (quadrupling, etc.) the `\fB\e\fR'. -.PP -Class-shorthand escapes (AREs only) provide shorthands for certain commonly-used -character classes: -.RS 2 -.TP 10 -\fB\ed\fR -\fB[[:digit:]]\fR -.TP -\fB\es\fR -\fB[[:space:]]\fR -.TP -\fB\ew\fR -\fB[[:alnum:]_]\fR -(note underscore) -.TP -\fB\eD\fR -\fB[^[:digit:]]\fR -.TP -\fB\eS\fR -\fB[^[:space:]]\fR -.TP -\fB\eW\fR -\fB[^[:alnum:]_]\fR -(note underscore) -.RE -.PP -Within bracket expressions, `\fB\ed\fR', `\fB\es\fR', -and `\fB\ew\fR'\& -lose their outer brackets, -and `\fB\eD\fR', `\fB\eS\fR', -and `\fB\eW\fR'\& -are illegal. -.PP -A constraint escape (AREs only) is a constraint, -matching the empty string if specific conditions are met, -written as an escape: -.RS 2 -.TP 6 -\fB\eA\fR -matches only at the beginning of the string -(see MATCHING, below, for how this differs from `\fB^\fR') -.TP -\fB\em\fR -matches only at the beginning of a word -.TP -\fB\eM\fR -matches only at the end of a word -.TP -\fB\ey\fR -matches only at the beginning or end of a word -.TP -\fB\eY\fR -matches only at a point which is not the beginning or end of a word -.TP -\fB\eZ\fR -matches only at the end of the string -(see MATCHING, below, for how this differs from `\fB$\fR') -.TP -\fB\e\fIm\fR -(where -\fIm\fR -is a nonzero digit) a \fIback reference\fR, see below -.TP -\fB\e\fImnn\fR -(where -\fIm\fR -is a nonzero digit, and -\fInn\fR -is some more digits, -and the decimal value -\fImnn\fR -is not greater than the number of closing capturing parentheses seen so far) -a \fIback reference\fR, see below -.RE -.PP -A word is defined as in the specification of -\fB[[:<:]]\fR -and -\fB[[:>:]]\fR -above. -Constraint escapes are illegal within bracket expressions. -.PP -A back reference (AREs only) matches the same string matched by the parenthesized -subexpression specified by the number, -so that (e.g.) -\fB([bc])\e1\fR -matches -\fBbb\fR -or -\fBcc\fR -but not `\fBbc\fR'. -The subexpression must entirely precede the back reference in the RE. -Subexpressions are numbered in the order of their leading parentheses. -Non-capturing parentheses do not define subexpressions. -.PP -There is an inherent historical ambiguity between octal character-entry -escapes and back references, which is resolved by heuristics, -as hinted at above. -A leading zero always indicates an octal escape. -A single non-zero digit, not followed by another digit, -is always taken as a back reference. -A multi-digit sequence not starting with a zero is taken as a back -reference if it comes after a suitable subexpression -(i.e. the number is in the legal range for a back reference), -and otherwise is taken as octal. -.SH "METASYNTAX" -In addition to the main syntax described above, there are some special -forms and miscellaneous syntactic facilities available. -.PP -Normally the flavor of RE being used is specified by -application-dependent means. -However, this can be overridden by a \fIdirector\fR. -If an RE of any flavor begins with `\fB***:\fR', -the rest of the RE is an ARE. -If an RE of any flavor begins with `\fB***=\fR', -the rest of the RE is taken to be a literal string, -with all characters considered ordinary characters. -.PP -An ARE may begin with \fIembedded options\fR: -a sequence -\fB(?\fIxyz\fB)\fR -(where -\fIxyz\fR -is one or more alphabetic characters) -specifies options affecting the rest of the RE. -These supplement, and can override, -any options specified by the application. -The available option letters are: -.RS 2 -.TP 3 -\fBb\fR -rest of RE is a BRE -.TP 3 -\fBc\fR -case-sensitive matching (usual default) -.TP 3 -\fBe\fR -rest of RE is an ERE -.TP 3 -\fBi\fR -case-insensitive matching (see MATCHING, below) -.TP 3 -\fBm\fR -historical synonym for -\fBn\fR -.TP 3 -\fBn\fR -newline-sensitive matching (see MATCHING, below) -.TP 3 -\fBp\fR -partial newline-sensitive matching (see MATCHING, below) -.TP 3 -\fBq\fR -rest of RE is a literal (``quoted'') string, all ordinary characters -.TP 3 -\fBs\fR -non-newline-sensitive matching (usual default) -.TP 3 -\fBt\fR -tight syntax (usual default; see below) -.TP 3 -\fBw\fR -inverse partial newline-sensitive (``weird'') matching (see MATCHING, below) -.TP 3 -\fBx\fR -expanded syntax (see below) -.RE -.PP -Embedded options take effect at the -\fB)\fR -terminating the sequence. -They are available only at the start of an ARE, -and may not be used later within it. -.PP -In addition to the usual (\fItight\fR) RE syntax, in which all characters are -significant, there is an \fIexpanded\fR syntax, -available in all flavors of RE -with the \fB-expanded\fR switch, or in AREs with the embedded x option. -In the expanded syntax, -white-space characters are ignored -and all characters between a -\fB#\fR -and the following newline (or the end of the RE) are ignored, -permitting paragraphing and commenting a complex RE. -There are three exceptions to that basic rule: -.RS 2 -.PP -a white-space character or `\fB#\fR' preceded by `\fB\e\fR' is retained -.PP -white space or `\fB#\fR' within a bracket expression is retained -.PP -white space and comments are illegal within multi-character symbols -like the ARE `\fB(?:\fR' or the BRE `\fB\e(\fR' -.RE -.PP -Expanded-syntax -white-space characters are blank, tab, newline, etc. (any character -defined as \fIspace\fR by -\fIctype\fR(3)). -Exactly how a multi-line expanded-syntax RE -can be entered interactively by a user, -if at all, is application-specific; -expanded syntax is primarily a scripting facility. -.PP -Finally, in an ARE, -outside bracket expressions, the sequence `\fB(?#\fIttt\fB)\fR' -(where -\fIttt\fR -is any text not containing a `\fB)\fR') -is a comment, -completely ignored. -Again, this is not allowed between the characters of -multi-character symbols like `\fB(?:\fR'. -Such comments are more a historical artifact than a useful facility, -and their use is deprecated; -use the expanded syntax instead. -.PP -\fINone\fR of these metasyntax extensions is available if the application -(or an initial -\fB***=\fR -director) -has specified that the user's input be treated as a literal string -rather than as an RE. -.SH MATCHING -In the event that an RE could match more than one substring of a given -string, -the RE matches the one starting earliest in the string. -If the RE could match more than one substring starting at that point, -its choice is determined by its \fIpreference\fR: -either the longest substring, or the shortest. -.PP -Most atoms, and all constraints, have no preference. -A parenthesized RE has the same preference (possibly none) as the RE. -A quantified atom with quantifier -\fB{\fIm\fB}\fR -or -\fB{\fIm\fB}?\fR -has the same preference (possibly none) as the atom itself. -A quantified atom with other normal quantifiers (including -\fB{\fIm\fB,\fIn\fB}\fR -with -\fIm\fR -equal to -\fIn\fR) -prefers longest match. -A quantified atom with other non-greedy quantifiers (including -\fB{\fIm\fB,\fIn\fB}?\fR -with -\fIm\fR -equal to -\fIn\fR) -prefers shortest match. -A branch has the same preference as the first quantified atom in it -which has a preference. -An RE consisting of two or more branches connected by the -\fB|\fR -operator prefers longest match. -.PP -Subject to the constraints imposed by the rules for matching the whole RE, -subexpressions also match the longest or shortest possible substrings, -based on their preferences, -with subexpressions starting earlier in the RE taking priority over -ones starting later. -Note that outer subexpressions thus take priority over -their component subexpressions. -.PP -Note that the quantifiers -\fB{1,1}\fR -and -\fB{1,1}?\fR -can be used to force longest and shortest preference, respectively, -on a subexpression or a whole RE. -.PP -Match lengths are measured in characters, not collating elements. -An empty string is considered longer than no match at all. -For example, -\fBbb*\fR -matches the three middle characters of `\fBabbbc\fR', -\fB(week|wee)(night|knights)\fR -matches all ten characters of `\fBweeknights\fR', -when -\fB(.*).*\fR -is matched against -\fBabc\fR -the parenthesized subexpression -matches all three characters, and -when -\fB(a*)*\fR -is matched against -\fBbc\fR -both the whole RE and the parenthesized -subexpression match an empty string. -.PP -If case-independent matching is specified, -the effect is much as if all case distinctions had vanished from the -alphabet. -When an alphabetic that exists in multiple cases appears as an -ordinary character outside a bracket expression, it is effectively -transformed into a bracket expression containing both cases, -so that -\fBx\fR -becomes `\fB[xX]\fR'. -When it appears inside a bracket expression, all case counterparts -of it are added to the bracket expression, so that -\fB[x]\fR -becomes -\fB[xX]\fR -and -\fB[^x]\fR -becomes `\fB[^xX]\fR'. -.PP -If newline-sensitive matching is specified, -\fB.\fR -and bracket expressions using -\fB^\fR -will never match the newline character -(so that matches will never cross newlines unless the RE -explicitly arranges it) -and -\fB^\fR -and -\fB$\fR -will match the empty string after and before a newline -respectively, in addition to matching at beginning and end of string -respectively. -ARE -\fB\eA\fR -and -\fB\eZ\fR -continue to match beginning or end of string \fIonly\fR. -.PP -If partial newline-sensitive matching is specified, -this affects -\fB.\fR -and bracket expressions -as with newline-sensitive matching, but not -\fB^\fR -and `\fB$\fR'. -.PP -If inverse partial newline-sensitive matching is specified, -this affects -\fB^\fR -and -\fB$\fR -as with -newline-sensitive matching, -but not -\fB.\fR -and bracket expressions. -This isn't very useful but is provided for symmetry. -.SH "LIMITS AND COMPATIBILITY" -No particular limit is imposed on the length of REs. -Programs intended to be highly portable should not employ REs longer -than 256 bytes, -as a POSIX-compliant implementation can refuse to accept such REs. -.PP -The only feature of AREs that is actually incompatible with -POSIX EREs is that -\fB\e\fR -does not lose its special -significance inside bracket expressions. -All other ARE features use syntax which is illegal or has -undefined or unspecified effects in POSIX EREs; -the -\fB***\fR -syntax of directors likewise is outside the POSIX -syntax for both BREs and EREs. -.PP -Many of the ARE extensions are borrowed from Perl, but some have -been changed to clean them up, and a few Perl extensions are not present. -Incompatibilities of note include `\fB\eb\fR', `\fB\eB\fR', -the lack of special treatment for a trailing newline, -the addition of complemented bracket expressions to the things -affected by newline-sensitive matching, -the restrictions on parentheses and back references in lookahead constraints, -and the longest/shortest-match (rather than first-match) matching semantics. -.PP -The matching rules for REs containing both normal and non-greedy quantifiers -have changed since early beta-test versions of this package. -(The new rules are much simpler and cleaner, -but don't work as hard at guessing the user's real intentions.) -.PP -Henry Spencer's original 1986 \fIregexp\fR package, -still in widespread use (e.g., in pre-8.1 releases of Tcl), -implemented an early version of today's EREs. -There are four incompatibilities between \fIregexp\fR's near-EREs -(`RREs' for short) and AREs. -In roughly increasing order of significance: -.PP -.RS -In AREs, -\fB\e\fR -followed by an alphanumeric character is either an -escape or an error, -while in RREs, it was just another way of writing the -alphanumeric. -This should not be a problem because there was no reason to write -such a sequence in RREs. -.PP -\fB{\fR -followed by a digit in an ARE is the beginning of a bound, -while in RREs, -\fB{\fR -was always an ordinary character. -Such sequences should be rare, -and will often result in an error because following characters -will not look like a valid bound. -.PP -In AREs, -\fB\e\fR -remains a special character within `\fB[\|]\fR', -so a literal -\fB\e\fR -within -\fB[\|]\fR -must be written `\fB\e\e\fR'. -\fB\e\e\fR -also gives a literal -\fB\e\fR -within -\fB[\|]\fR -in RREs, -but only truly paranoid programmers routinely doubled the backslash. -.PP -AREs report the longest/shortest match for the RE, -rather than the first found in a specified search order. -This may affect some RREs which were written in the expectation that -the first match would be reported. -(The careful crafting of RREs to optimize the search order for fast -matching is obsolete (AREs examine all possible matches -in parallel, and their performance is largely insensitive to their -complexity) but cases where the search order was exploited to deliberately -find a match which was \fInot\fR the longest/shortest will need rewriting.) -.RE - -.SH "BASIC REGULAR EXPRESSIONS" -BREs differ from EREs in several respects. `\fB|\fR', `\fB+\fR', -and -\fB?\fR -are ordinary characters and there is no equivalent -for their functionality. -The delimiters for bounds are -\fB\e{\fR -and `\fB\e}\fR', -with -\fB{\fR -and -\fB}\fR -by themselves ordinary characters. -The parentheses for nested subexpressions are -\fB\e(\fR -and `\fB\e)\fR', -with -\fB(\fR -and -\fB)\fR -by themselves ordinary characters. -\fB^\fR -is an ordinary character except at the beginning of the -RE or the beginning of a parenthesized subexpression, -\fB$\fR -is an ordinary character except at the end of the -RE or the end of a parenthesized subexpression, -and -\fB*\fR -is an ordinary character if it appears at the beginning of the -RE or the beginning of a parenthesized subexpression -(after a possible leading `\fB^\fR'). -Finally, -single-digit back references are available, -and -\fB\e<\fR -and -\fB\e>\fR -are synonyms for -\fB[[:<:]]\fR -and -\fB[[:>:]]\fR -respectively; -no other escapes are available. +.SH "SEE ALSO" +re_syntax(n) -.VE 8.1 .SH KEYWORDS match, regular expression, string |