summaryrefslogtreecommitdiffstats
path: root/doc/regexp.n
diff options
context:
space:
mode:
authorrjohnson <rjohnson>1998-03-26 14:45:59 (GMT)
committerrjohnson <rjohnson>1998-03-26 14:45:59 (GMT)
commit2b5738da524e944cda39e24c0a87b745a43bd8c3 (patch)
tree6e8c9473978f6dab66c601e911721a7bd9d70b1b /doc/regexp.n
parentc6a259aeeca4814a97cf6694814c63e74e4e18fa (diff)
downloadtcl-2b5738da524e944cda39e24c0a87b745a43bd8c3.zip
tcl-2b5738da524e944cda39e24c0a87b745a43bd8c3.tar.gz
tcl-2b5738da524e944cda39e24c0a87b745a43bd8c3.tar.bz2
Initial revision
Diffstat (limited to 'doc/regexp.n')
-rw-r--r--doc/regexp.n145
1 files changed, 145 insertions, 0 deletions
diff --git a/doc/regexp.n b/doc/regexp.n
new file mode 100644
index 0000000..f3951ee
--- /dev/null
+++ b/doc/regexp.n
@@ -0,0 +1,145 @@
+'\"
+'\" Copyright (c) 1993 The Regents of the University of California.
+'\" Copyright (c) 1994-1996 Sun Microsystems, Inc.
+'\"
+'\" See the file "license.terms" for information on usage and redistribution
+'\" of this file, and for a DISCLAIMER OF ALL WARRANTIES.
+'\"
+'\" SCCS: @(#) regexp.n 1.12 96/08/26 13:00:10
+'\"
+.so man.macros
+.TH regexp n "" Tcl "Tcl Built-In Commands"
+.BS
+'\" Note: do not modify the .SH NAME line immediately below!
+.SH NAME
+regexp \- Match a regular expression against a string
+.SH SYNOPSIS
+\fBregexp \fR?\fIswitches\fR? \fIexp string \fR?\fImatchVar\fR? ?\fIsubMatchVar subMatchVar ...\fR?
+.BE
+
+.SH DESCRIPTION
+.PP
+Determines whether the regular expression \fIexp\fR matches part or
+all of \fIstring\fR and returns 1 if it does, 0 if it doesn't.
+.LP
+If additional arguments are specified after \fIstring\fR then they
+are treated as the names of variables in which to return
+information about which part(s) of \fIstring\fR matched \fIexp\fR.
+\fIMatchVar\fR will be set to the range of \fIstring\fR that
+matched all of \fIexp\fR. The first \fIsubMatchVar\fR will contain
+the characters in \fIstring\fR that matched the leftmost parenthesized
+subexpression within \fIexp\fR, the next \fIsubMatchVar\fR will
+contain the characters that matched the next parenthesized
+subexpression to the right in \fIexp\fR, and so on.
+.LP
+If the initial arguments to \fBregexp\fR start with \fB\-\fR then
+they are treated as switches. The following switches are
+currently supported:
+.TP 10
+\fB\-nocase\fR
+Causes upper-case characters in \fIstring\fR to be treated as
+lower case during the matching process.
+.TP 10
+\fB\-indices\fR
+Changes what is stored in the \fIsubMatchVar\fRs.
+Instead of storing the matching characters from \fBstring\fR,
+each variable
+will contain a list of two decimal strings giving the indices
+in \fIstring\fR of the first and last characters in the matching
+range of characters.
+.TP 10
+\fB\-\|\-\fR
+Marks the end of switches. The argument following this one will
+be treated as \fIexp\fR even if it starts with a \fB\-\fR.
+.LP
+If there are more \fIsubMatchVar\fR's than parenthesized
+subexpressions within \fIexp\fR, or if a particular subexpression
+in \fIexp\fR doesn't match the string (e.g. because it was in a
+portion of the expression that wasn't matched), then the corresponding
+\fIsubMatchVar\fR will be set to ``\fB\-1 \-1\fR'' if \fB\-indices\fR
+has been specified or to an empty string otherwise.
+
+.SH "REGULAR EXPRESSIONS"
+.PP
+Regular expressions are implemented using Henry Spencer's package
+(thanks, Henry!),
+and much of the description of regular expressions below is copied verbatim
+from his manual entry.
+.PP
+A regular expression is zero or more \fIbranches\fR, separated by ``|''.
+It matches anything that matches one of the branches.
+.PP
+A branch is zero or more \fIpieces\fR, concatenated.
+It matches a match for the first, followed by a match for the second, etc.
+.PP
+A piece is an \fIatom\fR possibly followed by ``*'', ``+'', or ``?''.
+An atom followed by ``*'' matches a sequence of 0 or more matches of the atom.
+An atom followed by ``+'' matches a sequence of 1 or more matches of the atom.
+An atom followed by ``?'' matches a match of the atom, or the null string.
+.PP
+An atom is a regular expression in parentheses (matching a match for the
+regular expression), a \fIrange\fR (see below), ``.''
+(matching any single character), ``^'' (matching the null string at the
+beginning of the input string), ``$'' (matching the null string at the
+end of the input string), a ``\e'' followed by a single character (matching
+that character), or a single character with no other significance
+(matching that character).
+.PP
+A \fIrange\fR is a sequence of characters enclosed in ``[]''.
+It normally matches any single character from the sequence.
+If the sequence begins with ``^'',
+it matches any single character \fInot\fR from the rest of the sequence.
+If two characters in the sequence are separated by ``\-'', this is shorthand
+for the full list of ASCII characters between them
+(e.g. ``[0-9]'' matches any decimal digit).
+To include a literal ``]'' in the sequence, make it the first character
+(following a possible ``^'').
+To include a literal ``\-'', make it the first or last character.
+
+.SH "CHOOSING AMONG ALTERNATIVE MATCHES"
+.PP
+In general there may be more than one way to match a regular expression
+to an input string. For example, consider the command
+.CS
+\fBregexp (a*)b* aabaaabb x y\fR
+.CE
+Considering only the rules given so far, \fBx\fR and \fBy\fR could
+end up with the values \fBaabb\fR and \fBaa\fR, \fBaaab\fR and \fBaaa\fR,
+\fBab\fR and \fBa\fR, or any of several other combinations.
+To resolve this potential ambiguity \fBregexp\fR chooses among
+alternatives using the rule ``first then longest''.
+In other words, it considers the possible matches in order working
+from left to right across the input string and the pattern, and it
+attempts to match longer pieces of the input string before shorter
+ones. More specifically, the following rules apply in decreasing
+order of priority:
+.IP [1]
+If a regular expression could match two different parts of an input string
+then it will match the one that begins earliest.
+.IP [2]
+If a regular expression contains \fB|\fR operators then the leftmost
+matching sub-expression is chosen.
+.IP [3]
+In \fB*\fR, \fB+\fR, and \fB?\fR constructs, longer matches are chosen
+in preference to shorter ones.
+.IP [4]
+In sequences of expression components the components are considered
+from left to right.
+.LP
+In the example from above, \fB(a*)b*\fR matches \fBaab\fR: the \fB(a*)\fR
+portion of the pattern is matched first and it consumes the leading
+\fBaa\fR; then the \fBb*\fR portion of the pattern consumes the
+next \fBb\fR. Or, consider the following example:
+.CS
+\fBregexp (ab|a)(b*)c abc x y z\fR
+.CE
+After this command \fBx\fR will be \fBabc\fR, \fBy\fR will be
+\fBab\fR, and \fBz\fR will be an empty string.
+Rule 4 specifies that \fB(ab|a)\fR gets first shot at the input
+string and Rule 2 specifies that the \fBab\fR sub-expression
+is checked before the \fBa\fR sub-expression.
+Thus the \fBb\fR has already been claimed before the \fB(b*)\fR
+component is checked and \fB(b*)\fR must match an empty string.
+
+.SH KEYWORDS
+match, regular expression, string