summaryrefslogtreecommitdiffstats
path: root/Doc/lib/libregex.tex
diff options
context:
space:
mode:
authorThomas Wouters <thomas@python.org>2006-04-21 10:40:58 (GMT)
committerThomas Wouters <thomas@python.org>2006-04-21 10:40:58 (GMT)
commit49fd7fa4431da299196d74087df4a04f99f9c46f (patch)
tree35ace5fe78d3d52c7a9ab356ab9f6dbf8d4b71f4 /Doc/lib/libregex.tex
parent9ada3d6e29d5165dadacbe6be07bcd35cfbef59d (diff)
downloadcpython-49fd7fa4431da299196d74087df4a04f99f9c46f.zip
cpython-49fd7fa4431da299196d74087df4a04f99f9c46f.tar.gz
cpython-49fd7fa4431da299196d74087df4a04f99f9c46f.tar.bz2
Merge p3yk branch with the trunk up to revision 45595. This breaks a fair
number of tests, all because of the codecs/_multibytecodecs issue described here (it's not a Py3K issue, just something Py3K discovers): http://mail.python.org/pipermail/python-dev/2006-April/064051.html Hye-Shik Chang promised to look for a fix, so no need to fix it here. The tests that are expected to break are: test_codecencodings_cn test_codecencodings_hk test_codecencodings_jp test_codecencodings_kr test_codecencodings_tw test_codecs test_multibytecodec This merge fixes an actual test failure (test_weakref) in this branch, though, so I believe merging is the right thing to do anyway.
Diffstat (limited to 'Doc/lib/libregex.tex')
-rw-r--r--Doc/lib/libregex.tex370
1 files changed, 0 insertions, 370 deletions
diff --git a/Doc/lib/libregex.tex b/Doc/lib/libregex.tex
deleted file mode 100644
index 0982f81..0000000
--- a/Doc/lib/libregex.tex
+++ /dev/null
@@ -1,370 +0,0 @@
-\section{\module{regex} ---
- Regular expression operations}
-\declaremodule{builtin}{regex}
-
-\modulesynopsis{Regular expression search and match operations.
- \strong{Obsolete!}}
-
-
-This module provides regular expression matching operations similar to
-those found in Emacs.
-
-\strong{Obsolescence note:}
-This module is obsolete as of Python version 1.5; it is still being
-maintained because much existing code still uses it. All new code in
-need of regular expressions should use the new
-\code{re}\refstmodindex{re} module, which supports the more powerful
-and regular Perl-style regular expressions. Existing code should be
-converted. The standard library module
-\code{reconvert}\refstmodindex{reconvert} helps in converting
-\code{regex} style regular expressions to \code{re}\refstmodindex{re}
-style regular expressions. (For more conversion help, see Andrew
-Kuchling's\index{Kuchling, Andrew} ``\module{regex-to-re} HOWTO'' at
-\url{http://www.python.org/doc/howto/regex-to-re/}.)
-
-By default the patterns are Emacs-style regular expressions
-(with one exception). There is
-a way to change the syntax to match that of several well-known
-\UNIX{} utilities. The exception is that Emacs' \samp{\e s}
-pattern is not supported, since the original implementation references
-the Emacs syntax tables.
-
-This module is 8-bit clean: both patterns and strings may contain null
-bytes and characters whose high bit is set.
-
-\strong{Please note:} There is a little-known fact about Python string
-literals which means that you don't usually have to worry about
-doubling backslashes, even though they are used to escape special
-characters in string literals as well as in regular expressions. This
-is because Python doesn't remove backslashes from string literals if
-they are followed by an unrecognized escape character.
-\emph{However}, if you want to include a literal \dfn{backslash} in a
-regular expression represented as a string literal, you have to
-\emph{quadruple} it or enclose it in a singleton character class.
-E.g.\ to extract \LaTeX\ \samp{\e section\{\textrm{\ldots}\}} headers
-from a document, you can use this pattern:
-\code{'[\e ]section\{\e (.*\e )\}'}. \emph{Another exception:}
-the escape sequence \samp{\e b} is significant in string literals
-(where it means the ASCII bell character) as well as in Emacs regular
-expressions (where it stands for a word boundary), so in order to
-search for a word boundary, you should use the pattern \code{'\e \e b'}.
-Similarly, a backslash followed by a digit 0-7 should be doubled to
-avoid interpretation as an octal escape.
-
-\subsection{Regular Expressions}
-
-A regular expression (or RE) specifies a set of strings that matches
-it; the functions in this module let you check if a particular string
-matches a given regular expression (or if a given regular expression
-matches a particular string, which comes down to the same thing).
-
-Regular expressions can be concatenated to form new regular
-expressions; if \emph{A} and \emph{B} are both regular expressions,
-then \emph{AB} is also an regular expression. If a string \emph{p}
-matches A and another string \emph{q} matches B, the string \emph{pq}
-will match AB. Thus, complex expressions can easily be constructed
-from simpler ones like the primitives described here. For details of
-the theory and implementation of regular expressions, consult almost
-any textbook about compiler construction.
-
-% XXX The reference could be made more specific, say to
-% "Compilers: Principles, Techniques and Tools", by Alfred V. Aho,
-% Ravi Sethi, and Jeffrey D. Ullman, or some FA text.
-
-A brief explanation of the format of regular expressions follows.
-
-Regular expressions can contain both special and ordinary characters.
-Ordinary characters, like '\code{A}', '\code{a}', or '\code{0}', are
-the simplest regular expressions; they simply match themselves. You
-can concatenate ordinary characters, so '\code{last}' matches the
-characters 'last'. (In the rest of this section, we'll write RE's in
-\code{this special font}, usually without quotes, and strings to be
-matched 'in single quotes'.)
-
-Special characters either stand for classes of ordinary characters, or
-affect how the regular expressions around them are interpreted.
-
-The special characters are:
-\begin{itemize}
-\item[\code{.}] (Dot.) Matches any character except a newline.
-\item[\code{\^}] (Caret.) Matches the start of the string.
-\item[\code{\$}] Matches the end of the string.
-\code{foo} matches both 'foo' and 'foobar', while the regular
-expression '\code{foo\$}' matches only 'foo'.
-\item[\code{*}] Causes the resulting RE to
-match 0 or more repetitions of the preceding RE. \code{ab*} will
-match 'a', 'ab', or 'a' followed by any number of 'b's.
-\item[\code{+}] Causes the
-resulting RE to match 1 or more repetitions of the preceding RE.
-\code{ab+} will match 'a' followed by any non-zero number of 'b's; it
-will not match just 'a'.
-\item[\code{?}] Causes the resulting RE to
-match 0 or 1 repetitions of the preceding RE. \code{ab?} will
-match either 'a' or 'ab'.
-
-\item[\code{\e}] Either escapes special characters (permitting you to match
-characters like '*?+\&\$'), or signals a special sequence; special
-sequences are discussed below. Remember that Python also uses the
-backslash as an escape sequence in string literals; if the escape
-sequence isn't recognized by Python's parser, the backslash and
-subsequent character are included in the resulting string. However,
-if Python would recognize the resulting sequence, the backslash should
-be repeated twice.
-
-\item[\code{[]}] Used to indicate a set of characters. Characters can
-be listed individually, or a range is indicated by giving two
-characters and separating them by a '-'. Special characters are
-not active inside sets. For example, \code{[akm\$]}
-will match any of the characters 'a', 'k', 'm', or '\$'; \code{[a-z]} will
-match any lowercase letter.
-
-If you want to include a \code{]} inside a
-set, it must be the first character of the set; to include a \code{-},
-place it as the first or last character.
-
-Characters \emph{not} within a range can be matched by including a
-\code{\^} as the first character of the set; \code{\^} elsewhere will
-simply match the '\code{\^}' character.
-\end{itemize}
-
-The special sequences consist of '\code{\e}' and a character
-from the list below. If the ordinary character is not on the list,
-then the resulting RE will match the second character. For example,
-\code{\e\$} matches the character '\$'. Ones where the backslash
-should be doubled in string literals are indicated.
-
-\begin{itemize}
-\item[\code{\e|}]\code{A\e|B}, where A and B can be arbitrary REs,
-creates a regular expression that will match either A or B. This can
-be used inside groups (see below) as well.
-%
-\item[\code{\e( \e)}] Indicates the start and end of a group; the
-contents of a group can be matched later in the string with the
-\code{\e [1-9]} special sequence, described next.
-\end{itemize}
-
-\begin{fulllineitems}
-\item[\code{\e \e 1, ... \e \e 7, \e 8, \e 9}]
-Matches the contents of the group of the same
-number. For example, \code{\e (.+\e ) \e \e 1} matches 'the the' or
-'55 55', but not 'the end' (note the space after the group). This
-special sequence can only be used to match one of the first 9 groups;
-groups with higher numbers can be matched using the \code{\e v}
-sequence. (\code{\e 8} and \code{\e 9} don't need a double backslash
-because they are not octal digits.)
-\end{fulllineitems}
-
-\begin{itemize}
-\item[\code{\e \e b}] Matches the empty string, but only at the
-beginning or end of a word. A word is defined as a sequence of
-alphanumeric characters, so the end of a word is indicated by
-whitespace or a non-alphanumeric character.
-%
-\item[\code{\e B}] Matches the empty string, but when it is \emph{not} at the
-beginning or end of a word.
-%
-\item[\code{\e v}] Must be followed by a two digit decimal number, and
-matches the contents of the group of the same number. The group
-number must be between 1 and 99, inclusive.
-%
-\item[\code{\e w}]Matches any alphanumeric character; this is
-equivalent to the set \code{[a-zA-Z0-9]}.
-%
-\item[\code{\e W}] Matches any non-alphanumeric character; this is
-equivalent to the set \code{[\^a-zA-Z0-9]}.
-\item[\code{\e <}] Matches the empty string, but only at the beginning of a
-word. A word is defined as a sequence of alphanumeric characters, so
-the end of a word is indicated by whitespace or a non-alphanumeric
-character.
-\item[\code{\e >}] Matches the empty string, but only at the end of a
-word.
-
-\item[\code{\e \e \e \e}] Matches a literal backslash.
-
-% In Emacs, the following two are start of buffer/end of buffer. In
-% Python they seem to be synonyms for ^$.
-\item[\code{\e `}] Like \code{\^}, this only matches at the start of the
-string.
-\item[\code{\e \e '}] Like \code{\$}, this only matches at the end of
-the string.
-% end of buffer
-\end{itemize}
-
-\subsection{Module Contents}
-\nodename{Contents of Module regex}
-
-The module defines these functions, and an exception:
-
-
-\begin{funcdesc}{match}{pattern, string}
- Return how many characters at the beginning of \var{string} match
- the regular expression \var{pattern}. Return \code{-1} if the
- string does not match the pattern (this is different from a
- zero-length match!).
-\end{funcdesc}
-
-\begin{funcdesc}{search}{pattern, string}
- Return the first position in \var{string} that matches the regular
- expression \var{pattern}. Return \code{-1} if no position in the string
- matches the pattern (this is different from a zero-length match
- anywhere!).
-\end{funcdesc}
-
-\begin{funcdesc}{compile}{pattern\optional{, translate}}
- Compile a regular expression pattern into a regular expression
- object, which can be used for matching using its \code{match()} and
- \code{search()} methods, described below. The optional argument
- \var{translate}, if present, must be a 256-character string
- indicating how characters (both of the pattern and of the strings to
- be matched) are translated before comparing them; the \var{i}-th
- element of the string gives the translation for the character with
- \ASCII{} code \var{i}. This can be used to implement
- case-insensitive matching; see the \code{casefold} data item below.
-
- The sequence
-
-\begin{verbatim}
-prog = regex.compile(pat)
-result = prog.match(str)
-\end{verbatim}
-%
-is equivalent to
-
-\begin{verbatim}
-result = regex.match(pat, str)
-\end{verbatim}
-
-but the version using \code{compile()} is more efficient when multiple
-regular expressions are used concurrently in a single program. (The
-compiled version of the last pattern passed to \code{regex.match()} or
-\code{regex.search()} is cached, so programs that use only a single
-regular expression at a time needn't worry about compiling regular
-expressions.)
-\end{funcdesc}
-
-\begin{funcdesc}{set_syntax}{flags}
- Set the syntax to be used by future calls to \code{compile()},
- \code{match()} and \code{search()}. (Already compiled expression
- objects are not affected.) The argument is an integer which is the
- OR of several flag bits. The return value is the previous value of
- the syntax flags. Names for the flags are defined in the standard
- module \code{regex_syntax}\refstmodindex{regex_syntax}; read the
- file \file{regex_syntax.py} for more information.
-\end{funcdesc}
-
-\begin{funcdesc}{get_syntax}{}
- Returns the current value of the syntax flags as an integer.
-\end{funcdesc}
-
-\begin{funcdesc}{symcomp}{pattern\optional{, translate}}
-This is like \code{compile()}, but supports symbolic group names: if a
-parenthesis-enclosed group begins with a group name in angular
-brackets, e.g. \code{'\e(<id>[a-z][a-z0-9]*\e)'}, the group can
-be referenced by its name in arguments to the \code{group()} method of
-the resulting compiled regular expression object, like this:
-\code{p.group('id')}. Group names may contain alphanumeric characters
-and \code{'_'} only.
-\end{funcdesc}
-
-\begin{excdesc}{error}
- Exception raised when a string passed to one of the functions here
- is not a valid regular expression (e.g., unmatched parentheses) or
- when some other error occurs during compilation or matching. (It is
- never an error if a string contains no match for a pattern.)
-\end{excdesc}
-
-\begin{datadesc}{casefold}
-A string suitable to pass as the \var{translate} argument to
-\code{compile()} to map all upper case characters to their lowercase
-equivalents.
-\end{datadesc}
-
-\noindent
-Compiled regular expression objects support these methods:
-
-\setindexsubitem{(regex method)}
-\begin{funcdesc}{match}{string\optional{, pos}}
- Return how many characters at the beginning of \var{string} match
- the compiled regular expression. Return \code{-1} if the string
- does not match the pattern (this is different from a zero-length
- match!).
-
- The optional second parameter, \var{pos}, gives an index in the string
- where the search is to start; it defaults to \code{0}. This is not
- completely equivalent to slicing the string; the \code{'\^'} pattern
- character matches at the real beginning of the string and at positions
- just after a newline, not necessarily at the index where the search
- is to start.
-\end{funcdesc}
-
-\begin{funcdesc}{search}{string\optional{, pos}}
- Return the first position in \var{string} that matches the regular
- expression \code{pattern}. Return \code{-1} if no position in the
- string matches the pattern (this is different from a zero-length
- match anywhere!).
-
- The optional second parameter has the same meaning as for the
- \code{match()} method.
-\end{funcdesc}
-
-\begin{funcdesc}{group}{index, index, ...}
-This method is only valid when the last call to the \code{match()}
-or \code{search()} method found a match. It returns one or more
-groups of the match. If there is a single \var{index} argument,
-the result is a single string; if there are multiple arguments, the
-result is a tuple with one item per argument. If the \var{index} is
-zero, the corresponding return value is the entire matching string; if
-it is in the inclusive range [1..99], it is the string matching the
-corresponding parenthesized group (using the default syntax,
-groups are parenthesized using \code{{\e}(} and \code{{\e})}). If no
-such group exists, the corresponding result is \code{None}.
-
-If the regular expression was compiled by \code{symcomp()} instead of
-\code{compile()}, the \var{index} arguments may also be strings
-identifying groups by their group name.
-\end{funcdesc}
-
-\noindent
-Compiled regular expressions support these data attributes:
-
-\setindexsubitem{(regex attribute)}
-
-\begin{datadesc}{regs}
-When the last call to the \code{match()} or \code{search()} method found a
-match, this is a tuple of pairs of indexes corresponding to the
-beginning and end of all parenthesized groups in the pattern. Indices
-are relative to the string argument passed to \code{match()} or
-\code{search()}. The 0-th tuple gives the beginning and end or the
-whole pattern. When the last match or search failed, this is
-\code{None}.
-\end{datadesc}
-
-\begin{datadesc}{last}
-When the last call to the \code{match()} or \code{search()} method found a
-match, this is the string argument passed to that method. When the
-last match or search failed, this is \code{None}.
-\end{datadesc}
-
-\begin{datadesc}{translate}
-This is the value of the \var{translate} argument to
-\code{regex.compile()} that created this regular expression object. If
-the \var{translate} argument was omitted in the \code{regex.compile()}
-call, this is \code{None}.
-\end{datadesc}
-
-\begin{datadesc}{givenpat}
-The regular expression pattern as passed to \code{compile()} or
-\code{symcomp()}.
-\end{datadesc}
-
-\begin{datadesc}{realpat}
-The regular expression after stripping the group names for regular
-expressions compiled with \code{symcomp()}. Same as \code{givenpat}
-otherwise.
-\end{datadesc}
-
-\begin{datadesc}{groupindex}
-A dictionary giving the mapping from symbolic group names to numerical
-group indexes for regular expressions compiled with \code{symcomp()}.
-\code{None} otherwise.
-\end{datadesc}