Commit the howto source to the main Python repository, with Fred's approval

author: Andrew M. Kuchling <amk@amk.ca> 2005-08-30 01:25:05 (GMT)
committer: Andrew M. Kuchling <amk@amk.ca> 2005-08-30 01:25:05 (GMT)
commit: e8f44d683e79c7a9659a4480736d55193da4a7b1 (patch)
tree: 37e8b05066aa1caf85f6b25d52f1576366e45e8e /Doc/howto/regex.tex
parent: f1b2ba6aa1751c5325e8fb87a28e54a857796bfa (diff)
download: cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.zip
cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.tar.gz
cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.tar.bz2
1 files changed, 1466 insertions, 0 deletions
diff --git a/Doc/howto/regex.tex b/Doc/howto/regex.tex
new file mode 100644
index 0000000..5a65064
--- /dev/null
+++ b/Doc/howto/regex.tex
@@ -0,0 +1,1466 @@
+\documentclass{howto}
+
+% TODO:
+% Document lookbehind assertions
+% Better way of displaying a RE, a string, and what it matches
+% Mention optional argument to match.groups()
+% Unicode (at least a reference)
+
+\title{Regular Expression HOWTO}
+
+\release{0.05}
+
+\author{A.M. Kuchling}
+\authoraddress{\email{amk@amk.ca}}
+
+\begin{document}
+\maketitle
+
+\begin{abstract}
+\noindent
+This document is an introductory tutorial to using regular expressions
+in Python with the \module{re} module.  It provides a gentler
+introduction than the corresponding section in the Library Reference.
+
+This document is available from 
+\url{http://www.amk.ca/python/howto}.
+
+\end{abstract}
+
+\tableofcontents
+
+\section{Introduction}
+
+The \module{re} module was added in Python 1.5, and provides
+Perl-style regular expression patterns.  Earlier versions of Python
+came with the \module{regex} module, which provides Emacs-style
+patterns.  Emacs-style patterns are slightly less readable and
+don't provide as many features, so there's not much reason to use
+the \module{regex} module when writing new code, though you might
+encounter old code that uses it.
+
+Regular expressions (or REs) are essentially a tiny, highly
+specialized programming language embedded inside Python and made
+available through the \module{re} module.  Using this little language,
+you specify the rules for the set of possible strings that you want to
+match; this set might contain English sentences, or e-mail addresses,
+or TeX commands, or anything you like.  You can then ask questions
+such as ``Does this string match the pattern?'', or ``Is there a match
+for the pattern anywhere in this string?''.  You can also use REs to
+modify a string or to split it apart in various ways.
+
+Regular expression patterns are compiled into a series of bytecodes
+which are then executed by a matching engine written in C.  For
+advanced use, it may be necessary to pay careful attention to how the
+engine will execute a given RE, and write the RE in a certain way in
+order to produce bytecode that runs faster.  Optimization isn't
+covered in this document, because it requires that you have a good
+understanding of the matching engine's internals.
+
+The regular expression language is relatively small and restricted, so
+not all possible string processing tasks can be done using regular
+expressions.  There are also tasks that \emph{can} be done with
+regular expressions, but the expressions turn out to be very
+complicated.  In these cases, you may be better off writing Python
+code to do the processing; while Python code will be slower than an
+elaborate regular expression, it will also probably be more understandable.
+
+\section{Simple Patterns}
+
+We'll start by learning about the simplest possible regular
+expressions.  Since regular expressions are used to operate on
+strings, we'll begin with the most common task: matching characters.
+
+For a detailed explanation of the computer science underlying regular
+expressions (deterministic and non-deterministic finite automata), you
+can refer to almost any textbook on writing compilers.
+
+\subsection{Matching Characters}
+
+Most letters and characters will simply match themselves.  For
+example, the regular expression \regexp{test} will match the string
+\samp{test} exactly.  (You can enable a case-insensitive mode that
+would let this RE match \samp{Test} or \samp{TEST} as well; more
+about this later.)  
+
+There are exceptions to this rule; some characters are
+special, and don't match themselves.  Instead, they signal that some
+out-of-the-ordinary thing should be matched, or they affect other
+portions of the RE by repeating them.  Much of this document is
+devoted to discussing various metacharacters and what they do.
+
+Here's a complete list of the metacharacters; their meanings will be
+discussed in the rest of this HOWTO.
+
+\begin{verbatim}
+. ^ $ * + ? { [ ] \ | ( )
+\end{verbatim}
+% $
+
+The first metacharacters we'll look at are \samp{[} and \samp{]}.
+They're used for specifying a character class, which is a set of
+characters that you wish to match.  Characters can be listed
+individually, or a range of characters can be indicated by giving two
+characters and separating them by a \character{-}.  For example,
+\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or
+\samp{c}; this is the same as
+\regexp{[a-c]}, which uses a range to express the same set of
+characters.  If you wanted to match only lowercase letters, your
+RE would be \regexp{[a-z]}.
+
+Metacharacters are not active inside classes.  For example,
+\regexp{[akm\$]} will match any of the characters \character{a},
+\character{k}, \character{m}, or \character{\$}; \character{\$} is
+usually a metacharacter, but inside a character class it's stripped of
+its special nature.
+
+You can match the characters not within a range by \dfn{complementing}
+the set.  This is indicated by including a \character{\^} as the first
+character of the class; \character{\^} elsewhere will simply match the
+\character{\^} character.  For example, \verb|[^5]| will match any
+character except \character{5}.
+
+Perhaps the most important metacharacter is the backslash, \samp{\e}.  
+As in Python string literals, the backslash can be followed by various
+characters to signal various special sequences.  It's also used to escape
+all the metacharacters so you can still match them in patterns; for
+example, if you need to match a \samp{[} or 
+\samp{\e}, you can precede them with a backslash to remove their
+special meaning: \regexp{\e[} or \regexp{\e\e}.
+
+Some of the special sequences beginning with \character{\e} represent
+predefined sets of characters that are often useful, such as the set
+of digits, the set of letters, or the set of anything that isn't
+whitespace.  The following predefined special sequences are available:
+
+\begin{itemize}
+\item[\code{\e d}]Matches any decimal digit; this is
+equivalent to the class \regexp{[0-9]}.
+
+\item[\code{\e D}]Matches any non-digit character; this is
+equivalent to the class \verb|[^0-9]|.
+
+\item[\code{\e s}]Matches any whitespace character; this is
+equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
+
+\item[\code{\e S}]Matches any non-whitespace character; this is
+equivalent to the class \verb|[^ \t\n\r\f\v]|.
+
+\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
+\regexp{[a-zA-Z0-9_]}.  
+
+\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
+\verb|[^a-zA-Z0-9_]|.   
+\end{itemize}
+
+These sequences can be included inside a character class.  For
+example, \regexp{[\e s,.]} is a character class that will match any
+whitespace character, or \character{,} or \character{.}.
+
+The final metacharacter in this section is \regexp{.}.  It matches
+anything except a newline character, and there's an alternate mode
+(\code{re.DOTALL}) where it will match even a newline.  \character{.}
+is often used where you want to match ``any character''.  
+
+\subsection{Repeating Things}
+
+Being able to match varying sets of characters is the first thing
+regular expressions can do that isn't already possible with the
+methods available on strings.  However, if that was the only
+additional capability of regexes, they wouldn't be much of an advance.
+Another capability is that you can specify that portions of the RE
+must be repeated a certain number of times.
+
+The first metacharacter for repeating things that we'll look at is
+\regexp{*}.  \regexp{*} doesn't match the literal character \samp{*};
+instead, it specifies that the previous character can be matched zero
+or more times, instead of exactly once.
+
+For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
+characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
+characters), and so forth.  The RE engine has various internal
+limitations stemming from the size of C's \code{int} type, that will
+prevent it from matching over 2 billion \samp{a} characters; you
+probably don't have enough memory to construct a string that large, so
+you shouldn't run into that limit.
+
+Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE,
+the matching engine will try to repeat it as many times as possible.
+If later portions of the pattern don't match, the matching engine will
+then back up and try again with few repetitions.
+
+A step-by-step example will make this more obvious.  Let's consider
+the expression \regexp{a[bcd]*b}.  This matches the letter
+\character{a}, zero or more letters from the class \code{[bcd]}, and
+finally ends with a \character{b}.  Now imagine matching this RE
+against the string \samp{abcbd}.  
+
+\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation}
+\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.}
+\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as
+it can, which is to the end of the string.}
+\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the
+current position is at the end of the string, so it fails.}
+\lineiii{4}{\code{abcb}}{Back up, so that  \regexp{[bcd]*} matches
+one less character.}
+\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the
+current position is at the last character, which is a \character{d}.}
+\lineiii{6}{\code{abc}}{Back up again, so that  \regexp{[bcd]*} is
+only matching \samp{bc}.}
+\lineiii{6}{\code{abcb}}{Try \regexp{b} again.  This time 
+but the character at the current position is \character{b}, so it succeeds.}
+\end{tableiii}
+
+The end of the RE has now been reached, and it has matched
+\samp{abcb}.  This demonstrates how the matching engine goes as far as
+it can at first, and if no match is found it will then progressively
+back up and retry the rest of the RE again and again.  It will back up
+until it has tried zero matches for \regexp{[bcd]*}, and if that
+subsequently fails, the engine will conclude that the string doesn't
+match the RE at all.
+
+Another repeating metacharacter is \regexp{+}, which matches one or
+more times.  Pay careful attention to the difference between
+\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more
+times, so whatever's being repeated may not be present at all, while
+\regexp{+} requires at least \emph{one} occurrence.  To use a similar
+example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}),
+\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}.
+
+There are two more repeating qualifiers.  The question mark character,
+\regexp{?}, matches either once or zero times; you can think of it as
+marking something as being optional.  For example, \regexp{home-?brew}
+matches either \samp{homebrew} or \samp{home-brew}.  
+
+The most complicated repeated qualifier is
+\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal
+integers.  This qualifier means there must be at least \var{m}
+repetitions, and at most \var{n}.  For example, \regexp{a/\{1,3\}b}
+will match \samp{a/b}, \samp{a//b}, and \samp{a///b}.  It won't match
+\samp{ab}, which has no slashes, or \samp{a////b}, which has four.
+
+You can omit either \var{m} or \var{n}; in that case, a reasonable
+value is assumed for the missing value.  Omitting \var{m} is
+interpreted as a lower limit of 0, while omitting \var{n} results in  an
+upper bound of infinity --- actually, the 2 billion limit mentioned
+earlier, but that might as well be infinity.  
+
+Readers of a reductionist bent may notice that the three other qualifiers
+can all be expressed using this notation.  \regexp{\{0,\}} is the same
+as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
+\regexp{\{0,1\}} is the same as \regexp{?}.  It's better to use
+\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because
+they're shorter and easier to read.
+
+\section{Using Regular Expressions}
+
+Now that we've looked at some simple regular expressions, how do we
+actually use them in Python?  The \module{re} module provides an
+interface to the regular expression engine, allowing you to compile
+REs into objects and then perform matches with them.
+
+\subsection{Compiling Regular Expressions}
+
+Regular expressions are compiled into \class{RegexObject} instances,
+which have methods for various operations such as searching for
+pattern matches or performing string substitutions.
+
+\begin{verbatim}
+>>> import re
+>>> p = re.compile('ab*')
+>>> print p
+<re.RegexObject instance at 80b4150>
+\end{verbatim}
+
+\function{re.compile()} also accepts an optional \var{flags}
+argument, used to enable various special features and syntax
+variations.  We'll go over the available settings later, but for now a
+single example will do:
+
+\begin{verbatim}
+>>> p = re.compile('ab*', re.IGNORECASE)
+\end{verbatim}
+
+The RE is passed to \function{re.compile()} as a string.  REs are
+handled as strings because regular expressions aren't part of the core
+Python language, and no special syntax was created for expressing
+them.  (There are applications that don't need REs at all, so there's
+no need to bloat the language specification by including them.)
+Instead, the \module{re} module is simply a C extension module
+included with Python, just like the \module{socket} or \module{zlib}
+module.
+
+Putting REs in strings keeps the Python language simpler, but has one
+disadvantage which is the topic of the next section.
+
+\subsection{The Backslash Plague}
+
+As stated earlier, regular expressions use the backslash
+character (\character{\e}) to indicate special forms or to allow
+special characters to be used without invoking their special meaning.
+This conflicts with Python's usage of the same character for the same
+purpose in string literals.
+
+Let's say you want to write a RE that matches the string
+\samp{{\e}section}, which might be found in a \LaTeX\ file.  To figure
+out what to write in the program code, start with the desired string
+to be matched.  Next, you must escape any backslashes and other
+metacharacters by preceding them with a backslash, resulting in the
+string \samp{\e\e section}.  The resulting string that must be passed
+to \function{re.compile()} must be \verb|\\section|.  However, to
+express this as a Python string literal, both backslashes must be
+escaped \emph{again}.
+
+\begin{tableii}{c|l}{code}{Characters}{Stage}
+  \lineii{\e section}{Text string to be matched}
+  \lineii{\e\e section}{Escaped backslash for \function{re.compile}}
+  \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal}
+\end{tableii}
+
+In short, to match a literal backslash, one has to write
+\code{'\e\e\e\e'} as the RE string, because the regular expression
+must be \samp{\e\e}, and each backslash must be expressed as
+\samp{\e\e} inside a regular Python string literal.  In REs that
+feature backslashes repeatedly, this leads to lots of repeated
+backslashes and makes the resulting strings difficult to understand.
+
+The solution is to use Python's raw string notation for regular
+expressions; backslashes are not handled in any special way in
+a string literal prefixed with \character{r}, so \code{r"\e n"} is a
+two-character string containing \character{\e} and \character{n},
+while \code{"\e n"} is a one-character string containing a newline.
+Frequently regular expressions will be expressed in Python
+code using this raw string notation.  
+
+\begin{tableii}{c|c}{code}{Regular String}{Raw string}
+  \lineii{"ab*"}{\code{r"ab*"}}
+  \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}}
+  \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}}
+\end{tableii}
+
+\subsection{Performing Matches}
+
+Once you have an object representing a compiled regular expression,
+what do you do with it?  \class{RegexObject} instances have several
+methods and attributes.  Only the most significant ones will be
+covered here; consult \ulink{the Library
+Reference}{http://www.python.org/doc/lib/module-re.html} for a
+complete listing.
+
+\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
+  \lineii{match()}{Determine if the RE matches at the beginning of
+  the string.}
+  \lineii{search()}{Scan through a string, looking for any location
+  where this RE matches.}
+  \lineii{findall()}{Find all substrings where the RE matches,
+and returns them as a list.}
+  \lineii{finditer()}{Find all substrings where the RE matches,
+and returns them as an iterator.}
+\end{tableii}
+
+\method{match()} and \method{search()} return \code{None} if no match
+can be found.  If they're successful, a \code{MatchObject} instance is
+returned, containing information about the match: where it starts and
+ends, the substring it matched, and more.
+
+You can learn about this by interactively experimenting with the
+\module{re} module.  If you have Tkinter available, you may also want
+to look at \file{Tools/scripts/redemo.py}, a demonstration program
+included with the Python distribution.  It allows you to enter REs and
+strings, and displays whether the RE matches or fails.
+\file{redemo.py} can be quite useful when trying to debug a
+complicated RE.  Phil Schwartz's
+\ulink{Kodos}{http://kodos.sourceforge.net} is also an interactive
+tool for developing and testing RE patterns.  This HOWTO will use the
+standard Python interpreter for its examples.
+
+First, run the Python interpreter, import the \module{re} module, and
+compile a RE:
+
+\begin{verbatim}
+Python 2.2.2 (#1, Feb 10 2003, 12:57:01)
+>>> import re
+>>> p = re.compile('[a-z]+')
+>>> p
+<_sre.SRE_Pattern object at 80c3c28>
+\end{verbatim}
+
+Now, you can try matching various strings against the RE
+\regexp{[a-z]+}.  An empty string shouldn't match at all, since
+\regexp{+} means 'one or more repetitions'.  \method{match()} should
+return \code{None} in this case, which will cause the interpreter to
+print no output.  You can explicitly print the result of
+\method{match()} to make this clear.
+
+\begin{verbatim}
+>>> p.match("")
+>>> print p.match("")
+None
+\end{verbatim}
+
+Now, let's try it on a string that it should match, such as
+\samp{tempo}.  In this case, \method{match()} will return a
+\class{MatchObject}, so you should store the result in a variable for
+later use.
+
+\begin{verbatim}
+>>> m = p.match( 'tempo')
+>>> print m
+<_sre.SRE_Match object at 80c4f68>
+\end{verbatim}
+
+Now you can query the \class{MatchObject} for information about the
+matching string.   \class{MatchObject} instances also have several
+methods and attributes; the most important ones are:
+
+\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
+  \lineii{group()}{Return the string matched by the RE}
+  \lineii{start()}{Return the starting position of the match}
+  \lineii{end()}{Return the ending position of the match}
+  \lineii{span()}{Return a tuple containing the (start, end) positions 
+                  of the match}
+\end{tableii}
+
+Trying these methods will soon clarify their meaning:
+
+\begin{verbatim}
+>>> m.group()
+'tempo'
+>>> m.start(), m.end()
+(0, 5)
+>>> m.span()
+(0, 5)
+\end{verbatim}
+
+\method{group()} returns the substring that was matched by the
+RE.  \method{start()} and \method{end()} return the starting and
+ending index of the match. \method{span()} returns both start and end
+indexes in a single tuple.  Since the \method{match} method only
+checks if the RE matches at the start of a string,
+\method{start()} will always be zero.  However, the \method{search}
+method of \class{RegexObject} instances scans through the string, so 
+the match may not start at zero in that case.
+
+\begin{verbatim}
+>>> print p.match('::: message')
+None
+>>> m = p.search('::: message') ; print m
+<re.MatchObject instance at 80c9650>
+>>> m.group()
+'message'
+>>> m.span()
+(4, 11)
+\end{verbatim}
+
+In actual programs, the most common style is to store the
+\class{MatchObject} in a variable, and then check if it was
+\code{None}.  This usually looks like:
+
+\begin{verbatim}
+p = re.compile( ... )
+m = p.match( 'string goes here' )
+if m:
+    print 'Match found: ', m.group()
+else:
+    print 'No match'
+\end{verbatim}
+
+Two \class{RegexObject} methods return all of the matches for a pattern.
+\method{findall()} returns a list of matching strings:
+
+\begin{verbatim}
+>>> p = re.compile('\d+')
+>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
+['12', '11', '10']
+\end{verbatim}
+
+\method{findall()} has to create the entire list before it can be
+returned as the result.  In Python 2.2, the \method{finditer()} method
+is also available, returning a sequence of \class{MatchObject} instances 
+as an iterator.
+
+\begin{verbatim}
+>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
+>>> iterator
+<callable-iterator object at 0x401833ac>
+>>> for match in iterator:
+...     print match.span()
+...
+(0, 2)
+(22, 24)
+(29, 31)
+\end{verbatim}
+
+
+\subsection{Module-Level Functions}
+
+You don't have to produce a \class{RegexObject} and call its methods;
+the \module{re} module also provides top-level functions called
+\function{match()}, \function{search()}, \function{sub()}, and so
+forth.  These functions take the same arguments as the corresponding
+\class{RegexObject} method, with the RE string added as the first
+argument, and still return either \code{None} or a \class{MatchObject}
+instance.
+
+\begin{verbatim}
+>>> print re.match(r'From\s+', 'Fromage amk')
+None
+>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
+<re.MatchObject instance at 80c5978>
+\end{verbatim}
+
+Under the hood, these functions simply produce a \class{RegexObject}
+for you and call the appropriate method on it.  They also store the
+compiled object in a cache, so future calls using the same
+RE are faster.  
+
+Should you use these module-level functions, or should you get the
+\class{RegexObject} and call its methods yourself?  That choice
+depends on how frequently the RE will be used, and on your personal
+coding style.  If a RE is being used at only one point in the code,
+then the module functions are probably more convenient.  If a program
+contains a lot of regular expressions, or re-uses the same ones in
+several locations, then it might be worthwhile to collect all the
+definitions in one place, in a section of code that compiles all the
+REs ahead of time.  To take an example from the standard library,
+here's an extract from \file{xmllib.py}:
+
+\begin{verbatim}
+ref = re.compile( ... )
+entityref = re.compile( ... )
+charref = re.compile( ... )
+starttagopen = re.compile( ... )
+\end{verbatim}
+
+I generally prefer to work with the compiled object, even for
+one-time uses, but few people will be as much of a purist about this
+as I am.
+
+\subsection{Compilation Flags}
+
+Compilation flags let you modify some aspects of how regular
+expressions work.  Flags are available in the \module{re} module under
+two names, a long name such as \constant{IGNORECASE}, and a short,
+one-letter form such as \constant{I}.  (If you're familiar with Perl's
+pattern modifiers, the one-letter forms use the same letters; the
+short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
+Multiple flags can be specified by bitwise OR-ing them; \code{re.I |
+re.M} sets both the \constant{I} and \constant{M} flags, for example.
+
+Here's a table of the available flags, followed by
+a more detailed explanation of each one.
+
+\begin{tableii}{c|l}{}{Flag}{Meaning}
+  \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any
+  character, including newlines}
+  \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches}
+  \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match}
+  \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching,
+  affecting \regexp{\^} and \regexp{\$}}
+  \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs,
+  which can be organized more cleanly and understandably.}
+\end{tableii}
+
+\begin{datadesc}{I}
+\dataline{IGNORECASE}
+Perform case-insensitive matching; character class and literal strings
+will match
+letters by ignoring case.  For example, \regexp{[A-Z]} will match
+lowercase letters, too, and \regexp{Spam} will match \samp{Spam},
+\samp{spam}, or \samp{spAM}.
+This lowercasing doesn't take the current locale into account; it will
+if you also set the \constant{LOCALE} flag.
+\end{datadesc}
+
+\begin{datadesc}{L}
+\dataline{LOCALE}
+Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
+and \regexp{\e B}, dependent on the current locale.  
+
+Locales are a feature of the C library intended to help in writing
+programs that take account of language differences.  For example, if
+you're processing French text, you'd want to be able to write
+\regexp{\e w+} to match words, but \regexp{\e w} only matches the
+character class \regexp{[A-Za-z]}; it won't match \character{\'e} or
+\character{\c c}.  If your system is configured properly and a French
+locale is selected, certain C functions will tell the program that
+\character{\'e} should also be considered a letter.  Setting the
+\constant{LOCALE} flag when compiling a regular expression will cause the
+resulting compiled object to use these C functions for \regexp{\e w};
+this is slower, but also enables \regexp{\e w+} to match French words as
+you'd expect.
+\end{datadesc}
+
+\begin{datadesc}{M}
+\dataline{MULTILINE}
+(\regexp{\^} and \regexp{\$} haven't been explained yet; 
+they'll be introduced in section~\ref{more-metacharacters}.)
+
+Usually \regexp{\^} matches only at the beginning of the string, and
+\regexp{\$} matches only at the end of the string and immediately before the
+newline (if any) at the end of the string. When this flag is
+specified, \regexp{\^} matches at the beginning of the string and at
+the beginning of each line within the string, immediately following
+each newline.  Similarly, the \regexp{\$} metacharacter matches either at
+the end of the string and at the end of each line (immediately
+preceding each newline).
+
+\end{datadesc}
+
+\begin{datadesc}{S}
+\dataline{DOTALL}
+Makes the \character{.} special character match any character at all,
+including a newline; without this flag, \character{.} will match
+anything \emph{except} a newline.
+\end{datadesc}
+
+\begin{datadesc}{X}
+\dataline{VERBOSE} This flag allows you to write regular expressions
+that are more readable by granting you more flexibility in how you can
+format them.  When this flag has been specified, whitespace within the
+RE string is ignored, except when the whitespace is in a character
+class or preceded by an unescaped backslash; this lets you organize
+and indent the RE more clearly.  It also enables you to put comments
+within a RE that will be ignored by the engine; comments are marked by
+a \character{\#} that's neither in a character class or preceded by an
+unescaped backslash.
+
+For example, here's a RE that uses \constant{re.VERBOSE}; see how
+much easier it is to read?
+
+\begin{verbatim}
+charref = re.compile(r"""
+ &[#]		     # Start of a numeric entity reference
+ (
+   [0-9]+[^0-9]      # Decimal form
+   | 0[0-7]+[^0-7]   # Octal form
+   | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form
+ )
+""", re.VERBOSE)
+\end{verbatim}
+
+Without the verbose setting, the RE would look like this:
+\begin{verbatim}
+charref = re.compile("&#([0-9]+[^0-9]"
+                     "|0[0-7]+[^0-7]"
+                     "|x[0-9a-fA-F]+[^0-9a-fA-F])")
+\end{verbatim}
+
+In the above example, Python's automatic concatenation of string
+literals has been used to break up the RE into smaller pieces, but
+it's still more difficult to understand than the version using
+\constant{re.VERBOSE}.
+
+\end{datadesc}
+
+\section{More Pattern Power}
+
+So far we've only covered a part of the features of regular
+expressions.  In this section, we'll cover some new metacharacters,
+and how to use groups to retrieve portions of the text that was matched.
+
+\subsection{More Metacharacters\label{more-metacharacters}}
+
+There are some metacharacters that we haven't covered yet.  Most of
+them will be covered in this section.
+
+Some of the remaining metacharacters to be discussed are
+\dfn{zero-width assertions}.  They don't cause the engine to advance
+through the string; instead, they consume no characters at all,
+and simply succeed or fail.  For example, \regexp{\e b} is an
+assertion that the current position is located at a word boundary; the
+position isn't changed by the \regexp{\e b} at all.  This means that
+zero-width assertions should never be repeated, because if they match
+once at a given location, they can obviously be matched an infinite
+number of times.
+
+\begin{list}{}{}
+
+\item[\regexp{|}] 
+Alternation, or the ``or'' operator.  
+If A and B are regular expressions, 
+\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}.
+\regexp{|} has very low precedence in order to make it work reasonably when
+you're alternating multi-character strings.
+\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not
+\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}.
+
+To match a literal \character{|},
+use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
+
+\item[\regexp{\^}] Matches at the beginning of lines.  Unless the
+\constant{MULTILINE} flag has been set, this will only match at the
+beginning of the string.  In \constant{MULTILINE} mode, this also
+matches immediately after each newline within the string.  
+
+For example, if you wish to match the word \samp{From} only at the
+beginning of a line, the RE to use is \verb|^From|.
+
+\begin{verbatim}
+>>> print re.search('^From', 'From Here to Eternity')
+<re.MatchObject instance at 80c1520>
+>>> print re.search('^From', 'Reciting From Memory')
+None
+\end{verbatim}
+
+%To match a literal \character{\^}, use \regexp{\e\^} or enclose it
+%inside a character class, as in \regexp{[{\e}\^]}.
+
+\item[\regexp{\$}] Matches at the end of a line, which is defined as
+either the end of the string, or any location followed by a newline
+character.    
+
+\begin{verbatim}
+>>> print re.search('}$', '{block}')
+<re.MatchObject instance at 80adfa8>
+>>> print re.search('}$', '{block} ')
+None
+>>> print re.search('}$', '{block}\n')
+<re.MatchObject instance at 80adfa8>
+\end{verbatim}
+% $
+
+To match a literal \character{\$}, use \regexp{\e\$} or enclose it
+inside a character class, as in  \regexp{[\$]}.
+
+\item[\regexp{\e A}] Matches only at the start of the string.  When
+not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
+effectively the same.  In \constant{MULTILINE} mode, however, they're
+different; \regexp{\e A} still matches only at the beginning of the
+string, but \regexp{\^} may match at any location inside the string
+that follows a newline character.
+
+\item[\regexp{\e Z}]Matches only at the end of the string.  
+
+\item[\regexp{\e b}] Word boundary.  
+This is a zero-width assertion that matches only at the
+beginning or end of a word.  A word is defined as a sequence of
+alphanumeric characters, so the end of a word is indicated by
+whitespace or a non-alphanumeric character.  
+
+The following example matches \samp{class} only when it's a complete
+word; it won't match when it's contained inside another word.
+
+\begin{verbatim}
+>>> p = re.compile(r'\bclass\b')
+>>> print p.search('no class at all')
+<re.MatchObject instance at 80c8f28>
+>>> print p.search('the declassified algorithm')
+None
+>>> print p.search('one subclass is')
+None
+\end{verbatim}
+
+There are two subtleties you should remember when using this special
+sequence.  First, this is the worst collision between Python's string
+literals and regular expression sequences.  In Python's string
+literals, \samp{\e b} is the backspace character, ASCII value 8.  If
+you're not using raw strings, then Python will convert the \samp{\e b} to
+a backspace, and your RE won't match as you expect it to.  The
+following example looks the same as our previous RE, but omits
+the \character{r} in front of the RE string.
+
+\begin{verbatim}
+>>> p = re.compile('\bclass\b')
+>>> print p.search('no class at all')
+None
+>>> print p.search('\b' + 'class' + '\b')  
+<re.MatchObject instance at 80c3ee0>
+\end{verbatim}
+
+Second, inside a character class, where there's no use for this
+assertion, \regexp{\e b} represents the backspace character, for
+compatibility with Python's string literals.
+
+\item[\regexp{\e B}] Another zero-width assertion, this is the
+opposite of \regexp{\e b}, only matching when the current
+position is not at a word boundary.
+
+\end{list}
+
+\subsection{Grouping}
+
+Frequently you need to obtain more information than just whether the
+RE matched or not.  Regular expressions are often used to dissect
+strings by writing a RE divided into several subgroups which
+match different components of interest.  For example, an RFC-822
+header line is divided into a header name and a value, separated by a
+\character{:}.  This can be handled by writing a regular expression
+which matches an entire header line, and has one group which matches the
+header name, and another group which matches the header's value.
+
+Groups are marked by the \character{(}, \character{)} metacharacters.
+\character{(} and \character{)} have much the same meaning as they do
+in mathematical expressions; they group together the expressions
+contained inside them. For example, you can repeat the contents of a
+group with a repeating qualifier, such as \regexp{*}, \regexp{+},
+\regexp{?}, or \regexp{\{\var{m},\var{n}\}}.  For example,
+\regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
+
+\begin{verbatim}
+>>> p = re.compile('(ab)*')
+>>> print p.match('ababababab').span()
+(0, 10)
+\end{verbatim}
+
+Groups indicated with \character{(}, \character{)} also capture the
+starting and ending index of the text that they match; this can be
+retrieved by passing an argument to \method{group()},
+\method{start()}, \method{end()}, and \method{span()}.  Groups are
+numbered starting with 0.  Group 0 is always present; it's the whole
+RE, so \class{MatchObject} methods all have group 0 as their default
+argument.  Later we'll see how to express groups that don't capture
+the span of text that they match.
+
+\begin{verbatim}
+>>> p = re.compile('(a)b')
+>>> m = p.match('ab')
+>>> m.group()
+'ab'
+>>> m.group(0)
+'ab'
+\end{verbatim}
+
+Subgroups are numbered from left to right, from 1 upward.  Groups can
+be nested; to determine the number, just count the opening parenthesis
+characters, going from left to right.
+
+\begin{verbatim}
+>>> p = re.compile('(a(b)c)d')
+>>> m = p.match('abcd')
+>>> m.group(0)
+'abcd'
+>>> m.group(1)
+'abc'
+>>> m.group(2)
+'b'
+\end{verbatim}
+
+\method{group()} can be passed multiple group numbers at a time, in
+which case it will return a tuple containing the corresponding values
+for those groups.
+
+\begin{verbatim}  
+>>> m.group(2,1,2)
+('b', 'abc', 'b')
+\end{verbatim}  
+
+The \method{groups()} method returns a tuple containing the strings
+for all the subgroups, from 1 up to however many there are.
+
+\begin{verbatim}  
+>>> m.groups()
+('abc', 'b')
+\end{verbatim}  
+
+Backreferences in a pattern allow you to specify that the contents of
+an earlier capturing group must also be found at the current location
+in the string.  For example, \regexp{\e 1} will succeed if the exact
+contents of group 1 can be found at the current position, and fails
+otherwise.  Remember that Python's string literals also use a
+backslash followed by numbers to allow including arbitrary characters
+in a string, so be sure to use a raw string when incorporating
+backreferences in a RE.
+
+For example, the following RE detects doubled words in a string.
+
+\begin{verbatim}
+>>> p = re.compile(r'(\b\w+)\s+\1')
+>>> p.search('Paris in the the spring').group()
+'the the'
+\end{verbatim}
+
+Backreferences like this aren't often useful for just searching
+through a string --- there are few text formats which repeat data in
+this way --- but you'll soon find out that they're \emph{very} useful
+when performing string substitutions.
+
+\subsection{Non-capturing and Named Groups}
+
+Elaborate REs may use many groups, both to capture substrings of
+interest, and to group and structure the RE itself.  In complex REs,
+it becomes difficult to keep track of the group numbers.  There are
+two features which help with this problem.  Both of them use a common
+syntax for regular expression extensions, so we'll look at that first.
+
+Perl 5 added several additional features to standard regular
+expressions, and the Python \module{re} module supports most of them.
+It would have been difficult to choose new single-keystroke
+metacharacters or new special sequences beginning with \samp{\e} to
+represent the new features without making Perl's regular expressions
+confusingly different from standard REs.  If you chose \samp{\&} as a
+new metacharacter, for example, old expressions would be assuming that
+\samp{\&} was a regular character and wouldn't have escaped it by
+writing \regexp{\e \&} or \regexp{[\&]}.  
+
+The solution chosen by the Perl developers was to use \regexp{(?...)}
+as the extension syntax.  \samp{?} immediately after a parenthesis was
+a syntax error because the \samp{?} would have nothing to repeat, so
+this didn't introduce any compatibility problems.  The characters
+immediately after the \samp{?}  indicate what extension is being used,
+so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and
+\regexp{(?:foo)} is something else (a non-capturing group containing
+the subexpression \regexp{foo}).
+
+Python adds an extension syntax to Perl's extension syntax.  If the
+first character after the question mark is a \samp{P}, you know that
+it's an extension that's specific to Python.  Currently there are two
+such extensions: \regexp{(?P<\var{name}>...)} defines a named group,
+and \regexp{(?P=\var{name})} is a backreference to a named group.  If
+future versions of Perl 5 add similar features using a different
+syntax, the \module{re} module will be changed to support the new
+syntax, while preserving the Python-specific syntax for
+compatibility's sake.
+
+Now that we've looked at the general extension syntax, we can return
+to the features that simplify working with groups in complex REs.
+Since groups are numbered from left to right and a complex expression
+may use many groups, it can become difficult to keep track of the
+correct numbering, and modifying such a complex RE is annoying.
+Insert a new group near the beginning, and you change the numbers of
+everything that follows it.
+
+First, sometimes you'll want to use a group to collect a part of a
+regular expression, but aren't interested in retrieving the group's
+contents.  You can make this fact explicit by using a non-capturing
+group: \regexp{(?:...)}, where you can put any other regular
+expression inside the parentheses.  
+
+\begin{verbatim}
+>>> m = re.match("([abc])+", "abc")
+>>> m.groups()
+('c',)
+>>> m = re.match("(?:[abc])+", "abc")
+>>> m.groups()
+()
+\end{verbatim}
+
+Except for the fact that you can't retrieve the contents of what the
+group matched, a non-capturing group behaves exactly the same as a
+capturing group; you can put anything inside it, repeat it with a
+repetition metacharacter such as \samp{*}, and nest it within other
+groups (capturing or non-capturing).  \regexp{(?:...)} is particularly
+useful when modifying an existing group, since you can add new groups
+without changing how all the other groups are numbered.  It should be
+mentioned that there's no performance difference in searching between
+capturing and non-capturing groups; neither form is any faster than
+the other.
+
+The second, and more significant, feature is named groups; instead of
+referring to them by numbers, groups can be referenced by a name.
+
+The syntax for a named group is one of the Python-specific extensions:
+\regexp{(?P<\var{name}>...)}.  \var{name} is, obviously, the name of
+the group.  Except for associating a name with a group, named groups
+also behave identically to capturing groups.  The \class{MatchObject}
+methods that deal with capturing groups all accept either integers, to
+refer to groups by number, or a string containing the group name.
+Named groups are still given numbers, so you can retrieve information
+about a group in two ways:
+
+\begin{verbatim}
+>>> p = re.compile(r'(?P<word>\b\w+\b)')
+>>> m = p.search( '(((( Lots of punctuation )))' )
+>>> m.group('word')
+'Lots'
+>>> m.group(1)
+'Lots'
+\end{verbatim}
+
+Named groups are handy because they let you use easily-remembered
+names, instead of having to remember numbers.  Here's an example RE
+from the \module{imaplib} module:
+
+\begin{verbatim}
+InternalDate = re.compile(r'INTERNALDATE "'
+        r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
+	r'(?P<year>[0-9][0-9][0-9][0-9])'
+        r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
+        r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
+        r'"')
+\end{verbatim}
+
+It's obviously much easier to retrieve \code{m.group('zonem')},
+instead of having to remember to retrieve group 9.
+
+Since the syntax for backreferences, in an expression like
+\regexp{(...)\e 1}, refers to the number of the group there's
+naturally a variant that uses the group name instead of the number.
+This is also a Python extension: \regexp{(?P=\var{name})} indicates
+that the contents of the group called \var{name} should again be found
+at the current point.  The regular expression for finding doubled
+words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
+\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
+
+\begin{verbatim}
+>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
+>>> p.search('Paris in the the spring').group()
+'the the'
+\end{verbatim}
+
+\subsection{Lookahead Assertions}
+
+Another zero-width assertion is the lookahead assertion.  Lookahead
+assertions are available in both positive and negative form, and 
+look like this:
+
+\begin{itemize}
+\item[\regexp{(?=...)}] Positive lookahead assertion.  This succeeds
+if the contained regular expression, represented here by \code{...},
+successfully matches at the current location, and fails otherwise.
+But, once the contained expression has been tried, the matching engine
+doesn't advance at all; the rest of the pattern is tried right where
+the assertion started.
+
+\item[\regexp{(?!...)}] Negative lookahead assertion.  This is the
+opposite of the positive assertion; it succeeds if the contained expression
+\emph{doesn't} match at the current position in the string.
+\end{itemize}
+
+An example will help make this concrete by demonstrating a case
+where a lookahead is useful.  Consider a simple pattern to match a
+filename and split it apart into a base name and an extension,
+separated by a \samp{.}.  For example, in \samp{news.rc}, \samp{news}
+is the base name, and \samp{rc} is the filename's extension.  
+
+The pattern to match this is quite simple: 
+
+\regexp{.*[.].*\$}
+
+Notice that the \samp{.} needs to be treated specially because it's a
+metacharacter; I've put it inside a character class.  Also notice the
+trailing \regexp{\$}; this is added to ensure that all the rest of the
+string must be included in the extension.  This regular expression
+matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
+\samp{printers.conf}.
+
+Now, consider complicating the problem a bit; what if you want to
+match filenames where the extension is not \samp{bat}?
+Some incorrect attempts:
+
+\verb|.*[.][^b].*$|
+% $
+
+The first attempt above tries to exclude \samp{bat} by requiring that
+the first character of the extension is not a \samp{b}.  This is
+wrong, because the pattern also doesn't match \samp{foo.bar}.
+
+% Messes up the HTML without the curly braces around \^
+\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$}
+
+The expression gets messier when you try to patch up the first
+solution by requiring one of the following cases to match: the first
+character of the extension isn't \samp{b}; the second character isn't
+\samp{a}; or the third character isn't \samp{t}.  This accepts
+\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
+three-letter extension and won't accept a filename with a two-letter
+extension such as \samp{sendmail.cf}.  We'll complicate the pattern
+again in an effort to fix it.
+
+\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$}
+
+In the third attempt, the second and third letters are all made
+optional in order to allow matching extensions shorter than three
+characters, such as \samp{sendmail.cf}.
+
+The pattern's getting really complicated now, which makes it hard to
+read and understand.  Worse, if the problem changes and you want to
+exclude both \samp{bat} and \samp{exe} as extensions, the pattern
+would get even more complicated and confusing.
+
+A negative lookahead cuts through all this:
+
+\regexp{.*[.](?!bat\$).*\$}
+% $
+
+The lookahead means: if the expression \regexp{bat} doesn't match at
+this point, try the rest of the pattern; if \regexp{bat\$} does match,
+the whole pattern will fail.  The trailing \regexp{\$} is required to
+ensure that something like \samp{sample.batch}, where the extension
+only starts with \samp{bat}, will be allowed.
+
+Excluding another filename extension is now easy; simply add it as an
+alternative inside the assertion.  The following pattern excludes
+filenames that end in either \samp{bat} or \samp{exe}:
+
+\regexp{.*[.](?!bat\$|exe\$).*\$}
+% $
+
+
+\section{Modifying Strings}
+
+Up to this point, we've simply performed searches against a static
+string.  Regular expressions are also commonly used to modify a string
+in various ways, using the following \class{RegexObject} methods:
+
+\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
+  \lineii{split()}{Split the string into a list, splitting it wherever the RE matches}
+  \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string}
+  \lineii{subn()}{Does the same thing as \method{sub()}, 
+   but returns the new string and the number of replacements}
+\end{tableii}
+
+
+\subsection{Splitting Strings}
+
+The \method{split()} method of a \class{RegexObject} splits a string
+apart wherever the RE matches, returning a list of the pieces.
+It's similar to the \method{split()} method of strings but
+provides much more
+generality in the delimiters that you can split by;
+\method{split()} only supports splitting by whitespace or by
+a fixed string.  As you'd expect, there's a module-level
+\function{re.split()} function, too.
+
+\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}}
+  Split \var{string} by the matches of the regular expression.  If
+  capturing parentheses are used in the RE, then their contents will
+  also be returned as part of the resulting list.  If \var{maxsplit}
+  is nonzero, at most \var{maxsplit} splits are performed.
+\end{methoddesc}
+
+You can limit the number of splits made, by passing a value for
+\var{maxsplit}.  When \var{maxsplit} is nonzero, at most
+\var{maxsplit} splits will be made, and the remainder of the string is
+returned as the final element of the list.  In the following example,
+the delimiter is any sequence of non-alphanumeric characters.
+
+\begin{verbatim}
+>>> p = re.compile(r'\W+')
+>>> p.split('This is a test, short and sweet, of split().')
+['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
+>>> p.split('This is a test, short and sweet, of split().', 3)
+['This', 'is', 'a', 'test, short and sweet, of split().']
+\end{verbatim}
+
+Sometimes you're not only interested in what the text between
+delimiters is, but also need to know what the delimiter was.  If
+capturing parentheses are used in the RE, then their values are also
+returned as part of the list.  Compare the following calls:
+
+\begin{verbatim}
+>>> p = re.compile(r'\W+')
+>>> p2 = re.compile(r'(\W+)')
+>>> p.split('This... is a test.')
+['This', 'is', 'a', 'test', '']
+>>> p2.split('This... is a test.')
+['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
+\end{verbatim}
+
+The module-level function \function{re.split()} adds the RE to be
+used as the first argument, but is otherwise the same.  
+
+\begin{verbatim}
+>>> re.split('[\W]+', 'Words, words, words.')
+['Words', 'words', 'words', '']
+>>> re.split('([\W]+)', 'Words, words, words.')
+['Words', ', ', 'words', ', ', 'words', '.', '']
+>>> re.split('[\W]+', 'Words, words, words.', 1)
+['Words', 'words, words.']
+\end{verbatim}
+
+\subsection{Search and Replace}
+
+Another common task is to find all the matches for a pattern, and
+replace them with a different string.  The \method{sub()} method takes
+a replacement value, which can be either a string or a function, and
+the string to be processed.
+
+\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}}
+Returns the string obtained by replacing the leftmost non-overlapping
+occurrences of the RE in \var{string} by the replacement
+\var{replacement}.  If the pattern isn't found, \var{string} is returned
+unchanged.  
+
+The optional argument \var{count} is the maximum number of pattern
+occurrences to be replaced; \var{count} must be a non-negative
+integer.  The default value of 0 means to replace all occurrences.
+\end{methoddesc}
+
+Here's a simple example of using the \method{sub()} method.  It
+replaces colour names with the word \samp{colour}:
+
+\begin{verbatim}
+>>> p = re.compile( '(blue|white|red)')
+>>> p.sub( 'colour', 'blue socks and red shoes')
+'colour socks and colour shoes'
+>>> p.sub( 'colour', 'blue socks and red shoes', count=1)
+'colour socks and red shoes'
+\end{verbatim}
+
+The \method{subn()} method does the same work, but returns a 2-tuple
+containing the new string value and the number of replacements 
+that were performed:
+
+\begin{verbatim}
+>>> p = re.compile( '(blue|white|red)')
+>>> p.subn( 'colour', 'blue socks and red shoes')
+('colour socks and colour shoes', 2)
+>>> p.subn( 'colour', 'no colours at all')
+('no colours at all', 0)
+\end{verbatim}
+
+Empty matches are replaced only when they're not
+adjacent to a previous match.  
+
+\begin{verbatim}
+>>> p = re.compile('x*')
+>>> p.sub('-', 'abxd')
+'-a-b-d-'
+\end{verbatim}
+
+If \var{replacement} is a string, any backslash escapes in it are
+processed.  That is, \samp{\e n} is converted to a single newline
+character, \samp{\e r} is converted to a carriage return, and so forth.
+Unknown escapes such as \samp{\e j} are left alone.  Backreferences,
+such as \samp{\e 6}, are replaced with the substring matched by the
+corresponding group in the RE.  This lets you incorporate
+portions of the original text in the resulting
+replacement string.
+
+This example matches the word \samp{section} followed by a string
+enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to
+\samp{subsection}:
+
+\begin{verbatim}
+>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
+>>> p.sub(r'subsection{\1}','section{First} section{second}')
+'subsection{First} subsection{second}'
+\end{verbatim}
+
+There's also a syntax for referring to named groups as defined by the
+\regexp{(?P<name>...)} syntax.  \samp{\e g<name>} will use the
+substring matched by the group named \samp{name}, and 
+\samp{\e g<\var{number}>} 
+uses the corresponding group number.  
+\samp{\e g<2>} is therefore equivalent to \samp{\e 2}, 
+but isn't ambiguous in a
+replacement string such as \samp{\e g<2>0}.  (\samp{\e 20} would be
+interpreted as a reference to group 20, not a reference to group 2
+followed by the literal character \character{0}.)  The following
+substitutions are all equivalent, but use all three variations of the
+replacement string.
+
+\begin{verbatim}
+>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
+>>> p.sub(r'subsection{\1}','section{First}')
+'subsection{First}'
+>>> p.sub(r'subsection{\g<1>}','section{First}')
+'subsection{First}'
+>>> p.sub(r'subsection{\g<name>}','section{First}')
+'subsection{First}'
+\end{verbatim}
+
+\var{replacement} can also be a function, which gives you even more
+control.  If \var{replacement} is a function, the function is
+called for every non-overlapping occurrence of \var{pattern}.  On each
+call, the function is 
+passed a \class{MatchObject} argument for the match
+and can use this information to compute the desired replacement string and return it.
+
+In the following example, the replacement function translates 
+decimals into hexadecimal:
+
+\begin{verbatim}
+>>> def hexrepl( match ):
+...     "Return the hex string for a decimal number"
+...     value = int( match.group() )
+...     return hex(value)
+...
+>>> p = re.compile(r'\d+')
+>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
+'Call 0xffd2 for printing, 0xc000 for user code.'
+\end{verbatim}
+
+When using the module-level \function{re.sub()} function, the pattern
+is passed as the first argument.  The pattern may be a string or a
+\class{RegexObject}; if you need to specify regular expression flags,
+you must either use a \class{RegexObject} as the first parameter, or use
+embedded modifiers in the pattern, e.g.  \code{sub("(?i)b+", "x", "bbbb
+BBBB")} returns \code{'x x'}.
+
+\section{Common Problems}
+
+Regular expressions are a powerful tool for some applications, but in
+some ways their behaviour isn't intuitive and at times they don't
+behave the way you may expect them to.  This section will point out
+some of the most common pitfalls.
+
+\subsection{Use String Methods}
+
+Sometimes using the \module{re} module is a mistake.  If you're
+matching a fixed string, or a single character class, and you're not
+using any \module{re} features such as the \constant{IGNORECASE} flag,
+then the full power of regular expressions may not be required.
+Strings have several methods for performing operations with fixed
+strings and they're usually much faster, because the implementation is
+a single small C loop that's been optimized for the purpose, instead
+of the large, more generalized regular expression engine.
+
+One example might be replacing a single fixed string with another
+one; for example, you might replace \samp{word}
+with \samp{deed}.  \code{re.sub()} seems like the function to use for
+this, but consider the \method{replace()} method.  Note that 
+\function{replace()} will also replace \samp{word} inside
+words, turning \samp{swordfish} into \samp{sdeedfish}, but the 
+na{\"\i}ve RE \regexp{word} would have done that, too.  (To avoid performing
+the substitution on parts of words, the pattern would have to be
+\regexp{\e bword\e b}, in order to require that \samp{word} have a
+word boundary on either side.  This takes the job beyond 
+\method{replace}'s abilities.)
+
+Another common task is deleting every occurrence of a single character
+from a string or replacing it with another single character.  You
+might do this with something like \code{re.sub('\e n', ' ', S)}, but
+\method{translate()} is capable of doing both tasks
+and will be faster that any regular expression operation can be.
+
+In short, before turning to the \module{re} module, consider whether
+your problem can be solved with a faster and simpler string method.
+
+\subsection{match() versus search()}
+
+The \function{match()} function only checks if the RE matches at
+the beginning of the string while \function{search()} will scan
+forward through the string for a match.
+It's important to keep this distinction in mind.  Remember, 
+\function{match()} will only report a successful match which
+will start at 0; if the match wouldn't start at zero, 
+\function{match()} will \emph{not} report it.
+
+\begin{verbatim}
+>>> print re.match('super', 'superstition').span()  
+(0, 5)
+>>> print re.match('super', 'insuperable')    
+None
+\end{verbatim}
+
+On the other hand, \function{search()} will scan forward through the
+string, reporting the first match it finds.
+
+\begin{verbatim}
+>>> print re.search('super', 'superstition').span()
+(0, 5)
+>>> print re.search('super', 'insuperable').span()
+(2, 7)
+\end{verbatim}
+
+Sometimes you'll be tempted to keep using \function{re.match()}, and
+just add \regexp{.*} to the front of your RE.  Resist this temptation
+and use \function{re.search()} instead.  The regular expression
+compiler does some analysis of REs in order to speed up the process of
+looking for a match.  One such analysis figures out what the first
+character of a match must be; for example, a pattern starting with
+\regexp{Crow} must match starting with a \character{C}.  The analysis
+lets the engine quickly scan through the string looking for the
+starting character, only trying the full match if a \character{C} is found.
+
+Adding \regexp{.*} defeats this optimization, requiring scanning to
+the end of the string and then backtracking to find a match for the
+rest of the RE.  Use \function{re.search()} instead.
+
+\subsection{Greedy versus Non-Greedy}
+
+When repeating a regular expression, as in \regexp{a*}, the resulting
+action is to consume as much of the pattern as possible.  This
+fact often bites you when you're trying to match a pair of
+balanced delimiters, such as the angle brackets surrounding an HTML
+tag.  The na{\"\i}ve pattern for matching a single HTML tag doesn't
+work because of the greedy nature of \regexp{.*}.
+
+\begin{verbatim}
+>>> s = '<html><head><title>Title</title>'
+>>> len(s)
+32
+>>> print re.match('<.*>', s).span()
+(0, 32)
+>>> print re.match('<.*>', s).group()
+<html><head><title>Title</title>
+\end{verbatim}
+
+The RE matches the \character{<} in \samp{<html>}, and the
+\regexp{.*} consumes the rest of the string.  There's still more left
+in the RE, though, and the \regexp{>} can't match at the end of
+the string, so the regular expression engine has to backtrack
+character by character until it finds a match for the \regexp{>}.  
+The final match extends from the \character{<} in \samp{<html>}
+to the \character{>} in \samp{</title>}, which isn't what you want.
+
+In this case, the solution is to use the non-greedy qualifiers
+\regexp{*?}, \regexp{+?}, \regexp{??}, or
+\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as
+possible.  In the above example, the \character{>} is tried
+immediately after the first \character{<} matches, and when it fails,
+the engine advances a character at a time, retrying the \character{>}
+at every step.  This produces just the right result:
+
+\begin{verbatim}
+>>> print re.match('<.*?>', s).group()
+<html>
+\end{verbatim}
+
+(Note that parsing HTML or XML with regular expressions is painful.
+Quick-and-dirty patterns will handle common cases, but HTML and XML
+have special cases that will break the obvious regular expression; by
+the time you've written a regular expression that handles all of the
+possible cases, the patterns will be \emph{very} complicated.  Use an
+HTML or XML parser module for such tasks.)
+
+\subsection{Not Using re.VERBOSE}
+
+By now you've probably noticed that regular expressions are a very
+compact notation, but they're not terribly readable.  REs of
+moderate complexity can become lengthy collections of backslashes,
+parentheses, and metacharacters, making them difficult to read and
+understand.  
+
+For such REs, specifying the \code{re.VERBOSE} flag when
+compiling the regular expression can be helpful, because it allows
+you to format the regular expression more clearly.
+
+The \code{re.VERBOSE} flag has several effects.  Whitespace in the
+regular expression that \emph{isn't} inside a character class is
+ignored.  This means that an expression such as \regexp{dog | cat} is
+equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]}
+will still match the characters \character{a}, \character{b}, or a
+space.  In addition, you can also put comments inside a RE; comments
+extend from a \samp{\#} character to the next newline.  When used with
+triple-quoted strings, this enables REs to be formatted more neatly:
+
+\begin{verbatim}
+pat = re.compile(r"""
+ \s*                 # Skip leading whitespace
+ (?P<header>[^:]+)   # Header name
+ \s* :               # Whitespace, and a colon
+ (?P<value>.*?)      # The header's value -- *? used to
+                     # lose the following trailing whitespace
+ \s*$                # Trailing whitespace to end-of-line
+""", re.VERBOSE)
+\end{verbatim}
+% $
+
+This is far more readable than:
+
+\begin{verbatim}
+pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
+\end{verbatim}
+% $
+
+\section{Feedback}
+
+Regular expressions are a complicated topic.  Did this document help
+you understand them?  Were there parts that were unclear, or Problems
+you encountered that weren't covered here?  If so, please send
+suggestions for improvements to the author.
+
+The most complete book on regular expressions is almost certainly
+Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published
+by O'Reilly.  Unfortunately, it exclusively concentrates on Perl and
+Java's flavours of regular expressions, and doesn't contain any Python
+material at all, so it won't be useful as a reference for programming
+in Python.  (The first edition covered Python's now-obsolete
+\module{regex} module, which won't help you much.)  Consider checking
+it out from your library.
+
+\end{document}
+
author	Andrew M. Kuchling <amk@amk.ca>	2005-08-30 01:25:05 (GMT)
committer	Andrew M. Kuchling <amk@amk.ca>	2005-08-30 01:25:05 (GMT)
commit	e8f44d683e79c7a9659a4480736d55193da4a7b1 (patch)
tree	37e8b05066aa1caf85f6b25d52f1576366e45e8e /Doc/howto/regex.tex
parent	f1b2ba6aa1751c5325e8fb87a28e54a857796bfa (diff)
download	cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.zip cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.tar.gz cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.tar.bz2