summaryrefslogtreecommitdiffstats
path: root/Doc/howto/regex.tex
diff options
context:
space:
mode:
authorGeorg Brandl <georg@python.org>2007-08-15 14:27:07 (GMT)
committerGeorg Brandl <georg@python.org>2007-08-15 14:27:07 (GMT)
commit739c01d47b9118d04e5722333f0e6b4d0c8bdd9e (patch)
treef82b450d291927fc1758b96d981aa0610947b529 /Doc/howto/regex.tex
parent2d1649094402ef393ea2b128ba2c08c3937e6b93 (diff)
downloadcpython-739c01d47b9118d04e5722333f0e6b4d0c8bdd9e.zip
cpython-739c01d47b9118d04e5722333f0e6b4d0c8bdd9e.tar.gz
cpython-739c01d47b9118d04e5722333f0e6b4d0c8bdd9e.tar.bz2
Delete the LaTeX doc tree.
Diffstat (limited to 'Doc/howto/regex.tex')
-rw-r--r--Doc/howto/regex.tex1476
1 files changed, 0 insertions, 1476 deletions
diff --git a/Doc/howto/regex.tex b/Doc/howto/regex.tex
deleted file mode 100644
index d911be6..0000000
--- a/Doc/howto/regex.tex
+++ /dev/null
@@ -1,1476 +0,0 @@
-\documentclass{howto}
-
-% TODO:
-% Document lookbehind assertions
-% Better way of displaying a RE, a string, and what it matches
-% Mention optional argument to match.groups()
-% Unicode (at least a reference)
-
-\title{Regular Expression HOWTO}
-
-\release{0.05}
-
-\author{A.M. Kuchling}
-\authoraddress{\email{amk@amk.ca}}
-
-\begin{document}
-\maketitle
-
-\begin{abstract}
-\noindent
-This document is an introductory tutorial to using regular expressions
-in Python with the \module{re} module. It provides a gentler
-introduction than the corresponding section in the Library Reference.
-
-This document is available from
-\url{http://www.amk.ca/python/howto}.
-
-\end{abstract}
-
-\tableofcontents
-
-\section{Introduction}
-
-The \module{re} module was added in Python 1.5, and provides
-Perl-style regular expression patterns. Earlier versions of Python
-came with the \module{regex} module, which provided Emacs-style
-patterns. The \module{regex} module was removed completely in Python 2.5.
-
-Regular expressions (called REs, or regexes, or regex patterns) are
-essentially a tiny, highly specialized programming language embedded
-inside Python and made available through the \module{re} module.
-Using this little language, you specify the rules for the set of
-possible strings that you want to match; this set might contain
-English sentences, or e-mail addresses, or TeX commands, or anything
-you like. You can then ask questions such as ``Does this string match
-the pattern?'', or ``Is there a match for the pattern anywhere in this
-string?''. You can also use REs to modify a string or to split it
-apart in various ways.
-
-Regular expression patterns are compiled into a series of bytecodes
-which are then executed by a matching engine written in C. For
-advanced use, it may be necessary to pay careful attention to how the
-engine will execute a given RE, and write the RE in a certain way in
-order to produce bytecode that runs faster. Optimization isn't
-covered in this document, because it requires that you have a good
-understanding of the matching engine's internals.
-
-The regular expression language is relatively small and restricted, so
-not all possible string processing tasks can be done using regular
-expressions. There are also tasks that \emph{can} be done with
-regular expressions, but the expressions turn out to be very
-complicated. In these cases, you may be better off writing Python
-code to do the processing; while Python code will be slower than an
-elaborate regular expression, it will also probably be more understandable.
-
-\section{Simple Patterns}
-
-We'll start by learning about the simplest possible regular
-expressions. Since regular expressions are used to operate on
-strings, we'll begin with the most common task: matching characters.
-
-For a detailed explanation of the computer science underlying regular
-expressions (deterministic and non-deterministic finite automata), you
-can refer to almost any textbook on writing compilers.
-
-\subsection{Matching Characters}
-
-Most letters and characters will simply match themselves. For
-example, the regular expression \regexp{test} will match the string
-\samp{test} exactly. (You can enable a case-insensitive mode that
-would let this RE match \samp{Test} or \samp{TEST} as well; more
-about this later.)
-
-There are exceptions to this rule; some characters are special
-\dfn{metacharacters}, and don't match themselves. Instead, they
-signal that some out-of-the-ordinary thing should be matched, or they
-affect other portions of the RE by repeating them or changing their
-meaning. Much of this document is devoted to discussing various
-metacharacters and what they do.
-
-Here's a complete list of the metacharacters; their meanings will be
-discussed in the rest of this HOWTO.
-
-\begin{verbatim}
-. ^ $ * + ? { [ ] \ | ( )
-\end{verbatim}
-% $
-
-The first metacharacters we'll look at are \samp{[} and \samp{]}.
-They're used for specifying a character class, which is a set of
-characters that you wish to match. Characters can be listed
-individually, or a range of characters can be indicated by giving two
-characters and separating them by a \character{-}. For example,
-\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or
-\samp{c}; this is the same as
-\regexp{[a-c]}, which uses a range to express the same set of
-characters. If you wanted to match only lowercase letters, your
-RE would be \regexp{[a-z]}.
-
-Metacharacters are not active inside classes. For example,
-\regexp{[akm\$]} will match any of the characters \character{a},
-\character{k}, \character{m}, or \character{\$}; \character{\$} is
-usually a metacharacter, but inside a character class it's stripped of
-its special nature.
-
-You can match the characters not listed within the class by
-\dfn{complementing} the set. This is indicated by including a
-\character{\^} as the first character of the class; \character{\^}
-outside a character class will simply match the
-\character{\^} character. For example, \verb|[^5]| will match any
-character except \character{5}.
-
-Perhaps the most important metacharacter is the backslash, \samp{\e}.
-As in Python string literals, the backslash can be followed by various
-characters to signal various special sequences. It's also used to escape
-all the metacharacters so you can still match them in patterns; for
-example, if you need to match a \samp{[} or
-\samp{\e}, you can precede them with a backslash to remove their
-special meaning: \regexp{\e[} or \regexp{\e\e}.
-
-Some of the special sequences beginning with \character{\e} represent
-predefined sets of characters that are often useful, such as the set
-of digits, the set of letters, or the set of anything that isn't
-whitespace. The following predefined special sequences are available:
-
-\begin{itemize}
-\item[\code{\e d}]Matches any decimal digit; this is
-equivalent to the class \regexp{[0-9]}.
-
-\item[\code{\e D}]Matches any non-digit character; this is
-equivalent to the class \verb|[^0-9]|.
-
-\item[\code{\e s}]Matches any whitespace character; this is
-equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}.
-
-\item[\code{\e S}]Matches any non-whitespace character; this is
-equivalent to the class \verb|[^ \t\n\r\f\v]|.
-
-\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class
-\regexp{[a-zA-Z0-9_]}.
-
-\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class
-\verb|[^a-zA-Z0-9_]|.
-\end{itemize}
-
-These sequences can be included inside a character class. For
-example, \regexp{[\e s,.]} is a character class that will match any
-whitespace character, or \character{,} or \character{.}.
-
-The final metacharacter in this section is \regexp{.}. It matches
-anything except a newline character, and there's an alternate mode
-(\code{re.DOTALL}) where it will match even a newline. \character{.}
-is often used where you want to match ``any character''.
-
-\subsection{Repeating Things}
-
-Being able to match varying sets of characters is the first thing
-regular expressions can do that isn't already possible with the
-methods available on strings. However, if that was the only
-additional capability of regexes, they wouldn't be much of an advance.
-Another capability is that you can specify that portions of the RE
-must be repeated a certain number of times.
-
-The first metacharacter for repeating things that we'll look at is
-\regexp{*}. \regexp{*} doesn't match the literal character \samp{*};
-instead, it specifies that the previous character can be matched zero
-or more times, instead of exactly once.
-
-For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a}
-characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a}
-characters), and so forth. The RE engine has various internal
-limitations stemming from the size of C's \code{int} type that will
-prevent it from matching over 2 billion \samp{a} characters; you
-probably don't have enough memory to construct a string that large, so
-you shouldn't run into that limit.
-
-Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE,
-the matching engine will try to repeat it as many times as possible.
-If later portions of the pattern don't match, the matching engine will
-then back up and try again with few repetitions.
-
-A step-by-step example will make this more obvious. Let's consider
-the expression \regexp{a[bcd]*b}. This matches the letter
-\character{a}, zero or more letters from the class \code{[bcd]}, and
-finally ends with a \character{b}. Now imagine matching this RE
-against the string \samp{abcbd}.
-
-\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation}
-\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.}
-\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as
-it can, which is to the end of the string.}
-\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the
-current position is at the end of the string, so it fails.}
-\lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches
-one less character.}
-\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the
-current position is at the last character, which is a \character{d}.}
-\lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is
-only matching \samp{bc}.}
-\lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time
-but the character at the current position is \character{b}, so it succeeds.}
-\end{tableiii}
-
-The end of the RE has now been reached, and it has matched
-\samp{abcb}. This demonstrates how the matching engine goes as far as
-it can at first, and if no match is found it will then progressively
-back up and retry the rest of the RE again and again. It will back up
-until it has tried zero matches for \regexp{[bcd]*}, and if that
-subsequently fails, the engine will conclude that the string doesn't
-match the RE at all.
-
-Another repeating metacharacter is \regexp{+}, which matches one or
-more times. Pay careful attention to the difference between
-\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more
-times, so whatever's being repeated may not be present at all, while
-\regexp{+} requires at least \emph{one} occurrence. To use a similar
-example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}),
-\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}.
-
-There are two more repeating qualifiers. The question mark character,
-\regexp{?}, matches either once or zero times; you can think of it as
-marking something as being optional. For example, \regexp{home-?brew}
-matches either \samp{homebrew} or \samp{home-brew}.
-
-The most complicated repeated qualifier is
-\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal
-integers. This qualifier means there must be at least \var{m}
-repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b}
-will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match
-\samp{ab}, which has no slashes, or \samp{a////b}, which has four.
-
-You can omit either \var{m} or \var{n}; in that case, a reasonable
-value is assumed for the missing value. Omitting \var{m} is
-interpreted as a lower limit of 0, while omitting \var{n} results in
-an upper bound of infinity --- actually, the upper bound is the
-2-billion limit mentioned earlier, but that might as well be infinity.
-
-Readers of a reductionist bent may notice that the three other qualifiers
-can all be expressed using this notation. \regexp{\{0,\}} is the same
-as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and
-\regexp{\{0,1\}} is the same as \regexp{?}. It's better to use
-\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because
-they're shorter and easier to read.
-
-\section{Using Regular Expressions}
-
-Now that we've looked at some simple regular expressions, how do we
-actually use them in Python? The \module{re} module provides an
-interface to the regular expression engine, allowing you to compile
-REs into objects and then perform matches with them.
-
-\subsection{Compiling Regular Expressions}
-
-Regular expressions are compiled into \class{RegexObject} instances,
-which have methods for various operations such as searching for
-pattern matches or performing string substitutions.
-
-\begin{verbatim}
->>> import re
->>> p = re.compile('ab*')
->>> print p
-<re.RegexObject instance at 80b4150>
-\end{verbatim}
-
-\function{re.compile()} also accepts an optional \var{flags}
-argument, used to enable various special features and syntax
-variations. We'll go over the available settings later, but for now a
-single example will do:
-
-\begin{verbatim}
->>> p = re.compile('ab*', re.IGNORECASE)
-\end{verbatim}
-
-The RE is passed to \function{re.compile()} as a string. REs are
-handled as strings because regular expressions aren't part of the core
-Python language, and no special syntax was created for expressing
-them. (There are applications that don't need REs at all, so there's
-no need to bloat the language specification by including them.)
-Instead, the \module{re} module is simply a C extension module
-included with Python, just like the \module{socket} or \module{zlib}
-modules.
-
-Putting REs in strings keeps the Python language simpler, but has one
-disadvantage which is the topic of the next section.
-
-\subsection{The Backslash Plague}
-
-As stated earlier, regular expressions use the backslash
-character (\character{\e}) to indicate special forms or to allow
-special characters to be used without invoking their special meaning.
-This conflicts with Python's usage of the same character for the same
-purpose in string literals.
-
-Let's say you want to write a RE that matches the string
-\samp{{\e}section}, which might be found in a \LaTeX\ file. To figure
-out what to write in the program code, start with the desired string
-to be matched. Next, you must escape any backslashes and other
-metacharacters by preceding them with a backslash, resulting in the
-string \samp{\e\e section}. The resulting string that must be passed
-to \function{re.compile()} must be \verb|\\section|. However, to
-express this as a Python string literal, both backslashes must be
-escaped \emph{again}.
-
-\begin{tableii}{c|l}{code}{Characters}{Stage}
- \lineii{\e section}{Text string to be matched}
- \lineii{\e\e section}{Escaped backslash for \function{re.compile}}
- \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal}
-\end{tableii}
-
-In short, to match a literal backslash, one has to write
-\code{'\e\e\e\e'} as the RE string, because the regular expression
-must be \samp{\e\e}, and each backslash must be expressed as
-\samp{\e\e} inside a regular Python string literal. In REs that
-feature backslashes repeatedly, this leads to lots of repeated
-backslashes and makes the resulting strings difficult to understand.
-
-The solution is to use Python's raw string notation for regular
-expressions; backslashes are not handled in any special way in
-a string literal prefixed with \character{r}, so \code{r"\e n"} is a
-two-character string containing \character{\e} and \character{n},
-while \code{"\e n"} is a one-character string containing a newline.
-Regular expressions will often be written in Python
-code using this raw string notation.
-
-\begin{tableii}{c|c}{code}{Regular String}{Raw string}
- \lineii{"ab*"}{\code{r"ab*"}}
- \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}}
- \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}}
-\end{tableii}
-
-\subsection{Performing Matches}
-
-Once you have an object representing a compiled regular expression,
-what do you do with it? \class{RegexObject} instances have several
-methods and attributes. Only the most significant ones will be
-covered here; consult \ulink{the Library
-Reference}{http://www.python.org/doc/lib/module-re.html} for a
-complete listing.
-
-\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
- \lineii{match()}{Determine if the RE matches at the beginning of
- the string.}
- \lineii{search()}{Scan through a string, looking for any location
- where this RE matches.}
- \lineii{findall()}{Find all substrings where the RE matches,
-and returns them as a list.}
- \lineii{finditer()}{Find all substrings where the RE matches,
-and returns them as an iterator.}
-\end{tableii}
-
-\method{match()} and \method{search()} return \code{None} if no match
-can be found. If they're successful, a \code{MatchObject} instance is
-returned, containing information about the match: where it starts and
-ends, the substring it matched, and more.
-
-You can learn about this by interactively experimenting with the
-\module{re} module. If you have Tkinter available, you may also want
-to look at \file{Tools/scripts/redemo.py}, a demonstration program
-included with the Python distribution. It allows you to enter REs and
-strings, and displays whether the RE matches or fails.
-\file{redemo.py} can be quite useful when trying to debug a
-complicated RE. Phil Schwartz's
-\ulink{Kodos}{http://www.phil-schwartz.com/kodos.spy} is also an interactive
-tool for developing and testing RE patterns.
-
-This HOWTO uses the standard Python interpreter for its examples.
-First, run the Python interpreter, import the \module{re} module, and
-compile a RE:
-
-\begin{verbatim}
-Python 2.2.2 (#1, Feb 10 2003, 12:57:01)
->>> import re
->>> p = re.compile('[a-z]+')
->>> p
-<_sre.SRE_Pattern object at 80c3c28>
-\end{verbatim}
-
-Now, you can try matching various strings against the RE
-\regexp{[a-z]+}. An empty string shouldn't match at all, since
-\regexp{+} means 'one or more repetitions'. \method{match()} should
-return \code{None} in this case, which will cause the interpreter to
-print no output. You can explicitly print the result of
-\method{match()} to make this clear.
-
-\begin{verbatim}
->>> p.match("")
->>> print p.match("")
-None
-\end{verbatim}
-
-Now, let's try it on a string that it should match, such as
-\samp{tempo}. In this case, \method{match()} will return a
-\class{MatchObject}, so you should store the result in a variable for
-later use.
-
-\begin{verbatim}
->>> m = p.match('tempo')
->>> print m
-<_sre.SRE_Match object at 80c4f68>
-\end{verbatim}
-
-Now you can query the \class{MatchObject} for information about the
-matching string. \class{MatchObject} instances also have several
-methods and attributes; the most important ones are:
-
-\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
- \lineii{group()}{Return the string matched by the RE}
- \lineii{start()}{Return the starting position of the match}
- \lineii{end()}{Return the ending position of the match}
- \lineii{span()}{Return a tuple containing the (start, end) positions
- of the match}
-\end{tableii}
-
-Trying these methods will soon clarify their meaning:
-
-\begin{verbatim}
->>> m.group()
-'tempo'
->>> m.start(), m.end()
-(0, 5)
->>> m.span()
-(0, 5)
-\end{verbatim}
-
-\method{group()} returns the substring that was matched by the
-RE. \method{start()} and \method{end()} return the starting and
-ending index of the match. \method{span()} returns both start and end
-indexes in a single tuple. Since the \method{match} method only
-checks if the RE matches at the start of a string,
-\method{start()} will always be zero. However, the \method{search}
-method of \class{RegexObject} instances scans through the string, so
-the match may not start at zero in that case.
-
-\begin{verbatim}
->>> print p.match('::: message')
-None
->>> m = p.search('::: message') ; print m
-<re.MatchObject instance at 80c9650>
->>> m.group()
-'message'
->>> m.span()
-(4, 11)
-\end{verbatim}
-
-In actual programs, the most common style is to store the
-\class{MatchObject} in a variable, and then check if it was
-\code{None}. This usually looks like:
-
-\begin{verbatim}
-p = re.compile( ... )
-m = p.match( 'string goes here' )
-if m:
- print 'Match found: ', m.group()
-else:
- print 'No match'
-\end{verbatim}
-
-Two \class{RegexObject} methods return all of the matches for a pattern.
-\method{findall()} returns a list of matching strings:
-
-\begin{verbatim}
->>> p = re.compile('\d+')
->>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
-['12', '11', '10']
-\end{verbatim}
-
-\method{findall()} has to create the entire list before it can be
-returned as the result. The \method{finditer()} method returns a
-sequence of \class{MatchObject} instances as an
-iterator.\footnote{Introduced in Python 2.2.2.}
-
-\begin{verbatim}
->>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
->>> iterator
-<callable-iterator object at 0x401833ac>
->>> for match in iterator:
-... print match.span()
-...
-(0, 2)
-(22, 24)
-(29, 31)
-\end{verbatim}
-
-
-\subsection{Module-Level Functions}
-
-You don't have to create a \class{RegexObject} and call its methods;
-the \module{re} module also provides top-level functions called
-\function{match()}, \function{search()}, \function{findall()},
-\function{sub()}, and so forth. These functions take the same
-arguments as the corresponding \class{RegexObject} method, with the RE
-string added as the first argument, and still return either
-\code{None} or a \class{MatchObject} instance.
-
-\begin{verbatim}
->>> print re.match(r'From\s+', 'Fromage amk')
-None
->>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998')
-<re.MatchObject instance at 80c5978>
-\end{verbatim}
-
-Under the hood, these functions simply produce a \class{RegexObject}
-for you and call the appropriate method on it. They also store the
-compiled object in a cache, so future calls using the same
-RE are faster.
-
-Should you use these module-level functions, or should you get the
-\class{RegexObject} and call its methods yourself? That choice
-depends on how frequently the RE will be used, and on your personal
-coding style. If the RE is being used at only one point in the code,
-then the module functions are probably more convenient. If a program
-contains a lot of regular expressions, or re-uses the same ones in
-several locations, then it might be worthwhile to collect all the
-definitions in one place, in a section of code that compiles all the
-REs ahead of time. To take an example from the standard library:
-
-\begin{verbatim}
-ref = re.compile( ... )
-entityref = re.compile( ... )
-charref = re.compile( ... )
-starttagopen = re.compile( ... )
-\end{verbatim}
-
-I generally prefer to work with the compiled object, even for
-one-time uses, but few people will be as much of a purist about this
-as I am.
-
-\subsection{Compilation Flags}
-
-Compilation flags let you modify some aspects of how regular
-expressions work. Flags are available in the \module{re} module under
-two names, a long name such as \constant{IGNORECASE} and a short,
-one-letter form such as \constant{I}. (If you're familiar with Perl's
-pattern modifiers, the one-letter forms use the same letters; the
-short form of \constant{re.VERBOSE} is \constant{re.X}, for example.)
-Multiple flags can be specified by bitwise OR-ing them; \code{re.I |
-re.M} sets both the \constant{I} and \constant{M} flags, for example.
-
-Here's a table of the available flags, followed by
-a more detailed explanation of each one.
-
-\begin{tableii}{c|l}{}{Flag}{Meaning}
- \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any
- character, including newlines}
- \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches}
- \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match}
- \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching,
- affecting \regexp{\^} and \regexp{\$}}
- \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs,
- which can be organized more cleanly and understandably.}
-\end{tableii}
-
-\begin{datadesc}{I}
-\dataline{IGNORECASE}
-Perform case-insensitive matching; character class and literal strings
-will match
-letters by ignoring case. For example, \regexp{[A-Z]} will match
-lowercase letters, too, and \regexp{Spam} will match \samp{Spam},
-\samp{spam}, or \samp{spAM}.
-This lowercasing doesn't take the current locale into account; it will
-if you also set the \constant{LOCALE} flag.
-\end{datadesc}
-
-\begin{datadesc}{L}
-\dataline{LOCALE}
-Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
-and \regexp{\e B}, dependent on the current locale.
-
-Locales are a feature of the C library intended to help in writing
-programs that take account of language differences. For example, if
-you're processing French text, you'd want to be able to write
-\regexp{\e w+} to match words, but \regexp{\e w} only matches the
-character class \regexp{[A-Za-z]}; it won't match \character{\'e} or
-\character{\c c}. If your system is configured properly and a French
-locale is selected, certain C functions will tell the program that
-\character{\'e} should also be considered a letter. Setting the
-\constant{LOCALE} flag when compiling a regular expression will cause the
-resulting compiled object to use these C functions for \regexp{\e w};
-this is slower, but also enables \regexp{\e w+} to match French words as
-you'd expect.
-\end{datadesc}
-
-\begin{datadesc}{M}
-\dataline{MULTILINE}
-(\regexp{\^} and \regexp{\$} haven't been explained yet;
-they'll be introduced in section~\ref{more-metacharacters}.)
-
-Usually \regexp{\^} matches only at the beginning of the string, and
-\regexp{\$} matches only at the end of the string and immediately before the
-newline (if any) at the end of the string. When this flag is
-specified, \regexp{\^} matches at the beginning of the string and at
-the beginning of each line within the string, immediately following
-each newline. Similarly, the \regexp{\$} metacharacter matches either at
-the end of the string and at the end of each line (immediately
-preceding each newline).
-
-\end{datadesc}
-
-\begin{datadesc}{S}
-\dataline{DOTALL}
-Makes the \character{.} special character match any character at all,
-including a newline; without this flag, \character{.} will match
-anything \emph{except} a newline.
-\end{datadesc}
-
-\begin{datadesc}{X}
-\dataline{VERBOSE} This flag allows you to write regular expressions
-that are more readable by granting you more flexibility in how you can
-format them. When this flag has been specified, whitespace within the
-RE string is ignored, except when the whitespace is in a character
-class or preceded by an unescaped backslash; this lets you organize
-and indent the RE more clearly. This flag also lets you put comments
-within a RE that will be ignored by the engine; comments are marked by
-a \character{\#} that's neither in a character class or preceded by an
-unescaped backslash.
-
-For example, here's a RE that uses \constant{re.VERBOSE}; see how
-much easier it is to read?
-
-\begin{verbatim}
-charref = re.compile(r"""
- &[#] # Start of a numeric entity reference
- (
- 0[0-7]+ # Octal form
- | [0-9]+ # Decimal form
- | x[0-9a-fA-F]+ # Hexadecimal form
- )
- ; # Trailing semicolon
-""", re.VERBOSE)
-\end{verbatim}
-
-Without the verbose setting, the RE would look like this:
-\begin{verbatim}
-charref = re.compile("&#(0[0-7]+"
- "|[0-9]+"
- "|x[0-9a-fA-F]+);")
-\end{verbatim}
-
-In the above example, Python's automatic concatenation of string
-literals has been used to break up the RE into smaller pieces, but
-it's still more difficult to understand than the version using
-\constant{re.VERBOSE}.
-
-\end{datadesc}
-
-\section{More Pattern Power}
-
-So far we've only covered a part of the features of regular
-expressions. In this section, we'll cover some new metacharacters,
-and how to use groups to retrieve portions of the text that was matched.
-
-\subsection{More Metacharacters\label{more-metacharacters}}
-
-There are some metacharacters that we haven't covered yet. Most of
-them will be covered in this section.
-
-Some of the remaining metacharacters to be discussed are
-\dfn{zero-width assertions}. They don't cause the engine to advance
-through the string; instead, they consume no characters at all,
-and simply succeed or fail. For example, \regexp{\e b} is an
-assertion that the current position is located at a word boundary; the
-position isn't changed by the \regexp{\e b} at all. This means that
-zero-width assertions should never be repeated, because if they match
-once at a given location, they can obviously be matched an infinite
-number of times.
-
-\begin{list}{}{}
-
-\item[\regexp{|}]
-Alternation, or the ``or'' operator.
-If A and B are regular expressions,
-\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}.
-\regexp{|} has very low precedence in order to make it work reasonably when
-you're alternating multi-character strings.
-\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not
-\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}.
-
-To match a literal \character{|},
-use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
-
-\item[\regexp{\^}] Matches at the beginning of lines. Unless the
-\constant{MULTILINE} flag has been set, this will only match at the
-beginning of the string. In \constant{MULTILINE} mode, this also
-matches immediately after each newline within the string.
-
-For example, if you wish to match the word \samp{From} only at the
-beginning of a line, the RE to use is \verb|^From|.
-
-\begin{verbatim}
->>> print re.search('^From', 'From Here to Eternity')
-<re.MatchObject instance at 80c1520>
->>> print re.search('^From', 'Reciting From Memory')
-None
-\end{verbatim}
-
-%To match a literal \character{\^}, use \regexp{\e\^} or enclose it
-%inside a character class, as in \regexp{[{\e}\^]}.
-
-\item[\regexp{\$}] Matches at the end of a line, which is defined as
-either the end of the string, or any location followed by a newline
-character.
-
-\begin{verbatim}
->>> print re.search('}$', '{block}')
-<re.MatchObject instance at 80adfa8>
->>> print re.search('}$', '{block} ')
-None
->>> print re.search('}$', '{block}\n')
-<re.MatchObject instance at 80adfa8>
-\end{verbatim}
-% $
-
-To match a literal \character{\$}, use \regexp{\e\$} or enclose it
-inside a character class, as in \regexp{[\$]}.
-
-\item[\regexp{\e A}] Matches only at the start of the string. When
-not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are
-effectively the same. In \constant{MULTILINE} mode, they're
-different: \regexp{\e A} still matches only at the beginning of the
-string, but \regexp{\^} may match at any location inside the string
-that follows a newline character.
-
-\item[\regexp{\e Z}] Matches only at the end of the string.
-
-\item[\regexp{\e b}] Word boundary.
-This is a zero-width assertion that matches only at the
-beginning or end of a word. A word is defined as a sequence of
-alphanumeric characters, so the end of a word is indicated by
-whitespace or a non-alphanumeric character.
-
-The following example matches \samp{class} only when it's a complete
-word; it won't match when it's contained inside another word.
-
-\begin{verbatim}
->>> p = re.compile(r'\bclass\b')
->>> print p.search('no class at all')
-<re.MatchObject instance at 80c8f28>
->>> print p.search('the declassified algorithm')
-None
->>> print p.search('one subclass is')
-None
-\end{verbatim}
-
-There are two subtleties you should remember when using this special
-sequence. First, this is the worst collision between Python's string
-literals and regular expression sequences. In Python's string
-literals, \samp{\e b} is the backspace character, ASCII value 8. If
-you're not using raw strings, then Python will convert the \samp{\e b} to
-a backspace, and your RE won't match as you expect it to. The
-following example looks the same as our previous RE, but omits
-the \character{r} in front of the RE string.
-
-\begin{verbatim}
->>> p = re.compile('\bclass\b')
->>> print p.search('no class at all')
-None
->>> print p.search('\b' + 'class' + '\b')
-<re.MatchObject instance at 80c3ee0>
-\end{verbatim}
-
-Second, inside a character class, where there's no use for this
-assertion, \regexp{\e b} represents the backspace character, for
-compatibility with Python's string literals.
-
-\item[\regexp{\e B}] Another zero-width assertion, this is the
-opposite of \regexp{\e b}, only matching when the current
-position is not at a word boundary.
-
-\end{list}
-
-\subsection{Grouping}
-
-Frequently you need to obtain more information than just whether the
-RE matched or not. Regular expressions are often used to dissect
-strings by writing a RE divided into several subgroups which
-match different components of interest. For example, an RFC-822
-header line is divided into a header name and a value, separated by a
-\character{:}, like this:
-
-\begin{verbatim}
-From: author@example.com
-User-Agent: Thunderbird 1.5.0.9 (X11/20061227)
-MIME-Version: 1.0
-To: editor@example.com
-\end{verbatim}
-
-This can be handled by writing a regular expression
-which matches an entire header line, and has one group which matches the
-header name, and another group which matches the header's value.
-
-Groups are marked by the \character{(}, \character{)} metacharacters.
-\character{(} and \character{)} have much the same meaning as they do
-in mathematical expressions; they group together the expressions
-contained inside them, and you can repeat the contents of a
-group with a repeating qualifier, such as \regexp{*}, \regexp{+},
-\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example,
-\regexp{(ab)*} will match zero or more repetitions of \samp{ab}.
-
-\begin{verbatim}
->>> p = re.compile('(ab)*')
->>> print p.match('ababababab').span()
-(0, 10)
-\end{verbatim}
-
-Groups indicated with \character{(}, \character{)} also capture the
-starting and ending index of the text that they match; this can be
-retrieved by passing an argument to \method{group()},
-\method{start()}, \method{end()}, and \method{span()}. Groups are
-numbered starting with 0. Group 0 is always present; it's the whole
-RE, so \class{MatchObject} methods all have group 0 as their default
-argument. Later we'll see how to express groups that don't capture
-the span of text that they match.
-
-\begin{verbatim}
->>> p = re.compile('(a)b')
->>> m = p.match('ab')
->>> m.group()
-'ab'
->>> m.group(0)
-'ab'
-\end{verbatim}
-
-Subgroups are numbered from left to right, from 1 upward. Groups can
-be nested; to determine the number, just count the opening parenthesis
-characters, going from left to right.
-
-\begin{verbatim}
->>> p = re.compile('(a(b)c)d')
->>> m = p.match('abcd')
->>> m.group(0)
-'abcd'
->>> m.group(1)
-'abc'
->>> m.group(2)
-'b'
-\end{verbatim}
-
-\method{group()} can be passed multiple group numbers at a time, in
-which case it will return a tuple containing the corresponding values
-for those groups.
-
-\begin{verbatim}
->>> m.group(2,1,2)
-('b', 'abc', 'b')
-\end{verbatim}
-
-The \method{groups()} method returns a tuple containing the strings
-for all the subgroups, from 1 up to however many there are.
-
-\begin{verbatim}
->>> m.groups()
-('abc', 'b')
-\end{verbatim}
-
-Backreferences in a pattern allow you to specify that the contents of
-an earlier capturing group must also be found at the current location
-in the string. For example, \regexp{\e 1} will succeed if the exact
-contents of group 1 can be found at the current position, and fails
-otherwise. Remember that Python's string literals also use a
-backslash followed by numbers to allow including arbitrary characters
-in a string, so be sure to use a raw string when incorporating
-backreferences in a RE.
-
-For example, the following RE detects doubled words in a string.
-
-\begin{verbatim}
->>> p = re.compile(r'(\b\w+)\s+\1')
->>> p.search('Paris in the the spring').group()
-'the the'
-\end{verbatim}
-
-Backreferences like this aren't often useful for just searching
-through a string --- there are few text formats which repeat data in
-this way --- but you'll soon find out that they're \emph{very} useful
-when performing string substitutions.
-
-\subsection{Non-capturing and Named Groups}
-
-Elaborate REs may use many groups, both to capture substrings of
-interest, and to group and structure the RE itself. In complex REs,
-it becomes difficult to keep track of the group numbers. There are
-two features which help with this problem. Both of them use a common
-syntax for regular expression extensions, so we'll look at that first.
-
-Perl 5 added several additional features to standard regular
-expressions, and the Python \module{re} module supports most of them.
-It would have been difficult to choose new
-single-keystroke metacharacters or new special sequences beginning
-with \samp{\e} to represent the new features without making Perl's
-regular expressions confusingly different from standard REs. If you
-chose \samp{\&} as a new metacharacter, for example, old expressions
-would be assuming that
-\samp{\&} was a regular character and wouldn't have escaped it by
-writing \regexp{\e \&} or \regexp{[\&]}.
-
-The solution chosen by the Perl developers was to use \regexp{(?...)}
-as the extension syntax. \samp{?} immediately after a parenthesis was
-a syntax error because the \samp{?} would have nothing to repeat, so
-this didn't introduce any compatibility problems. The characters
-immediately after the \samp{?} indicate what extension is being used,
-so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and
-\regexp{(?:foo)} is something else (a non-capturing group containing
-the subexpression \regexp{foo}).
-
-Python adds an extension syntax to Perl's extension syntax. If the
-first character after the question mark is a \samp{P}, you know that
-it's an extension that's specific to Python. Currently there are two
-such extensions: \regexp{(?P<\var{name}>...)} defines a named group,
-and \regexp{(?P=\var{name})} is a backreference to a named group. If
-future versions of Perl 5 add similar features using a different
-syntax, the \module{re} module will be changed to support the new
-syntax, while preserving the Python-specific syntax for
-compatibility's sake.
-
-Now that we've looked at the general extension syntax, we can return
-to the features that simplify working with groups in complex REs.
-Since groups are numbered from left to right and a complex expression
-may use many groups, it can become difficult to keep track of the
-correct numbering. Modifying such a complex RE is annoying, too:
-insert a new group near the beginning and you change the numbers of
-everything that follows it.
-
-Sometimes you'll want to use a group to collect a part of a regular
-expression, but aren't interested in retrieving the group's contents.
-You can make this fact explicit by using a non-capturing group:
-\regexp{(?:...)}, where you can replace the \regexp{...}
-with any other regular expression.
-
-\begin{verbatim}
->>> m = re.match("([abc])+", "abc")
->>> m.groups()
-('c',)
->>> m = re.match("(?:[abc])+", "abc")
->>> m.groups()
-()
-\end{verbatim}
-
-Except for the fact that you can't retrieve the contents of what the
-group matched, a non-capturing group behaves exactly the same as a
-capturing group; you can put anything inside it, repeat it with a
-repetition metacharacter such as \samp{*}, and nest it within other
-groups (capturing or non-capturing). \regexp{(?:...)} is particularly
-useful when modifying an existing pattern, since you can add new groups
-without changing how all the other groups are numbered. It should be
-mentioned that there's no performance difference in searching between
-capturing and non-capturing groups; neither form is any faster than
-the other.
-
-A more significant feature is named groups: instead of
-referring to them by numbers, groups can be referenced by a name.
-
-The syntax for a named group is one of the Python-specific extensions:
-\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of
-the group. Named groups also behave exactly like capturing groups,
-and additionally associate a name with a group. The
-\class{MatchObject} methods that deal with capturing groups all accept
-either integers that refer to the group by number or strings that
-contain the desired group's name. Named groups are still given
-numbers, so you can retrieve information about a group in two ways:
-
-\begin{verbatim}
->>> p = re.compile(r'(?P<word>\b\w+\b)')
->>> m = p.search( '(((( Lots of punctuation )))' )
->>> m.group('word')
-'Lots'
->>> m.group(1)
-'Lots'
-\end{verbatim}
-
-Named groups are handy because they let you use easily-remembered
-names, instead of having to remember numbers. Here's an example RE
-from the \module{imaplib} module:
-
-\begin{verbatim}
-InternalDate = re.compile(r'INTERNALDATE "'
- r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-'
- r'(?P<year>[0-9][0-9][0-9][0-9])'
- r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])'
- r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])'
- r'"')
-\end{verbatim}
-
-It's obviously much easier to retrieve \code{m.group('zonem')},
-instead of having to remember to retrieve group 9.
-
-The syntax for backreferences in an expression such as
-\regexp{(...)\e 1} refers to the number of the group. There's
-naturally a variant that uses the group name instead of the number.
-This is another Python extension: \regexp{(?P=\var{name})} indicates
-that the contents of the group called \var{name} should again be matched
-at the current point. The regular expression for finding doubled
-words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as
-\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}:
-
-\begin{verbatim}
->>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)')
->>> p.search('Paris in the the spring').group()
-'the the'
-\end{verbatim}
-
-\subsection{Lookahead Assertions}
-
-Another zero-width assertion is the lookahead assertion. Lookahead
-assertions are available in both positive and negative form, and
-look like this:
-
-\begin{itemize}
-\item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds
-if the contained regular expression, represented here by \code{...},
-successfully matches at the current location, and fails otherwise.
-But, once the contained expression has been tried, the matching engine
-doesn't advance at all; the rest of the pattern is tried right where
-the assertion started.
-
-\item[\regexp{(?!...)}] Negative lookahead assertion. This is the
-opposite of the positive assertion; it succeeds if the contained expression
-\emph{doesn't} match at the current position in the string.
-\end{itemize}
-
-To make this concrete, let's look at a case where a lookahead is
-useful. Consider a simple pattern to match a filename and split it
-apart into a base name and an extension, separated by a \samp{.}. For
-example, in \samp{news.rc}, \samp{news} is the base name, and
-\samp{rc} is the filename's extension.
-
-The pattern to match this is quite simple:
-
-\regexp{.*[.].*\$}
-
-Notice that the \samp{.} needs to be treated specially because it's a
-metacharacter; I've put it inside a character class. Also notice the
-trailing \regexp{\$}; this is added to ensure that all the rest of the
-string must be included in the extension. This regular expression
-matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and
-\samp{printers.conf}.
-
-Now, consider complicating the problem a bit; what if you want to
-match filenames where the extension is not \samp{bat}?
-Some incorrect attempts:
-
-\verb|.*[.][^b].*$|
-% $
-
-The first attempt above tries to exclude \samp{bat} by requiring that
-the first character of the extension is not a \samp{b}. This is
-wrong, because the pattern also doesn't match \samp{foo.bar}.
-
-% Messes up the HTML without the curly braces around \^
-\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$}
-
-The expression gets messier when you try to patch up the first
-solution by requiring one of the following cases to match: the first
-character of the extension isn't \samp{b}; the second character isn't
-\samp{a}; or the third character isn't \samp{t}. This accepts
-\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a
-three-letter extension and won't accept a filename with a two-letter
-extension such as \samp{sendmail.cf}. We'll complicate the pattern
-again in an effort to fix it.
-
-\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$}
-
-In the third attempt, the second and third letters are all made
-optional in order to allow matching extensions shorter than three
-characters, such as \samp{sendmail.cf}.
-
-The pattern's getting really complicated now, which makes it hard to
-read and understand. Worse, if the problem changes and you want to
-exclude both \samp{bat} and \samp{exe} as extensions, the pattern
-would get even more complicated and confusing.
-
-A negative lookahead cuts through all this confusion:
-
-\regexp{.*[.](?!bat\$).*\$}
-% $
-
-The negative lookahead means: if the expression \regexp{bat} doesn't match at
-this point, try the rest of the pattern; if \regexp{bat\$} does match,
-the whole pattern will fail. The trailing \regexp{\$} is required to
-ensure that something like \samp{sample.batch}, where the extension
-only starts with \samp{bat}, will be allowed.
-
-Excluding another filename extension is now easy; simply add it as an
-alternative inside the assertion. The following pattern excludes
-filenames that end in either \samp{bat} or \samp{exe}:
-
-\regexp{.*[.](?!bat\$|exe\$).*\$}
-% $
-
-
-\section{Modifying Strings}
-
-Up to this point, we've simply performed searches against a static
-string. Regular expressions are also commonly used to modify strings
-in various ways, using the following \class{RegexObject} methods:
-
-\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose}
- \lineii{split()}{Split the string into a list, splitting it wherever the RE matches}
- \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string}
- \lineii{subn()}{Does the same thing as \method{sub()},
- but returns the new string and the number of replacements}
-\end{tableii}
-
-
-\subsection{Splitting Strings}
-
-The \method{split()} method of a \class{RegexObject} splits a string
-apart wherever the RE matches, returning a list of the pieces.
-It's similar to the \method{split()} method of strings but
-provides much more
-generality in the delimiters that you can split by;
-\method{split()} only supports splitting by whitespace or by
-a fixed string. As you'd expect, there's a module-level
-\function{re.split()} function, too.
-
-\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}}
- Split \var{string} by the matches of the regular expression. If
- capturing parentheses are used in the RE, then their contents will
- also be returned as part of the resulting list. If \var{maxsplit}
- is nonzero, at most \var{maxsplit} splits are performed.
-\end{methoddesc}
-
-You can limit the number of splits made, by passing a value for
-\var{maxsplit}. When \var{maxsplit} is nonzero, at most
-\var{maxsplit} splits will be made, and the remainder of the string is
-returned as the final element of the list. In the following example,
-the delimiter is any sequence of non-alphanumeric characters.
-
-\begin{verbatim}
->>> p = re.compile(r'\W+')
->>> p.split('This is a test, short and sweet, of split().')
-['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', '']
->>> p.split('This is a test, short and sweet, of split().', 3)
-['This', 'is', 'a', 'test, short and sweet, of split().']
-\end{verbatim}
-
-Sometimes you're not only interested in what the text between
-delimiters is, but also need to know what the delimiter was. If
-capturing parentheses are used in the RE, then their values are also
-returned as part of the list. Compare the following calls:
-
-\begin{verbatim}
->>> p = re.compile(r'\W+')
->>> p2 = re.compile(r'(\W+)')
->>> p.split('This... is a test.')
-['This', 'is', 'a', 'test', '']
->>> p2.split('This... is a test.')
-['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', '']
-\end{verbatim}
-
-The module-level function \function{re.split()} adds the RE to be
-used as the first argument, but is otherwise the same.
-
-\begin{verbatim}
->>> re.split('[\W]+', 'Words, words, words.')
-['Words', 'words', 'words', '']
->>> re.split('([\W]+)', 'Words, words, words.')
-['Words', ', ', 'words', ', ', 'words', '.', '']
->>> re.split('[\W]+', 'Words, words, words.', 1)
-['Words', 'words, words.']
-\end{verbatim}
-
-\subsection{Search and Replace}
-
-Another common task is to find all the matches for a pattern, and
-replace them with a different string. The \method{sub()} method takes
-a replacement value, which can be either a string or a function, and
-the string to be processed.
-
-\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}}
-Returns the string obtained by replacing the leftmost non-overlapping
-occurrences of the RE in \var{string} by the replacement
-\var{replacement}. If the pattern isn't found, \var{string} is returned
-unchanged.
-
-The optional argument \var{count} is the maximum number of pattern
-occurrences to be replaced; \var{count} must be a non-negative
-integer. The default value of 0 means to replace all occurrences.
-\end{methoddesc}
-
-Here's a simple example of using the \method{sub()} method. It
-replaces colour names with the word \samp{colour}:
-
-\begin{verbatim}
->>> p = re.compile( '(blue|white|red)')
->>> p.sub( 'colour', 'blue socks and red shoes')
-'colour socks and colour shoes'
->>> p.sub( 'colour', 'blue socks and red shoes', count=1)
-'colour socks and red shoes'
-\end{verbatim}
-
-The \method{subn()} method does the same work, but returns a 2-tuple
-containing the new string value and the number of replacements
-that were performed:
-
-\begin{verbatim}
->>> p = re.compile( '(blue|white|red)')
->>> p.subn( 'colour', 'blue socks and red shoes')
-('colour socks and colour shoes', 2)
->>> p.subn( 'colour', 'no colours at all')
-('no colours at all', 0)
-\end{verbatim}
-
-Empty matches are replaced only when they're not
-adjacent to a previous match.
-
-\begin{verbatim}
->>> p = re.compile('x*')
->>> p.sub('-', 'abxd')
-'-a-b-d-'
-\end{verbatim}
-
-If \var{replacement} is a string, any backslash escapes in it are
-processed. That is, \samp{\e n} is converted to a single newline
-character, \samp{\e r} is converted to a carriage return, and so forth.
-Unknown escapes such as \samp{\e j} are left alone. Backreferences,
-such as \samp{\e 6}, are replaced with the substring matched by the
-corresponding group in the RE. This lets you incorporate
-portions of the original text in the resulting
-replacement string.
-
-This example matches the word \samp{section} followed by a string
-enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to
-\samp{subsection}:
-
-\begin{verbatim}
->>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE)
->>> p.sub(r'subsection{\1}','section{First} section{second}')
-'subsection{First} subsection{second}'
-\end{verbatim}
-
-There's also a syntax for referring to named groups as defined by the
-\regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the
-substring matched by the group named \samp{name}, and
-\samp{\e g<\var{number}>}
-uses the corresponding group number.
-\samp{\e g<2>} is therefore equivalent to \samp{\e 2},
-but isn't ambiguous in a
-replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be
-interpreted as a reference to group 20, not a reference to group 2
-followed by the literal character \character{0}.) The following
-substitutions are all equivalent, but use all three variations of the
-replacement string.
-
-\begin{verbatim}
->>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE)
->>> p.sub(r'subsection{\1}','section{First}')
-'subsection{First}'
->>> p.sub(r'subsection{\g<1>}','section{First}')
-'subsection{First}'
->>> p.sub(r'subsection{\g<name>}','section{First}')
-'subsection{First}'
-\end{verbatim}
-
-\var{replacement} can also be a function, which gives you even more
-control. If \var{replacement} is a function, the function is
-called for every non-overlapping occurrence of \var{pattern}. On each
-call, the function is
-passed a \class{MatchObject} argument for the match
-and can use this information to compute the desired replacement string and return it.
-
-In the following example, the replacement function translates
-decimals into hexadecimal:
-
-\begin{verbatim}
->>> def hexrepl( match ):
-... "Return the hex string for a decimal number"
-... value = int( match.group() )
-... return hex(value)
-...
->>> p = re.compile(r'\d+')
->>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.')
-'Call 0xffd2 for printing, 0xc000 for user code.'
-\end{verbatim}
-
-When using the module-level \function{re.sub()} function, the pattern
-is passed as the first argument. The pattern may be a string or a
-\class{RegexObject}; if you need to specify regular expression flags,
-you must either use a \class{RegexObject} as the first parameter, or use
-embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb
-BBBB")} returns \code{'x x'}.
-
-\section{Common Problems}
-
-Regular expressions are a powerful tool for some applications, but in
-some ways their behaviour isn't intuitive and at times they don't
-behave the way you may expect them to. This section will point out
-some of the most common pitfalls.
-
-\subsection{Use String Methods}
-
-Sometimes using the \module{re} module is a mistake. If you're
-matching a fixed string, or a single character class, and you're not
-using any \module{re} features such as the \constant{IGNORECASE} flag,
-then the full power of regular expressions may not be required.
-Strings have several methods for performing operations with fixed
-strings and they're usually much faster, because the implementation is
-a single small C loop that's been optimized for the purpose, instead
-of the large, more generalized regular expression engine.
-
-One example might be replacing a single fixed string with another
-one; for example, you might replace \samp{word}
-with \samp{deed}. \code{re.sub()} seems like the function to use for
-this, but consider the \method{replace()} method. Note that
-\function{replace()} will also replace \samp{word} inside
-words, turning \samp{swordfish} into \samp{sdeedfish}, but the
-na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing
-the substitution on parts of words, the pattern would have to be
-\regexp{\e bword\e b}, in order to require that \samp{word} have a
-word boundary on either side. This takes the job beyond
-\method{replace}'s abilities.)
-
-Another common task is deleting every occurrence of a single character
-from a string or replacing it with another single character. You
-might do this with something like \code{re.sub('\e n', ' ', S)}, but
-\method{translate()} is capable of doing both tasks
-and will be faster than any regular expression operation can be.
-
-In short, before turning to the \module{re} module, consider whether
-your problem can be solved with a faster and simpler string method.
-
-\subsection{match() versus search()}
-
-The \function{match()} function only checks if the RE matches at
-the beginning of the string while \function{search()} will scan
-forward through the string for a match.
-It's important to keep this distinction in mind. Remember,
-\function{match()} will only report a successful match which
-will start at 0; if the match wouldn't start at zero,
-\function{match()} will \emph{not} report it.
-
-\begin{verbatim}
->>> print re.match('super', 'superstition').span()
-(0, 5)
->>> print re.match('super', 'insuperable')
-None
-\end{verbatim}
-
-On the other hand, \function{search()} will scan forward through the
-string, reporting the first match it finds.
-
-\begin{verbatim}
->>> print re.search('super', 'superstition').span()
-(0, 5)
->>> print re.search('super', 'insuperable').span()
-(2, 7)
-\end{verbatim}
-
-Sometimes you'll be tempted to keep using \function{re.match()}, and
-just add \regexp{.*} to the front of your RE. Resist this temptation
-and use \function{re.search()} instead. The regular expression
-compiler does some analysis of REs in order to speed up the process of
-looking for a match. One such analysis figures out what the first
-character of a match must be; for example, a pattern starting with
-\regexp{Crow} must match starting with a \character{C}. The analysis
-lets the engine quickly scan through the string looking for the
-starting character, only trying the full match if a \character{C} is found.
-
-Adding \regexp{.*} defeats this optimization, requiring scanning to
-the end of the string and then backtracking to find a match for the
-rest of the RE. Use \function{re.search()} instead.
-
-\subsection{Greedy versus Non-Greedy}
-
-When repeating a regular expression, as in \regexp{a*}, the resulting
-action is to consume as much of the pattern as possible. This
-fact often bites you when you're trying to match a pair of
-balanced delimiters, such as the angle brackets surrounding an HTML
-tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't
-work because of the greedy nature of \regexp{.*}.
-
-\begin{verbatim}
->>> s = '<html><head><title>Title</title>'
->>> len(s)
-32
->>> print re.match('<.*>', s).span()
-(0, 32)
->>> print re.match('<.*>', s).group()
-<html><head><title>Title</title>
-\end{verbatim}
-
-The RE matches the \character{<} in \samp{<html>}, and the
-\regexp{.*} consumes the rest of the string. There's still more left
-in the RE, though, and the \regexp{>} can't match at the end of
-the string, so the regular expression engine has to backtrack
-character by character until it finds a match for the \regexp{>}.
-The final match extends from the \character{<} in \samp{<html>}
-to the \character{>} in \samp{</title>}, which isn't what you want.
-
-In this case, the solution is to use the non-greedy qualifiers
-\regexp{*?}, \regexp{+?}, \regexp{??}, or
-\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as
-possible. In the above example, the \character{>} is tried
-immediately after the first \character{<} matches, and when it fails,
-the engine advances a character at a time, retrying the \character{>}
-at every step. This produces just the right result:
-
-\begin{verbatim}
->>> print re.match('<.*?>', s).group()
-<html>
-\end{verbatim}
-
-(Note that parsing HTML or XML with regular expressions is painful.
-Quick-and-dirty patterns will handle common cases, but HTML and XML
-have special cases that will break the obvious regular expression; by
-the time you've written a regular expression that handles all of the
-possible cases, the patterns will be \emph{very} complicated. Use an
-HTML or XML parser module for such tasks.)
-
-\subsection{Not Using re.VERBOSE}
-
-By now you've probably noticed that regular expressions are a very
-compact notation, but they're not terribly readable. REs of
-moderate complexity can become lengthy collections of backslashes,
-parentheses, and metacharacters, making them difficult to read and
-understand.
-
-For such REs, specifying the \code{re.VERBOSE} flag when
-compiling the regular expression can be helpful, because it allows
-you to format the regular expression more clearly.
-
-The \code{re.VERBOSE} flag has several effects. Whitespace in the
-regular expression that \emph{isn't} inside a character class is
-ignored. This means that an expression such as \regexp{dog | cat} is
-equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]}
-will still match the characters \character{a}, \character{b}, or a
-space. In addition, you can also put comments inside a RE; comments
-extend from a \samp{\#} character to the next newline. When used with
-triple-quoted strings, this enables REs to be formatted more neatly:
-
-\begin{verbatim}
-pat = re.compile(r"""
- \s* # Skip leading whitespace
- (?P<header>[^:]+) # Header name
- \s* : # Whitespace, and a colon
- (?P<value>.*?) # The header's value -- *? used to
- # lose the following trailing whitespace
- \s*$ # Trailing whitespace to end-of-line
-""", re.VERBOSE)
-\end{verbatim}
-% $
-
-This is far more readable than:
-
-\begin{verbatim}
-pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$")
-\end{verbatim}
-% $
-
-\section{Feedback}
-
-Regular expressions are a complicated topic. Did this document help
-you understand them? Were there parts that were unclear, or Problems
-you encountered that weren't covered here? If so, please send
-suggestions for improvements to the author.
-
-The most complete book on regular expressions is almost certainly
-Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published
-by O'Reilly. Unfortunately, it exclusively concentrates on Perl and
-Java's flavours of regular expressions, and doesn't contain any Python
-material at all, so it won't be useful as a reference for programming
-in Python. (The first edition covered Python's now-removed
-\module{regex} module, which won't help you much.) Consider checking
-it out from your library.
-
-\end{document}
-