diff options
author | Andrew M. Kuchling <amk@amk.ca> | 2005-08-30 01:25:05 (GMT) |
---|---|---|
committer | Andrew M. Kuchling <amk@amk.ca> | 2005-08-30 01:25:05 (GMT) |
commit | e8f44d683e79c7a9659a4480736d55193da4a7b1 (patch) | |
tree | 37e8b05066aa1caf85f6b25d52f1576366e45e8e /Doc/howto/regex.tex | |
parent | f1b2ba6aa1751c5325e8fb87a28e54a857796bfa (diff) | |
download | cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.zip cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.tar.gz cpython-e8f44d683e79c7a9659a4480736d55193da4a7b1.tar.bz2 |
Commit the howto source to the main Python repository, with Fred's approval
Diffstat (limited to 'Doc/howto/regex.tex')
-rw-r--r-- | Doc/howto/regex.tex | 1466 |
1 files changed, 1466 insertions, 0 deletions
diff --git a/Doc/howto/regex.tex b/Doc/howto/regex.tex new file mode 100644 index 0000000..5a65064 --- /dev/null +++ b/Doc/howto/regex.tex @@ -0,0 +1,1466 @@ +\documentclass{howto} + +% TODO: +% Document lookbehind assertions +% Better way of displaying a RE, a string, and what it matches +% Mention optional argument to match.groups() +% Unicode (at least a reference) + +\title{Regular Expression HOWTO} + +\release{0.05} + +\author{A.M. Kuchling} +\authoraddress{\email{amk@amk.ca}} + +\begin{document} +\maketitle + +\begin{abstract} +\noindent +This document is an introductory tutorial to using regular expressions +in Python with the \module{re} module. It provides a gentler +introduction than the corresponding section in the Library Reference. + +This document is available from +\url{http://www.amk.ca/python/howto}. + +\end{abstract} + +\tableofcontents + +\section{Introduction} + +The \module{re} module was added in Python 1.5, and provides +Perl-style regular expression patterns. Earlier versions of Python +came with the \module{regex} module, which provides Emacs-style +patterns. Emacs-style patterns are slightly less readable and +don't provide as many features, so there's not much reason to use +the \module{regex} module when writing new code, though you might +encounter old code that uses it. + +Regular expressions (or REs) are essentially a tiny, highly +specialized programming language embedded inside Python and made +available through the \module{re} module. Using this little language, +you specify the rules for the set of possible strings that you want to +match; this set might contain English sentences, or e-mail addresses, +or TeX commands, or anything you like. You can then ask questions +such as ``Does this string match the pattern?'', or ``Is there a match +for the pattern anywhere in this string?''. You can also use REs to +modify a string or to split it apart in various ways. + +Regular expression patterns are compiled into a series of bytecodes +which are then executed by a matching engine written in C. For +advanced use, it may be necessary to pay careful attention to how the +engine will execute a given RE, and write the RE in a certain way in +order to produce bytecode that runs faster. Optimization isn't +covered in this document, because it requires that you have a good +understanding of the matching engine's internals. + +The regular expression language is relatively small and restricted, so +not all possible string processing tasks can be done using regular +expressions. There are also tasks that \emph{can} be done with +regular expressions, but the expressions turn out to be very +complicated. In these cases, you may be better off writing Python +code to do the processing; while Python code will be slower than an +elaborate regular expression, it will also probably be more understandable. + +\section{Simple Patterns} + +We'll start by learning about the simplest possible regular +expressions. Since regular expressions are used to operate on +strings, we'll begin with the most common task: matching characters. + +For a detailed explanation of the computer science underlying regular +expressions (deterministic and non-deterministic finite automata), you +can refer to almost any textbook on writing compilers. + +\subsection{Matching Characters} + +Most letters and characters will simply match themselves. For +example, the regular expression \regexp{test} will match the string +\samp{test} exactly. (You can enable a case-insensitive mode that +would let this RE match \samp{Test} or \samp{TEST} as well; more +about this later.) + +There are exceptions to this rule; some characters are +special, and don't match themselves. Instead, they signal that some +out-of-the-ordinary thing should be matched, or they affect other +portions of the RE by repeating them. Much of this document is +devoted to discussing various metacharacters and what they do. + +Here's a complete list of the metacharacters; their meanings will be +discussed in the rest of this HOWTO. + +\begin{verbatim} +. ^ $ * + ? { [ ] \ | ( ) +\end{verbatim} +% $ + +The first metacharacters we'll look at are \samp{[} and \samp{]}. +They're used for specifying a character class, which is a set of +characters that you wish to match. Characters can be listed +individually, or a range of characters can be indicated by giving two +characters and separating them by a \character{-}. For example, +\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or +\samp{c}; this is the same as +\regexp{[a-c]}, which uses a range to express the same set of +characters. If you wanted to match only lowercase letters, your +RE would be \regexp{[a-z]}. + +Metacharacters are not active inside classes. For example, +\regexp{[akm\$]} will match any of the characters \character{a}, +\character{k}, \character{m}, or \character{\$}; \character{\$} is +usually a metacharacter, but inside a character class it's stripped of +its special nature. + +You can match the characters not within a range by \dfn{complementing} +the set. This is indicated by including a \character{\^} as the first +character of the class; \character{\^} elsewhere will simply match the +\character{\^} character. For example, \verb|[^5]| will match any +character except \character{5}. + +Perhaps the most important metacharacter is the backslash, \samp{\e}. +As in Python string literals, the backslash can be followed by various +characters to signal various special sequences. It's also used to escape +all the metacharacters so you can still match them in patterns; for +example, if you need to match a \samp{[} or +\samp{\e}, you can precede them with a backslash to remove their +special meaning: \regexp{\e[} or \regexp{\e\e}. + +Some of the special sequences beginning with \character{\e} represent +predefined sets of characters that are often useful, such as the set +of digits, the set of letters, or the set of anything that isn't +whitespace. The following predefined special sequences are available: + +\begin{itemize} +\item[\code{\e d}]Matches any decimal digit; this is +equivalent to the class \regexp{[0-9]}. + +\item[\code{\e D}]Matches any non-digit character; this is +equivalent to the class \verb|[^0-9]|. + +\item[\code{\e s}]Matches any whitespace character; this is +equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}. + +\item[\code{\e S}]Matches any non-whitespace character; this is +equivalent to the class \verb|[^ \t\n\r\f\v]|. + +\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class +\regexp{[a-zA-Z0-9_]}. + +\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class +\verb|[^a-zA-Z0-9_]|. +\end{itemize} + +These sequences can be included inside a character class. For +example, \regexp{[\e s,.]} is a character class that will match any +whitespace character, or \character{,} or \character{.}. + +The final metacharacter in this section is \regexp{.}. It matches +anything except a newline character, and there's an alternate mode +(\code{re.DOTALL}) where it will match even a newline. \character{.} +is often used where you want to match ``any character''. + +\subsection{Repeating Things} + +Being able to match varying sets of characters is the first thing +regular expressions can do that isn't already possible with the +methods available on strings. However, if that was the only +additional capability of regexes, they wouldn't be much of an advance. +Another capability is that you can specify that portions of the RE +must be repeated a certain number of times. + +The first metacharacter for repeating things that we'll look at is +\regexp{*}. \regexp{*} doesn't match the literal character \samp{*}; +instead, it specifies that the previous character can be matched zero +or more times, instead of exactly once. + +For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a} +characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a} +characters), and so forth. The RE engine has various internal +limitations stemming from the size of C's \code{int} type, that will +prevent it from matching over 2 billion \samp{a} characters; you +probably don't have enough memory to construct a string that large, so +you shouldn't run into that limit. + +Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE, +the matching engine will try to repeat it as many times as possible. +If later portions of the pattern don't match, the matching engine will +then back up and try again with few repetitions. + +A step-by-step example will make this more obvious. Let's consider +the expression \regexp{a[bcd]*b}. This matches the letter +\character{a}, zero or more letters from the class \code{[bcd]}, and +finally ends with a \character{b}. Now imagine matching this RE +against the string \samp{abcbd}. + +\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation} +\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.} +\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as +it can, which is to the end of the string.} +\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the +current position is at the end of the string, so it fails.} +\lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches +one less character.} +\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the +current position is at the last character, which is a \character{d}.} +\lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is +only matching \samp{bc}.} +\lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time +but the character at the current position is \character{b}, so it succeeds.} +\end{tableiii} + +The end of the RE has now been reached, and it has matched +\samp{abcb}. This demonstrates how the matching engine goes as far as +it can at first, and if no match is found it will then progressively +back up and retry the rest of the RE again and again. It will back up +until it has tried zero matches for \regexp{[bcd]*}, and if that +subsequently fails, the engine will conclude that the string doesn't +match the RE at all. + +Another repeating metacharacter is \regexp{+}, which matches one or +more times. Pay careful attention to the difference between +\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more +times, so whatever's being repeated may not be present at all, while +\regexp{+} requires at least \emph{one} occurrence. To use a similar +example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}), +\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}. + +There are two more repeating qualifiers. The question mark character, +\regexp{?}, matches either once or zero times; you can think of it as +marking something as being optional. For example, \regexp{home-?brew} +matches either \samp{homebrew} or \samp{home-brew}. + +The most complicated repeated qualifier is +\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal +integers. This qualifier means there must be at least \var{m} +repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b} +will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match +\samp{ab}, which has no slashes, or \samp{a////b}, which has four. + +You can omit either \var{m} or \var{n}; in that case, a reasonable +value is assumed for the missing value. Omitting \var{m} is +interpreted as a lower limit of 0, while omitting \var{n} results in an +upper bound of infinity --- actually, the 2 billion limit mentioned +earlier, but that might as well be infinity. + +Readers of a reductionist bent may notice that the three other qualifiers +can all be expressed using this notation. \regexp{\{0,\}} is the same +as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and +\regexp{\{0,1\}} is the same as \regexp{?}. It's better to use +\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because +they're shorter and easier to read. + +\section{Using Regular Expressions} + +Now that we've looked at some simple regular expressions, how do we +actually use them in Python? The \module{re} module provides an +interface to the regular expression engine, allowing you to compile +REs into objects and then perform matches with them. + +\subsection{Compiling Regular Expressions} + +Regular expressions are compiled into \class{RegexObject} instances, +which have methods for various operations such as searching for +pattern matches or performing string substitutions. + +\begin{verbatim} +>>> import re +>>> p = re.compile('ab*') +>>> print p +<re.RegexObject instance at 80b4150> +\end{verbatim} + +\function{re.compile()} also accepts an optional \var{flags} +argument, used to enable various special features and syntax +variations. We'll go over the available settings later, but for now a +single example will do: + +\begin{verbatim} +>>> p = re.compile('ab*', re.IGNORECASE) +\end{verbatim} + +The RE is passed to \function{re.compile()} as a string. REs are +handled as strings because regular expressions aren't part of the core +Python language, and no special syntax was created for expressing +them. (There are applications that don't need REs at all, so there's +no need to bloat the language specification by including them.) +Instead, the \module{re} module is simply a C extension module +included with Python, just like the \module{socket} or \module{zlib} +module. + +Putting REs in strings keeps the Python language simpler, but has one +disadvantage which is the topic of the next section. + +\subsection{The Backslash Plague} + +As stated earlier, regular expressions use the backslash +character (\character{\e}) to indicate special forms or to allow +special characters to be used without invoking their special meaning. +This conflicts with Python's usage of the same character for the same +purpose in string literals. + +Let's say you want to write a RE that matches the string +\samp{{\e}section}, which might be found in a \LaTeX\ file. To figure +out what to write in the program code, start with the desired string +to be matched. Next, you must escape any backslashes and other +metacharacters by preceding them with a backslash, resulting in the +string \samp{\e\e section}. The resulting string that must be passed +to \function{re.compile()} must be \verb|\\section|. However, to +express this as a Python string literal, both backslashes must be +escaped \emph{again}. + +\begin{tableii}{c|l}{code}{Characters}{Stage} + \lineii{\e section}{Text string to be matched} + \lineii{\e\e section}{Escaped backslash for \function{re.compile}} + \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal} +\end{tableii} + +In short, to match a literal backslash, one has to write +\code{'\e\e\e\e'} as the RE string, because the regular expression +must be \samp{\e\e}, and each backslash must be expressed as +\samp{\e\e} inside a regular Python string literal. In REs that +feature backslashes repeatedly, this leads to lots of repeated +backslashes and makes the resulting strings difficult to understand. + +The solution is to use Python's raw string notation for regular +expressions; backslashes are not handled in any special way in +a string literal prefixed with \character{r}, so \code{r"\e n"} is a +two-character string containing \character{\e} and \character{n}, +while \code{"\e n"} is a one-character string containing a newline. +Frequently regular expressions will be expressed in Python +code using this raw string notation. + +\begin{tableii}{c|c}{code}{Regular String}{Raw string} + \lineii{"ab*"}{\code{r"ab*"}} + \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}} + \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}} +\end{tableii} + +\subsection{Performing Matches} + +Once you have an object representing a compiled regular expression, +what do you do with it? \class{RegexObject} instances have several +methods and attributes. Only the most significant ones will be +covered here; consult \ulink{the Library +Reference}{http://www.python.org/doc/lib/module-re.html} for a +complete listing. + +\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} + \lineii{match()}{Determine if the RE matches at the beginning of + the string.} + \lineii{search()}{Scan through a string, looking for any location + where this RE matches.} + \lineii{findall()}{Find all substrings where the RE matches, +and returns them as a list.} + \lineii{finditer()}{Find all substrings where the RE matches, +and returns them as an iterator.} +\end{tableii} + +\method{match()} and \method{search()} return \code{None} if no match +can be found. If they're successful, a \code{MatchObject} instance is +returned, containing information about the match: where it starts and +ends, the substring it matched, and more. + +You can learn about this by interactively experimenting with the +\module{re} module. If you have Tkinter available, you may also want +to look at \file{Tools/scripts/redemo.py}, a demonstration program +included with the Python distribution. It allows you to enter REs and +strings, and displays whether the RE matches or fails. +\file{redemo.py} can be quite useful when trying to debug a +complicated RE. Phil Schwartz's +\ulink{Kodos}{http://kodos.sourceforge.net} is also an interactive +tool for developing and testing RE patterns. This HOWTO will use the +standard Python interpreter for its examples. + +First, run the Python interpreter, import the \module{re} module, and +compile a RE: + +\begin{verbatim} +Python 2.2.2 (#1, Feb 10 2003, 12:57:01) +>>> import re +>>> p = re.compile('[a-z]+') +>>> p +<_sre.SRE_Pattern object at 80c3c28> +\end{verbatim} + +Now, you can try matching various strings against the RE +\regexp{[a-z]+}. An empty string shouldn't match at all, since +\regexp{+} means 'one or more repetitions'. \method{match()} should +return \code{None} in this case, which will cause the interpreter to +print no output. You can explicitly print the result of +\method{match()} to make this clear. + +\begin{verbatim} +>>> p.match("") +>>> print p.match("") +None +\end{verbatim} + +Now, let's try it on a string that it should match, such as +\samp{tempo}. In this case, \method{match()} will return a +\class{MatchObject}, so you should store the result in a variable for +later use. + +\begin{verbatim} +>>> m = p.match( 'tempo') +>>> print m +<_sre.SRE_Match object at 80c4f68> +\end{verbatim} + +Now you can query the \class{MatchObject} for information about the +matching string. \class{MatchObject} instances also have several +methods and attributes; the most important ones are: + +\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} + \lineii{group()}{Return the string matched by the RE} + \lineii{start()}{Return the starting position of the match} + \lineii{end()}{Return the ending position of the match} + \lineii{span()}{Return a tuple containing the (start, end) positions + of the match} +\end{tableii} + +Trying these methods will soon clarify their meaning: + +\begin{verbatim} +>>> m.group() +'tempo' +>>> m.start(), m.end() +(0, 5) +>>> m.span() +(0, 5) +\end{verbatim} + +\method{group()} returns the substring that was matched by the +RE. \method{start()} and \method{end()} return the starting and +ending index of the match. \method{span()} returns both start and end +indexes in a single tuple. Since the \method{match} method only +checks if the RE matches at the start of a string, +\method{start()} will always be zero. However, the \method{search} +method of \class{RegexObject} instances scans through the string, so +the match may not start at zero in that case. + +\begin{verbatim} +>>> print p.match('::: message') +None +>>> m = p.search('::: message') ; print m +<re.MatchObject instance at 80c9650> +>>> m.group() +'message' +>>> m.span() +(4, 11) +\end{verbatim} + +In actual programs, the most common style is to store the +\class{MatchObject} in a variable, and then check if it was +\code{None}. This usually looks like: + +\begin{verbatim} +p = re.compile( ... ) +m = p.match( 'string goes here' ) +if m: + print 'Match found: ', m.group() +else: + print 'No match' +\end{verbatim} + +Two \class{RegexObject} methods return all of the matches for a pattern. +\method{findall()} returns a list of matching strings: + +\begin{verbatim} +>>> p = re.compile('\d+') +>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') +['12', '11', '10'] +\end{verbatim} + +\method{findall()} has to create the entire list before it can be +returned as the result. In Python 2.2, the \method{finditer()} method +is also available, returning a sequence of \class{MatchObject} instances +as an iterator. + +\begin{verbatim} +>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') +>>> iterator +<callable-iterator object at 0x401833ac> +>>> for match in iterator: +... print match.span() +... +(0, 2) +(22, 24) +(29, 31) +\end{verbatim} + + +\subsection{Module-Level Functions} + +You don't have to produce a \class{RegexObject} and call its methods; +the \module{re} module also provides top-level functions called +\function{match()}, \function{search()}, \function{sub()}, and so +forth. These functions take the same arguments as the corresponding +\class{RegexObject} method, with the RE string added as the first +argument, and still return either \code{None} or a \class{MatchObject} +instance. + +\begin{verbatim} +>>> print re.match(r'From\s+', 'Fromage amk') +None +>>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') +<re.MatchObject instance at 80c5978> +\end{verbatim} + +Under the hood, these functions simply produce a \class{RegexObject} +for you and call the appropriate method on it. They also store the +compiled object in a cache, so future calls using the same +RE are faster. + +Should you use these module-level functions, or should you get the +\class{RegexObject} and call its methods yourself? That choice +depends on how frequently the RE will be used, and on your personal +coding style. If a RE is being used at only one point in the code, +then the module functions are probably more convenient. If a program +contains a lot of regular expressions, or re-uses the same ones in +several locations, then it might be worthwhile to collect all the +definitions in one place, in a section of code that compiles all the +REs ahead of time. To take an example from the standard library, +here's an extract from \file{xmllib.py}: + +\begin{verbatim} +ref = re.compile( ... ) +entityref = re.compile( ... ) +charref = re.compile( ... ) +starttagopen = re.compile( ... ) +\end{verbatim} + +I generally prefer to work with the compiled object, even for +one-time uses, but few people will be as much of a purist about this +as I am. + +\subsection{Compilation Flags} + +Compilation flags let you modify some aspects of how regular +expressions work. Flags are available in the \module{re} module under +two names, a long name such as \constant{IGNORECASE}, and a short, +one-letter form such as \constant{I}. (If you're familiar with Perl's +pattern modifiers, the one-letter forms use the same letters; the +short form of \constant{re.VERBOSE} is \constant{re.X}, for example.) +Multiple flags can be specified by bitwise OR-ing them; \code{re.I | +re.M} sets both the \constant{I} and \constant{M} flags, for example. + +Here's a table of the available flags, followed by +a more detailed explanation of each one. + +\begin{tableii}{c|l}{}{Flag}{Meaning} + \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any + character, including newlines} + \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches} + \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match} + \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching, + affecting \regexp{\^} and \regexp{\$}} + \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs, + which can be organized more cleanly and understandably.} +\end{tableii} + +\begin{datadesc}{I} +\dataline{IGNORECASE} +Perform case-insensitive matching; character class and literal strings +will match +letters by ignoring case. For example, \regexp{[A-Z]} will match +lowercase letters, too, and \regexp{Spam} will match \samp{Spam}, +\samp{spam}, or \samp{spAM}. +This lowercasing doesn't take the current locale into account; it will +if you also set the \constant{LOCALE} flag. +\end{datadesc} + +\begin{datadesc}{L} +\dataline{LOCALE} +Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, +and \regexp{\e B}, dependent on the current locale. + +Locales are a feature of the C library intended to help in writing +programs that take account of language differences. For example, if +you're processing French text, you'd want to be able to write +\regexp{\e w+} to match words, but \regexp{\e w} only matches the +character class \regexp{[A-Za-z]}; it won't match \character{\'e} or +\character{\c c}. If your system is configured properly and a French +locale is selected, certain C functions will tell the program that +\character{\'e} should also be considered a letter. Setting the +\constant{LOCALE} flag when compiling a regular expression will cause the +resulting compiled object to use these C functions for \regexp{\e w}; +this is slower, but also enables \regexp{\e w+} to match French words as +you'd expect. +\end{datadesc} + +\begin{datadesc}{M} +\dataline{MULTILINE} +(\regexp{\^} and \regexp{\$} haven't been explained yet; +they'll be introduced in section~\ref{more-metacharacters}.) + +Usually \regexp{\^} matches only at the beginning of the string, and +\regexp{\$} matches only at the end of the string and immediately before the +newline (if any) at the end of the string. When this flag is +specified, \regexp{\^} matches at the beginning of the string and at +the beginning of each line within the string, immediately following +each newline. Similarly, the \regexp{\$} metacharacter matches either at +the end of the string and at the end of each line (immediately +preceding each newline). + +\end{datadesc} + +\begin{datadesc}{S} +\dataline{DOTALL} +Makes the \character{.} special character match any character at all, +including a newline; without this flag, \character{.} will match +anything \emph{except} a newline. +\end{datadesc} + +\begin{datadesc}{X} +\dataline{VERBOSE} This flag allows you to write regular expressions +that are more readable by granting you more flexibility in how you can +format them. When this flag has been specified, whitespace within the +RE string is ignored, except when the whitespace is in a character +class or preceded by an unescaped backslash; this lets you organize +and indent the RE more clearly. It also enables you to put comments +within a RE that will be ignored by the engine; comments are marked by +a \character{\#} that's neither in a character class or preceded by an +unescaped backslash. + +For example, here's a RE that uses \constant{re.VERBOSE}; see how +much easier it is to read? + +\begin{verbatim} +charref = re.compile(r""" + &[#] # Start of a numeric entity reference + ( + [0-9]+[^0-9] # Decimal form + | 0[0-7]+[^0-7] # Octal form + | x[0-9a-fA-F]+[^0-9a-fA-F] # Hexadecimal form + ) +""", re.VERBOSE) +\end{verbatim} + +Without the verbose setting, the RE would look like this: +\begin{verbatim} +charref = re.compile("&#([0-9]+[^0-9]" + "|0[0-7]+[^0-7]" + "|x[0-9a-fA-F]+[^0-9a-fA-F])") +\end{verbatim} + +In the above example, Python's automatic concatenation of string +literals has been used to break up the RE into smaller pieces, but +it's still more difficult to understand than the version using +\constant{re.VERBOSE}. + +\end{datadesc} + +\section{More Pattern Power} + +So far we've only covered a part of the features of regular +expressions. In this section, we'll cover some new metacharacters, +and how to use groups to retrieve portions of the text that was matched. + +\subsection{More Metacharacters\label{more-metacharacters}} + +There are some metacharacters that we haven't covered yet. Most of +them will be covered in this section. + +Some of the remaining metacharacters to be discussed are +\dfn{zero-width assertions}. They don't cause the engine to advance +through the string; instead, they consume no characters at all, +and simply succeed or fail. For example, \regexp{\e b} is an +assertion that the current position is located at a word boundary; the +position isn't changed by the \regexp{\e b} at all. This means that +zero-width assertions should never be repeated, because if they match +once at a given location, they can obviously be matched an infinite +number of times. + +\begin{list}{}{} + +\item[\regexp{|}] +Alternation, or the ``or'' operator. +If A and B are regular expressions, +\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}. +\regexp{|} has very low precedence in order to make it work reasonably when +you're alternating multi-character strings. +\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not +\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}. + +To match a literal \character{|}, +use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}. + +\item[\regexp{\^}] Matches at the beginning of lines. Unless the +\constant{MULTILINE} flag has been set, this will only match at the +beginning of the string. In \constant{MULTILINE} mode, this also +matches immediately after each newline within the string. + +For example, if you wish to match the word \samp{From} only at the +beginning of a line, the RE to use is \verb|^From|. + +\begin{verbatim} +>>> print re.search('^From', 'From Here to Eternity') +<re.MatchObject instance at 80c1520> +>>> print re.search('^From', 'Reciting From Memory') +None +\end{verbatim} + +%To match a literal \character{\^}, use \regexp{\e\^} or enclose it +%inside a character class, as in \regexp{[{\e}\^]}. + +\item[\regexp{\$}] Matches at the end of a line, which is defined as +either the end of the string, or any location followed by a newline +character. + +\begin{verbatim} +>>> print re.search('}$', '{block}') +<re.MatchObject instance at 80adfa8> +>>> print re.search('}$', '{block} ') +None +>>> print re.search('}$', '{block}\n') +<re.MatchObject instance at 80adfa8> +\end{verbatim} +% $ + +To match a literal \character{\$}, use \regexp{\e\$} or enclose it +inside a character class, as in \regexp{[\$]}. + +\item[\regexp{\e A}] Matches only at the start of the string. When +not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are +effectively the same. In \constant{MULTILINE} mode, however, they're +different; \regexp{\e A} still matches only at the beginning of the +string, but \regexp{\^} may match at any location inside the string +that follows a newline character. + +\item[\regexp{\e Z}]Matches only at the end of the string. + +\item[\regexp{\e b}] Word boundary. +This is a zero-width assertion that matches only at the +beginning or end of a word. A word is defined as a sequence of +alphanumeric characters, so the end of a word is indicated by +whitespace or a non-alphanumeric character. + +The following example matches \samp{class} only when it's a complete +word; it won't match when it's contained inside another word. + +\begin{verbatim} +>>> p = re.compile(r'\bclass\b') +>>> print p.search('no class at all') +<re.MatchObject instance at 80c8f28> +>>> print p.search('the declassified algorithm') +None +>>> print p.search('one subclass is') +None +\end{verbatim} + +There are two subtleties you should remember when using this special +sequence. First, this is the worst collision between Python's string +literals and regular expression sequences. In Python's string +literals, \samp{\e b} is the backspace character, ASCII value 8. If +you're not using raw strings, then Python will convert the \samp{\e b} to +a backspace, and your RE won't match as you expect it to. The +following example looks the same as our previous RE, but omits +the \character{r} in front of the RE string. + +\begin{verbatim} +>>> p = re.compile('\bclass\b') +>>> print p.search('no class at all') +None +>>> print p.search('\b' + 'class' + '\b') +<re.MatchObject instance at 80c3ee0> +\end{verbatim} + +Second, inside a character class, where there's no use for this +assertion, \regexp{\e b} represents the backspace character, for +compatibility with Python's string literals. + +\item[\regexp{\e B}] Another zero-width assertion, this is the +opposite of \regexp{\e b}, only matching when the current +position is not at a word boundary. + +\end{list} + +\subsection{Grouping} + +Frequently you need to obtain more information than just whether the +RE matched or not. Regular expressions are often used to dissect +strings by writing a RE divided into several subgroups which +match different components of interest. For example, an RFC-822 +header line is divided into a header name and a value, separated by a +\character{:}. This can be handled by writing a regular expression +which matches an entire header line, and has one group which matches the +header name, and another group which matches the header's value. + +Groups are marked by the \character{(}, \character{)} metacharacters. +\character{(} and \character{)} have much the same meaning as they do +in mathematical expressions; they group together the expressions +contained inside them. For example, you can repeat the contents of a +group with a repeating qualifier, such as \regexp{*}, \regexp{+}, +\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example, +\regexp{(ab)*} will match zero or more repetitions of \samp{ab}. + +\begin{verbatim} +>>> p = re.compile('(ab)*') +>>> print p.match('ababababab').span() +(0, 10) +\end{verbatim} + +Groups indicated with \character{(}, \character{)} also capture the +starting and ending index of the text that they match; this can be +retrieved by passing an argument to \method{group()}, +\method{start()}, \method{end()}, and \method{span()}. Groups are +numbered starting with 0. Group 0 is always present; it's the whole +RE, so \class{MatchObject} methods all have group 0 as their default +argument. Later we'll see how to express groups that don't capture +the span of text that they match. + +\begin{verbatim} +>>> p = re.compile('(a)b') +>>> m = p.match('ab') +>>> m.group() +'ab' +>>> m.group(0) +'ab' +\end{verbatim} + +Subgroups are numbered from left to right, from 1 upward. Groups can +be nested; to determine the number, just count the opening parenthesis +characters, going from left to right. + +\begin{verbatim} +>>> p = re.compile('(a(b)c)d') +>>> m = p.match('abcd') +>>> m.group(0) +'abcd' +>>> m.group(1) +'abc' +>>> m.group(2) +'b' +\end{verbatim} + +\method{group()} can be passed multiple group numbers at a time, in +which case it will return a tuple containing the corresponding values +for those groups. + +\begin{verbatim} +>>> m.group(2,1,2) +('b', 'abc', 'b') +\end{verbatim} + +The \method{groups()} method returns a tuple containing the strings +for all the subgroups, from 1 up to however many there are. + +\begin{verbatim} +>>> m.groups() +('abc', 'b') +\end{verbatim} + +Backreferences in a pattern allow you to specify that the contents of +an earlier capturing group must also be found at the current location +in the string. For example, \regexp{\e 1} will succeed if the exact +contents of group 1 can be found at the current position, and fails +otherwise. Remember that Python's string literals also use a +backslash followed by numbers to allow including arbitrary characters +in a string, so be sure to use a raw string when incorporating +backreferences in a RE. + +For example, the following RE detects doubled words in a string. + +\begin{verbatim} +>>> p = re.compile(r'(\b\w+)\s+\1') +>>> p.search('Paris in the the spring').group() +'the the' +\end{verbatim} + +Backreferences like this aren't often useful for just searching +through a string --- there are few text formats which repeat data in +this way --- but you'll soon find out that they're \emph{very} useful +when performing string substitutions. + +\subsection{Non-capturing and Named Groups} + +Elaborate REs may use many groups, both to capture substrings of +interest, and to group and structure the RE itself. In complex REs, +it becomes difficult to keep track of the group numbers. There are +two features which help with this problem. Both of them use a common +syntax for regular expression extensions, so we'll look at that first. + +Perl 5 added several additional features to standard regular +expressions, and the Python \module{re} module supports most of them. +It would have been difficult to choose new single-keystroke +metacharacters or new special sequences beginning with \samp{\e} to +represent the new features without making Perl's regular expressions +confusingly different from standard REs. If you chose \samp{\&} as a +new metacharacter, for example, old expressions would be assuming that +\samp{\&} was a regular character and wouldn't have escaped it by +writing \regexp{\e \&} or \regexp{[\&]}. + +The solution chosen by the Perl developers was to use \regexp{(?...)} +as the extension syntax. \samp{?} immediately after a parenthesis was +a syntax error because the \samp{?} would have nothing to repeat, so +this didn't introduce any compatibility problems. The characters +immediately after the \samp{?} indicate what extension is being used, +so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and +\regexp{(?:foo)} is something else (a non-capturing group containing +the subexpression \regexp{foo}). + +Python adds an extension syntax to Perl's extension syntax. If the +first character after the question mark is a \samp{P}, you know that +it's an extension that's specific to Python. Currently there are two +such extensions: \regexp{(?P<\var{name}>...)} defines a named group, +and \regexp{(?P=\var{name})} is a backreference to a named group. If +future versions of Perl 5 add similar features using a different +syntax, the \module{re} module will be changed to support the new +syntax, while preserving the Python-specific syntax for +compatibility's sake. + +Now that we've looked at the general extension syntax, we can return +to the features that simplify working with groups in complex REs. +Since groups are numbered from left to right and a complex expression +may use many groups, it can become difficult to keep track of the +correct numbering, and modifying such a complex RE is annoying. +Insert a new group near the beginning, and you change the numbers of +everything that follows it. + +First, sometimes you'll want to use a group to collect a part of a +regular expression, but aren't interested in retrieving the group's +contents. You can make this fact explicit by using a non-capturing +group: \regexp{(?:...)}, where you can put any other regular +expression inside the parentheses. + +\begin{verbatim} +>>> m = re.match("([abc])+", "abc") +>>> m.groups() +('c',) +>>> m = re.match("(?:[abc])+", "abc") +>>> m.groups() +() +\end{verbatim} + +Except for the fact that you can't retrieve the contents of what the +group matched, a non-capturing group behaves exactly the same as a +capturing group; you can put anything inside it, repeat it with a +repetition metacharacter such as \samp{*}, and nest it within other +groups (capturing or non-capturing). \regexp{(?:...)} is particularly +useful when modifying an existing group, since you can add new groups +without changing how all the other groups are numbered. It should be +mentioned that there's no performance difference in searching between +capturing and non-capturing groups; neither form is any faster than +the other. + +The second, and more significant, feature is named groups; instead of +referring to them by numbers, groups can be referenced by a name. + +The syntax for a named group is one of the Python-specific extensions: +\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of +the group. Except for associating a name with a group, named groups +also behave identically to capturing groups. The \class{MatchObject} +methods that deal with capturing groups all accept either integers, to +refer to groups by number, or a string containing the group name. +Named groups are still given numbers, so you can retrieve information +about a group in two ways: + +\begin{verbatim} +>>> p = re.compile(r'(?P<word>\b\w+\b)') +>>> m = p.search( '(((( Lots of punctuation )))' ) +>>> m.group('word') +'Lots' +>>> m.group(1) +'Lots' +\end{verbatim} + +Named groups are handy because they let you use easily-remembered +names, instead of having to remember numbers. Here's an example RE +from the \module{imaplib} module: + +\begin{verbatim} +InternalDate = re.compile(r'INTERNALDATE "' + r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' + r'(?P<year>[0-9][0-9][0-9][0-9])' + r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' + r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' + r'"') +\end{verbatim} + +It's obviously much easier to retrieve \code{m.group('zonem')}, +instead of having to remember to retrieve group 9. + +Since the syntax for backreferences, in an expression like +\regexp{(...)\e 1}, refers to the number of the group there's +naturally a variant that uses the group name instead of the number. +This is also a Python extension: \regexp{(?P=\var{name})} indicates +that the contents of the group called \var{name} should again be found +at the current point. The regular expression for finding doubled +words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as +\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}: + +\begin{verbatim} +>>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') +>>> p.search('Paris in the the spring').group() +'the the' +\end{verbatim} + +\subsection{Lookahead Assertions} + +Another zero-width assertion is the lookahead assertion. Lookahead +assertions are available in both positive and negative form, and +look like this: + +\begin{itemize} +\item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds +if the contained regular expression, represented here by \code{...}, +successfully matches at the current location, and fails otherwise. +But, once the contained expression has been tried, the matching engine +doesn't advance at all; the rest of the pattern is tried right where +the assertion started. + +\item[\regexp{(?!...)}] Negative lookahead assertion. This is the +opposite of the positive assertion; it succeeds if the contained expression +\emph{doesn't} match at the current position in the string. +\end{itemize} + +An example will help make this concrete by demonstrating a case +where a lookahead is useful. Consider a simple pattern to match a +filename and split it apart into a base name and an extension, +separated by a \samp{.}. For example, in \samp{news.rc}, \samp{news} +is the base name, and \samp{rc} is the filename's extension. + +The pattern to match this is quite simple: + +\regexp{.*[.].*\$} + +Notice that the \samp{.} needs to be treated specially because it's a +metacharacter; I've put it inside a character class. Also notice the +trailing \regexp{\$}; this is added to ensure that all the rest of the +string must be included in the extension. This regular expression +matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and +\samp{printers.conf}. + +Now, consider complicating the problem a bit; what if you want to +match filenames where the extension is not \samp{bat}? +Some incorrect attempts: + +\verb|.*[.][^b].*$| +% $ + +The first attempt above tries to exclude \samp{bat} by requiring that +the first character of the extension is not a \samp{b}. This is +wrong, because the pattern also doesn't match \samp{foo.bar}. + +% Messes up the HTML without the curly braces around \^ +\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$} + +The expression gets messier when you try to patch up the first +solution by requiring one of the following cases to match: the first +character of the extension isn't \samp{b}; the second character isn't +\samp{a}; or the third character isn't \samp{t}. This accepts +\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a +three-letter extension and won't accept a filename with a two-letter +extension such as \samp{sendmail.cf}. We'll complicate the pattern +again in an effort to fix it. + +\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$} + +In the third attempt, the second and third letters are all made +optional in order to allow matching extensions shorter than three +characters, such as \samp{sendmail.cf}. + +The pattern's getting really complicated now, which makes it hard to +read and understand. Worse, if the problem changes and you want to +exclude both \samp{bat} and \samp{exe} as extensions, the pattern +would get even more complicated and confusing. + +A negative lookahead cuts through all this: + +\regexp{.*[.](?!bat\$).*\$} +% $ + +The lookahead means: if the expression \regexp{bat} doesn't match at +this point, try the rest of the pattern; if \regexp{bat\$} does match, +the whole pattern will fail. The trailing \regexp{\$} is required to +ensure that something like \samp{sample.batch}, where the extension +only starts with \samp{bat}, will be allowed. + +Excluding another filename extension is now easy; simply add it as an +alternative inside the assertion. The following pattern excludes +filenames that end in either \samp{bat} or \samp{exe}: + +\regexp{.*[.](?!bat\$|exe\$).*\$} +% $ + + +\section{Modifying Strings} + +Up to this point, we've simply performed searches against a static +string. Regular expressions are also commonly used to modify a string +in various ways, using the following \class{RegexObject} methods: + +\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} + \lineii{split()}{Split the string into a list, splitting it wherever the RE matches} + \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string} + \lineii{subn()}{Does the same thing as \method{sub()}, + but returns the new string and the number of replacements} +\end{tableii} + + +\subsection{Splitting Strings} + +The \method{split()} method of a \class{RegexObject} splits a string +apart wherever the RE matches, returning a list of the pieces. +It's similar to the \method{split()} method of strings but +provides much more +generality in the delimiters that you can split by; +\method{split()} only supports splitting by whitespace or by +a fixed string. As you'd expect, there's a module-level +\function{re.split()} function, too. + +\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}} + Split \var{string} by the matches of the regular expression. If + capturing parentheses are used in the RE, then their contents will + also be returned as part of the resulting list. If \var{maxsplit} + is nonzero, at most \var{maxsplit} splits are performed. +\end{methoddesc} + +You can limit the number of splits made, by passing a value for +\var{maxsplit}. When \var{maxsplit} is nonzero, at most +\var{maxsplit} splits will be made, and the remainder of the string is +returned as the final element of the list. In the following example, +the delimiter is any sequence of non-alphanumeric characters. + +\begin{verbatim} +>>> p = re.compile(r'\W+') +>>> p.split('This is a test, short and sweet, of split().') +['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] +>>> p.split('This is a test, short and sweet, of split().', 3) +['This', 'is', 'a', 'test, short and sweet, of split().'] +\end{verbatim} + +Sometimes you're not only interested in what the text between +delimiters is, but also need to know what the delimiter was. If +capturing parentheses are used in the RE, then their values are also +returned as part of the list. Compare the following calls: + +\begin{verbatim} +>>> p = re.compile(r'\W+') +>>> p2 = re.compile(r'(\W+)') +>>> p.split('This... is a test.') +['This', 'is', 'a', 'test', ''] +>>> p2.split('This... is a test.') +['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] +\end{verbatim} + +The module-level function \function{re.split()} adds the RE to be +used as the first argument, but is otherwise the same. + +\begin{verbatim} +>>> re.split('[\W]+', 'Words, words, words.') +['Words', 'words', 'words', ''] +>>> re.split('([\W]+)', 'Words, words, words.') +['Words', ', ', 'words', ', ', 'words', '.', ''] +>>> re.split('[\W]+', 'Words, words, words.', 1) +['Words', 'words, words.'] +\end{verbatim} + +\subsection{Search and Replace} + +Another common task is to find all the matches for a pattern, and +replace them with a different string. The \method{sub()} method takes +a replacement value, which can be either a string or a function, and +the string to be processed. + +\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}} +Returns the string obtained by replacing the leftmost non-overlapping +occurrences of the RE in \var{string} by the replacement +\var{replacement}. If the pattern isn't found, \var{string} is returned +unchanged. + +The optional argument \var{count} is the maximum number of pattern +occurrences to be replaced; \var{count} must be a non-negative +integer. The default value of 0 means to replace all occurrences. +\end{methoddesc} + +Here's a simple example of using the \method{sub()} method. It +replaces colour names with the word \samp{colour}: + +\begin{verbatim} +>>> p = re.compile( '(blue|white|red)') +>>> p.sub( 'colour', 'blue socks and red shoes') +'colour socks and colour shoes' +>>> p.sub( 'colour', 'blue socks and red shoes', count=1) +'colour socks and red shoes' +\end{verbatim} + +The \method{subn()} method does the same work, but returns a 2-tuple +containing the new string value and the number of replacements +that were performed: + +\begin{verbatim} +>>> p = re.compile( '(blue|white|red)') +>>> p.subn( 'colour', 'blue socks and red shoes') +('colour socks and colour shoes', 2) +>>> p.subn( 'colour', 'no colours at all') +('no colours at all', 0) +\end{verbatim} + +Empty matches are replaced only when they're not +adjacent to a previous match. + +\begin{verbatim} +>>> p = re.compile('x*') +>>> p.sub('-', 'abxd') +'-a-b-d-' +\end{verbatim} + +If \var{replacement} is a string, any backslash escapes in it are +processed. That is, \samp{\e n} is converted to a single newline +character, \samp{\e r} is converted to a carriage return, and so forth. +Unknown escapes such as \samp{\e j} are left alone. Backreferences, +such as \samp{\e 6}, are replaced with the substring matched by the +corresponding group in the RE. This lets you incorporate +portions of the original text in the resulting +replacement string. + +This example matches the word \samp{section} followed by a string +enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to +\samp{subsection}: + +\begin{verbatim} +>>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) +>>> p.sub(r'subsection{\1}','section{First} section{second}') +'subsection{First} subsection{second}' +\end{verbatim} + +There's also a syntax for referring to named groups as defined by the +\regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the +substring matched by the group named \samp{name}, and +\samp{\e g<\var{number}>} +uses the corresponding group number. +\samp{\e g<2>} is therefore equivalent to \samp{\e 2}, +but isn't ambiguous in a +replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be +interpreted as a reference to group 20, not a reference to group 2 +followed by the literal character \character{0}.) The following +substitutions are all equivalent, but use all three variations of the +replacement string. + +\begin{verbatim} +>>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) +>>> p.sub(r'subsection{\1}','section{First}') +'subsection{First}' +>>> p.sub(r'subsection{\g<1>}','section{First}') +'subsection{First}' +>>> p.sub(r'subsection{\g<name>}','section{First}') +'subsection{First}' +\end{verbatim} + +\var{replacement} can also be a function, which gives you even more +control. If \var{replacement} is a function, the function is +called for every non-overlapping occurrence of \var{pattern}. On each +call, the function is +passed a \class{MatchObject} argument for the match +and can use this information to compute the desired replacement string and return it. + +In the following example, the replacement function translates +decimals into hexadecimal: + +\begin{verbatim} +>>> def hexrepl( match ): +... "Return the hex string for a decimal number" +... value = int( match.group() ) +... return hex(value) +... +>>> p = re.compile(r'\d+') +>>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') +'Call 0xffd2 for printing, 0xc000 for user code.' +\end{verbatim} + +When using the module-level \function{re.sub()} function, the pattern +is passed as the first argument. The pattern may be a string or a +\class{RegexObject}; if you need to specify regular expression flags, +you must either use a \class{RegexObject} as the first parameter, or use +embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb +BBBB")} returns \code{'x x'}. + +\section{Common Problems} + +Regular expressions are a powerful tool for some applications, but in +some ways their behaviour isn't intuitive and at times they don't +behave the way you may expect them to. This section will point out +some of the most common pitfalls. + +\subsection{Use String Methods} + +Sometimes using the \module{re} module is a mistake. If you're +matching a fixed string, or a single character class, and you're not +using any \module{re} features such as the \constant{IGNORECASE} flag, +then the full power of regular expressions may not be required. +Strings have several methods for performing operations with fixed +strings and they're usually much faster, because the implementation is +a single small C loop that's been optimized for the purpose, instead +of the large, more generalized regular expression engine. + +One example might be replacing a single fixed string with another +one; for example, you might replace \samp{word} +with \samp{deed}. \code{re.sub()} seems like the function to use for +this, but consider the \method{replace()} method. Note that +\function{replace()} will also replace \samp{word} inside +words, turning \samp{swordfish} into \samp{sdeedfish}, but the +na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing +the substitution on parts of words, the pattern would have to be +\regexp{\e bword\e b}, in order to require that \samp{word} have a +word boundary on either side. This takes the job beyond +\method{replace}'s abilities.) + +Another common task is deleting every occurrence of a single character +from a string or replacing it with another single character. You +might do this with something like \code{re.sub('\e n', ' ', S)}, but +\method{translate()} is capable of doing both tasks +and will be faster that any regular expression operation can be. + +In short, before turning to the \module{re} module, consider whether +your problem can be solved with a faster and simpler string method. + +\subsection{match() versus search()} + +The \function{match()} function only checks if the RE matches at +the beginning of the string while \function{search()} will scan +forward through the string for a match. +It's important to keep this distinction in mind. Remember, +\function{match()} will only report a successful match which +will start at 0; if the match wouldn't start at zero, +\function{match()} will \emph{not} report it. + +\begin{verbatim} +>>> print re.match('super', 'superstition').span() +(0, 5) +>>> print re.match('super', 'insuperable') +None +\end{verbatim} + +On the other hand, \function{search()} will scan forward through the +string, reporting the first match it finds. + +\begin{verbatim} +>>> print re.search('super', 'superstition').span() +(0, 5) +>>> print re.search('super', 'insuperable').span() +(2, 7) +\end{verbatim} + +Sometimes you'll be tempted to keep using \function{re.match()}, and +just add \regexp{.*} to the front of your RE. Resist this temptation +and use \function{re.search()} instead. The regular expression +compiler does some analysis of REs in order to speed up the process of +looking for a match. One such analysis figures out what the first +character of a match must be; for example, a pattern starting with +\regexp{Crow} must match starting with a \character{C}. The analysis +lets the engine quickly scan through the string looking for the +starting character, only trying the full match if a \character{C} is found. + +Adding \regexp{.*} defeats this optimization, requiring scanning to +the end of the string and then backtracking to find a match for the +rest of the RE. Use \function{re.search()} instead. + +\subsection{Greedy versus Non-Greedy} + +When repeating a regular expression, as in \regexp{a*}, the resulting +action is to consume as much of the pattern as possible. This +fact often bites you when you're trying to match a pair of +balanced delimiters, such as the angle brackets surrounding an HTML +tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't +work because of the greedy nature of \regexp{.*}. + +\begin{verbatim} +>>> s = '<html><head><title>Title</title>' +>>> len(s) +32 +>>> print re.match('<.*>', s).span() +(0, 32) +>>> print re.match('<.*>', s).group() +<html><head><title>Title</title> +\end{verbatim} + +The RE matches the \character{<} in \samp{<html>}, and the +\regexp{.*} consumes the rest of the string. There's still more left +in the RE, though, and the \regexp{>} can't match at the end of +the string, so the regular expression engine has to backtrack +character by character until it finds a match for the \regexp{>}. +The final match extends from the \character{<} in \samp{<html>} +to the \character{>} in \samp{</title>}, which isn't what you want. + +In this case, the solution is to use the non-greedy qualifiers +\regexp{*?}, \regexp{+?}, \regexp{??}, or +\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as +possible. In the above example, the \character{>} is tried +immediately after the first \character{<} matches, and when it fails, +the engine advances a character at a time, retrying the \character{>} +at every step. This produces just the right result: + +\begin{verbatim} +>>> print re.match('<.*?>', s).group() +<html> +\end{verbatim} + +(Note that parsing HTML or XML with regular expressions is painful. +Quick-and-dirty patterns will handle common cases, but HTML and XML +have special cases that will break the obvious regular expression; by +the time you've written a regular expression that handles all of the +possible cases, the patterns will be \emph{very} complicated. Use an +HTML or XML parser module for such tasks.) + +\subsection{Not Using re.VERBOSE} + +By now you've probably noticed that regular expressions are a very +compact notation, but they're not terribly readable. REs of +moderate complexity can become lengthy collections of backslashes, +parentheses, and metacharacters, making them difficult to read and +understand. + +For such REs, specifying the \code{re.VERBOSE} flag when +compiling the regular expression can be helpful, because it allows +you to format the regular expression more clearly. + +The \code{re.VERBOSE} flag has several effects. Whitespace in the +regular expression that \emph{isn't} inside a character class is +ignored. This means that an expression such as \regexp{dog | cat} is +equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]} +will still match the characters \character{a}, \character{b}, or a +space. In addition, you can also put comments inside a RE; comments +extend from a \samp{\#} character to the next newline. When used with +triple-quoted strings, this enables REs to be formatted more neatly: + +\begin{verbatim} +pat = re.compile(r""" + \s* # Skip leading whitespace + (?P<header>[^:]+) # Header name + \s* : # Whitespace, and a colon + (?P<value>.*?) # The header's value -- *? used to + # lose the following trailing whitespace + \s*$ # Trailing whitespace to end-of-line +""", re.VERBOSE) +\end{verbatim} +% $ + +This is far more readable than: + +\begin{verbatim} +pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") +\end{verbatim} +% $ + +\section{Feedback} + +Regular expressions are a complicated topic. Did this document help +you understand them? Were there parts that were unclear, or Problems +you encountered that weren't covered here? If so, please send +suggestions for improvements to the author. + +The most complete book on regular expressions is almost certainly +Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published +by O'Reilly. Unfortunately, it exclusively concentrates on Perl and +Java's flavours of regular expressions, and doesn't contain any Python +material at all, so it won't be useful as a reference for programming +in Python. (The first edition covered Python's now-obsolete +\module{regex} module, which won't help you much.) Consider checking +it out from your library. + +\end{document} + |