diff options
Diffstat (limited to 'Doc/howto/regex.tex')
-rw-r--r-- | Doc/howto/regex.tex | 1476 |
1 files changed, 0 insertions, 1476 deletions
diff --git a/Doc/howto/regex.tex b/Doc/howto/regex.tex deleted file mode 100644 index d911be6..0000000 --- a/Doc/howto/regex.tex +++ /dev/null @@ -1,1476 +0,0 @@ -\documentclass{howto} - -% TODO: -% Document lookbehind assertions -% Better way of displaying a RE, a string, and what it matches -% Mention optional argument to match.groups() -% Unicode (at least a reference) - -\title{Regular Expression HOWTO} - -\release{0.05} - -\author{A.M. Kuchling} -\authoraddress{\email{amk@amk.ca}} - -\begin{document} -\maketitle - -\begin{abstract} -\noindent -This document is an introductory tutorial to using regular expressions -in Python with the \module{re} module. It provides a gentler -introduction than the corresponding section in the Library Reference. - -This document is available from -\url{http://www.amk.ca/python/howto}. - -\end{abstract} - -\tableofcontents - -\section{Introduction} - -The \module{re} module was added in Python 1.5, and provides -Perl-style regular expression patterns. Earlier versions of Python -came with the \module{regex} module, which provided Emacs-style -patterns. The \module{regex} module was removed completely in Python 2.5. - -Regular expressions (called REs, or regexes, or regex patterns) are -essentially a tiny, highly specialized programming language embedded -inside Python and made available through the \module{re} module. -Using this little language, you specify the rules for the set of -possible strings that you want to match; this set might contain -English sentences, or e-mail addresses, or TeX commands, or anything -you like. You can then ask questions such as ``Does this string match -the pattern?'', or ``Is there a match for the pattern anywhere in this -string?''. You can also use REs to modify a string or to split it -apart in various ways. - -Regular expression patterns are compiled into a series of bytecodes -which are then executed by a matching engine written in C. For -advanced use, it may be necessary to pay careful attention to how the -engine will execute a given RE, and write the RE in a certain way in -order to produce bytecode that runs faster. Optimization isn't -covered in this document, because it requires that you have a good -understanding of the matching engine's internals. - -The regular expression language is relatively small and restricted, so -not all possible string processing tasks can be done using regular -expressions. There are also tasks that \emph{can} be done with -regular expressions, but the expressions turn out to be very -complicated. In these cases, you may be better off writing Python -code to do the processing; while Python code will be slower than an -elaborate regular expression, it will also probably be more understandable. - -\section{Simple Patterns} - -We'll start by learning about the simplest possible regular -expressions. Since regular expressions are used to operate on -strings, we'll begin with the most common task: matching characters. - -For a detailed explanation of the computer science underlying regular -expressions (deterministic and non-deterministic finite automata), you -can refer to almost any textbook on writing compilers. - -\subsection{Matching Characters} - -Most letters and characters will simply match themselves. For -example, the regular expression \regexp{test} will match the string -\samp{test} exactly. (You can enable a case-insensitive mode that -would let this RE match \samp{Test} or \samp{TEST} as well; more -about this later.) - -There are exceptions to this rule; some characters are special -\dfn{metacharacters}, and don't match themselves. Instead, they -signal that some out-of-the-ordinary thing should be matched, or they -affect other portions of the RE by repeating them or changing their -meaning. Much of this document is devoted to discussing various -metacharacters and what they do. - -Here's a complete list of the metacharacters; their meanings will be -discussed in the rest of this HOWTO. - -\begin{verbatim} -. ^ $ * + ? { [ ] \ | ( ) -\end{verbatim} -% $ - -The first metacharacters we'll look at are \samp{[} and \samp{]}. -They're used for specifying a character class, which is a set of -characters that you wish to match. Characters can be listed -individually, or a range of characters can be indicated by giving two -characters and separating them by a \character{-}. For example, -\regexp{[abc]} will match any of the characters \samp{a}, \samp{b}, or -\samp{c}; this is the same as -\regexp{[a-c]}, which uses a range to express the same set of -characters. If you wanted to match only lowercase letters, your -RE would be \regexp{[a-z]}. - -Metacharacters are not active inside classes. For example, -\regexp{[akm\$]} will match any of the characters \character{a}, -\character{k}, \character{m}, or \character{\$}; \character{\$} is -usually a metacharacter, but inside a character class it's stripped of -its special nature. - -You can match the characters not listed within the class by -\dfn{complementing} the set. This is indicated by including a -\character{\^} as the first character of the class; \character{\^} -outside a character class will simply match the -\character{\^} character. For example, \verb|[^5]| will match any -character except \character{5}. - -Perhaps the most important metacharacter is the backslash, \samp{\e}. -As in Python string literals, the backslash can be followed by various -characters to signal various special sequences. It's also used to escape -all the metacharacters so you can still match them in patterns; for -example, if you need to match a \samp{[} or -\samp{\e}, you can precede them with a backslash to remove their -special meaning: \regexp{\e[} or \regexp{\e\e}. - -Some of the special sequences beginning with \character{\e} represent -predefined sets of characters that are often useful, such as the set -of digits, the set of letters, or the set of anything that isn't -whitespace. The following predefined special sequences are available: - -\begin{itemize} -\item[\code{\e d}]Matches any decimal digit; this is -equivalent to the class \regexp{[0-9]}. - -\item[\code{\e D}]Matches any non-digit character; this is -equivalent to the class \verb|[^0-9]|. - -\item[\code{\e s}]Matches any whitespace character; this is -equivalent to the class \regexp{[ \e t\e n\e r\e f\e v]}. - -\item[\code{\e S}]Matches any non-whitespace character; this is -equivalent to the class \verb|[^ \t\n\r\f\v]|. - -\item[\code{\e w}]Matches any alphanumeric character; this is equivalent to the class -\regexp{[a-zA-Z0-9_]}. - -\item[\code{\e W}]Matches any non-alphanumeric character; this is equivalent to the class -\verb|[^a-zA-Z0-9_]|. -\end{itemize} - -These sequences can be included inside a character class. For -example, \regexp{[\e s,.]} is a character class that will match any -whitespace character, or \character{,} or \character{.}. - -The final metacharacter in this section is \regexp{.}. It matches -anything except a newline character, and there's an alternate mode -(\code{re.DOTALL}) where it will match even a newline. \character{.} -is often used where you want to match ``any character''. - -\subsection{Repeating Things} - -Being able to match varying sets of characters is the first thing -regular expressions can do that isn't already possible with the -methods available on strings. However, if that was the only -additional capability of regexes, they wouldn't be much of an advance. -Another capability is that you can specify that portions of the RE -must be repeated a certain number of times. - -The first metacharacter for repeating things that we'll look at is -\regexp{*}. \regexp{*} doesn't match the literal character \samp{*}; -instead, it specifies that the previous character can be matched zero -or more times, instead of exactly once. - -For example, \regexp{ca*t} will match \samp{ct} (0 \samp{a} -characters), \samp{cat} (1 \samp{a}), \samp{caaat} (3 \samp{a} -characters), and so forth. The RE engine has various internal -limitations stemming from the size of C's \code{int} type that will -prevent it from matching over 2 billion \samp{a} characters; you -probably don't have enough memory to construct a string that large, so -you shouldn't run into that limit. - -Repetitions such as \regexp{*} are \dfn{greedy}; when repeating a RE, -the matching engine will try to repeat it as many times as possible. -If later portions of the pattern don't match, the matching engine will -then back up and try again with few repetitions. - -A step-by-step example will make this more obvious. Let's consider -the expression \regexp{a[bcd]*b}. This matches the letter -\character{a}, zero or more letters from the class \code{[bcd]}, and -finally ends with a \character{b}. Now imagine matching this RE -against the string \samp{abcbd}. - -\begin{tableiii}{c|l|l}{}{Step}{Matched}{Explanation} -\lineiii{1}{\code{a}}{The \regexp{a} in the RE matches.} -\lineiii{2}{\code{abcbd}}{The engine matches \regexp{[bcd]*}, going as far as -it can, which is to the end of the string.} -\lineiii{3}{\emph{Failure}}{The engine tries to match \regexp{b}, but the -current position is at the end of the string, so it fails.} -\lineiii{4}{\code{abcb}}{Back up, so that \regexp{[bcd]*} matches -one less character.} -\lineiii{5}{\emph{Failure}}{Try \regexp{b} again, but the -current position is at the last character, which is a \character{d}.} -\lineiii{6}{\code{abc}}{Back up again, so that \regexp{[bcd]*} is -only matching \samp{bc}.} -\lineiii{6}{\code{abcb}}{Try \regexp{b} again. This time -but the character at the current position is \character{b}, so it succeeds.} -\end{tableiii} - -The end of the RE has now been reached, and it has matched -\samp{abcb}. This demonstrates how the matching engine goes as far as -it can at first, and if no match is found it will then progressively -back up and retry the rest of the RE again and again. It will back up -until it has tried zero matches for \regexp{[bcd]*}, and if that -subsequently fails, the engine will conclude that the string doesn't -match the RE at all. - -Another repeating metacharacter is \regexp{+}, which matches one or -more times. Pay careful attention to the difference between -\regexp{*} and \regexp{+}; \regexp{*} matches \emph{zero} or more -times, so whatever's being repeated may not be present at all, while -\regexp{+} requires at least \emph{one} occurrence. To use a similar -example, \regexp{ca+t} will match \samp{cat} (1 \samp{a}), -\samp{caaat} (3 \samp{a}'s), but won't match \samp{ct}. - -There are two more repeating qualifiers. The question mark character, -\regexp{?}, matches either once or zero times; you can think of it as -marking something as being optional. For example, \regexp{home-?brew} -matches either \samp{homebrew} or \samp{home-brew}. - -The most complicated repeated qualifier is -\regexp{\{\var{m},\var{n}\}}, where \var{m} and \var{n} are decimal -integers. This qualifier means there must be at least \var{m} -repetitions, and at most \var{n}. For example, \regexp{a/\{1,3\}b} -will match \samp{a/b}, \samp{a//b}, and \samp{a///b}. It won't match -\samp{ab}, which has no slashes, or \samp{a////b}, which has four. - -You can omit either \var{m} or \var{n}; in that case, a reasonable -value is assumed for the missing value. Omitting \var{m} is -interpreted as a lower limit of 0, while omitting \var{n} results in -an upper bound of infinity --- actually, the upper bound is the -2-billion limit mentioned earlier, but that might as well be infinity. - -Readers of a reductionist bent may notice that the three other qualifiers -can all be expressed using this notation. \regexp{\{0,\}} is the same -as \regexp{*}, \regexp{\{1,\}} is equivalent to \regexp{+}, and -\regexp{\{0,1\}} is the same as \regexp{?}. It's better to use -\regexp{*}, \regexp{+}, or \regexp{?} when you can, simply because -they're shorter and easier to read. - -\section{Using Regular Expressions} - -Now that we've looked at some simple regular expressions, how do we -actually use them in Python? The \module{re} module provides an -interface to the regular expression engine, allowing you to compile -REs into objects and then perform matches with them. - -\subsection{Compiling Regular Expressions} - -Regular expressions are compiled into \class{RegexObject} instances, -which have methods for various operations such as searching for -pattern matches or performing string substitutions. - -\begin{verbatim} ->>> import re ->>> p = re.compile('ab*') ->>> print p -<re.RegexObject instance at 80b4150> -\end{verbatim} - -\function{re.compile()} also accepts an optional \var{flags} -argument, used to enable various special features and syntax -variations. We'll go over the available settings later, but for now a -single example will do: - -\begin{verbatim} ->>> p = re.compile('ab*', re.IGNORECASE) -\end{verbatim} - -The RE is passed to \function{re.compile()} as a string. REs are -handled as strings because regular expressions aren't part of the core -Python language, and no special syntax was created for expressing -them. (There are applications that don't need REs at all, so there's -no need to bloat the language specification by including them.) -Instead, the \module{re} module is simply a C extension module -included with Python, just like the \module{socket} or \module{zlib} -modules. - -Putting REs in strings keeps the Python language simpler, but has one -disadvantage which is the topic of the next section. - -\subsection{The Backslash Plague} - -As stated earlier, regular expressions use the backslash -character (\character{\e}) to indicate special forms or to allow -special characters to be used without invoking their special meaning. -This conflicts with Python's usage of the same character for the same -purpose in string literals. - -Let's say you want to write a RE that matches the string -\samp{{\e}section}, which might be found in a \LaTeX\ file. To figure -out what to write in the program code, start with the desired string -to be matched. Next, you must escape any backslashes and other -metacharacters by preceding them with a backslash, resulting in the -string \samp{\e\e section}. The resulting string that must be passed -to \function{re.compile()} must be \verb|\\section|. However, to -express this as a Python string literal, both backslashes must be -escaped \emph{again}. - -\begin{tableii}{c|l}{code}{Characters}{Stage} - \lineii{\e section}{Text string to be matched} - \lineii{\e\e section}{Escaped backslash for \function{re.compile}} - \lineii{"\e\e\e\e section"}{Escaped backslashes for a string literal} -\end{tableii} - -In short, to match a literal backslash, one has to write -\code{'\e\e\e\e'} as the RE string, because the regular expression -must be \samp{\e\e}, and each backslash must be expressed as -\samp{\e\e} inside a regular Python string literal. In REs that -feature backslashes repeatedly, this leads to lots of repeated -backslashes and makes the resulting strings difficult to understand. - -The solution is to use Python's raw string notation for regular -expressions; backslashes are not handled in any special way in -a string literal prefixed with \character{r}, so \code{r"\e n"} is a -two-character string containing \character{\e} and \character{n}, -while \code{"\e n"} is a one-character string containing a newline. -Regular expressions will often be written in Python -code using this raw string notation. - -\begin{tableii}{c|c}{code}{Regular String}{Raw string} - \lineii{"ab*"}{\code{r"ab*"}} - \lineii{"\e\e\e\e section"}{\code{r"\e\e section"}} - \lineii{"\e\e w+\e\e s+\e\e 1"}{\code{r"\e w+\e s+\e 1"}} -\end{tableii} - -\subsection{Performing Matches} - -Once you have an object representing a compiled regular expression, -what do you do with it? \class{RegexObject} instances have several -methods and attributes. Only the most significant ones will be -covered here; consult \ulink{the Library -Reference}{http://www.python.org/doc/lib/module-re.html} for a -complete listing. - -\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} - \lineii{match()}{Determine if the RE matches at the beginning of - the string.} - \lineii{search()}{Scan through a string, looking for any location - where this RE matches.} - \lineii{findall()}{Find all substrings where the RE matches, -and returns them as a list.} - \lineii{finditer()}{Find all substrings where the RE matches, -and returns them as an iterator.} -\end{tableii} - -\method{match()} and \method{search()} return \code{None} if no match -can be found. If they're successful, a \code{MatchObject} instance is -returned, containing information about the match: where it starts and -ends, the substring it matched, and more. - -You can learn about this by interactively experimenting with the -\module{re} module. If you have Tkinter available, you may also want -to look at \file{Tools/scripts/redemo.py}, a demonstration program -included with the Python distribution. It allows you to enter REs and -strings, and displays whether the RE matches or fails. -\file{redemo.py} can be quite useful when trying to debug a -complicated RE. Phil Schwartz's -\ulink{Kodos}{http://www.phil-schwartz.com/kodos.spy} is also an interactive -tool for developing and testing RE patterns. - -This HOWTO uses the standard Python interpreter for its examples. -First, run the Python interpreter, import the \module{re} module, and -compile a RE: - -\begin{verbatim} -Python 2.2.2 (#1, Feb 10 2003, 12:57:01) ->>> import re ->>> p = re.compile('[a-z]+') ->>> p -<_sre.SRE_Pattern object at 80c3c28> -\end{verbatim} - -Now, you can try matching various strings against the RE -\regexp{[a-z]+}. An empty string shouldn't match at all, since -\regexp{+} means 'one or more repetitions'. \method{match()} should -return \code{None} in this case, which will cause the interpreter to -print no output. You can explicitly print the result of -\method{match()} to make this clear. - -\begin{verbatim} ->>> p.match("") ->>> print p.match("") -None -\end{verbatim} - -Now, let's try it on a string that it should match, such as -\samp{tempo}. In this case, \method{match()} will return a -\class{MatchObject}, so you should store the result in a variable for -later use. - -\begin{verbatim} ->>> m = p.match('tempo') ->>> print m -<_sre.SRE_Match object at 80c4f68> -\end{verbatim} - -Now you can query the \class{MatchObject} for information about the -matching string. \class{MatchObject} instances also have several -methods and attributes; the most important ones are: - -\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} - \lineii{group()}{Return the string matched by the RE} - \lineii{start()}{Return the starting position of the match} - \lineii{end()}{Return the ending position of the match} - \lineii{span()}{Return a tuple containing the (start, end) positions - of the match} -\end{tableii} - -Trying these methods will soon clarify their meaning: - -\begin{verbatim} ->>> m.group() -'tempo' ->>> m.start(), m.end() -(0, 5) ->>> m.span() -(0, 5) -\end{verbatim} - -\method{group()} returns the substring that was matched by the -RE. \method{start()} and \method{end()} return the starting and -ending index of the match. \method{span()} returns both start and end -indexes in a single tuple. Since the \method{match} method only -checks if the RE matches at the start of a string, -\method{start()} will always be zero. However, the \method{search} -method of \class{RegexObject} instances scans through the string, so -the match may not start at zero in that case. - -\begin{verbatim} ->>> print p.match('::: message') -None ->>> m = p.search('::: message') ; print m -<re.MatchObject instance at 80c9650> ->>> m.group() -'message' ->>> m.span() -(4, 11) -\end{verbatim} - -In actual programs, the most common style is to store the -\class{MatchObject} in a variable, and then check if it was -\code{None}. This usually looks like: - -\begin{verbatim} -p = re.compile( ... ) -m = p.match( 'string goes here' ) -if m: - print 'Match found: ', m.group() -else: - print 'No match' -\end{verbatim} - -Two \class{RegexObject} methods return all of the matches for a pattern. -\method{findall()} returns a list of matching strings: - -\begin{verbatim} ->>> p = re.compile('\d+') ->>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') -['12', '11', '10'] -\end{verbatim} - -\method{findall()} has to create the entire list before it can be -returned as the result. The \method{finditer()} method returns a -sequence of \class{MatchObject} instances as an -iterator.\footnote{Introduced in Python 2.2.2.} - -\begin{verbatim} ->>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') ->>> iterator -<callable-iterator object at 0x401833ac> ->>> for match in iterator: -... print match.span() -... -(0, 2) -(22, 24) -(29, 31) -\end{verbatim} - - -\subsection{Module-Level Functions} - -You don't have to create a \class{RegexObject} and call its methods; -the \module{re} module also provides top-level functions called -\function{match()}, \function{search()}, \function{findall()}, -\function{sub()}, and so forth. These functions take the same -arguments as the corresponding \class{RegexObject} method, with the RE -string added as the first argument, and still return either -\code{None} or a \class{MatchObject} instance. - -\begin{verbatim} ->>> print re.match(r'From\s+', 'Fromage amk') -None ->>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') -<re.MatchObject instance at 80c5978> -\end{verbatim} - -Under the hood, these functions simply produce a \class{RegexObject} -for you and call the appropriate method on it. They also store the -compiled object in a cache, so future calls using the same -RE are faster. - -Should you use these module-level functions, or should you get the -\class{RegexObject} and call its methods yourself? That choice -depends on how frequently the RE will be used, and on your personal -coding style. If the RE is being used at only one point in the code, -then the module functions are probably more convenient. If a program -contains a lot of regular expressions, or re-uses the same ones in -several locations, then it might be worthwhile to collect all the -definitions in one place, in a section of code that compiles all the -REs ahead of time. To take an example from the standard library: - -\begin{verbatim} -ref = re.compile( ... ) -entityref = re.compile( ... ) -charref = re.compile( ... ) -starttagopen = re.compile( ... ) -\end{verbatim} - -I generally prefer to work with the compiled object, even for -one-time uses, but few people will be as much of a purist about this -as I am. - -\subsection{Compilation Flags} - -Compilation flags let you modify some aspects of how regular -expressions work. Flags are available in the \module{re} module under -two names, a long name such as \constant{IGNORECASE} and a short, -one-letter form such as \constant{I}. (If you're familiar with Perl's -pattern modifiers, the one-letter forms use the same letters; the -short form of \constant{re.VERBOSE} is \constant{re.X}, for example.) -Multiple flags can be specified by bitwise OR-ing them; \code{re.I | -re.M} sets both the \constant{I} and \constant{M} flags, for example. - -Here's a table of the available flags, followed by -a more detailed explanation of each one. - -\begin{tableii}{c|l}{}{Flag}{Meaning} - \lineii{\constant{DOTALL}, \constant{S}}{Make \regexp{.} match any - character, including newlines} - \lineii{\constant{IGNORECASE}, \constant{I}}{Do case-insensitive matches} - \lineii{\constant{LOCALE}, \constant{L}}{Do a locale-aware match} - \lineii{\constant{MULTILINE}, \constant{M}}{Multi-line matching, - affecting \regexp{\^} and \regexp{\$}} - \lineii{\constant{VERBOSE}, \constant{X}}{Enable verbose REs, - which can be organized more cleanly and understandably.} -\end{tableii} - -\begin{datadesc}{I} -\dataline{IGNORECASE} -Perform case-insensitive matching; character class and literal strings -will match -letters by ignoring case. For example, \regexp{[A-Z]} will match -lowercase letters, too, and \regexp{Spam} will match \samp{Spam}, -\samp{spam}, or \samp{spAM}. -This lowercasing doesn't take the current locale into account; it will -if you also set the \constant{LOCALE} flag. -\end{datadesc} - -\begin{datadesc}{L} -\dataline{LOCALE} -Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, -and \regexp{\e B}, dependent on the current locale. - -Locales are a feature of the C library intended to help in writing -programs that take account of language differences. For example, if -you're processing French text, you'd want to be able to write -\regexp{\e w+} to match words, but \regexp{\e w} only matches the -character class \regexp{[A-Za-z]}; it won't match \character{\'e} or -\character{\c c}. If your system is configured properly and a French -locale is selected, certain C functions will tell the program that -\character{\'e} should also be considered a letter. Setting the -\constant{LOCALE} flag when compiling a regular expression will cause the -resulting compiled object to use these C functions for \regexp{\e w}; -this is slower, but also enables \regexp{\e w+} to match French words as -you'd expect. -\end{datadesc} - -\begin{datadesc}{M} -\dataline{MULTILINE} -(\regexp{\^} and \regexp{\$} haven't been explained yet; -they'll be introduced in section~\ref{more-metacharacters}.) - -Usually \regexp{\^} matches only at the beginning of the string, and -\regexp{\$} matches only at the end of the string and immediately before the -newline (if any) at the end of the string. When this flag is -specified, \regexp{\^} matches at the beginning of the string and at -the beginning of each line within the string, immediately following -each newline. Similarly, the \regexp{\$} metacharacter matches either at -the end of the string and at the end of each line (immediately -preceding each newline). - -\end{datadesc} - -\begin{datadesc}{S} -\dataline{DOTALL} -Makes the \character{.} special character match any character at all, -including a newline; without this flag, \character{.} will match -anything \emph{except} a newline. -\end{datadesc} - -\begin{datadesc}{X} -\dataline{VERBOSE} This flag allows you to write regular expressions -that are more readable by granting you more flexibility in how you can -format them. When this flag has been specified, whitespace within the -RE string is ignored, except when the whitespace is in a character -class or preceded by an unescaped backslash; this lets you organize -and indent the RE more clearly. This flag also lets you put comments -within a RE that will be ignored by the engine; comments are marked by -a \character{\#} that's neither in a character class or preceded by an -unescaped backslash. - -For example, here's a RE that uses \constant{re.VERBOSE}; see how -much easier it is to read? - -\begin{verbatim} -charref = re.compile(r""" - &[#] # Start of a numeric entity reference - ( - 0[0-7]+ # Octal form - | [0-9]+ # Decimal form - | x[0-9a-fA-F]+ # Hexadecimal form - ) - ; # Trailing semicolon -""", re.VERBOSE) -\end{verbatim} - -Without the verbose setting, the RE would look like this: -\begin{verbatim} -charref = re.compile("&#(0[0-7]+" - "|[0-9]+" - "|x[0-9a-fA-F]+);") -\end{verbatim} - -In the above example, Python's automatic concatenation of string -literals has been used to break up the RE into smaller pieces, but -it's still more difficult to understand than the version using -\constant{re.VERBOSE}. - -\end{datadesc} - -\section{More Pattern Power} - -So far we've only covered a part of the features of regular -expressions. In this section, we'll cover some new metacharacters, -and how to use groups to retrieve portions of the text that was matched. - -\subsection{More Metacharacters\label{more-metacharacters}} - -There are some metacharacters that we haven't covered yet. Most of -them will be covered in this section. - -Some of the remaining metacharacters to be discussed are -\dfn{zero-width assertions}. They don't cause the engine to advance -through the string; instead, they consume no characters at all, -and simply succeed or fail. For example, \regexp{\e b} is an -assertion that the current position is located at a word boundary; the -position isn't changed by the \regexp{\e b} at all. This means that -zero-width assertions should never be repeated, because if they match -once at a given location, they can obviously be matched an infinite -number of times. - -\begin{list}{}{} - -\item[\regexp{|}] -Alternation, or the ``or'' operator. -If A and B are regular expressions, -\regexp{A|B} will match any string that matches either \samp{A} or \samp{B}. -\regexp{|} has very low precedence in order to make it work reasonably when -you're alternating multi-character strings. -\regexp{Crow|Servo} will match either \samp{Crow} or \samp{Servo}, not -\samp{Cro}, a \character{w} or an \character{S}, and \samp{ervo}. - -To match a literal \character{|}, -use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}. - -\item[\regexp{\^}] Matches at the beginning of lines. Unless the -\constant{MULTILINE} flag has been set, this will only match at the -beginning of the string. In \constant{MULTILINE} mode, this also -matches immediately after each newline within the string. - -For example, if you wish to match the word \samp{From} only at the -beginning of a line, the RE to use is \verb|^From|. - -\begin{verbatim} ->>> print re.search('^From', 'From Here to Eternity') -<re.MatchObject instance at 80c1520> ->>> print re.search('^From', 'Reciting From Memory') -None -\end{verbatim} - -%To match a literal \character{\^}, use \regexp{\e\^} or enclose it -%inside a character class, as in \regexp{[{\e}\^]}. - -\item[\regexp{\$}] Matches at the end of a line, which is defined as -either the end of the string, or any location followed by a newline -character. - -\begin{verbatim} ->>> print re.search('}$', '{block}') -<re.MatchObject instance at 80adfa8> ->>> print re.search('}$', '{block} ') -None ->>> print re.search('}$', '{block}\n') -<re.MatchObject instance at 80adfa8> -\end{verbatim} -% $ - -To match a literal \character{\$}, use \regexp{\e\$} or enclose it -inside a character class, as in \regexp{[\$]}. - -\item[\regexp{\e A}] Matches only at the start of the string. When -not in \constant{MULTILINE} mode, \regexp{\e A} and \regexp{\^} are -effectively the same. In \constant{MULTILINE} mode, they're -different: \regexp{\e A} still matches only at the beginning of the -string, but \regexp{\^} may match at any location inside the string -that follows a newline character. - -\item[\regexp{\e Z}] Matches only at the end of the string. - -\item[\regexp{\e b}] Word boundary. -This is a zero-width assertion that matches only at the -beginning or end of a word. A word is defined as a sequence of -alphanumeric characters, so the end of a word is indicated by -whitespace or a non-alphanumeric character. - -The following example matches \samp{class} only when it's a complete -word; it won't match when it's contained inside another word. - -\begin{verbatim} ->>> p = re.compile(r'\bclass\b') ->>> print p.search('no class at all') -<re.MatchObject instance at 80c8f28> ->>> print p.search('the declassified algorithm') -None ->>> print p.search('one subclass is') -None -\end{verbatim} - -There are two subtleties you should remember when using this special -sequence. First, this is the worst collision between Python's string -literals and regular expression sequences. In Python's string -literals, \samp{\e b} is the backspace character, ASCII value 8. If -you're not using raw strings, then Python will convert the \samp{\e b} to -a backspace, and your RE won't match as you expect it to. The -following example looks the same as our previous RE, but omits -the \character{r} in front of the RE string. - -\begin{verbatim} ->>> p = re.compile('\bclass\b') ->>> print p.search('no class at all') -None ->>> print p.search('\b' + 'class' + '\b') -<re.MatchObject instance at 80c3ee0> -\end{verbatim} - -Second, inside a character class, where there's no use for this -assertion, \regexp{\e b} represents the backspace character, for -compatibility with Python's string literals. - -\item[\regexp{\e B}] Another zero-width assertion, this is the -opposite of \regexp{\e b}, only matching when the current -position is not at a word boundary. - -\end{list} - -\subsection{Grouping} - -Frequently you need to obtain more information than just whether the -RE matched or not. Regular expressions are often used to dissect -strings by writing a RE divided into several subgroups which -match different components of interest. For example, an RFC-822 -header line is divided into a header name and a value, separated by a -\character{:}, like this: - -\begin{verbatim} -From: author@example.com -User-Agent: Thunderbird 1.5.0.9 (X11/20061227) -MIME-Version: 1.0 -To: editor@example.com -\end{verbatim} - -This can be handled by writing a regular expression -which matches an entire header line, and has one group which matches the -header name, and another group which matches the header's value. - -Groups are marked by the \character{(}, \character{)} metacharacters. -\character{(} and \character{)} have much the same meaning as they do -in mathematical expressions; they group together the expressions -contained inside them, and you can repeat the contents of a -group with a repeating qualifier, such as \regexp{*}, \regexp{+}, -\regexp{?}, or \regexp{\{\var{m},\var{n}\}}. For example, -\regexp{(ab)*} will match zero or more repetitions of \samp{ab}. - -\begin{verbatim} ->>> p = re.compile('(ab)*') ->>> print p.match('ababababab').span() -(0, 10) -\end{verbatim} - -Groups indicated with \character{(}, \character{)} also capture the -starting and ending index of the text that they match; this can be -retrieved by passing an argument to \method{group()}, -\method{start()}, \method{end()}, and \method{span()}. Groups are -numbered starting with 0. Group 0 is always present; it's the whole -RE, so \class{MatchObject} methods all have group 0 as their default -argument. Later we'll see how to express groups that don't capture -the span of text that they match. - -\begin{verbatim} ->>> p = re.compile('(a)b') ->>> m = p.match('ab') ->>> m.group() -'ab' ->>> m.group(0) -'ab' -\end{verbatim} - -Subgroups are numbered from left to right, from 1 upward. Groups can -be nested; to determine the number, just count the opening parenthesis -characters, going from left to right. - -\begin{verbatim} ->>> p = re.compile('(a(b)c)d') ->>> m = p.match('abcd') ->>> m.group(0) -'abcd' ->>> m.group(1) -'abc' ->>> m.group(2) -'b' -\end{verbatim} - -\method{group()} can be passed multiple group numbers at a time, in -which case it will return a tuple containing the corresponding values -for those groups. - -\begin{verbatim} ->>> m.group(2,1,2) -('b', 'abc', 'b') -\end{verbatim} - -The \method{groups()} method returns a tuple containing the strings -for all the subgroups, from 1 up to however many there are. - -\begin{verbatim} ->>> m.groups() -('abc', 'b') -\end{verbatim} - -Backreferences in a pattern allow you to specify that the contents of -an earlier capturing group must also be found at the current location -in the string. For example, \regexp{\e 1} will succeed if the exact -contents of group 1 can be found at the current position, and fails -otherwise. Remember that Python's string literals also use a -backslash followed by numbers to allow including arbitrary characters -in a string, so be sure to use a raw string when incorporating -backreferences in a RE. - -For example, the following RE detects doubled words in a string. - -\begin{verbatim} ->>> p = re.compile(r'(\b\w+)\s+\1') ->>> p.search('Paris in the the spring').group() -'the the' -\end{verbatim} - -Backreferences like this aren't often useful for just searching -through a string --- there are few text formats which repeat data in -this way --- but you'll soon find out that they're \emph{very} useful -when performing string substitutions. - -\subsection{Non-capturing and Named Groups} - -Elaborate REs may use many groups, both to capture substrings of -interest, and to group and structure the RE itself. In complex REs, -it becomes difficult to keep track of the group numbers. There are -two features which help with this problem. Both of them use a common -syntax for regular expression extensions, so we'll look at that first. - -Perl 5 added several additional features to standard regular -expressions, and the Python \module{re} module supports most of them. -It would have been difficult to choose new -single-keystroke metacharacters or new special sequences beginning -with \samp{\e} to represent the new features without making Perl's -regular expressions confusingly different from standard REs. If you -chose \samp{\&} as a new metacharacter, for example, old expressions -would be assuming that -\samp{\&} was a regular character and wouldn't have escaped it by -writing \regexp{\e \&} or \regexp{[\&]}. - -The solution chosen by the Perl developers was to use \regexp{(?...)} -as the extension syntax. \samp{?} immediately after a parenthesis was -a syntax error because the \samp{?} would have nothing to repeat, so -this didn't introduce any compatibility problems. The characters -immediately after the \samp{?} indicate what extension is being used, -so \regexp{(?=foo)} is one thing (a positive lookahead assertion) and -\regexp{(?:foo)} is something else (a non-capturing group containing -the subexpression \regexp{foo}). - -Python adds an extension syntax to Perl's extension syntax. If the -first character after the question mark is a \samp{P}, you know that -it's an extension that's specific to Python. Currently there are two -such extensions: \regexp{(?P<\var{name}>...)} defines a named group, -and \regexp{(?P=\var{name})} is a backreference to a named group. If -future versions of Perl 5 add similar features using a different -syntax, the \module{re} module will be changed to support the new -syntax, while preserving the Python-specific syntax for -compatibility's sake. - -Now that we've looked at the general extension syntax, we can return -to the features that simplify working with groups in complex REs. -Since groups are numbered from left to right and a complex expression -may use many groups, it can become difficult to keep track of the -correct numbering. Modifying such a complex RE is annoying, too: -insert a new group near the beginning and you change the numbers of -everything that follows it. - -Sometimes you'll want to use a group to collect a part of a regular -expression, but aren't interested in retrieving the group's contents. -You can make this fact explicit by using a non-capturing group: -\regexp{(?:...)}, where you can replace the \regexp{...} -with any other regular expression. - -\begin{verbatim} ->>> m = re.match("([abc])+", "abc") ->>> m.groups() -('c',) ->>> m = re.match("(?:[abc])+", "abc") ->>> m.groups() -() -\end{verbatim} - -Except for the fact that you can't retrieve the contents of what the -group matched, a non-capturing group behaves exactly the same as a -capturing group; you can put anything inside it, repeat it with a -repetition metacharacter such as \samp{*}, and nest it within other -groups (capturing or non-capturing). \regexp{(?:...)} is particularly -useful when modifying an existing pattern, since you can add new groups -without changing how all the other groups are numbered. It should be -mentioned that there's no performance difference in searching between -capturing and non-capturing groups; neither form is any faster than -the other. - -A more significant feature is named groups: instead of -referring to them by numbers, groups can be referenced by a name. - -The syntax for a named group is one of the Python-specific extensions: -\regexp{(?P<\var{name}>...)}. \var{name} is, obviously, the name of -the group. Named groups also behave exactly like capturing groups, -and additionally associate a name with a group. The -\class{MatchObject} methods that deal with capturing groups all accept -either integers that refer to the group by number or strings that -contain the desired group's name. Named groups are still given -numbers, so you can retrieve information about a group in two ways: - -\begin{verbatim} ->>> p = re.compile(r'(?P<word>\b\w+\b)') ->>> m = p.search( '(((( Lots of punctuation )))' ) ->>> m.group('word') -'Lots' ->>> m.group(1) -'Lots' -\end{verbatim} - -Named groups are handy because they let you use easily-remembered -names, instead of having to remember numbers. Here's an example RE -from the \module{imaplib} module: - -\begin{verbatim} -InternalDate = re.compile(r'INTERNALDATE "' - r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' - r'(?P<year>[0-9][0-9][0-9][0-9])' - r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' - r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' - r'"') -\end{verbatim} - -It's obviously much easier to retrieve \code{m.group('zonem')}, -instead of having to remember to retrieve group 9. - -The syntax for backreferences in an expression such as -\regexp{(...)\e 1} refers to the number of the group. There's -naturally a variant that uses the group name instead of the number. -This is another Python extension: \regexp{(?P=\var{name})} indicates -that the contents of the group called \var{name} should again be matched -at the current point. The regular expression for finding doubled -words, \regexp{(\e b\e w+)\e s+\e 1} can also be written as -\regexp{(?P<word>\e b\e w+)\e s+(?P=word)}: - -\begin{verbatim} ->>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') ->>> p.search('Paris in the the spring').group() -'the the' -\end{verbatim} - -\subsection{Lookahead Assertions} - -Another zero-width assertion is the lookahead assertion. Lookahead -assertions are available in both positive and negative form, and -look like this: - -\begin{itemize} -\item[\regexp{(?=...)}] Positive lookahead assertion. This succeeds -if the contained regular expression, represented here by \code{...}, -successfully matches at the current location, and fails otherwise. -But, once the contained expression has been tried, the matching engine -doesn't advance at all; the rest of the pattern is tried right where -the assertion started. - -\item[\regexp{(?!...)}] Negative lookahead assertion. This is the -opposite of the positive assertion; it succeeds if the contained expression -\emph{doesn't} match at the current position in the string. -\end{itemize} - -To make this concrete, let's look at a case where a lookahead is -useful. Consider a simple pattern to match a filename and split it -apart into a base name and an extension, separated by a \samp{.}. For -example, in \samp{news.rc}, \samp{news} is the base name, and -\samp{rc} is the filename's extension. - -The pattern to match this is quite simple: - -\regexp{.*[.].*\$} - -Notice that the \samp{.} needs to be treated specially because it's a -metacharacter; I've put it inside a character class. Also notice the -trailing \regexp{\$}; this is added to ensure that all the rest of the -string must be included in the extension. This regular expression -matches \samp{foo.bar} and \samp{autoexec.bat} and \samp{sendmail.cf} and -\samp{printers.conf}. - -Now, consider complicating the problem a bit; what if you want to -match filenames where the extension is not \samp{bat}? -Some incorrect attempts: - -\verb|.*[.][^b].*$| -% $ - -The first attempt above tries to exclude \samp{bat} by requiring that -the first character of the extension is not a \samp{b}. This is -wrong, because the pattern also doesn't match \samp{foo.bar}. - -% Messes up the HTML without the curly braces around \^ -\regexp{.*[.]([{\^}b]..|.[{\^}a].|..[{\^}t])\$} - -The expression gets messier when you try to patch up the first -solution by requiring one of the following cases to match: the first -character of the extension isn't \samp{b}; the second character isn't -\samp{a}; or the third character isn't \samp{t}. This accepts -\samp{foo.bar} and rejects \samp{autoexec.bat}, but it requires a -three-letter extension and won't accept a filename with a two-letter -extension such as \samp{sendmail.cf}. We'll complicate the pattern -again in an effort to fix it. - -\regexp{.*[.]([{\^}b].?.?|.[{\^}a]?.?|..?[{\^}t]?)\$} - -In the third attempt, the second and third letters are all made -optional in order to allow matching extensions shorter than three -characters, such as \samp{sendmail.cf}. - -The pattern's getting really complicated now, which makes it hard to -read and understand. Worse, if the problem changes and you want to -exclude both \samp{bat} and \samp{exe} as extensions, the pattern -would get even more complicated and confusing. - -A negative lookahead cuts through all this confusion: - -\regexp{.*[.](?!bat\$).*\$} -% $ - -The negative lookahead means: if the expression \regexp{bat} doesn't match at -this point, try the rest of the pattern; if \regexp{bat\$} does match, -the whole pattern will fail. The trailing \regexp{\$} is required to -ensure that something like \samp{sample.batch}, where the extension -only starts with \samp{bat}, will be allowed. - -Excluding another filename extension is now easy; simply add it as an -alternative inside the assertion. The following pattern excludes -filenames that end in either \samp{bat} or \samp{exe}: - -\regexp{.*[.](?!bat\$|exe\$).*\$} -% $ - - -\section{Modifying Strings} - -Up to this point, we've simply performed searches against a static -string. Regular expressions are also commonly used to modify strings -in various ways, using the following \class{RegexObject} methods: - -\begin{tableii}{c|l}{code}{Method/Attribute}{Purpose} - \lineii{split()}{Split the string into a list, splitting it wherever the RE matches} - \lineii{sub()}{Find all substrings where the RE matches, and replace them with a different string} - \lineii{subn()}{Does the same thing as \method{sub()}, - but returns the new string and the number of replacements} -\end{tableii} - - -\subsection{Splitting Strings} - -The \method{split()} method of a \class{RegexObject} splits a string -apart wherever the RE matches, returning a list of the pieces. -It's similar to the \method{split()} method of strings but -provides much more -generality in the delimiters that you can split by; -\method{split()} only supports splitting by whitespace or by -a fixed string. As you'd expect, there's a module-level -\function{re.split()} function, too. - -\begin{methoddesc}{split}{string \optional{, maxsplit\code{ = 0}}} - Split \var{string} by the matches of the regular expression. If - capturing parentheses are used in the RE, then their contents will - also be returned as part of the resulting list. If \var{maxsplit} - is nonzero, at most \var{maxsplit} splits are performed. -\end{methoddesc} - -You can limit the number of splits made, by passing a value for -\var{maxsplit}. When \var{maxsplit} is nonzero, at most -\var{maxsplit} splits will be made, and the remainder of the string is -returned as the final element of the list. In the following example, -the delimiter is any sequence of non-alphanumeric characters. - -\begin{verbatim} ->>> p = re.compile(r'\W+') ->>> p.split('This is a test, short and sweet, of split().') -['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] ->>> p.split('This is a test, short and sweet, of split().', 3) -['This', 'is', 'a', 'test, short and sweet, of split().'] -\end{verbatim} - -Sometimes you're not only interested in what the text between -delimiters is, but also need to know what the delimiter was. If -capturing parentheses are used in the RE, then their values are also -returned as part of the list. Compare the following calls: - -\begin{verbatim} ->>> p = re.compile(r'\W+') ->>> p2 = re.compile(r'(\W+)') ->>> p.split('This... is a test.') -['This', 'is', 'a', 'test', ''] ->>> p2.split('This... is a test.') -['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] -\end{verbatim} - -The module-level function \function{re.split()} adds the RE to be -used as the first argument, but is otherwise the same. - -\begin{verbatim} ->>> re.split('[\W]+', 'Words, words, words.') -['Words', 'words', 'words', ''] ->>> re.split('([\W]+)', 'Words, words, words.') -['Words', ', ', 'words', ', ', 'words', '.', ''] ->>> re.split('[\W]+', 'Words, words, words.', 1) -['Words', 'words, words.'] -\end{verbatim} - -\subsection{Search and Replace} - -Another common task is to find all the matches for a pattern, and -replace them with a different string. The \method{sub()} method takes -a replacement value, which can be either a string or a function, and -the string to be processed. - -\begin{methoddesc}{sub}{replacement, string\optional{, count\code{ = 0}}} -Returns the string obtained by replacing the leftmost non-overlapping -occurrences of the RE in \var{string} by the replacement -\var{replacement}. If the pattern isn't found, \var{string} is returned -unchanged. - -The optional argument \var{count} is the maximum number of pattern -occurrences to be replaced; \var{count} must be a non-negative -integer. The default value of 0 means to replace all occurrences. -\end{methoddesc} - -Here's a simple example of using the \method{sub()} method. It -replaces colour names with the word \samp{colour}: - -\begin{verbatim} ->>> p = re.compile( '(blue|white|red)') ->>> p.sub( 'colour', 'blue socks and red shoes') -'colour socks and colour shoes' ->>> p.sub( 'colour', 'blue socks and red shoes', count=1) -'colour socks and red shoes' -\end{verbatim} - -The \method{subn()} method does the same work, but returns a 2-tuple -containing the new string value and the number of replacements -that were performed: - -\begin{verbatim} ->>> p = re.compile( '(blue|white|red)') ->>> p.subn( 'colour', 'blue socks and red shoes') -('colour socks and colour shoes', 2) ->>> p.subn( 'colour', 'no colours at all') -('no colours at all', 0) -\end{verbatim} - -Empty matches are replaced only when they're not -adjacent to a previous match. - -\begin{verbatim} ->>> p = re.compile('x*') ->>> p.sub('-', 'abxd') -'-a-b-d-' -\end{verbatim} - -If \var{replacement} is a string, any backslash escapes in it are -processed. That is, \samp{\e n} is converted to a single newline -character, \samp{\e r} is converted to a carriage return, and so forth. -Unknown escapes such as \samp{\e j} are left alone. Backreferences, -such as \samp{\e 6}, are replaced with the substring matched by the -corresponding group in the RE. This lets you incorporate -portions of the original text in the resulting -replacement string. - -This example matches the word \samp{section} followed by a string -enclosed in \samp{\{}, \samp{\}}, and changes \samp{section} to -\samp{subsection}: - -\begin{verbatim} ->>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) ->>> p.sub(r'subsection{\1}','section{First} section{second}') -'subsection{First} subsection{second}' -\end{verbatim} - -There's also a syntax for referring to named groups as defined by the -\regexp{(?P<name>...)} syntax. \samp{\e g<name>} will use the -substring matched by the group named \samp{name}, and -\samp{\e g<\var{number}>} -uses the corresponding group number. -\samp{\e g<2>} is therefore equivalent to \samp{\e 2}, -but isn't ambiguous in a -replacement string such as \samp{\e g<2>0}. (\samp{\e 20} would be -interpreted as a reference to group 20, not a reference to group 2 -followed by the literal character \character{0}.) The following -substitutions are all equivalent, but use all three variations of the -replacement string. - -\begin{verbatim} ->>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) ->>> p.sub(r'subsection{\1}','section{First}') -'subsection{First}' ->>> p.sub(r'subsection{\g<1>}','section{First}') -'subsection{First}' ->>> p.sub(r'subsection{\g<name>}','section{First}') -'subsection{First}' -\end{verbatim} - -\var{replacement} can also be a function, which gives you even more -control. If \var{replacement} is a function, the function is -called for every non-overlapping occurrence of \var{pattern}. On each -call, the function is -passed a \class{MatchObject} argument for the match -and can use this information to compute the desired replacement string and return it. - -In the following example, the replacement function translates -decimals into hexadecimal: - -\begin{verbatim} ->>> def hexrepl( match ): -... "Return the hex string for a decimal number" -... value = int( match.group() ) -... return hex(value) -... ->>> p = re.compile(r'\d+') ->>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') -'Call 0xffd2 for printing, 0xc000 for user code.' -\end{verbatim} - -When using the module-level \function{re.sub()} function, the pattern -is passed as the first argument. The pattern may be a string or a -\class{RegexObject}; if you need to specify regular expression flags, -you must either use a \class{RegexObject} as the first parameter, or use -embedded modifiers in the pattern, e.g. \code{sub("(?i)b+", "x", "bbbb -BBBB")} returns \code{'x x'}. - -\section{Common Problems} - -Regular expressions are a powerful tool for some applications, but in -some ways their behaviour isn't intuitive and at times they don't -behave the way you may expect them to. This section will point out -some of the most common pitfalls. - -\subsection{Use String Methods} - -Sometimes using the \module{re} module is a mistake. If you're -matching a fixed string, or a single character class, and you're not -using any \module{re} features such as the \constant{IGNORECASE} flag, -then the full power of regular expressions may not be required. -Strings have several methods for performing operations with fixed -strings and they're usually much faster, because the implementation is -a single small C loop that's been optimized for the purpose, instead -of the large, more generalized regular expression engine. - -One example might be replacing a single fixed string with another -one; for example, you might replace \samp{word} -with \samp{deed}. \code{re.sub()} seems like the function to use for -this, but consider the \method{replace()} method. Note that -\function{replace()} will also replace \samp{word} inside -words, turning \samp{swordfish} into \samp{sdeedfish}, but the -na{\"\i}ve RE \regexp{word} would have done that, too. (To avoid performing -the substitution on parts of words, the pattern would have to be -\regexp{\e bword\e b}, in order to require that \samp{word} have a -word boundary on either side. This takes the job beyond -\method{replace}'s abilities.) - -Another common task is deleting every occurrence of a single character -from a string or replacing it with another single character. You -might do this with something like \code{re.sub('\e n', ' ', S)}, but -\method{translate()} is capable of doing both tasks -and will be faster than any regular expression operation can be. - -In short, before turning to the \module{re} module, consider whether -your problem can be solved with a faster and simpler string method. - -\subsection{match() versus search()} - -The \function{match()} function only checks if the RE matches at -the beginning of the string while \function{search()} will scan -forward through the string for a match. -It's important to keep this distinction in mind. Remember, -\function{match()} will only report a successful match which -will start at 0; if the match wouldn't start at zero, -\function{match()} will \emph{not} report it. - -\begin{verbatim} ->>> print re.match('super', 'superstition').span() -(0, 5) ->>> print re.match('super', 'insuperable') -None -\end{verbatim} - -On the other hand, \function{search()} will scan forward through the -string, reporting the first match it finds. - -\begin{verbatim} ->>> print re.search('super', 'superstition').span() -(0, 5) ->>> print re.search('super', 'insuperable').span() -(2, 7) -\end{verbatim} - -Sometimes you'll be tempted to keep using \function{re.match()}, and -just add \regexp{.*} to the front of your RE. Resist this temptation -and use \function{re.search()} instead. The regular expression -compiler does some analysis of REs in order to speed up the process of -looking for a match. One such analysis figures out what the first -character of a match must be; for example, a pattern starting with -\regexp{Crow} must match starting with a \character{C}. The analysis -lets the engine quickly scan through the string looking for the -starting character, only trying the full match if a \character{C} is found. - -Adding \regexp{.*} defeats this optimization, requiring scanning to -the end of the string and then backtracking to find a match for the -rest of the RE. Use \function{re.search()} instead. - -\subsection{Greedy versus Non-Greedy} - -When repeating a regular expression, as in \regexp{a*}, the resulting -action is to consume as much of the pattern as possible. This -fact often bites you when you're trying to match a pair of -balanced delimiters, such as the angle brackets surrounding an HTML -tag. The na{\"\i}ve pattern for matching a single HTML tag doesn't -work because of the greedy nature of \regexp{.*}. - -\begin{verbatim} ->>> s = '<html><head><title>Title</title>' ->>> len(s) -32 ->>> print re.match('<.*>', s).span() -(0, 32) ->>> print re.match('<.*>', s).group() -<html><head><title>Title</title> -\end{verbatim} - -The RE matches the \character{<} in \samp{<html>}, and the -\regexp{.*} consumes the rest of the string. There's still more left -in the RE, though, and the \regexp{>} can't match at the end of -the string, so the regular expression engine has to backtrack -character by character until it finds a match for the \regexp{>}. -The final match extends from the \character{<} in \samp{<html>} -to the \character{>} in \samp{</title>}, which isn't what you want. - -In this case, the solution is to use the non-greedy qualifiers -\regexp{*?}, \regexp{+?}, \regexp{??}, or -\regexp{\{\var{m},\var{n}\}?}, which match as \emph{little} text as -possible. In the above example, the \character{>} is tried -immediately after the first \character{<} matches, and when it fails, -the engine advances a character at a time, retrying the \character{>} -at every step. This produces just the right result: - -\begin{verbatim} ->>> print re.match('<.*?>', s).group() -<html> -\end{verbatim} - -(Note that parsing HTML or XML with regular expressions is painful. -Quick-and-dirty patterns will handle common cases, but HTML and XML -have special cases that will break the obvious regular expression; by -the time you've written a regular expression that handles all of the -possible cases, the patterns will be \emph{very} complicated. Use an -HTML or XML parser module for such tasks.) - -\subsection{Not Using re.VERBOSE} - -By now you've probably noticed that regular expressions are a very -compact notation, but they're not terribly readable. REs of -moderate complexity can become lengthy collections of backslashes, -parentheses, and metacharacters, making them difficult to read and -understand. - -For such REs, specifying the \code{re.VERBOSE} flag when -compiling the regular expression can be helpful, because it allows -you to format the regular expression more clearly. - -The \code{re.VERBOSE} flag has several effects. Whitespace in the -regular expression that \emph{isn't} inside a character class is -ignored. This means that an expression such as \regexp{dog | cat} is -equivalent to the less readable \regexp{dog|cat}, but \regexp{[a b]} -will still match the characters \character{a}, \character{b}, or a -space. In addition, you can also put comments inside a RE; comments -extend from a \samp{\#} character to the next newline. When used with -triple-quoted strings, this enables REs to be formatted more neatly: - -\begin{verbatim} -pat = re.compile(r""" - \s* # Skip leading whitespace - (?P<header>[^:]+) # Header name - \s* : # Whitespace, and a colon - (?P<value>.*?) # The header's value -- *? used to - # lose the following trailing whitespace - \s*$ # Trailing whitespace to end-of-line -""", re.VERBOSE) -\end{verbatim} -% $ - -This is far more readable than: - -\begin{verbatim} -pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") -\end{verbatim} -% $ - -\section{Feedback} - -Regular expressions are a complicated topic. Did this document help -you understand them? Were there parts that were unclear, or Problems -you encountered that weren't covered here? If so, please send -suggestions for improvements to the author. - -The most complete book on regular expressions is almost certainly -Jeffrey Friedl's \citetitle{Mastering Regular Expressions}, published -by O'Reilly. Unfortunately, it exclusively concentrates on Perl and -Java's flavours of regular expressions, and doesn't contain any Python -material at all, so it won't be useful as a reference for programming -in Python. (The first edition covered Python's now-removed -\module{regex} module, which won't help you much.) Consider checking -it out from your library. - -\end{document} - |