diff options
author | Georg Brandl <georg@python.org> | 2007-08-15 14:28:22 (GMT) |
---|---|---|
committer | Georg Brandl <georg@python.org> | 2007-08-15 14:28:22 (GMT) |
commit | 116aa62bf54a39697e25f21d6cf6799f7faa1349 (patch) | |
tree | 8db5729518ed4ca88e26f1e26cc8695151ca3eb3 /Doc/howto/regex.rst | |
parent | 739c01d47b9118d04e5722333f0e6b4d0c8bdd9e (diff) | |
download | cpython-116aa62bf54a39697e25f21d6cf6799f7faa1349.zip cpython-116aa62bf54a39697e25f21d6cf6799f7faa1349.tar.gz cpython-116aa62bf54a39697e25f21d6cf6799f7faa1349.tar.bz2 |
Move the 3k reST doc tree in place.
Diffstat (limited to 'Doc/howto/regex.rst')
-rw-r--r-- | Doc/howto/regex.rst | 1377 |
1 files changed, 1377 insertions, 0 deletions
diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst new file mode 100644 index 0000000..b200764 --- /dev/null +++ b/Doc/howto/regex.rst @@ -0,0 +1,1377 @@ +**************************** + Regular Expression HOWTO +**************************** + +:Author: A.M. Kuchling +:Release: 0.05 + +.. % TODO: +.. % Document lookbehind assertions +.. % Better way of displaying a RE, a string, and what it matches +.. % Mention optional argument to match.groups() +.. % Unicode (at least a reference) + + +.. topic:: Abstract + + This document is an introductory tutorial to using regular expressions in Python + with the :mod:`re` module. It provides a gentler introduction than the + corresponding section in the Library Reference. + + +Introduction +============ + +The :mod:`re` module was added in Python 1.5, and provides Perl-style regular +expression patterns. Earlier versions of Python came with the :mod:`regex` +module, which provided Emacs-style patterns. The :mod:`regex` module was +removed completely in Python 2.5. + +Regular expressions (called REs, or regexes, or regex patterns) are essentially +a tiny, highly specialized programming language embedded inside Python and made +available through the :mod:`re` module. Using this little language, you specify +the rules for the set of possible strings that you want to match; this set might +contain English sentences, or e-mail addresses, or TeX commands, or anything you +like. You can then ask questions such as "Does this string match the pattern?", +or "Is there a match for the pattern anywhere in this string?". You can also +use REs to modify a string or to split it apart in various ways. + +Regular expression patterns are compiled into a series of bytecodes which are +then executed by a matching engine written in C. For advanced use, it may be +necessary to pay careful attention to how the engine will execute a given RE, +and write the RE in a certain way in order to produce bytecode that runs faster. +Optimization isn't covered in this document, because it requires that you have a +good understanding of the matching engine's internals. + +The regular expression language is relatively small and restricted, so not all +possible string processing tasks can be done using regular expressions. There +are also tasks that *can* be done with regular expressions, but the expressions +turn out to be very complicated. In these cases, you may be better off writing +Python code to do the processing; while Python code will be slower than an +elaborate regular expression, it will also probably be more understandable. + + +Simple Patterns +=============== + +We'll start by learning about the simplest possible regular expressions. Since +regular expressions are used to operate on strings, we'll begin with the most +common task: matching characters. + +For a detailed explanation of the computer science underlying regular +expressions (deterministic and non-deterministic finite automata), you can refer +to almost any textbook on writing compilers. + + +Matching Characters +------------------- + +Most letters and characters will simply match themselves. For example, the +regular expression ``test`` will match the string ``test`` exactly. (You can +enable a case-insensitive mode that would let this RE match ``Test`` or ``TEST`` +as well; more about this later.) + +There are exceptions to this rule; some characters are special +:dfn:`metacharacters`, and don't match themselves. Instead, they signal that +some out-of-the-ordinary thing should be matched, or they affect other portions +of the RE by repeating them or changing their meaning. Much of this document is +devoted to discussing various metacharacters and what they do. + +Here's a complete list of the metacharacters; their meanings will be discussed +in the rest of this HOWTO. :: + + . ^ $ * + ? { [ ] \ | ( ) + +The first metacharacters we'll look at are ``[`` and ``]``. They're used for +specifying a character class, which is a set of characters that you wish to +match. Characters can be listed individually, or a range of characters can be +indicated by giving two characters and separating them by a ``'-'``. For +example, ``[abc]`` will match any of the characters ``a``, ``b``, or ``c``; this +is the same as ``[a-c]``, which uses a range to express the same set of +characters. If you wanted to match only lowercase letters, your RE would be +``[a-z]``. + +.. % $ + +Metacharacters are not active inside classes. For example, ``[akm$]`` will +match any of the characters ``'a'``, ``'k'``, ``'m'``, or ``'$'``; ``'$'`` is +usually a metacharacter, but inside a character class it's stripped of its +special nature. + +You can match the characters not listed within the class by :dfn:`complementing` +the set. This is indicated by including a ``'^'`` as the first character of the +class; ``'^'`` outside a character class will simply match the ``'^'`` +character. For example, ``[^5]`` will match any character except ``'5'``. + +Perhaps the most important metacharacter is the backslash, ``\``. As in Python +string literals, the backslash can be followed by various characters to signal +various special sequences. It's also used to escape all the metacharacters so +you can still match them in patterns; for example, if you need to match a ``[`` +or ``\``, you can precede them with a backslash to remove their special +meaning: ``\[`` or ``\\``. + +Some of the special sequences beginning with ``'\'`` represent predefined sets +of characters that are often useful, such as the set of digits, the set of +letters, or the set of anything that isn't whitespace. The following predefined +special sequences are available: + +``\d`` + Matches any decimal digit; this is equivalent to the class ``[0-9]``. + +``\D`` + Matches any non-digit character; this is equivalent to the class ``[^0-9]``. + +``\s`` + Matches any whitespace character; this is equivalent to the class ``[ + \t\n\r\f\v]``. + +``\S`` + Matches any non-whitespace character; this is equivalent to the class ``[^ + \t\n\r\f\v]``. + +``\w`` + Matches any alphanumeric character; this is equivalent to the class + ``[a-zA-Z0-9_]``. + +``\W`` + Matches any non-alphanumeric character; this is equivalent to the class + ``[^a-zA-Z0-9_]``. + +These sequences can be included inside a character class. For example, +``[\s,.]`` is a character class that will match any whitespace character, or +``','`` or ``'.'``. + +The final metacharacter in this section is ``.``. It matches anything except a +newline character, and there's an alternate mode (``re.DOTALL``) where it will +match even a newline. ``'.'`` is often used where you want to match "any +character". + + +Repeating Things +---------------- + +Being able to match varying sets of characters is the first thing regular +expressions can do that isn't already possible with the methods available on +strings. However, if that was the only additional capability of regexes, they +wouldn't be much of an advance. Another capability is that you can specify that +portions of the RE must be repeated a certain number of times. + +The first metacharacter for repeating things that we'll look at is ``*``. ``*`` +doesn't match the literal character ``*``; instead, it specifies that the +previous character can be matched zero or more times, instead of exactly once. + +For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``), +``caaat`` (3 ``a`` characters), and so forth. The RE engine has various +internal limitations stemming from the size of C's ``int`` type that will +prevent it from matching over 2 billion ``a`` characters; you probably don't +have enough memory to construct a string that large, so you shouldn't run into +that limit. + +Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching +engine will try to repeat it as many times as possible. If later portions of the +pattern don't match, the matching engine will then back up and try again with +few repetitions. + +A step-by-step example will make this more obvious. Let's consider the +expression ``a[bcd]*b``. This matches the letter ``'a'``, zero or more letters +from the class ``[bcd]``, and finally ends with a ``'b'``. Now imagine matching +this RE against the string ``abcbd``. + ++------+-----------+---------------------------------+ +| Step | Matched | Explanation | ++======+===========+=================================+ +| 1 | ``a`` | The ``a`` in the RE matches. | ++------+-----------+---------------------------------+ +| 2 | ``abcbd`` | The engine matches ``[bcd]*``, | +| | | going as far as it can, which | +| | | is to the end of the string. | ++------+-----------+---------------------------------+ +| 3 | *Failure* | The engine tries to match | +| | | ``b``, but the current position | +| | | is at the end of the string, so | +| | | it fails. | ++------+-----------+---------------------------------+ +| 4 | ``abcb`` | Back up, so that ``[bcd]*`` | +| | | matches one less character. | ++------+-----------+---------------------------------+ +| 5 | *Failure* | Try ``b`` again, but the | +| | | current position is at the last | +| | | character, which is a ``'d'``. | ++------+-----------+---------------------------------+ +| 6 | ``abc`` | Back up again, so that | +| | | ``[bcd]*`` is only matching | +| | | ``bc``. | ++------+-----------+---------------------------------+ +| 6 | ``abcb`` | Try ``b`` again. This time | +| | | but the character at the | +| | | current position is ``'b'``, so | +| | | it succeeds. | ++------+-----------+---------------------------------+ + +The end of the RE has now been reached, and it has matched ``abcb``. This +demonstrates how the matching engine goes as far as it can at first, and if no +match is found it will then progressively back up and retry the rest of the RE +again and again. It will back up until it has tried zero matches for +``[bcd]*``, and if that subsequently fails, the engine will conclude that the +string doesn't match the RE at all. + +Another repeating metacharacter is ``+``, which matches one or more times. Pay +careful attention to the difference between ``*`` and ``+``; ``*`` matches +*zero* or more times, so whatever's being repeated may not be present at all, +while ``+`` requires at least *one* occurrence. To use a similar example, +``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match +``ct``. + +There are two more repeating qualifiers. The question mark character, ``?``, +matches either once or zero times; you can think of it as marking something as +being optional. For example, ``home-?brew`` matches either ``homebrew`` or +``home-brew``. + +The most complicated repeated qualifier is ``{m,n}``, where *m* and *n* are +decimal integers. This qualifier means there must be at least *m* repetitions, +and at most *n*. For example, ``a/{1,3}b`` will match ``a/b``, ``a//b``, and +``a///b``. It won't match ``ab``, which has no slashes, or ``a////b``, which +has four. + +You can omit either *m* or *n*; in that case, a reasonable value is assumed for +the missing value. Omitting *m* is interpreted as a lower limit of 0, while +omitting *n* results in an upper bound of infinity --- actually, the upper bound +is the 2-billion limit mentioned earlier, but that might as well be infinity. + +Readers of a reductionist bent may notice that the three other qualifiers can +all be expressed using this notation. ``{0,}`` is the same as ``*``, ``{1,}`` +is equivalent to ``+``, and ``{0,1}`` is the same as ``?``. It's better to use +``*``, ``+``, or ``?`` when you can, simply because they're shorter and easier +to read. + + +Using Regular Expressions +========================= + +Now that we've looked at some simple regular expressions, how do we actually use +them in Python? The :mod:`re` module provides an interface to the regular +expression engine, allowing you to compile REs into objects and then perform +matches with them. + + +Compiling Regular Expressions +----------------------------- + +Regular expressions are compiled into :class:`RegexObject` instances, which have +methods for various operations such as searching for pattern matches or +performing string substitutions. :: + + >>> import re + >>> p = re.compile('ab*') + >>> print p + <re.RegexObject instance at 80b4150> + +:func:`re.compile` also accepts an optional *flags* argument, used to enable +various special features and syntax variations. We'll go over the available +settings later, but for now a single example will do:: + + >>> p = re.compile('ab*', re.IGNORECASE) + +The RE is passed to :func:`re.compile` as a string. REs are handled as strings +because regular expressions aren't part of the core Python language, and no +special syntax was created for expressing them. (There are applications that +don't need REs at all, so there's no need to bloat the language specification by +including them.) Instead, the :mod:`re` module is simply a C extension module +included with Python, just like the :mod:`socket` or :mod:`zlib` modules. + +Putting REs in strings keeps the Python language simpler, but has one +disadvantage which is the topic of the next section. + + +The Backslash Plague +-------------------- + +As stated earlier, regular expressions use the backslash character (``'\'``) to +indicate special forms or to allow special characters to be used without +invoking their special meaning. This conflicts with Python's usage of the same +character for the same purpose in string literals. + +Let's say you want to write a RE that matches the string ``\section``, which +might be found in a LaTeX file. To figure out what to write in the program +code, start with the desired string to be matched. Next, you must escape any +backslashes and other metacharacters by preceding them with a backslash, +resulting in the string ``\\section``. The resulting string that must be passed +to :func:`re.compile` must be ``\\section``. However, to express this as a +Python string literal, both backslashes must be escaped *again*. + ++-------------------+------------------------------------------+ +| Characters | Stage | ++===================+==========================================+ +| ``\section`` | Text string to be matched | ++-------------------+------------------------------------------+ +| ``\\section`` | Escaped backslash for :func:`re.compile` | ++-------------------+------------------------------------------+ +| ``"\\\\section"`` | Escaped backslashes for a string literal | ++-------------------+------------------------------------------+ + +In short, to match a literal backslash, one has to write ``'\\\\'`` as the RE +string, because the regular expression must be ``\\``, and each backslash must +be expressed as ``\\`` inside a regular Python string literal. In REs that +feature backslashes repeatedly, this leads to lots of repeated backslashes and +makes the resulting strings difficult to understand. + +The solution is to use Python's raw string notation for regular expressions; +backslashes are not handled in any special way in a string literal prefixed with +``'r'``, so ``r"\n"`` is a two-character string containing ``'\'`` and ``'n'``, +while ``"\n"`` is a one-character string containing a newline. Regular +expressions will often be written in Python code using this raw string notation. + ++-------------------+------------------+ +| Regular String | Raw string | ++===================+==================+ +| ``"ab*"`` | ``r"ab*"`` | ++-------------------+------------------+ +| ``"\\\\section"`` | ``r"\\section"`` | ++-------------------+------------------+ +| ``"\\w+\\s+\\1"`` | ``r"\w+\s+\1"`` | ++-------------------+------------------+ + + +Performing Matches +------------------ + +Once you have an object representing a compiled regular expression, what do you +do with it? :class:`RegexObject` instances have several methods and attributes. +Only the most significant ones will be covered here; consult `the Library +Reference <http://www.python.org/doc/lib/module-re.html>`_ for a complete +listing. + ++------------------+-----------------------------------------------+ +| Method/Attribute | Purpose | ++==================+===============================================+ +| ``match()`` | Determine if the RE matches at the beginning | +| | of the string. | ++------------------+-----------------------------------------------+ +| ``search()`` | Scan through a string, looking for any | +| | location where this RE matches. | ++------------------+-----------------------------------------------+ +| ``findall()`` | Find all substrings where the RE matches, and | +| | returns them as a list. | ++------------------+-----------------------------------------------+ +| ``finditer()`` | Find all substrings where the RE matches, and | +| | returns them as an iterator. | ++------------------+-----------------------------------------------+ + +:meth:`match` and :meth:`search` return ``None`` if no match can be found. If +they're successful, a ``MatchObject`` instance is returned, containing +information about the match: where it starts and ends, the substring it matched, +and more. + +You can learn about this by interactively experimenting with the :mod:`re` +module. If you have Tkinter available, you may also want to look at +:file:`Tools/scripts/redemo.py`, a demonstration program included with the +Python distribution. It allows you to enter REs and strings, and displays +whether the RE matches or fails. :file:`redemo.py` can be quite useful when +trying to debug a complicated RE. Phil Schwartz's `Kodos +<http://www.phil-schwartz.com/kodos.spy>`_ is also an interactive tool for +developing and testing RE patterns. + +This HOWTO uses the standard Python interpreter for its examples. First, run the +Python interpreter, import the :mod:`re` module, and compile a RE:: + + Python 2.2.2 (#1, Feb 10 2003, 12:57:01) + >>> import re + >>> p = re.compile('[a-z]+') + >>> p + <_sre.SRE_Pattern object at 80c3c28> + +Now, you can try matching various strings against the RE ``[a-z]+``. An empty +string shouldn't match at all, since ``+`` means 'one or more repetitions'. +:meth:`match` should return ``None`` in this case, which will cause the +interpreter to print no output. You can explicitly print the result of +:meth:`match` to make this clear. :: + + >>> p.match("") + >>> print p.match("") + None + +Now, let's try it on a string that it should match, such as ``tempo``. In this +case, :meth:`match` will return a :class:`MatchObject`, so you should store the +result in a variable for later use. :: + + >>> m = p.match('tempo') + >>> print m + <_sre.SRE_Match object at 80c4f68> + +Now you can query the :class:`MatchObject` for information about the matching +string. :class:`MatchObject` instances also have several methods and +attributes; the most important ones are: + ++------------------+--------------------------------------------+ +| Method/Attribute | Purpose | ++==================+============================================+ +| ``group()`` | Return the string matched by the RE | ++------------------+--------------------------------------------+ +| ``start()`` | Return the starting position of the match | ++------------------+--------------------------------------------+ +| ``end()`` | Return the ending position of the match | ++------------------+--------------------------------------------+ +| ``span()`` | Return a tuple containing the (start, end) | +| | positions of the match | ++------------------+--------------------------------------------+ + +Trying these methods will soon clarify their meaning:: + + >>> m.group() + 'tempo' + >>> m.start(), m.end() + (0, 5) + >>> m.span() + (0, 5) + +:meth:`group` returns the substring that was matched by the RE. :meth:`start` +and :meth:`end` return the starting and ending index of the match. :meth:`span` +returns both start and end indexes in a single tuple. Since the :meth:`match` +method only checks if the RE matches at the start of a string, :meth:`start` +will always be zero. However, the :meth:`search` method of :class:`RegexObject` +instances scans through the string, so the match may not start at zero in that +case. :: + + >>> print p.match('::: message') + None + >>> m = p.search('::: message') ; print m + <re.MatchObject instance at 80c9650> + >>> m.group() + 'message' + >>> m.span() + (4, 11) + +In actual programs, the most common style is to store the :class:`MatchObject` +in a variable, and then check if it was ``None``. This usually looks like:: + + p = re.compile( ... ) + m = p.match( 'string goes here' ) + if m: + print 'Match found: ', m.group() + else: + print 'No match' + +Two :class:`RegexObject` methods return all of the matches for a pattern. +:meth:`findall` returns a list of matching strings:: + + >>> p = re.compile('\d+') + >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') + ['12', '11', '10'] + +:meth:`findall` has to create the entire list before it can be returned as the +result. The :meth:`finditer` method returns a sequence of :class:`MatchObject` +instances as an iterator. [#]_ :: + + >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') + >>> iterator + <callable-iterator object at 0x401833ac> + >>> for match in iterator: + ... print match.span() + ... + (0, 2) + (22, 24) + (29, 31) + + +Module-Level Functions +---------------------- + +You don't have to create a :class:`RegexObject` and call its methods; the +:mod:`re` module also provides top-level functions called :func:`match`, +:func:`search`, :func:`findall`, :func:`sub`, and so forth. These functions +take the same arguments as the corresponding :class:`RegexObject` method, with +the RE string added as the first argument, and still return either ``None`` or a +:class:`MatchObject` instance. :: + + >>> print re.match(r'From\s+', 'Fromage amk') + None + >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') + <re.MatchObject instance at 80c5978> + +Under the hood, these functions simply produce a :class:`RegexObject` for you +and call the appropriate method on it. They also store the compiled object in a +cache, so future calls using the same RE are faster. + +Should you use these module-level functions, or should you get the +:class:`RegexObject` and call its methods yourself? That choice depends on how +frequently the RE will be used, and on your personal coding style. If the RE is +being used at only one point in the code, then the module functions are probably +more convenient. If a program contains a lot of regular expressions, or re-uses +the same ones in several locations, then it might be worthwhile to collect all +the definitions in one place, in a section of code that compiles all the REs +ahead of time. To take an example from the standard library, here's an extract +from :file:`xmllib.py`:: + + ref = re.compile( ... ) + entityref = re.compile( ... ) + charref = re.compile( ... ) + starttagopen = re.compile( ... ) + +I generally prefer to work with the compiled object, even for one-time uses, but +few people will be as much of a purist about this as I am. + + +Compilation Flags +----------------- + +Compilation flags let you modify some aspects of how regular expressions work. +Flags are available in the :mod:`re` module under two names, a long name such as +:const:`IGNORECASE` and a short, one-letter form such as :const:`I`. (If you're +familiar with Perl's pattern modifiers, the one-letter forms use the same +letters; the short form of :const:`re.VERBOSE` is :const:`re.X`, for example.) +Multiple flags can be specified by bitwise OR-ing them; ``re.I | re.M`` sets +both the :const:`I` and :const:`M` flags, for example. + +Here's a table of the available flags, followed by a more detailed explanation +of each one. + ++---------------------------------+--------------------------------------------+ +| Flag | Meaning | ++=================================+============================================+ +| :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including | +| | newlines | ++---------------------------------+--------------------------------------------+ +| :const:`IGNORECASE`, :const:`I` | Do case-insensitive matches | ++---------------------------------+--------------------------------------------+ +| :const:`LOCALE`, :const:`L` | Do a locale-aware match | ++---------------------------------+--------------------------------------------+ +| :const:`MULTILINE`, :const:`M` | Multi-line matching, affecting ``^`` and | +| | ``$`` | ++---------------------------------+--------------------------------------------+ +| :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized | +| | more cleanly and understandably. | ++---------------------------------+--------------------------------------------+ + + +.. data:: I + IGNORECASE + :noindex: + + Perform case-insensitive matching; character class and literal strings will + match letters by ignoring case. For example, ``[A-Z]`` will match lowercase + letters, too, and ``Spam`` will match ``Spam``, ``spam``, or ``spAM``. This + lowercasing doesn't take the current locale into account; it will if you also + set the :const:`LOCALE` flag. + + +.. data:: L + LOCALE + :noindex: + + Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale. + + Locales are a feature of the C library intended to help in writing programs that + take account of language differences. For example, if you're processing French + text, you'd want to be able to write ``\w+`` to match words, but ``\w`` only + matches the character class ``[A-Za-z]``; it won't match ``'é'`` or ``'ç'``. If + your system is configured properly and a French locale is selected, certain C + functions will tell the program that ``'é'`` should also be considered a letter. + Setting the :const:`LOCALE` flag when compiling a regular expression will cause + the resulting compiled object to use these C functions for ``\w``; this is + slower, but also enables ``\w+`` to match French words as you'd expect. + + +.. data:: M + MULTILINE + :noindex: + + (``^`` and ``$`` haven't been explained yet; they'll be introduced in section + :ref:`more-metacharacters`.) + + Usually ``^`` matches only at the beginning of the string, and ``$`` matches + only at the end of the string and immediately before the newline (if any) at the + end of the string. When this flag is specified, ``^`` matches at the beginning + of the string and at the beginning of each line within the string, immediately + following each newline. Similarly, the ``$`` metacharacter matches either at + the end of the string and at the end of each line (immediately preceding each + newline). + + +.. data:: S + DOTALL + :noindex: + + Makes the ``'.'`` special character match any character at all, including a + newline; without this flag, ``'.'`` will match anything *except* a newline. + + +.. data:: X + VERBOSE + :noindex: + + This flag allows you to write regular expressions that are more readable by + granting you more flexibility in how you can format them. When this flag has + been specified, whitespace within the RE string is ignored, except when the + whitespace is in a character class or preceded by an unescaped backslash; this + lets you organize and indent the RE more clearly. This flag also lets you put + comments within a RE that will be ignored by the engine; comments are marked by + a ``'#'`` that's neither in a character class or preceded by an unescaped + backslash. + + For example, here's a RE that uses :const:`re.VERBOSE`; see how much easier it + is to read? :: + + charref = re.compile(r""" + &[#] # Start of a numeric entity reference + ( + 0[0-7]+ # Octal form + | [0-9]+ # Decimal form + | x[0-9a-fA-F]+ # Hexadecimal form + ) + ; # Trailing semicolon + """, re.VERBOSE) + + Without the verbose setting, the RE would look like this:: + + charref = re.compile("&#(0[0-7]+" + "|[0-9]+" + "|x[0-9a-fA-F]+);") + + In the above example, Python's automatic concatenation of string literals has + been used to break up the RE into smaller pieces, but it's still more difficult + to understand than the version using :const:`re.VERBOSE`. + + +More Pattern Power +================== + +So far we've only covered a part of the features of regular expressions. In +this section, we'll cover some new metacharacters, and how to use groups to +retrieve portions of the text that was matched. + + +.. _more-metacharacters: + +More Metacharacters +------------------- + +There are some metacharacters that we haven't covered yet. Most of them will be +covered in this section. + +Some of the remaining metacharacters to be discussed are :dfn:`zero-width +assertions`. They don't cause the engine to advance through the string; +instead, they consume no characters at all, and simply succeed or fail. For +example, ``\b`` is an assertion that the current position is located at a word +boundary; the position isn't changed by the ``\b`` at all. This means that +zero-width assertions should never be repeated, because if they match once at a +given location, they can obviously be matched an infinite number of times. + +``|`` + Alternation, or the "or" operator. If A and B are regular expressions, + ``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very + low precedence in order to make it work reasonably when you're alternating + multi-character strings. ``Crow|Servo`` will match either ``Crow`` or ``Servo``, + not ``Cro``, a ``'w'`` or an ``'S'``, and ``ervo``. + + To match a literal ``'|'``, use ``\|``, or enclose it inside a character class, + as in ``[|]``. + +``^`` + Matches at the beginning of lines. Unless the :const:`MULTILINE` flag has been + set, this will only match at the beginning of the string. In :const:`MULTILINE` + mode, this also matches immediately after each newline within the string. + + For example, if you wish to match the word ``From`` only at the beginning of a + line, the RE to use is ``^From``. :: + + >>> print re.search('^From', 'From Here to Eternity') + <re.MatchObject instance at 80c1520> + >>> print re.search('^From', 'Reciting From Memory') + None + + .. % To match a literal \character{\^}, use \regexp{\e\^} or enclose it + .. % inside a character class, as in \regexp{[{\e}\^]}. + +``$`` + Matches at the end of a line, which is defined as either the end of the string, + or any location followed by a newline character. :: + + >>> print re.search('}$', '{block}') + <re.MatchObject instance at 80adfa8> + >>> print re.search('}$', '{block} ') + None + >>> print re.search('}$', '{block}\n') + <re.MatchObject instance at 80adfa8> + + To match a literal ``'$'``, use ``\$`` or enclose it inside a character class, + as in ``[$]``. + + .. % $ + +``\A`` + Matches only at the start of the string. When not in :const:`MULTILINE` mode, + ``\A`` and ``^`` are effectively the same. In :const:`MULTILINE` mode, they're + different: ``\A`` still matches only at the beginning of the string, but ``^`` + may match at any location inside the string that follows a newline character. + +``\Z`` + Matches only at the end of the string. + +``\b`` + Word boundary. This is a zero-width assertion that matches only at the + beginning or end of a word. A word is defined as a sequence of alphanumeric + characters, so the end of a word is indicated by whitespace or a + non-alphanumeric character. + + The following example matches ``class`` only when it's a complete word; it won't + match when it's contained inside another word. :: + + >>> p = re.compile(r'\bclass\b') + >>> print p.search('no class at all') + <re.MatchObject instance at 80c8f28> + >>> print p.search('the declassified algorithm') + None + >>> print p.search('one subclass is') + None + + There are two subtleties you should remember when using this special sequence. + First, this is the worst collision between Python's string literals and regular + expression sequences. In Python's string literals, ``\b`` is the backspace + character, ASCII value 8. If you're not using raw strings, then Python will + convert the ``\b`` to a backspace, and your RE won't match as you expect it to. + The following example looks the same as our previous RE, but omits the ``'r'`` + in front of the RE string. :: + + >>> p = re.compile('\bclass\b') + >>> print p.search('no class at all') + None + >>> print p.search('\b' + 'class' + '\b') + <re.MatchObject instance at 80c3ee0> + + Second, inside a character class, where there's no use for this assertion, + ``\b`` represents the backspace character, for compatibility with Python's + string literals. + +``\B`` + Another zero-width assertion, this is the opposite of ``\b``, only matching when + the current position is not at a word boundary. + + +Grouping +-------- + +Frequently you need to obtain more information than just whether the RE matched +or not. Regular expressions are often used to dissect strings by writing a RE +divided into several subgroups which match different components of interest. +For example, an RFC-822 header line is divided into a header name and a value, +separated by a ``':'``, like this:: + + From: author@example.com + User-Agent: Thunderbird 1.5.0.9 (X11/20061227) + MIME-Version: 1.0 + To: editor@example.com + +This can be handled by writing a regular expression which matches an entire +header line, and has one group which matches the header name, and another group +which matches the header's value. + +Groups are marked by the ``'('``, ``')'`` metacharacters. ``'('`` and ``')'`` +have much the same meaning as they do in mathematical expressions; they group +together the expressions contained inside them, and you can repeat the contents +of a group with a repeating qualifier, such as ``*``, ``+``, ``?``, or +``{m,n}``. For example, ``(ab)*`` will match zero or more repetitions of +``ab``. :: + + >>> p = re.compile('(ab)*') + >>> print p.match('ababababab').span() + (0, 10) + +Groups indicated with ``'('``, ``')'`` also capture the starting and ending +index of the text that they match; this can be retrieved by passing an argument +to :meth:`group`, :meth:`start`, :meth:`end`, and :meth:`span`. Groups are +numbered starting with 0. Group 0 is always present; it's the whole RE, so +:class:`MatchObject` methods all have group 0 as their default argument. Later +we'll see how to express groups that don't capture the span of text that they +match. :: + + >>> p = re.compile('(a)b') + >>> m = p.match('ab') + >>> m.group() + 'ab' + >>> m.group(0) + 'ab' + +Subgroups are numbered from left to right, from 1 upward. Groups can be nested; +to determine the number, just count the opening parenthesis characters, going +from left to right. :: + + >>> p = re.compile('(a(b)c)d') + >>> m = p.match('abcd') + >>> m.group(0) + 'abcd' + >>> m.group(1) + 'abc' + >>> m.group(2) + 'b' + +:meth:`group` can be passed multiple group numbers at a time, in which case it +will return a tuple containing the corresponding values for those groups. :: + + >>> m.group(2,1,2) + ('b', 'abc', 'b') + +The :meth:`groups` method returns a tuple containing the strings for all the +subgroups, from 1 up to however many there are. :: + + >>> m.groups() + ('abc', 'b') + +Backreferences in a pattern allow you to specify that the contents of an earlier +capturing group must also be found at the current location in the string. For +example, ``\1`` will succeed if the exact contents of group 1 can be found at +the current position, and fails otherwise. Remember that Python's string +literals also use a backslash followed by numbers to allow including arbitrary +characters in a string, so be sure to use a raw string when incorporating +backreferences in a RE. + +For example, the following RE detects doubled words in a string. :: + + >>> p = re.compile(r'(\b\w+)\s+\1') + >>> p.search('Paris in the the spring').group() + 'the the' + +Backreferences like this aren't often useful for just searching through a string +--- there are few text formats which repeat data in this way --- but you'll soon +find out that they're *very* useful when performing string substitutions. + + +Non-capturing and Named Groups +------------------------------ + +Elaborate REs may use many groups, both to capture substrings of interest, and +to group and structure the RE itself. In complex REs, it becomes difficult to +keep track of the group numbers. There are two features which help with this +problem. Both of them use a common syntax for regular expression extensions, so +we'll look at that first. + +Perl 5 added several additional features to standard regular expressions, and +the Python :mod:`re` module supports most of them. It would have been +difficult to choose new single-keystroke metacharacters or new special sequences +beginning with ``\`` to represent the new features without making Perl's regular +expressions confusingly different from standard REs. If you chose ``&`` as a +new metacharacter, for example, old expressions would be assuming that ``&`` was +a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``. + +The solution chosen by the Perl developers was to use ``(?...)`` as the +extension syntax. ``?`` immediately after a parenthesis was a syntax error +because the ``?`` would have nothing to repeat, so this didn't introduce any +compatibility problems. The characters immediately after the ``?`` indicate +what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead +assertion) and ``(?:foo)`` is something else (a non-capturing group containing +the subexpression ``foo``). + +Python adds an extension syntax to Perl's extension syntax. If the first +character after the question mark is a ``P``, you know that it's an extension +that's specific to Python. Currently there are two such extensions: +``(?P<name>...)`` defines a named group, and ``(?P=name)`` is a backreference to +a named group. If future versions of Perl 5 add similar features using a +different syntax, the :mod:`re` module will be changed to support the new +syntax, while preserving the Python-specific syntax for compatibility's sake. + +Now that we've looked at the general extension syntax, we can return to the +features that simplify working with groups in complex REs. Since groups are +numbered from left to right and a complex expression may use many groups, it can +become difficult to keep track of the correct numbering. Modifying such a +complex RE is annoying, too: insert a new group near the beginning and you +change the numbers of everything that follows it. + +Sometimes you'll want to use a group to collect a part of a regular expression, +but aren't interested in retrieving the group's contents. You can make this fact +explicit by using a non-capturing group: ``(?:...)``, where you can replace the +``...`` with any other regular expression. :: + + >>> m = re.match("([abc])+", "abc") + >>> m.groups() + ('c',) + >>> m = re.match("(?:[abc])+", "abc") + >>> m.groups() + () + +Except for the fact that you can't retrieve the contents of what the group +matched, a non-capturing group behaves exactly the same as a capturing group; +you can put anything inside it, repeat it with a repetition metacharacter such +as ``*``, and nest it within other groups (capturing or non-capturing). +``(?:...)`` is particularly useful when modifying an existing pattern, since you +can add new groups without changing how all the other groups are numbered. It +should be mentioned that there's no performance difference in searching between +capturing and non-capturing groups; neither form is any faster than the other. + +A more significant feature is named groups: instead of referring to them by +numbers, groups can be referenced by a name. + +The syntax for a named group is one of the Python-specific extensions: +``(?P<name>...)``. *name* is, obviously, the name of the group. Named groups +also behave exactly like capturing groups, and additionally associate a name +with a group. The :class:`MatchObject` methods that deal with capturing groups +all accept either integers that refer to the group by number or strings that +contain the desired group's name. Named groups are still given numbers, so you +can retrieve information about a group in two ways:: + + >>> p = re.compile(r'(?P<word>\b\w+\b)') + >>> m = p.search( '(((( Lots of punctuation )))' ) + >>> m.group('word') + 'Lots' + >>> m.group(1) + 'Lots' + +Named groups are handy because they let you use easily-remembered names, instead +of having to remember numbers. Here's an example RE from the :mod:`imaplib` +module:: + + InternalDate = re.compile(r'INTERNALDATE "' + r'(?P<day>[ 123][0-9])-(?P<mon>[A-Z][a-z][a-z])-' + r'(?P<year>[0-9][0-9][0-9][0-9])' + r' (?P<hour>[0-9][0-9]):(?P<min>[0-9][0-9]):(?P<sec>[0-9][0-9])' + r' (?P<zonen>[-+])(?P<zoneh>[0-9][0-9])(?P<zonem>[0-9][0-9])' + r'"') + +It's obviously much easier to retrieve ``m.group('zonem')``, instead of having +to remember to retrieve group 9. + +The syntax for backreferences in an expression such as ``(...)\1`` refers to the +number of the group. There's naturally a variant that uses the group name +instead of the number. This is another Python extension: ``(?P=name)`` indicates +that the contents of the group called *name* should again be matched at the +current point. The regular expression for finding doubled words, +``(\b\w+)\s+\1`` can also be written as ``(?P<word>\b\w+)\s+(?P=word)``:: + + >>> p = re.compile(r'(?P<word>\b\w+)\s+(?P=word)') + >>> p.search('Paris in the the spring').group() + 'the the' + + +Lookahead Assertions +-------------------- + +Another zero-width assertion is the lookahead assertion. Lookahead assertions +are available in both positive and negative form, and look like this: + +``(?=...)`` + Positive lookahead assertion. This succeeds if the contained regular + expression, represented here by ``...``, successfully matches at the current + location, and fails otherwise. But, once the contained expression has been + tried, the matching engine doesn't advance at all; the rest of the pattern is + tried right where the assertion started. + +``(?!...)`` + Negative lookahead assertion. This is the opposite of the positive assertion; + it succeeds if the contained expression *doesn't* match at the current position + in the string. + +To make this concrete, let's look at a case where a lookahead is useful. +Consider a simple pattern to match a filename and split it apart into a base +name and an extension, separated by a ``.``. For example, in ``news.rc``, +``news`` is the base name, and ``rc`` is the filename's extension. + +The pattern to match this is quite simple: + +``.*[.].*$`` + +Notice that the ``.`` needs to be treated specially because it's a +metacharacter; I've put it inside a character class. Also notice the trailing +``$``; this is added to ensure that all the rest of the string must be included +in the extension. This regular expression matches ``foo.bar`` and +``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``. + +Now, consider complicating the problem a bit; what if you want to match +filenames where the extension is not ``bat``? Some incorrect attempts: + +``.*[.][^b].*$`` The first attempt above tries to exclude ``bat`` by requiring +that the first character of the extension is not a ``b``. This is wrong, +because the pattern also doesn't match ``foo.bar``. + +.. % $ + +``.*[.]([^b]..|.[^a].|..[^t])$`` + +.. % Messes up the HTML without the curly braces around \^ + +The expression gets messier when you try to patch up the first solution by +requiring one of the following cases to match: the first character of the +extension isn't ``b``; the second character isn't ``a``; or the third character +isn't ``t``. This accepts ``foo.bar`` and rejects ``autoexec.bat``, but it +requires a three-letter extension and won't accept a filename with a two-letter +extension such as ``sendmail.cf``. We'll complicate the pattern again in an +effort to fix it. + +``.*[.]([^b].?.?|.[^a]?.?|..?[^t]?)$`` + +In the third attempt, the second and third letters are all made optional in +order to allow matching extensions shorter than three characters, such as +``sendmail.cf``. + +The pattern's getting really complicated now, which makes it hard to read and +understand. Worse, if the problem changes and you want to exclude both ``bat`` +and ``exe`` as extensions, the pattern would get even more complicated and +confusing. + +A negative lookahead cuts through all this confusion: + +``.*[.](?!bat$).*$`` The negative lookahead means: if the expression ``bat`` +doesn't match at this point, try the rest of the pattern; if ``bat$`` does +match, the whole pattern will fail. The trailing ``$`` is required to ensure +that something like ``sample.batch``, where the extension only starts with +``bat``, will be allowed. + +.. % $ + +Excluding another filename extension is now easy; simply add it as an +alternative inside the assertion. The following pattern excludes filenames that +end in either ``bat`` or ``exe``: + +``.*[.](?!bat$|exe$).*$`` + +.. % $ + + +Modifying Strings +================= + +Up to this point, we've simply performed searches against a static string. +Regular expressions are also commonly used to modify strings in various ways, +using the following :class:`RegexObject` methods: + ++------------------+-----------------------------------------------+ +| Method/Attribute | Purpose | ++==================+===============================================+ +| ``split()`` | Split the string into a list, splitting it | +| | wherever the RE matches | ++------------------+-----------------------------------------------+ +| ``sub()`` | Find all substrings where the RE matches, and | +| | replace them with a different string | ++------------------+-----------------------------------------------+ +| ``subn()`` | Does the same thing as :meth:`sub`, but | +| | returns the new string and the number of | +| | replacements | ++------------------+-----------------------------------------------+ + + +Splitting Strings +----------------- + +The :meth:`split` method of a :class:`RegexObject` splits a string apart +wherever the RE matches, returning a list of the pieces. It's similar to the +:meth:`split` method of strings but provides much more generality in the +delimiters that you can split by; :meth:`split` only supports splitting by +whitespace or by a fixed string. As you'd expect, there's a module-level +:func:`re.split` function, too. + + +.. method:: .split(string [, maxsplit=0]) + :noindex: + + Split *string* by the matches of the regular expression. If capturing + parentheses are used in the RE, then their contents will also be returned as + part of the resulting list. If *maxsplit* is nonzero, at most *maxsplit* splits + are performed. + +You can limit the number of splits made, by passing a value for *maxsplit*. +When *maxsplit* is nonzero, at most *maxsplit* splits will be made, and the +remainder of the string is returned as the final element of the list. In the +following example, the delimiter is any sequence of non-alphanumeric characters. +:: + + >>> p = re.compile(r'\W+') + >>> p.split('This is a test, short and sweet, of split().') + ['This', 'is', 'a', 'test', 'short', 'and', 'sweet', 'of', 'split', ''] + >>> p.split('This is a test, short and sweet, of split().', 3) + ['This', 'is', 'a', 'test, short and sweet, of split().'] + +Sometimes you're not only interested in what the text between delimiters is, but +also need to know what the delimiter was. If capturing parentheses are used in +the RE, then their values are also returned as part of the list. Compare the +following calls:: + + >>> p = re.compile(r'\W+') + >>> p2 = re.compile(r'(\W+)') + >>> p.split('This... is a test.') + ['This', 'is', 'a', 'test', ''] + >>> p2.split('This... is a test.') + ['This', '... ', 'is', ' ', 'a', ' ', 'test', '.', ''] + +The module-level function :func:`re.split` adds the RE to be used as the first +argument, but is otherwise the same. :: + + >>> re.split('[\W]+', 'Words, words, words.') + ['Words', 'words', 'words', ''] + >>> re.split('([\W]+)', 'Words, words, words.') + ['Words', ', ', 'words', ', ', 'words', '.', ''] + >>> re.split('[\W]+', 'Words, words, words.', 1) + ['Words', 'words, words.'] + + +Search and Replace +------------------ + +Another common task is to find all the matches for a pattern, and replace them +with a different string. The :meth:`sub` method takes a replacement value, +which can be either a string or a function, and the string to be processed. + + +.. method:: .sub(replacement, string[, count=0]) + :noindex: + + Returns the string obtained by replacing the leftmost non-overlapping + occurrences of the RE in *string* by the replacement *replacement*. If the + pattern isn't found, *string* is returned unchanged. + + The optional argument *count* is the maximum number of pattern occurrences to be + replaced; *count* must be a non-negative integer. The default value of 0 means + to replace all occurrences. + +Here's a simple example of using the :meth:`sub` method. It replaces colour +names with the word ``colour``:: + + >>> p = re.compile( '(blue|white|red)') + >>> p.sub( 'colour', 'blue socks and red shoes') + 'colour socks and colour shoes' + >>> p.sub( 'colour', 'blue socks and red shoes', count=1) + 'colour socks and red shoes' + +The :meth:`subn` method does the same work, but returns a 2-tuple containing the +new string value and the number of replacements that were performed:: + + >>> p = re.compile( '(blue|white|red)') + >>> p.subn( 'colour', 'blue socks and red shoes') + ('colour socks and colour shoes', 2) + >>> p.subn( 'colour', 'no colours at all') + ('no colours at all', 0) + +Empty matches are replaced only when they're not adjacent to a previous match. +:: + + >>> p = re.compile('x*') + >>> p.sub('-', 'abxd') + '-a-b-d-' + +If *replacement* is a string, any backslash escapes in it are processed. That +is, ``\n`` is converted to a single newline character, ``\r`` is converted to a +carriage return, and so forth. Unknown escapes such as ``\j`` are left alone. +Backreferences, such as ``\6``, are replaced with the substring matched by the +corresponding group in the RE. This lets you incorporate portions of the +original text in the resulting replacement string. + +This example matches the word ``section`` followed by a string enclosed in +``{``, ``}``, and changes ``section`` to ``subsection``:: + + >>> p = re.compile('section{ ( [^}]* ) }', re.VERBOSE) + >>> p.sub(r'subsection{\1}','section{First} section{second}') + 'subsection{First} subsection{second}' + +There's also a syntax for referring to named groups as defined by the +``(?P<name>...)`` syntax. ``\g<name>`` will use the substring matched by the +group named ``name``, and ``\g<number>`` uses the corresponding group number. +``\g<2>`` is therefore equivalent to ``\2``, but isn't ambiguous in a +replacement string such as ``\g<2>0``. (``\20`` would be interpreted as a +reference to group 20, not a reference to group 2 followed by the literal +character ``'0'``.) The following substitutions are all equivalent, but use all +three variations of the replacement string. :: + + >>> p = re.compile('section{ (?P<name> [^}]* ) }', re.VERBOSE) + >>> p.sub(r'subsection{\1}','section{First}') + 'subsection{First}' + >>> p.sub(r'subsection{\g<1>}','section{First}') + 'subsection{First}' + >>> p.sub(r'subsection{\g<name>}','section{First}') + 'subsection{First}' + +*replacement* can also be a function, which gives you even more control. If +*replacement* is a function, the function is called for every non-overlapping +occurrence of *pattern*. On each call, the function is passed a +:class:`MatchObject` argument for the match and can use this information to +compute the desired replacement string and return it. + +In the following example, the replacement function translates decimals into +hexadecimal:: + + >>> def hexrepl( match ): + ... "Return the hex string for a decimal number" + ... value = int( match.group() ) + ... return hex(value) + ... + >>> p = re.compile(r'\d+') + >>> p.sub(hexrepl, 'Call 65490 for printing, 49152 for user code.') + 'Call 0xffd2 for printing, 0xc000 for user code.' + +When using the module-level :func:`re.sub` function, the pattern is passed as +the first argument. The pattern may be a string or a :class:`RegexObject`; if +you need to specify regular expression flags, you must either use a +:class:`RegexObject` as the first parameter, or use embedded modifiers in the +pattern, e.g. ``sub("(?i)b+", "x", "bbbb BBBB")`` returns ``'x x'``. + + +Common Problems +=============== + +Regular expressions are a powerful tool for some applications, but in some ways +their behaviour isn't intuitive and at times they don't behave the way you may +expect them to. This section will point out some of the most common pitfalls. + + +Use String Methods +------------------ + +Sometimes using the :mod:`re` module is a mistake. If you're matching a fixed +string, or a single character class, and you're not using any :mod:`re` features +such as the :const:`IGNORECASE` flag, then the full power of regular expressions +may not be required. Strings have several methods for performing operations with +fixed strings and they're usually much faster, because the implementation is a +single small C loop that's been optimized for the purpose, instead of the large, +more generalized regular expression engine. + +One example might be replacing a single fixed string with another one; for +example, you might replace ``word`` with ``deed``. ``re.sub()`` seems like the +function to use for this, but consider the :meth:`replace` method. Note that +:func:`replace` will also replace ``word`` inside words, turning ``swordfish`` +into ``sdeedfish``, but the naive RE ``word`` would have done that, too. (To +avoid performing the substitution on parts of words, the pattern would have to +be ``\bword\b``, in order to require that ``word`` have a word boundary on +either side. This takes the job beyond :meth:`replace`'s abilities.) + +Another common task is deleting every occurrence of a single character from a +string or replacing it with another single character. You might do this with +something like ``re.sub('\n', ' ', S)``, but :meth:`translate` is capable of +doing both tasks and will be faster than any regular expression operation can +be. + +In short, before turning to the :mod:`re` module, consider whether your problem +can be solved with a faster and simpler string method. + + +match() versus search() +----------------------- + +The :func:`match` function only checks if the RE matches at the beginning of the +string while :func:`search` will scan forward through the string for a match. +It's important to keep this distinction in mind. Remember, :func:`match` will +only report a successful match which will start at 0; if the match wouldn't +start at zero, :func:`match` will *not* report it. :: + + >>> print re.match('super', 'superstition').span() + (0, 5) + >>> print re.match('super', 'insuperable') + None + +On the other hand, :func:`search` will scan forward through the string, +reporting the first match it finds. :: + + >>> print re.search('super', 'superstition').span() + (0, 5) + >>> print re.search('super', 'insuperable').span() + (2, 7) + +Sometimes you'll be tempted to keep using :func:`re.match`, and just add ``.*`` +to the front of your RE. Resist this temptation and use :func:`re.search` +instead. The regular expression compiler does some analysis of REs in order to +speed up the process of looking for a match. One such analysis figures out what +the first character of a match must be; for example, a pattern starting with +``Crow`` must match starting with a ``'C'``. The analysis lets the engine +quickly scan through the string looking for the starting character, only trying +the full match if a ``'C'`` is found. + +Adding ``.*`` defeats this optimization, requiring scanning to the end of the +string and then backtracking to find a match for the rest of the RE. Use +:func:`re.search` instead. + + +Greedy versus Non-Greedy +------------------------ + +When repeating a regular expression, as in ``a*``, the resulting action is to +consume as much of the pattern as possible. This fact often bites you when +you're trying to match a pair of balanced delimiters, such as the angle brackets +surrounding an HTML tag. The naive pattern for matching a single HTML tag +doesn't work because of the greedy nature of ``.*``. :: + + >>> s = '<html><head><title>Title</title>' + >>> len(s) + 32 + >>> print re.match('<.*>', s).span() + (0, 32) + >>> print re.match('<.*>', s).group() + <html><head><title>Title</title> + +The RE matches the ``'<'`` in ``<html>``, and the ``.*`` consumes the rest of +the string. There's still more left in the RE, though, and the ``>`` can't +match at the end of the string, so the regular expression engine has to +backtrack character by character until it finds a match for the ``>``. The +final match extends from the ``'<'`` in ``<html>`` to the ``'>'`` in +``</title>``, which isn't what you want. + +In this case, the solution is to use the non-greedy qualifiers ``*?``, ``+?``, +``??``, or ``{m,n}?``, which match as *little* text as possible. In the above +example, the ``'>'`` is tried immediately after the first ``'<'`` matches, and +when it fails, the engine advances a character at a time, retrying the ``'>'`` +at every step. This produces just the right result:: + + >>> print re.match('<.*?>', s).group() + <html> + +(Note that parsing HTML or XML with regular expressions is painful. +Quick-and-dirty patterns will handle common cases, but HTML and XML have special +cases that will break the obvious regular expression; by the time you've written +a regular expression that handles all of the possible cases, the patterns will +be *very* complicated. Use an HTML or XML parser module for such tasks.) + + +Not Using re.VERBOSE +-------------------- + +By now you've probably noticed that regular expressions are a very compact +notation, but they're not terribly readable. REs of moderate complexity can +become lengthy collections of backslashes, parentheses, and metacharacters, +making them difficult to read and understand. + +For such REs, specifying the ``re.VERBOSE`` flag when compiling the regular +expression can be helpful, because it allows you to format the regular +expression more clearly. + +The ``re.VERBOSE`` flag has several effects. Whitespace in the regular +expression that *isn't* inside a character class is ignored. This means that an +expression such as ``dog | cat`` is equivalent to the less readable ``dog|cat``, +but ``[a b]`` will still match the characters ``'a'``, ``'b'``, or a space. In +addition, you can also put comments inside a RE; comments extend from a ``#`` +character to the next newline. When used with triple-quoted strings, this +enables REs to be formatted more neatly:: + + pat = re.compile(r""" + \s* # Skip leading whitespace + (?P<header>[^:]+) # Header name + \s* : # Whitespace, and a colon + (?P<value>.*?) # The header's value -- *? used to + # lose the following trailing whitespace + \s*$ # Trailing whitespace to end-of-line + """, re.VERBOSE) + +This is far more readable than: + +.. % $ + +:: + + pat = re.compile(r"\s*(?P<header>[^:]+)\s*:(?P<value>.*?)\s*$") + +.. % $ + + +Feedback +======== + +Regular expressions are a complicated topic. Did this document help you +understand them? Were there parts that were unclear, or Problems you +encountered that weren't covered here? If so, please send suggestions for +improvements to the author. + +The most complete book on regular expressions is almost certainly Jeffrey +Friedl's Mastering Regular Expressions, published by O'Reilly. Unfortunately, +it exclusively concentrates on Perl and Java's flavours of regular expressions, +and doesn't contain any Python material at all, so it won't be useful as a +reference for programming in Python. (The first edition covered Python's +now-removed :mod:`regex` module, which won't help you much.) Consider checking +it out from your library. + + +.. rubric:: Footnotes + +.. [#] Introduced in Python 2.2.2. + |