diff options
Diffstat (limited to 'Doc/howto/regex.rst')
-rw-r--r-- | Doc/howto/regex.rst | 430 |
1 files changed, 208 insertions, 222 deletions
diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index d574c37..81c0495 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -23,6 +23,11 @@ Introduction ============ +The :mod:`re` module was added in Python 1.5, and provides Perl-style regular +expression patterns. Earlier versions of Python came with the :mod:`regex` +module, which provided Emacs-style patterns. The :mod:`regex` module was +removed completely in Python 2.5. + Regular expressions (called REs, or regexes, or regex patterns) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the :mod:`re` module. Using this little language, you specify @@ -107,25 +112,13 @@ you can still match them in patterns; for example, if you need to match a ``[`` or ``\``, you can precede them with a backslash to remove their special meaning: ``\[`` or ``\\``. -Some of the special sequences beginning with ``'\'`` represent -predefined sets of characters that are often useful, such as the set -of digits, the set of letters, or the set of anything that isn't -whitespace. - -Let's take an example: ``\w`` matches any alphanumeric character. If -the regex pattern is expressed in bytes, this is equivalent to the -class ``[a-zA-Z0-9_]``. If the regex pattern is a string, ``\w`` will -match all the characters marked as letters in the Unicode database -provided by the :mod:`unicodedata` module. You can use the more -restricted definition of ``\w`` in a string pattern by supplying the -:const:`re.ASCII` flag when compiling the regular expression. - -The following list of special sequences isn't complete. For a complete -list of sequences and expanded class definitions for Unicode string -patterns, see the last part of :ref:`Regular Expression Syntax -<re-syntax>` in the Standard Library reference. In general, the -Unicode versions match any character that's in the appropriate -category in the Unicode database. +Some of the special sequences beginning with ``'\'`` represent predefined sets +of characters that are often useful, such as the set of digits, the set of +letters, or the set of anything that isn't whitespace. The following predefined +special sequences are a subset of those available. The equivalent classes are +for byte string patterns. For a complete list of sequences and expanded class +definitions for Unicode string patterns, see the last part of +:ref:`Regular Expression Syntax <re-syntax>`. ``\d`` Matches any decimal digit; this is equivalent to the class ``[0-9]``. @@ -154,8 +147,8 @@ These sequences can be included inside a character class. For example, ``','`` or ``'.'``. The final metacharacter in this section is ``.``. It matches anything except a -newline character, and there's an alternate mode (:const:`re.DOTALL`) where it will -match even a newline. ``.`` is often used where you want to match "any +newline character, and there's an alternate mode (``re.DOTALL``) where it will +match even a newline. ``'.'`` is often used where you want to match "any character". @@ -169,11 +162,15 @@ wouldn't be much of an advance. Another capability is that you can specify that portions of the RE must be repeated a certain number of times. The first metacharacter for repeating things that we'll look at is ``*``. ``*`` -doesn't match the literal character ``'*'``; instead, it specifies that the +doesn't match the literal character ``*``; instead, it specifies that the previous character can be matched zero or more times, instead of exactly once. -For example, ``ca*t`` will match ``'ct'`` (0 ``'a'`` characters), ``'cat'`` (1 ``'a'``), -``'caaat'`` (3 ``'a'`` characters), and so forth. +For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``), +``caaat`` (3 ``a`` characters), and so forth. The RE engine has various +internal limitations stemming from the size of C's ``int`` type that will +prevent it from matching over 2 billion ``a`` characters; you probably don't +have enough memory to construct a string that large, so you shouldn't run into +that limit. Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the @@ -183,7 +180,7 @@ fewer repetitions. A step-by-step example will make this more obvious. Let's consider the expression ``a[bcd]*b``. This matches the letter ``'a'``, zero or more letters from the class ``[bcd]``, and finally ends with a ``'b'``. Now imagine matching -this RE against the string ``'abcbd'``. +this RE against the string ``abcbd``. +------+-----------+---------------------------------+ | Step | Matched | Explanation | @@ -216,7 +213,7 @@ this RE against the string ``'abcbd'``. | | | it succeeds. | +------+-----------+---------------------------------+ -The end of the RE has now been reached, and it has matched ``'abcb'``. This +The end of the RE has now been reached, and it has matched ``abcb``. This demonstrates how the matching engine goes as far as it can at first, and if no match is found it will then progressively back up and retry the rest of the RE again and again. It will back up until it has tried zero matches for @@ -227,23 +224,24 @@ Another repeating metacharacter is ``+``, which matches one or more times. Pay careful attention to the difference between ``*`` and ``+``; ``*`` matches *zero* or more times, so whatever's being repeated may not be present at all, while ``+`` requires at least *one* occurrence. To use a similar example, -``ca+t`` will match ``'cat'`` (1 ``'a'``), ``'caaat'`` (3 ``'a'``\ s), but won't -match ``'ct'``. +``ca+t`` will match ``cat`` (1 ``a``), ``caaat`` (3 ``a``'s), but won't match +``ct``. There are two more repeating qualifiers. The question mark character, ``?``, matches either once or zero times; you can think of it as marking something as -being optional. For example, ``home-?brew`` matches either ``'homebrew'`` or -``'home-brew'``. +being optional. For example, ``home-?brew`` matches either ``homebrew`` or +``home-brew``. The most complicated repeated qualifier is ``{m,n}``, where *m* and *n* are decimal integers. This qualifier means there must be at least *m* repetitions, -and at most *n*. For example, ``a/{1,3}b`` will match ``'a/b'``, ``'a//b'``, and -``'a///b'``. It won't match ``'ab'``, which has no slashes, or ``'a////b'``, which +and at most *n*. For example, ``a/{1,3}b`` will match ``a/b``, ``a//b``, and +``a///b``. It won't match ``ab``, which has no slashes, or ``a////b``, which has four. You can omit either *m* or *n*; in that case, a reasonable value is assumed for the missing value. Omitting *m* is interpreted as a lower limit of 0, while -omitting *n* results in an upper bound of infinity. +omitting *n* results in an upper bound of infinity --- actually, the upper bound +is the 2-billion limit mentioned earlier, but that might as well be infinity. Readers of a reductionist bent may notice that the three other qualifiers can all be expressed using this notation. ``{0,}`` is the same as ``*``, ``{1,}`` @@ -270,8 +268,8 @@ performing string substitutions. :: >>> import re >>> p = re.compile('ab*') - >>> p - re.compile('ab*') + >>> p #doctest: +ELLIPSIS + <_sre.SRE_Pattern object at 0x...> :func:`re.compile` also accepts an optional *flags* argument, used to enable various special features and syntax variations. We'll go over the available @@ -290,8 +288,6 @@ Putting REs in strings keeps the Python language simpler, but has one disadvantage which is the topic of the next section. -.. _the-backslash-plague: - The Backslash Plague -------------------- @@ -330,13 +326,6 @@ backslashes are not handled in any special way in a string literal prefixed with while ``"\n"`` is a one-character string containing a newline. Regular expressions will often be written in Python code using this raw string notation. -In addition, special escape sequences that are valid in regular expressions, -but not valid as Python string literals, now result in a -:exc:`DeprecationWarning` and will eventually become a :exc:`SyntaxError`, -which means the sequences will be invalid if raw string notation or escaping -the backslashes isn't used. - - +-------------------+------------------+ | Regular String | Raw string | +===================+==================+ @@ -372,46 +361,49 @@ for a complete listing. | | returns them as an :term:`iterator`. | +------------------+-----------------------------------------------+ -:meth:`~re.Pattern.match` and :meth:`~re.Pattern.search` return ``None`` if no match can be found. If +:meth:`match` and :meth:`search` return ``None`` if no match can be found. If they're successful, a :ref:`match object <match-objects>` instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more. You can learn about this by interactively experimenting with the :mod:`re` -module. If you have :mod:`tkinter` available, you may also want to look at -:source:`Tools/demo/redemo.py`, a demonstration program included with the +module. If you have Tkinter available, you may also want to look at +:source:`Tools/scripts/redemo.py`, a demonstration program included with the Python distribution. It allows you to enter REs and strings, and displays whether the RE matches or fails. :file:`redemo.py` can be quite useful when -trying to debug a complicated RE. +trying to debug a complicated RE. Phil Schwartz's `Kodos +<http://kodos.sourceforge.net/>`_ is also an interactive tool for developing and +testing RE patterns. This HOWTO uses the standard Python interpreter for its examples. First, run the Python interpreter, import the :mod:`re` module, and compile a RE:: + Python 2.2.2 (#1, Feb 10 2003, 12:57:01) >>> import re >>> p = re.compile('[a-z]+') - >>> p - re.compile('[a-z]+') + >>> p #doctest: +ELLIPSIS + <_sre.SRE_Pattern object at 0x...> Now, you can try matching various strings against the RE ``[a-z]+``. An empty string shouldn't match at all, since ``+`` means 'one or more repetitions'. -:meth:`~re.Pattern.match` should return ``None`` in this case, which will cause the +:meth:`match` should return ``None`` in this case, which will cause the interpreter to print no output. You can explicitly print the result of -:meth:`!match` to make this clear. :: +:meth:`match` to make this clear. :: >>> p.match("") - >>> print(p.match("")) + >>> print p.match("") None Now, let's try it on a string that it should match, such as ``tempo``. In this -case, :meth:`~re.Pattern.match` will return a :ref:`match object <match-objects>`, so you +case, :meth:`match` will return a :ref:`match object <match-objects>`, so you should store the result in a variable for later use. :: >>> m = p.match('tempo') - >>> m - <re.Match object; span=(0, 5), match='tempo'> + >>> m #doctest: +ELLIPSIS + <_sre.SRE_Match object at 0x...> Now you can query the :ref:`match object <match-objects>` for information -about the matching string. Match object instances +about the matching string. :ref:`match object <match-objects>` instances also have several methods and attributes; the most important ones are: +------------------+--------------------------------------------+ @@ -436,18 +428,18 @@ Trying these methods will soon clarify their meaning:: >>> m.span() (0, 5) -:meth:`~re.Match.group` returns the substring that was matched by the RE. :meth:`~re.Match.start` -and :meth:`~re.Match.end` return the starting and ending index of the match. :meth:`~re.Match.span` -returns both start and end indexes in a single tuple. Since the :meth:`~re.Pattern.match` -method only checks if the RE matches at the start of a string, :meth:`!start` -will always be zero. However, the :meth:`~re.Pattern.search` method of patterns +:meth:`group` returns the substring that was matched by the RE. :meth:`start` +and :meth:`end` return the starting and ending index of the match. :meth:`span` +returns both start and end indexes in a single tuple. Since the :meth:`match` +method only checks if the RE matches at the start of a string, :meth:`start` +will always be zero. However, the :meth:`search` method of patterns scans through the string, so the match may not start at zero in that case. :: - >>> print(p.match('::: message')) + >>> print p.match('::: message') None - >>> m = p.search('::: message'); print(m) - <re.Match object; span=(4, 11), match='message'> + >>> m = p.search('::: message'); print m #doctest: +ELLIPSIS + <_sre.SRE_Match object at 0x...> >>> m.group() 'message' >>> m.span() @@ -460,32 +452,26 @@ In actual programs, the most common style is to store the p = re.compile( ... ) m = p.match( 'string goes here' ) if m: - print('Match found: ', m.group()) + print 'Match found: ', m.group() else: - print('No match') + print 'No match' Two pattern methods return all of the matches for a pattern. -:meth:`~re.Pattern.findall` returns a list of matching strings:: +:meth:`findall` returns a list of matching strings:: - >>> p = re.compile(r'\d+') + >>> p = re.compile('\d+') >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') ['12', '11', '10'] -The ``r`` prefix, making the literal a raw string literal, is needed in this -example because escape sequences in a normal "cooked" string literal that are -not recognized by Python, as opposed to regular expressions, now result in a -:exc:`DeprecationWarning` and will eventually become a :exc:`SyntaxError`. See -:ref:`the-backslash-plague`. - -:meth:`~re.Pattern.findall` has to create the entire list before it can be returned as the -result. The :meth:`~re.Pattern.finditer` method returns a sequence of -:ref:`match object <match-objects>` instances as an :term:`iterator`:: +:meth:`findall` has to create the entire list before it can be returned as the +result. The :meth:`finditer` method returns a sequence of +:ref:`match object <match-objects>` instances as an :term:`iterator`. [#]_ :: >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') >>> iterator #doctest: +ELLIPSIS - <callable_iterator object at 0x...> + <callable-iterator object at 0x...> >>> for match in iterator: - ... print(match.span()) + ... print match.span() ... (0, 2) (22, 24) @@ -496,27 +482,38 @@ Module-Level Functions ---------------------- You don't have to create a pattern object and call its methods; the -:mod:`re` module also provides top-level functions called :func:`~re.match`, -:func:`~re.search`, :func:`~re.findall`, :func:`~re.sub`, and so forth. These functions -take the same arguments as the corresponding pattern method with +:mod:`re` module also provides top-level functions called :func:`match`, +:func:`search`, :func:`findall`, :func:`sub`, and so forth. These functions +take the same arguments as the corresponding pattern method, with the RE string added as the first argument, and still return either ``None`` or a :ref:`match object <match-objects>` instance. :: - >>> print(re.match(r'From\s+', 'Fromage amk')) + >>> print re.match(r'From\s+', 'Fromage amk') None >>> re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') #doctest: +ELLIPSIS - <re.Match object; span=(0, 5), match='From '> + <_sre.SRE_Match object at 0x...> Under the hood, these functions simply create a pattern object for you -and call the appropriate method on it. They also store the compiled -object in a cache, so future calls using the same RE won't need to -parse the pattern again and again. +and call the appropriate method on it. They also store the compiled object in a +cache, so future calls using the same RE are faster. Should you use these module-level functions, or should you get the -pattern and call its methods yourself? If you're accessing a regex -within a loop, pre-compiling it will save a few function calls. -Outside of loops, there's not much difference thanks to the internal -cache. +pattern and call its methods yourself? That choice depends on how +frequently the RE will be used, and on your personal coding style. If the RE is +being used at only one point in the code, then the module functions are probably +more convenient. If a program contains a lot of regular expressions, or re-uses +the same ones in several locations, then it might be worthwhile to collect all +the definitions in one place, in a section of code that compiles all the REs +ahead of time. To take an example from the standard library, here's an extract +from the deprecated :mod:`xmllib` module:: + + ref = re.compile( ... ) + entityref = re.compile( ... ) + charref = re.compile( ... ) + starttagopen = re.compile( ... ) + +I generally prefer to work with the compiled object, even for one-time uses, but +few people will be as much of a purist about this as I am. Compilation Flags @@ -536,22 +533,22 @@ of each one. +---------------------------------+--------------------------------------------+ | Flag | Meaning | +=================================+============================================+ -| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, | -| | ``\s`` and ``\d`` match only on ASCII | -| | characters with the respective property. | -+---------------------------------+--------------------------------------------+ | :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including | -| | newlines. | +| | newlines | +---------------------------------+--------------------------------------------+ -| :const:`IGNORECASE`, :const:`I` | Do case-insensitive matches. | +| :const:`IGNORECASE`, :const:`I` | Do case-insensitive matches | +---------------------------------+--------------------------------------------+ -| :const:`LOCALE`, :const:`L` | Do a locale-aware match. | +| :const:`LOCALE`, :const:`L` | Do a locale-aware match | +---------------------------------+--------------------------------------------+ | :const:`MULTILINE`, :const:`M` | Multi-line matching, affecting ``^`` and | -| | ``$``. | +| | ``$`` | +---------------------------------+--------------------------------------------+ | :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized | -| (for 'extended') | more cleanly and understandably. | +| | more cleanly and understandably. | ++---------------------------------+--------------------------------------------+ +| :const:`UNICODE`, :const:`U` | Makes several escapes like ``\w``, ``\b``, | +| | ``\s`` and ``\d`` dependent on the Unicode | +| | character database. | +---------------------------------+--------------------------------------------+ @@ -561,41 +558,26 @@ of each one. Perform case-insensitive matching; character class and literal strings will match letters by ignoring case. For example, ``[A-Z]`` will match lowercase - letters, too. Full Unicode matching also works unless the :const:`ASCII` - flag is used to disable non-ASCII matches. When the Unicode patterns - ``[a-z]`` or ``[A-Z]`` are used in combination with the :const:`IGNORECASE` - flag, they will match the 52 ASCII letters and 4 additional non-ASCII - letters: 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131, - Latin small letter dotless i), 'ſ' (U+017F, Latin small letter long s) and - 'K' (U+212A, Kelvin sign). ``Spam`` will match ``'Spam'``, ``'spam'``, - ``'spAM'``, or ``'ſpam'`` (the latter is matched only in Unicode mode). - This lowercasing doesn't take the current locale into account; - it will if you also set the :const:`LOCALE` flag. + letters, too, and ``Spam`` will match ``Spam``, ``spam``, or ``spAM``. This + lowercasing doesn't take the current locale into account; it will if you also + set the :const:`LOCALE` flag. .. data:: L LOCALE :noindex: - Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching dependent - on the current locale instead of the Unicode database. - - Locales are a feature of the C library intended to help in writing programs - that take account of language differences. For example, if you're - processing encoded French text, you'd want to be able to write ``\w+`` to - match words, but ``\w`` only matches the character class ``[A-Za-z]`` in - bytes patterns; it won't match bytes corresponding to ``é`` or ``ç``. - If your system is configured properly and a French locale is selected, - certain C functions will tell the program that the byte corresponding to - ``é`` should also be considered a letter. + Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale. + + Locales are a feature of the C library intended to help in writing programs that + take account of language differences. For example, if you're processing French + text, you'd want to be able to write ``\w+`` to match words, but ``\w`` only + matches the character class ``[A-Za-z]``; it won't match ``'é'`` or ``'ç'``. If + your system is configured properly and a French locale is selected, certain C + functions will tell the program that ``'é'`` should also be considered a letter. Setting the :const:`LOCALE` flag when compiling a regular expression will cause the resulting compiled object to use these C functions for ``\w``; this is slower, but also enables ``\w+`` to match French words as you'd expect. - The use of this flag is discouraged in Python 3 as the locale mechanism - is very unreliable, it only handles one "culture" at a time, and it only - works with 8-bit locales. Unicode matching is already enabled by default - in Python 3 for Unicode (str) patterns, and it is able to handle different - locales/languages. .. data:: M @@ -622,13 +604,12 @@ of each one. newline; without this flag, ``'.'`` will match anything *except* a newline. -.. data:: A - ASCII +.. data:: U + UNICODE :noindex: - Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` perform ASCII-only - matching instead of full Unicode matching. This is only meaningful for - Unicode patterns, and is ignored for byte patterns. + Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` + dependent on the Unicode character properties database. .. data:: X @@ -693,11 +674,11 @@ zero-width assertions should never be repeated, because if they match once at a given location, they can obviously be matched an infinite number of times. ``|`` - Alternation, or the "or" operator. If *A* and *B* are regular expressions, - ``A|B`` will match any string that matches either *A* or *B*. ``|`` has very + Alternation, or the "or" operator. If A and B are regular expressions, + ``A|B`` will match any string that matches either ``A`` or ``B``. ``|`` has very low precedence in order to make it work reasonably when you're alternating - multi-character strings. ``Crow|Servo`` will match either ``'Crow'`` or ``'Servo'``, - not ``'Cro'``, a ``'w'`` or an ``'S'``, and ``'ervo'``. + multi-character strings. ``Crow|Servo`` will match either ``Crow`` or ``Servo``, + not ``Cro``, a ``'w'`` or an ``'S'``, and ``ervo``. To match a literal ``'|'``, use ``\|``, or enclose it inside a character class, as in ``[|]``. @@ -710,23 +691,24 @@ given location, they can obviously be matched an infinite number of times. For example, if you wish to match the word ``From`` only at the beginning of a line, the RE to use is ``^From``. :: - >>> print(re.search('^From', 'From Here to Eternity')) #doctest: +ELLIPSIS - <re.Match object; span=(0, 4), match='From'> - >>> print(re.search('^From', 'Reciting From Memory')) + >>> print re.search('^From', 'From Here to Eternity') #doctest: +ELLIPSIS + <_sre.SRE_Match object at 0x...> + >>> print re.search('^From', 'Reciting From Memory') None - To match a literal ``'^'``, use ``\^``. + .. To match a literal \character{\^}, use \regexp{\e\^} or enclose it + .. inside a character class, as in \regexp{[{\e}\^]}. ``$`` Matches at the end of a line, which is defined as either the end of the string, or any location followed by a newline character. :: - >>> print(re.search('}$', '{block}')) #doctest: +ELLIPSIS - <re.Match object; span=(6, 7), match='}'> - >>> print(re.search('}$', '{block} ')) + >>> print re.search('}$', '{block}') #doctest: +ELLIPSIS + <_sre.SRE_Match object at 0x...> + >>> print re.search('}$', '{block} ') None - >>> print(re.search('}$', '{block}\n')) #doctest: +ELLIPSIS - <re.Match object; span=(6, 7), match='}'> + >>> print re.search('}$', '{block}\n') #doctest: +ELLIPSIS + <_sre.SRE_Match object at 0x...> To match a literal ``'$'``, use ``\$`` or enclose it inside a character class, as in ``[$]``. @@ -750,11 +732,11 @@ given location, they can obviously be matched an infinite number of times. match when it's contained inside another word. :: >>> p = re.compile(r'\bclass\b') - >>> print(p.search('no class at all')) - <re.Match object; span=(3, 8), match='class'> - >>> print(p.search('the declassified algorithm')) + >>> print p.search('no class at all') #doctest: +ELLIPSIS + <_sre.SRE_Match object at 0x...> + >>> print p.search('the declassified algorithm') None - >>> print(p.search('one subclass is')) + >>> print p.search('one subclass is') None There are two subtleties you should remember when using this special sequence. @@ -766,10 +748,10 @@ given location, they can obviously be matched an infinite number of times. in front of the RE string. :: >>> p = re.compile('\bclass\b') - >>> print(p.search('no class at all')) + >>> print p.search('no class at all') None - >>> print(p.search('\b' + 'class' + '\b')) - <re.Match object; span=(0, 7), match='\x08class\x08'> + >>> print p.search('\b' + 'class' + '\b') #doctest: +ELLIPSIS + <_sre.SRE_Match object at 0x...> Second, inside a character class, where there's no use for this assertion, ``\b`` represents the backspace character, for compatibility with Python's @@ -787,9 +769,7 @@ Frequently you need to obtain more information than just whether the RE matched or not. Regular expressions are often used to dissect strings by writing a RE divided into several subgroups which match different components of interest. For example, an RFC-822 header line is divided into a header name and a value, -separated by a ``':'``, like this: - -.. code-block:: none +separated by a ``':'``, like this:: From: author@example.com User-Agent: Thunderbird 1.5.0.9 (X11/20061227) @@ -808,13 +788,12 @@ of a group with a repeating qualifier, such as ``*``, ``+``, ``?``, or ``ab``. :: >>> p = re.compile('(ab)*') - >>> print(p.match('ababababab').span()) + >>> print p.match('ababababab').span() (0, 10) Groups indicated with ``'('``, ``')'`` also capture the starting and ending index of the text that they match; this can be retrieved by passing an argument -to :meth:`~re.Match.group`, :meth:`~re.Match.start`, :meth:`~re.Match.end`, and -:meth:`~re.Match.span`. Groups are +to :meth:`group`, :meth:`start`, :meth:`end`, and :meth:`span`. Groups are numbered starting with 0. Group 0 is always present; it's the whole RE, so :ref:`match object <match-objects>` methods all have group 0 as their default argument. Later we'll see how to express groups that don't capture the span @@ -840,13 +819,13 @@ from left to right. :: >>> m.group(2) 'b' -:meth:`~re.Match.group` can be passed multiple group numbers at a time, in which case it +:meth:`group` can be passed multiple group numbers at a time, in which case it will return a tuple containing the corresponding values for those groups. :: >>> m.group(2,1,2) ('b', 'abc', 'b') -The :meth:`~re.Match.groups` method returns a tuple containing the strings for all the +The :meth:`groups` method returns a tuple containing the strings for all the subgroups, from 1 up to however many there are. :: >>> m.groups() @@ -880,10 +859,11 @@ keep track of the group numbers. There are two features which help with this problem. Both of them use a common syntax for regular expression extensions, so we'll look at that first. -Perl 5 is well known for its powerful additions to standard regular expressions. -For these new features the Perl developers couldn't choose new single-keystroke metacharacters -or new special sequences beginning with ``\`` without making Perl's regular -expressions confusingly different from standard REs. If they chose ``&`` as a +Perl 5 added several additional features to standard regular expressions, and +the Python :mod:`re` module supports most of them. It would have been +difficult to choose new single-keystroke metacharacters or new special sequences +beginning with ``\`` to represent the new features without making Perl's regular +expressions confusingly different from standard REs. If you chose ``&`` as a new metacharacter, for example, old expressions would be assuming that ``&`` was a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``. @@ -895,15 +875,22 @@ what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead assertion) and ``(?:foo)`` is something else (a non-capturing group containing the subexpression ``foo``). -Python supports several of Perl's extensions and adds an extension -syntax to Perl's extension syntax. If the first character after the -question mark is a ``P``, you know that it's an extension that's -specific to Python. - -Now that we've looked at the general extension syntax, we can return -to the features that simplify working with groups in complex REs. - -Sometimes you'll want to use a group to denote a part of a regular expression, +Python adds an extension syntax to Perl's extension syntax. If the first +character after the question mark is a ``P``, you know that it's an extension +that's specific to Python. Currently there are two such extensions: +``(?P<name>...)`` defines a named group, and ``(?P=name)`` is a backreference to +a named group. If future versions of Perl 5 add similar features using a +different syntax, the :mod:`re` module will be changed to support the new +syntax, while preserving the Python-specific syntax for compatibility's sake. + +Now that we've looked at the general extension syntax, we can return to the +features that simplify working with groups in complex REs. Since groups are +numbered from left to right and a complex expression may use many groups, it can +become difficult to keep track of the correct numbering. Modifying such a +complex RE is annoying, too: insert a new group near the beginning and you +change the numbers of everything that follows it. + +Sometimes you'll want to use a group to collect a part of a regular expression, but aren't interested in retrieving the group's contents. You can make this fact explicit by using a non-capturing group: ``(?:...)``, where you can replace the ``...`` with any other regular expression. :: @@ -929,7 +916,7 @@ numbers, groups can be referenced by a name. The syntax for a named group is one of the Python-specific extensions: ``(?P<name>...)``. *name* is, obviously, the name of the group. Named groups -behave exactly like capturing groups, and additionally associate a name +also behave exactly like capturing groups, and additionally associate a name with a group. The :ref:`match object <match-objects>` methods that deal with capturing groups all accept either integers that refer to the group by number or strings that contain the desired group's name. Named groups are still @@ -942,13 +929,6 @@ given numbers, so you can retrieve information about a group in two ways:: >>> m.group(1) 'Lots' -Additionally, you can retrieve named groups as a dictionary with -:meth:`~re.Match.groupdict`:: - - >>> m = re.match(r'(?P<first>\w+) (?P<last>\w+)', 'Jane Doe') - >>> m.groupdict() - {'first': 'Jane', 'last': 'Doe'} - Named groups are handy because they let you use easily-remembered names, instead of having to remember numbers. Here's an example RE from the :mod:`imaplib` module:: @@ -1003,10 +983,9 @@ The pattern to match this is quite simple: ``.*[.].*$`` Notice that the ``.`` needs to be treated specially because it's a -metacharacter, so it's inside a character class to only match that -specific character. Also notice the trailing ``$``; this is added to -ensure that all the rest of the string must be included in the -extension. This regular expression matches ``foo.bar`` and +metacharacter; I've put it inside a character class. Also notice the trailing +``$``; this is added to ensure that all the rest of the string must be included +in the extension. This regular expression matches ``foo.bar`` and ``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``. Now, consider complicating the problem a bit; what if you want to match @@ -1069,7 +1048,7 @@ using the following pattern methods: | ``sub()`` | Find all substrings where the RE matches, and | | | replace them with a different string | +------------------+-----------------------------------------------+ -| ``subn()`` | Does the same thing as :meth:`!sub`, but | +| ``subn()`` | Does the same thing as :meth:`sub`, but | | | returns the new string and the number of | | | replacements | +------------------+-----------------------------------------------+ @@ -1078,10 +1057,10 @@ using the following pattern methods: Splitting Strings ----------------- -The :meth:`~re.Pattern.split` method of a pattern splits a string apart +The :meth:`split` method of a pattern splits a string apart wherever the RE matches, returning a list of the pieces. It's similar to the -:meth:`~str.split` method of strings but provides much more generality in the -delimiters that you can split by; string :meth:`!split` only supports splitting by +:meth:`split` method of strings but provides much more generality in the +delimiters that you can split by; :meth:`split` only supports splitting by whitespace or by a fixed string. As you'd expect, there's a module-level :func:`re.split` function, too. @@ -1121,11 +1100,11 @@ following calls:: The module-level function :func:`re.split` adds the RE to be used as the first argument, but is otherwise the same. :: - >>> re.split(r'[\W]+', 'Words, words, words.') + >>> re.split('[\W]+', 'Words, words, words.') ['Words', 'words', 'words', ''] - >>> re.split(r'([\W]+)', 'Words, words, words.') + >>> re.split('([\W]+)', 'Words, words, words.') ['Words', ', ', 'words', ', ', 'words', '.', ''] - >>> re.split(r'[\W]+', 'Words, words, words.', 1) + >>> re.split('[\W]+', 'Words, words, words.', 1) ['Words', 'words, words.'] @@ -1133,9 +1112,10 @@ Search and Replace ------------------ Another common task is to find all the matches for a pattern, and replace them -with a different string. The :meth:`~re.Pattern.sub` method takes a replacement value, +with a different string. The :meth:`sub` method takes a replacement value, which can be either a string or a function, and the string to be processed. + .. method:: .sub(replacement, string[, count=0]) :noindex: @@ -1147,7 +1127,7 @@ which can be either a string or a function, and the string to be processed. replaced; *count* must be a non-negative integer. The default value of 0 means to replace all occurrences. -Here's a simple example of using the :meth:`~re.Pattern.sub` method. It replaces colour +Here's a simple example of using the :meth:`sub` method. It replaces colour names with the word ``colour``:: >>> p = re.compile('(blue|white|red)') @@ -1156,7 +1136,7 @@ names with the word ``colour``:: >>> p.sub('colour', 'blue socks and red shoes', count=1) 'colour socks and red shoes' -The :meth:`~re.Pattern.subn` method does the same work, but returns a 2-tuple containing the +The :meth:`subn` method does the same work, but returns a 2-tuple containing the new string value and the number of replacements that were performed:: >>> p = re.compile('(blue|white|red)') @@ -1165,16 +1145,16 @@ new string value and the number of replacements that were performed:: >>> p.subn('colour', 'no colours at all') ('no colours at all', 0) -Empty matches are replaced only when they're not adjacent to a previous empty match. +Empty matches are replaced only when they're not adjacent to a previous match. :: >>> p = re.compile('x*') >>> p.sub('-', 'abxd') - '-a-b--d-' + '-a-b-d-' If *replacement* is a string, any backslash escapes in it are processed. That is, ``\n`` is converted to a single newline character, ``\r`` is converted to a -carriage return, and so forth. Unknown escapes such as ``\&`` are left alone. +carriage return, and so forth. Unknown escapes such as ``\j`` are left alone. Backreferences, such as ``\6``, are replaced with the substring matched by the corresponding group in the RE. This lets you incorporate portions of the original text in the resulting replacement string. @@ -1241,24 +1221,24 @@ Use String Methods Sometimes using the :mod:`re` module is a mistake. If you're matching a fixed string, or a single character class, and you're not using any :mod:`re` features -such as the :const:`~re.IGNORECASE` flag, then the full power of regular expressions +such as the :const:`IGNORECASE` flag, then the full power of regular expressions may not be required. Strings have several methods for performing operations with fixed strings and they're usually much faster, because the implementation is a single small C loop that's been optimized for the purpose, instead of the large, more generalized regular expression engine. One example might be replacing a single fixed string with another one; for -example, you might replace ``word`` with ``deed``. :func:`re.sub` seems like the -function to use for this, but consider the :meth:`~str.replace` method. Note that -:meth:`!replace` will also replace ``word`` inside words, turning ``swordfish`` +example, you might replace ``word`` with ``deed``. ``re.sub()`` seems like the +function to use for this, but consider the :meth:`replace` method. Note that +:func:`replace` will also replace ``word`` inside words, turning ``swordfish`` into ``sdeedfish``, but the naive RE ``word`` would have done that, too. (To avoid performing the substitution on parts of words, the pattern would have to be ``\bword\b``, in order to require that ``word`` have a word boundary on -either side. This takes the job beyond :meth:`!replace`'s abilities.) +either side. This takes the job beyond :meth:`replace`'s abilities.) Another common task is deleting every occurrence of a single character from a string or replacing it with another single character. You might do this with -something like ``re.sub('\n', ' ', S)``, but :meth:`~str.translate` is capable of +something like ``re.sub('\n', ' ', S)``, but :meth:`translate` is capable of doing both tasks and will be faster than any regular expression operation can be. @@ -1269,23 +1249,23 @@ can be solved with a faster and simpler string method. match() versus search() ----------------------- -The :func:`~re.match` function only checks if the RE matches at the beginning of the -string while :func:`~re.search` will scan forward through the string for a match. -It's important to keep this distinction in mind. Remember, :func:`!match` will +The :func:`match` function only checks if the RE matches at the beginning of the +string while :func:`search` will scan forward through the string for a match. +It's important to keep this distinction in mind. Remember, :func:`match` will only report a successful match which will start at 0; if the match wouldn't -start at zero, :func:`!match` will *not* report it. :: +start at zero, :func:`match` will *not* report it. :: - >>> print(re.match('super', 'superstition').span()) + >>> print re.match('super', 'superstition').span() (0, 5) - >>> print(re.match('super', 'insuperable')) + >>> print re.match('super', 'insuperable') None -On the other hand, :func:`~re.search` will scan forward through the string, +On the other hand, :func:`search` will scan forward through the string, reporting the first match it finds. :: - >>> print(re.search('super', 'superstition').span()) + >>> print re.search('super', 'superstition').span() (0, 5) - >>> print(re.search('super', 'insuperable').span()) + >>> print re.search('super', 'insuperable').span() (2, 7) Sometimes you'll be tempted to keep using :func:`re.match`, and just add ``.*`` @@ -1314,17 +1294,17 @@ doesn't work because of the greedy nature of ``.*``. :: >>> s = '<html><head><title>Title</title>' >>> len(s) 32 - >>> print(re.match('<.*>', s).span()) + >>> print re.match('<.*>', s).span() (0, 32) - >>> print(re.match('<.*>', s).group()) + >>> print re.match('<.*>', s).group() <html><head><title>Title</title> -The RE matches the ``'<'`` in ``'<html>'``, and the ``.*`` consumes the rest of +The RE matches the ``'<'`` in ``<html>``, and the ``.*`` consumes the rest of the string. There's still more left in the RE, though, and the ``>`` can't match at the end of the string, so the regular expression engine has to backtrack character by character until it finds a match for the ``>``. The -final match extends from the ``'<'`` in ``'<html>'`` to the ``'>'`` in -``'</title>'``, which isn't what you want. +final match extends from the ``'<'`` in ``<html>`` to the ``'>'`` in +``</title>``, which isn't what you want. In this case, the solution is to use the non-greedy qualifiers ``*?``, ``+?``, ``??``, or ``{m,n}?``, which match as *little* text as possible. In the above @@ -1332,7 +1312,7 @@ example, the ``'>'`` is tried immediately after the first ``'<'`` matches, and when it fails, the engine advances a character at a time, retrying the ``'>'`` at every step. This produces just the right result:: - >>> print(re.match('<.*?>', s).group()) + >>> print re.match('<.*?>', s).group() <html> (Note that parsing HTML or XML with regular expressions is painful. @@ -1343,14 +1323,14 @@ be *very* complicated. Use an HTML or XML parser module for such tasks.) Using re.VERBOSE ----------------- +-------------------- By now you've probably noticed that regular expressions are a very compact notation, but they're not terribly readable. REs of moderate complexity can become lengthy collections of backslashes, parentheses, and metacharacters, making them difficult to read and understand. -For such REs, specifying the :const:`re.VERBOSE` flag when compiling the regular +For such REs, specifying the ``re.VERBOSE`` flag when compiling the regular expression can be helpful, because it allows you to format the regular expression more clearly. @@ -1389,5 +1369,11 @@ Friedl's Mastering Regular Expressions, published by O'Reilly. Unfortunately, it exclusively concentrates on Perl and Java's flavours of regular expressions, and doesn't contain any Python material at all, so it won't be useful as a reference for programming in Python. (The first edition covered Python's -now-removed :mod:`!regex` module, which won't help you much.) Consider checking +now-removed :mod:`regex` module, which won't help you much.) Consider checking it out from your library. + + +.. rubric:: Footnotes + +.. [#] Introduced in Python 2.2.2. + |