summaryrefslogtreecommitdiffstats
path: root/Doc/howto
diff options
context:
space:
mode:
authorAndrew Kuchling <amk@amk.ca>2013-08-18 22:57:22 (GMT)
committerAndrew Kuchling <amk@amk.ca>2013-08-18 22:57:22 (GMT)
commit3f4f3ba1a86c21dc2aacb09b7b9e1bfb5fa746a3 (patch)
tree1dd2255ab51197365b4deccd8c903a7a18c85b22 /Doc/howto
parentba5d8f33ec9538797665ded0b051b3cc6ab52d5c (diff)
downloadcpython-3f4f3ba1a86c21dc2aacb09b7b9e1bfb5fa746a3.zip
cpython-3f4f3ba1a86c21dc2aacb09b7b9e1bfb5fa746a3.tar.gz
cpython-3f4f3ba1a86c21dc2aacb09b7b9e1bfb5fa746a3.tar.bz2
#18562: various revisions to the regex howto for 3.x
* describe how \w is different when used in bytes and Unicode patterns. * describe re.ASCII flag to change that behaviour. * remove personal references ('I generally prefer...') * add some more links to the re module in the library reference * various small edits and re-wording.
Diffstat (limited to 'Doc/howto')
-rw-r--r--Doc/howto/regex.rst132
1 files changed, 62 insertions, 70 deletions
diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst
index 9adfa85..5203e53 100644
--- a/Doc/howto/regex.rst
+++ b/Doc/howto/regex.rst
@@ -104,13 +104,25 @@ you can still match them in patterns; for example, if you need to match a ``[``
or ``\``, you can precede them with a backslash to remove their special
meaning: ``\[`` or ``\\``.
-Some of the special sequences beginning with ``'\'`` represent predefined sets
-of characters that are often useful, such as the set of digits, the set of
-letters, or the set of anything that isn't whitespace. The following predefined
-special sequences are a subset of those available. The equivalent classes are
-for bytes patterns. For a complete list of sequences and expanded class
-definitions for Unicode string patterns, see the last part of
-:ref:`Regular Expression Syntax <re-syntax>`.
+Some of the special sequences beginning with ``'\'`` represent
+predefined sets of characters that are often useful, such as the set
+of digits, the set of letters, or the set of anything that isn't
+whitespace.
+
+Let's take an example: ``\w`` matches any alphanumeric character. If
+the regex pattern is expressed in bytes, this is equivalent to the
+class ``[a-zA-Z0-9_]``. If the regex pattern is a string, ``\w`` will
+match all the characters marked as letters in the Unicode database
+provided by the :mod:`unicodedata` module. You can use the more
+restricted definition of ``\w`` in a string pattern by supplying the
+:const:`re.ASCII` flag when compiling the regular expression.
+
+The following list of special sequences isn't complete. For a complete
+list of sequences and expanded class definitions for Unicode string
+patterns, see the last part of :ref:`Regular Expression Syntax
+<re-syntax>` in the Standard Library reference. In general, the
+Unicode versions match any character that's in the appropriate
+category in the Unicode database.
``\d``
Matches any decimal digit; this is equivalent to the class ``[0-9]``.
@@ -160,9 +172,8 @@ previous character can be matched zero or more times, instead of exactly once.
For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``),
``caaat`` (3 ``a`` characters), and so forth. The RE engine has various
internal limitations stemming from the size of C's ``int`` type that will
-prevent it from matching over 2 billion ``a`` characters; you probably don't
-have enough memory to construct a string that large, so you shouldn't run into
-that limit.
+prevent it from matching over 2 billion ``a`` characters; patterns
+are usually not written to match that much data.
Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching
engine will try to repeat it as many times as possible. If later portions of the
@@ -353,7 +364,7 @@ for a complete listing.
| | returns them as an :term:`iterator`. |
+------------------+-----------------------------------------------+
-:meth:`match` and :meth:`search` return ``None`` if no match can be found. If
+:meth:`~re.regex.match` and :meth:`~re.regex.search` return ``None`` if no match can be found. If
they're successful, a :ref:`match object <match-objects>` instance is returned,
containing information about the match: where it starts and ends, the substring
it matched, and more.
@@ -419,8 +430,8 @@ Trying these methods will soon clarify their meaning::
>>> m.span()
(0, 5)
-:meth:`group` returns the substring that was matched by the RE. :meth:`start`
-and :meth:`end` return the starting and ending index of the match. :meth:`span`
+:meth:`~re.match.group` returns the substring that was matched by the RE. :meth:`~re.match.start`
+and :meth:`~re.match.end` return the starting and ending index of the match. :meth:`~re.match.span`
returns both start and end indexes in a single tuple. Since the :meth:`match`
method only checks if the RE matches at the start of a string, :meth:`start`
will always be zero. However, the :meth:`search` method of patterns
@@ -448,14 +459,14 @@ In actual programs, the most common style is to store the
print('No match')
Two pattern methods return all of the matches for a pattern.
-:meth:`findall` returns a list of matching strings::
+:meth:`~re.regex.findall` returns a list of matching strings::
>>> p = re.compile('\d+')
>>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping')
['12', '11', '10']
:meth:`findall` has to create the entire list before it can be returned as the
-result. The :meth:`finditer` method returns a sequence of
+result. The :meth:`~re.regex.finditer` method returns a sequence of
:ref:`match object <match-objects>` instances as an :term:`iterator`::
>>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...')
@@ -473,9 +484,9 @@ Module-Level Functions
----------------------
You don't have to create a pattern object and call its methods; the
-:mod:`re` module also provides top-level functions called :func:`match`,
-:func:`search`, :func:`findall`, :func:`sub`, and so forth. These functions
-take the same arguments as the corresponding pattern method, with
+:mod:`re` module also provides top-level functions called :func:`~re.match`,
+:func:`~re.search`, :func:`~re.findall`, :func:`~re.sub`, and so forth. These functions
+take the same arguments as the corresponding pattern method with
the RE string added as the first argument, and still return either ``None`` or a
:ref:`match object <match-objects>` instance. ::
@@ -485,26 +496,15 @@ the RE string added as the first argument, and still return either ``None`` or a
<_sre.SRE_Match object at 0x...>
Under the hood, these functions simply create a pattern object for you
-and call the appropriate method on it. They also store the compiled object in a
-cache, so future calls using the same RE are faster.
+and call the appropriate method on it. They also store the compiled
+object in a cache, so future calls using the same RE won't need to
+parse the pattern again and again.
Should you use these module-level functions, or should you get the
-pattern and call its methods yourself? That choice depends on how
-frequently the RE will be used, and on your personal coding style. If the RE is
-being used at only one point in the code, then the module functions are probably
-more convenient. If a program contains a lot of regular expressions, or re-uses
-the same ones in several locations, then it might be worthwhile to collect all
-the definitions in one place, in a section of code that compiles all the REs
-ahead of time. To take an example from the standard library, here's an extract
-from the now-defunct Python 2 standard :mod:`xmllib` module::
-
- ref = re.compile( ... )
- entityref = re.compile( ... )
- charref = re.compile( ... )
- starttagopen = re.compile( ... )
-
-I generally prefer to work with the compiled object, even for one-time uses, but
-few people will be as much of a purist about this as I am.
+pattern and call its methods yourself? If you're accessing a regex
+within a loop, pre-compiling it will save a few function calls.
+Outside of loops, there's not much difference thanks to the internal
+cache.
Compilation Flags
@@ -524,6 +524,10 @@ of each one.
+---------------------------------+--------------------------------------------+
| Flag | Meaning |
+=================================+============================================+
+| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, |
+| | ``\s`` and ``\d`` match only on ASCII |
+| | characters with the respective property. |
++---------------------------------+--------------------------------------------+
| :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including |
| | newlines |
+---------------------------------+--------------------------------------------+
@@ -535,11 +539,7 @@ of each one.
| | ``$`` |
+---------------------------------+--------------------------------------------+
| :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized |
-| | more cleanly and understandably. |
-+---------------------------------+--------------------------------------------+
-| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, |
-| | ``\s`` and ``\d`` match only on ASCII |
-| | characters with the respective property. |
+| (for 'extended') | more cleanly and understandably. |
+---------------------------------+--------------------------------------------+
@@ -558,7 +558,8 @@ of each one.
LOCALE
:noindex:
- Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale.
+ Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale
+ instead of the Unicode database.
Locales are a feature of the C library intended to help in writing programs that
take account of language differences. For example, if you're processing French
@@ -851,11 +852,10 @@ keep track of the group numbers. There are two features which help with this
problem. Both of them use a common syntax for regular expression extensions, so
we'll look at that first.
-Perl 5 added several additional features to standard regular expressions, and
-the Python :mod:`re` module supports most of them. It would have been
-difficult to choose new single-keystroke metacharacters or new special sequences
-beginning with ``\`` to represent the new features without making Perl's regular
-expressions confusingly different from standard REs. If you chose ``&`` as a
+Perl 5 is well-known for its powerful additions to standard regular expressions.
+For these new features the Perl developers couldn't choose new single-keystroke metacharacters
+or new special sequences beginning with ``\`` without making Perl's regular
+expressions confusingly different from standard REs. If they chose ``&`` as a
new metacharacter, for example, old expressions would be assuming that ``&`` was
a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``.
@@ -867,22 +867,15 @@ what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead
assertion) and ``(?:foo)`` is something else (a non-capturing group containing
the subexpression ``foo``).
-Python adds an extension syntax to Perl's extension syntax. If the first
-character after the question mark is a ``P``, you know that it's an extension
-that's specific to Python. Currently there are two such extensions:
-``(?P<name>...)`` defines a named group, and ``(?P=name)`` is a backreference to
-a named group. If future versions of Perl 5 add similar features using a
-different syntax, the :mod:`re` module will be changed to support the new
-syntax, while preserving the Python-specific syntax for compatibility's sake.
-
-Now that we've looked at the general extension syntax, we can return to the
-features that simplify working with groups in complex REs. Since groups are
-numbered from left to right and a complex expression may use many groups, it can
-become difficult to keep track of the correct numbering. Modifying such a
-complex RE is annoying, too: insert a new group near the beginning and you
-change the numbers of everything that follows it.
-
-Sometimes you'll want to use a group to collect a part of a regular expression,
+Python supports several of Perl's extensions and adds an extension
+syntax to Perl's extension syntax. If the first character after the
+question mark is a ``P``, you know that it's an extension that's
+specific to Python.
+
+Now that we've looked at the general extension syntax, we can return
+to the features that simplify working with groups in complex REs.
+
+Sometimes you'll want to use a group to denote a part of a regular expression,
but aren't interested in retrieving the group's contents. You can make this fact
explicit by using a non-capturing group: ``(?:...)``, where you can replace the
``...`` with any other regular expression. ::
@@ -908,7 +901,7 @@ numbers, groups can be referenced by a name.
The syntax for a named group is one of the Python-specific extensions:
``(?P<name>...)``. *name* is, obviously, the name of the group. Named groups
-also behave exactly like capturing groups, and additionally associate a name
+behave exactly like capturing groups, and additionally associate a name
with a group. The :ref:`match object <match-objects>` methods that deal with
capturing groups all accept either integers that refer to the group by number
or strings that contain the desired group's name. Named groups are still
@@ -975,9 +968,10 @@ The pattern to match this is quite simple:
``.*[.].*$``
Notice that the ``.`` needs to be treated specially because it's a
-metacharacter; I've put it inside a character class. Also notice the trailing
-``$``; this is added to ensure that all the rest of the string must be included
-in the extension. This regular expression matches ``foo.bar`` and
+metacharacter, so it's inside a character class to only match that
+specific character. Also notice the trailing ``$``; this is added to
+ensure that all the rest of the string must be included in the
+extension. This regular expression matches ``foo.bar`` and
``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``.
Now, consider complicating the problem a bit; what if you want to match
@@ -1051,7 +1045,7 @@ Splitting Strings
The :meth:`split` method of a pattern splits a string apart
wherever the RE matches, returning a list of the pieces. It's similar to the
:meth:`split` method of strings but provides much more generality in the
-delimiters that you can split by; :meth:`split` only supports splitting by
+delimiters that you can split by; string :meth:`split` only supports splitting by
whitespace or by a fixed string. As you'd expect, there's a module-level
:func:`re.split` function, too.
@@ -1106,7 +1100,6 @@ Another common task is to find all the matches for a pattern, and replace them
with a different string. The :meth:`sub` method takes a replacement value,
which can be either a string or a function, and the string to be processed.
-
.. method:: .sub(replacement, string[, count=0])
:noindex:
@@ -1362,4 +1355,3 @@ and doesn't contain any Python material at all, so it won't be useful as a
reference for programming in Python. (The first edition covered Python's
now-removed :mod:`regex` module, which won't help you much.) Consider checking
it out from your library.
-