diff options
author | Andrew Kuchling <amk@amk.ca> | 2013-08-18 22:57:22 (GMT) |
---|---|---|
committer | Andrew Kuchling <amk@amk.ca> | 2013-08-18 22:57:22 (GMT) |
commit | 3f4f3ba1a86c21dc2aacb09b7b9e1bfb5fa746a3 (patch) | |
tree | 1dd2255ab51197365b4deccd8c903a7a18c85b22 /Doc | |
parent | ba5d8f33ec9538797665ded0b051b3cc6ab52d5c (diff) | |
download | cpython-3f4f3ba1a86c21dc2aacb09b7b9e1bfb5fa746a3.zip cpython-3f4f3ba1a86c21dc2aacb09b7b9e1bfb5fa746a3.tar.gz cpython-3f4f3ba1a86c21dc2aacb09b7b9e1bfb5fa746a3.tar.bz2 |
#18562: various revisions to the regex howto for 3.x
* describe how \w is different when used in bytes and Unicode patterns.
* describe re.ASCII flag to change that behaviour.
* remove personal references ('I generally prefer...')
* add some more links to the re module in the library reference
* various small edits and re-wording.
Diffstat (limited to 'Doc')
-rw-r--r-- | Doc/howto/regex.rst | 132 |
1 files changed, 62 insertions, 70 deletions
diff --git a/Doc/howto/regex.rst b/Doc/howto/regex.rst index 9adfa85..5203e53 100644 --- a/Doc/howto/regex.rst +++ b/Doc/howto/regex.rst @@ -104,13 +104,25 @@ you can still match them in patterns; for example, if you need to match a ``[`` or ``\``, you can precede them with a backslash to remove their special meaning: ``\[`` or ``\\``. -Some of the special sequences beginning with ``'\'`` represent predefined sets -of characters that are often useful, such as the set of digits, the set of -letters, or the set of anything that isn't whitespace. The following predefined -special sequences are a subset of those available. The equivalent classes are -for bytes patterns. For a complete list of sequences and expanded class -definitions for Unicode string patterns, see the last part of -:ref:`Regular Expression Syntax <re-syntax>`. +Some of the special sequences beginning with ``'\'`` represent +predefined sets of characters that are often useful, such as the set +of digits, the set of letters, or the set of anything that isn't +whitespace. + +Let's take an example: ``\w`` matches any alphanumeric character. If +the regex pattern is expressed in bytes, this is equivalent to the +class ``[a-zA-Z0-9_]``. If the regex pattern is a string, ``\w`` will +match all the characters marked as letters in the Unicode database +provided by the :mod:`unicodedata` module. You can use the more +restricted definition of ``\w`` in a string pattern by supplying the +:const:`re.ASCII` flag when compiling the regular expression. + +The following list of special sequences isn't complete. For a complete +list of sequences and expanded class definitions for Unicode string +patterns, see the last part of :ref:`Regular Expression Syntax +<re-syntax>` in the Standard Library reference. In general, the +Unicode versions match any character that's in the appropriate +category in the Unicode database. ``\d`` Matches any decimal digit; this is equivalent to the class ``[0-9]``. @@ -160,9 +172,8 @@ previous character can be matched zero or more times, instead of exactly once. For example, ``ca*t`` will match ``ct`` (0 ``a`` characters), ``cat`` (1 ``a``), ``caaat`` (3 ``a`` characters), and so forth. The RE engine has various internal limitations stemming from the size of C's ``int`` type that will -prevent it from matching over 2 billion ``a`` characters; you probably don't -have enough memory to construct a string that large, so you shouldn't run into -that limit. +prevent it from matching over 2 billion ``a`` characters; patterns +are usually not written to match that much data. Repetitions such as ``*`` are :dfn:`greedy`; when repeating a RE, the matching engine will try to repeat it as many times as possible. If later portions of the @@ -353,7 +364,7 @@ for a complete listing. | | returns them as an :term:`iterator`. | +------------------+-----------------------------------------------+ -:meth:`match` and :meth:`search` return ``None`` if no match can be found. If +:meth:`~re.regex.match` and :meth:`~re.regex.search` return ``None`` if no match can be found. If they're successful, a :ref:`match object <match-objects>` instance is returned, containing information about the match: where it starts and ends, the substring it matched, and more. @@ -419,8 +430,8 @@ Trying these methods will soon clarify their meaning:: >>> m.span() (0, 5) -:meth:`group` returns the substring that was matched by the RE. :meth:`start` -and :meth:`end` return the starting and ending index of the match. :meth:`span` +:meth:`~re.match.group` returns the substring that was matched by the RE. :meth:`~re.match.start` +and :meth:`~re.match.end` return the starting and ending index of the match. :meth:`~re.match.span` returns both start and end indexes in a single tuple. Since the :meth:`match` method only checks if the RE matches at the start of a string, :meth:`start` will always be zero. However, the :meth:`search` method of patterns @@ -448,14 +459,14 @@ In actual programs, the most common style is to store the print('No match') Two pattern methods return all of the matches for a pattern. -:meth:`findall` returns a list of matching strings:: +:meth:`~re.regex.findall` returns a list of matching strings:: >>> p = re.compile('\d+') >>> p.findall('12 drummers drumming, 11 pipers piping, 10 lords a-leaping') ['12', '11', '10'] :meth:`findall` has to create the entire list before it can be returned as the -result. The :meth:`finditer` method returns a sequence of +result. The :meth:`~re.regex.finditer` method returns a sequence of :ref:`match object <match-objects>` instances as an :term:`iterator`:: >>> iterator = p.finditer('12 drummers drumming, 11 ... 10 ...') @@ -473,9 +484,9 @@ Module-Level Functions ---------------------- You don't have to create a pattern object and call its methods; the -:mod:`re` module also provides top-level functions called :func:`match`, -:func:`search`, :func:`findall`, :func:`sub`, and so forth. These functions -take the same arguments as the corresponding pattern method, with +:mod:`re` module also provides top-level functions called :func:`~re.match`, +:func:`~re.search`, :func:`~re.findall`, :func:`~re.sub`, and so forth. These functions +take the same arguments as the corresponding pattern method with the RE string added as the first argument, and still return either ``None`` or a :ref:`match object <match-objects>` instance. :: @@ -485,26 +496,15 @@ the RE string added as the first argument, and still return either ``None`` or a <_sre.SRE_Match object at 0x...> Under the hood, these functions simply create a pattern object for you -and call the appropriate method on it. They also store the compiled object in a -cache, so future calls using the same RE are faster. +and call the appropriate method on it. They also store the compiled +object in a cache, so future calls using the same RE won't need to +parse the pattern again and again. Should you use these module-level functions, or should you get the -pattern and call its methods yourself? That choice depends on how -frequently the RE will be used, and on your personal coding style. If the RE is -being used at only one point in the code, then the module functions are probably -more convenient. If a program contains a lot of regular expressions, or re-uses -the same ones in several locations, then it might be worthwhile to collect all -the definitions in one place, in a section of code that compiles all the REs -ahead of time. To take an example from the standard library, here's an extract -from the now-defunct Python 2 standard :mod:`xmllib` module:: - - ref = re.compile( ... ) - entityref = re.compile( ... ) - charref = re.compile( ... ) - starttagopen = re.compile( ... ) - -I generally prefer to work with the compiled object, even for one-time uses, but -few people will be as much of a purist about this as I am. +pattern and call its methods yourself? If you're accessing a regex +within a loop, pre-compiling it will save a few function calls. +Outside of loops, there's not much difference thanks to the internal +cache. Compilation Flags @@ -524,6 +524,10 @@ of each one. +---------------------------------+--------------------------------------------+ | Flag | Meaning | +=================================+============================================+ +| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, | +| | ``\s`` and ``\d`` match only on ASCII | +| | characters with the respective property. | ++---------------------------------+--------------------------------------------+ | :const:`DOTALL`, :const:`S` | Make ``.`` match any character, including | | | newlines | +---------------------------------+--------------------------------------------+ @@ -535,11 +539,7 @@ of each one. | | ``$`` | +---------------------------------+--------------------------------------------+ | :const:`VERBOSE`, :const:`X` | Enable verbose REs, which can be organized | -| | more cleanly and understandably. | -+---------------------------------+--------------------------------------------+ -| :const:`ASCII`, :const:`A` | Makes several escapes like ``\w``, ``\b``, | -| | ``\s`` and ``\d`` match only on ASCII | -| | characters with the respective property. | +| (for 'extended') | more cleanly and understandably. | +---------------------------------+--------------------------------------------+ @@ -558,7 +558,8 @@ of each one. LOCALE :noindex: - Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale. + Make ``\w``, ``\W``, ``\b``, and ``\B``, dependent on the current locale + instead of the Unicode database. Locales are a feature of the C library intended to help in writing programs that take account of language differences. For example, if you're processing French @@ -851,11 +852,10 @@ keep track of the group numbers. There are two features which help with this problem. Both of them use a common syntax for regular expression extensions, so we'll look at that first. -Perl 5 added several additional features to standard regular expressions, and -the Python :mod:`re` module supports most of them. It would have been -difficult to choose new single-keystroke metacharacters or new special sequences -beginning with ``\`` to represent the new features without making Perl's regular -expressions confusingly different from standard REs. If you chose ``&`` as a +Perl 5 is well-known for its powerful additions to standard regular expressions. +For these new features the Perl developers couldn't choose new single-keystroke metacharacters +or new special sequences beginning with ``\`` without making Perl's regular +expressions confusingly different from standard REs. If they chose ``&`` as a new metacharacter, for example, old expressions would be assuming that ``&`` was a regular character and wouldn't have escaped it by writing ``\&`` or ``[&]``. @@ -867,22 +867,15 @@ what extension is being used, so ``(?=foo)`` is one thing (a positive lookahead assertion) and ``(?:foo)`` is something else (a non-capturing group containing the subexpression ``foo``). -Python adds an extension syntax to Perl's extension syntax. If the first -character after the question mark is a ``P``, you know that it's an extension -that's specific to Python. Currently there are two such extensions: -``(?P<name>...)`` defines a named group, and ``(?P=name)`` is a backreference to -a named group. If future versions of Perl 5 add similar features using a -different syntax, the :mod:`re` module will be changed to support the new -syntax, while preserving the Python-specific syntax for compatibility's sake. - -Now that we've looked at the general extension syntax, we can return to the -features that simplify working with groups in complex REs. Since groups are -numbered from left to right and a complex expression may use many groups, it can -become difficult to keep track of the correct numbering. Modifying such a -complex RE is annoying, too: insert a new group near the beginning and you -change the numbers of everything that follows it. - -Sometimes you'll want to use a group to collect a part of a regular expression, +Python supports several of Perl's extensions and adds an extension +syntax to Perl's extension syntax. If the first character after the +question mark is a ``P``, you know that it's an extension that's +specific to Python. + +Now that we've looked at the general extension syntax, we can return +to the features that simplify working with groups in complex REs. + +Sometimes you'll want to use a group to denote a part of a regular expression, but aren't interested in retrieving the group's contents. You can make this fact explicit by using a non-capturing group: ``(?:...)``, where you can replace the ``...`` with any other regular expression. :: @@ -908,7 +901,7 @@ numbers, groups can be referenced by a name. The syntax for a named group is one of the Python-specific extensions: ``(?P<name>...)``. *name* is, obviously, the name of the group. Named groups -also behave exactly like capturing groups, and additionally associate a name +behave exactly like capturing groups, and additionally associate a name with a group. The :ref:`match object <match-objects>` methods that deal with capturing groups all accept either integers that refer to the group by number or strings that contain the desired group's name. Named groups are still @@ -975,9 +968,10 @@ The pattern to match this is quite simple: ``.*[.].*$`` Notice that the ``.`` needs to be treated specially because it's a -metacharacter; I've put it inside a character class. Also notice the trailing -``$``; this is added to ensure that all the rest of the string must be included -in the extension. This regular expression matches ``foo.bar`` and +metacharacter, so it's inside a character class to only match that +specific character. Also notice the trailing ``$``; this is added to +ensure that all the rest of the string must be included in the +extension. This regular expression matches ``foo.bar`` and ``autoexec.bat`` and ``sendmail.cf`` and ``printers.conf``. Now, consider complicating the problem a bit; what if you want to match @@ -1051,7 +1045,7 @@ Splitting Strings The :meth:`split` method of a pattern splits a string apart wherever the RE matches, returning a list of the pieces. It's similar to the :meth:`split` method of strings but provides much more generality in the -delimiters that you can split by; :meth:`split` only supports splitting by +delimiters that you can split by; string :meth:`split` only supports splitting by whitespace or by a fixed string. As you'd expect, there's a module-level :func:`re.split` function, too. @@ -1106,7 +1100,6 @@ Another common task is to find all the matches for a pattern, and replace them with a different string. The :meth:`sub` method takes a replacement value, which can be either a string or a function, and the string to be processed. - .. method:: .sub(replacement, string[, count=0]) :noindex: @@ -1362,4 +1355,3 @@ and doesn't contain any Python material at all, so it won't be useful as a reference for programming in Python. (The first edition covered Python's now-removed :mod:`regex` module, which won't help you much.) Consider checking it out from your library. - |