diff options
Diffstat (limited to 'Doc')
-rw-r--r-- | Doc/library/re.rst | 127 |
1 files changed, 77 insertions, 50 deletions
diff --git a/Doc/library/re.rst b/Doc/library/re.rst index a6ebc22..f6f0d89 100644 --- a/Doc/library/re.rst +++ b/Doc/library/re.rst @@ -11,9 +11,13 @@ This module provides regular expression matching operations similar to -those found in Perl. Both patterns and strings to be searched can be -Unicode strings as well as 8-bit strings. The :mod:`re` module is -always available. +those found in Perl. The :mod:`re` module is always available. + +Both patterns and strings to be searched can be Unicode strings as well as +8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed: +that is, you cannot match an Unicode string with a byte pattern or +vice-versa; similarly, when asking for a substition, the replacement +string must be of the same type as both the pattern and the search string. Regular expressions use the backslash character (``'\'``) to indicate special forms or to allow special characters to be used without invoking @@ -212,12 +216,12 @@ The special characters are: group; ``(?P<name>...)`` is the only exception to this rule. Following are the currently supported extensions. -``(?iLmsux)`` - (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``, - ``'u'``, ``'x'``.) The group matches the empty string; the letters - set the corresponding flags: :const:`re.I` (ignore case), - :const:`re.L` (locale dependent), :const:`re.M` (multi-line), - :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent), +``(?aiLmsux)`` + (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, + ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the + letters set the corresponding flags: :const:`re.a` (ASCII-only matching), + :const:`re.I` (ignore case), :const:`re.L` (locale dependent), + :const:`re.M` (multi-line), :const:`re.S` (dot matches all), and :const:`re.X` (verbose), for the entire regular expression. (The flags are described in :ref:`contents-of-module-re`.) This is useful if you wish to include the flags as part of the regular @@ -324,56 +328,62 @@ the second character. For example, ``\$`` matches the character ``'$'``. word is indicated by whitespace or a non-alphanumeric, non-underscore character. Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the precise set of characters deemed to be alphanumeric depends on the values of the - ``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents + ``ASCII`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents the backspace character, for compatibility with Python's string literals. ``\B`` Matches the empty string, but only when it is *not* at the beginning or end of a word. This is just the opposite of ``\b``, so is also subject to the settings - of ``LOCALE`` and ``UNICODE``. + of ``ASCII`` and ``LOCALE`` . ``\d`` - When the :const:`UNICODE` flag is not specified, matches any decimal digit; this - is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match - whatever is classified as a digit in the Unicode character properties database. + For Unicode (str) patterns: + When the :const:`ASCII` flag is specified, matches any decimal digit; this + is equivalent to the set ``[0-9]``. Otherwise, it will match whatever + is classified as a digit in the Unicode character properties database + (but this does include the standard ASCII digits and is thus a superset + of [0-9]). + For 8-bit (bytes) patterns: + Matches any decimal digit; this is equivalent to the set ``[0-9]``. ``\D`` - When the :const:`UNICODE` flag is not specified, matches any non-digit - character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it - will match anything other than character marked as digits in the Unicode - character properties database. + Matches any character which is not a decimal digit. This is the + opposite of ``\d`` and is therefore similarly subject to the settings of + ``ASCII`` and ``LOCALE``. ``\s`` - When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches - any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With - :const:`LOCALE`, it will match this set plus whatever characters are defined as - space for the current locale. If :const:`UNICODE` is set, this will match the - characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode - character properties database. + For Unicode (str) patterns: + When the :const:`ASCII` flag is specified, matches only ASCII whitespace + characters; this is equivalent to the set ``[ \t\n\r\f\v]``. Otherwise, + it will match this set whatever is classified as space in the Unicode + character properties database (including for example the non-breaking + spaces mandated by typography rules in many languages). + For 8-bit (bytes) patterns: + Matches characters considered whitespace in the ASCII character set; + this is equivalent to the set ``[ \t\n\r\f\v]``. ``\S`` - When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches - any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]`` - With :const:`LOCALE`, it will match any character not in this set, and not - defined as space in the current locale. If :const:`UNICODE` is set, this will - match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in - the Unicode character properties database. + Matches any character which is not a whitespace character. This is the + opposite of ``\s`` and is therefore similarly subject to the settings of + ``ASCII`` and ``LOCALE``. ``\w`` - When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches - any alphanumeric character and the underscore; this is equivalent to the set - ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus - whatever characters are defined as alphanumeric for the current locale. If - :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever - is classified as alphanumeric in the Unicode character properties database. + For Unicode (str) patterns: + When the :const:`ASCII` flag is specified, this is equivalent to the set + ``[a-zA-Z0-9_]``. Otherwise, it will match whatever is classified as + alphanumeric in the Unicode character properties database (it will + include most characters that can be part of a word in whatever language, + as well as numbers and the underscore sign). + For 8-bit (bytes) patterns: + Matches characters considered alphanumeric in the ASCII character set; + this is equivalent to the set ``[a-zA-Z0-9_]``. With :const:`LOCALE`, + it will additionally match whatever characters are defined as + alphanumeric for the current locale. ``\W`` - When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches - any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``. - With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and - not defined as alphanumeric for the current locale. If :const:`UNICODE` is set, - this will match anything other than ``[0-9_]`` and characters marked as - alphanumeric in the Unicode character properties database. + Matches any character which is not an alphanumeric character. This is the + opposite of ``\w`` and is therefore similarly subject to the settings of + ``ASCII`` and ``LOCALE``. ``\Z`` Matches only at the end of the string. @@ -454,6 +464,25 @@ form. expression at a time needn't worry about compiling regular expressions.) +.. data:: A + ASCII + + Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` perform ASCII-only + matching instead of full Unicode matching. This is only meaningful for + Unicode patterns, and is ignored for byte patterns. + + Note that the :const:`re.U` flag still exists (as well as its synonym + :const:`re.UNICODE` and its embedded counterpart ``(?u)``), but it has + become useless in Python 3.0. + In previous Python versions, it was used to specify that + matching had to be Unicode dependent (the default was ASCII matching in + all circumstances). Starting from Python 3.0, the default is Unicode + matching for Unicode strings (which can be changed by specifying the + ``'a'`` flag), and ASCII matching for 8-bit strings. Further, Unicode + dependent matching for 8-bit strings isn't allowed anymore and results + in a ValueError. + + .. data:: I IGNORECASE @@ -465,7 +494,10 @@ form. LOCALE Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the - current locale. + current locale. The use of this flag is discouraged as the locale mechanism + is very unreliable, and it only handles one "culture" at a time anyway; + you should use Unicode matching instead, which is the default in Python 3.0 + for Unicode (str) patterns. .. data:: M @@ -486,13 +518,6 @@ form. newline; without this flag, ``'.'`` will match anything *except* a newline. -.. data:: U - UNICODE - - Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent - on the Unicode character properties database. - - .. data:: X VERBOSE @@ -511,6 +536,8 @@ form. b = re.compile(r"\d+\.\d*") + + .. function:: search(pattern, string[, flags]) Scan through *string* looking for a location where the regular expression |