summaryrefslogtreecommitdiffstats
path: root/Doc
diff options
context:
space:
mode:
Diffstat (limited to 'Doc')
-rw-r--r--Doc/library/re.rst127
1 files changed, 77 insertions, 50 deletions
diff --git a/Doc/library/re.rst b/Doc/library/re.rst
index a6ebc22..f6f0d89 100644
--- a/Doc/library/re.rst
+++ b/Doc/library/re.rst
@@ -11,9 +11,13 @@
This module provides regular expression matching operations similar to
-those found in Perl. Both patterns and strings to be searched can be
-Unicode strings as well as 8-bit strings. The :mod:`re` module is
-always available.
+those found in Perl. The :mod:`re` module is always available.
+
+Both patterns and strings to be searched can be Unicode strings as well as
+8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed:
+that is, you cannot match an Unicode string with a byte pattern or
+vice-versa; similarly, when asking for a substition, the replacement
+string must be of the same type as both the pattern and the search string.
Regular expressions use the backslash character (``'\'``) to indicate
special forms or to allow special characters to be used without invoking
@@ -212,12 +216,12 @@ The special characters are:
group; ``(?P<name>...)`` is the only exception to this rule. Following are the
currently supported extensions.
-``(?iLmsux)``
- (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
- ``'u'``, ``'x'``.) The group matches the empty string; the letters
- set the corresponding flags: :const:`re.I` (ignore case),
- :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
- :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
+``(?aiLmsux)``
+ (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
+ ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
+ letters set the corresponding flags: :const:`re.a` (ASCII-only matching),
+ :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
+ :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
and :const:`re.X` (verbose), for the entire regular expression. (The
flags are described in :ref:`contents-of-module-re`.) This
is useful if you wish to include the flags as part of the regular
@@ -324,56 +328,62 @@ the second character. For example, ``\$`` matches the character ``'$'``.
word is indicated by whitespace or a non-alphanumeric, non-underscore character.
Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
precise set of characters deemed to be alphanumeric depends on the values of the
- ``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
+ ``ASCII`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
the backspace character, for compatibility with Python's string literals.
``\B``
Matches the empty string, but only when it is *not* at the beginning or end of a
word. This is just the opposite of ``\b``, so is also subject to the settings
- of ``LOCALE`` and ``UNICODE``.
+ of ``ASCII`` and ``LOCALE`` .
``\d``
- When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
- is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
- whatever is classified as a digit in the Unicode character properties database.
+ For Unicode (str) patterns:
+ When the :const:`ASCII` flag is specified, matches any decimal digit; this
+ is equivalent to the set ``[0-9]``. Otherwise, it will match whatever
+ is classified as a digit in the Unicode character properties database
+ (but this does include the standard ASCII digits and is thus a superset
+ of [0-9]).
+ For 8-bit (bytes) patterns:
+ Matches any decimal digit; this is equivalent to the set ``[0-9]``.
``\D``
- When the :const:`UNICODE` flag is not specified, matches any non-digit
- character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
- will match anything other than character marked as digits in the Unicode
- character properties database.
+ Matches any character which is not a decimal digit. This is the
+ opposite of ``\d`` and is therefore similarly subject to the settings of
+ ``ASCII`` and ``LOCALE``.
``\s``
- When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
- any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
- :const:`LOCALE`, it will match this set plus whatever characters are defined as
- space for the current locale. If :const:`UNICODE` is set, this will match the
- characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
- character properties database.
+ For Unicode (str) patterns:
+ When the :const:`ASCII` flag is specified, matches only ASCII whitespace
+ characters; this is equivalent to the set ``[ \t\n\r\f\v]``. Otherwise,
+ it will match this set whatever is classified as space in the Unicode
+ character properties database (including for example the non-breaking
+ spaces mandated by typography rules in many languages).
+ For 8-bit (bytes) patterns:
+ Matches characters considered whitespace in the ASCII character set;
+ this is equivalent to the set ``[ \t\n\r\f\v]``.
``\S``
- When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
- any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
- With :const:`LOCALE`, it will match any character not in this set, and not
- defined as space in the current locale. If :const:`UNICODE` is set, this will
- match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
- the Unicode character properties database.
+ Matches any character which is not a whitespace character. This is the
+ opposite of ``\s`` and is therefore similarly subject to the settings of
+ ``ASCII`` and ``LOCALE``.
``\w``
- When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
- any alphanumeric character and the underscore; this is equivalent to the set
- ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
- whatever characters are defined as alphanumeric for the current locale. If
- :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
- is classified as alphanumeric in the Unicode character properties database.
+ For Unicode (str) patterns:
+ When the :const:`ASCII` flag is specified, this is equivalent to the set
+ ``[a-zA-Z0-9_]``. Otherwise, it will match whatever is classified as
+ alphanumeric in the Unicode character properties database (it will
+ include most characters that can be part of a word in whatever language,
+ as well as numbers and the underscore sign).
+ For 8-bit (bytes) patterns:
+ Matches characters considered alphanumeric in the ASCII character set;
+ this is equivalent to the set ``[a-zA-Z0-9_]``. With :const:`LOCALE`,
+ it will additionally match whatever characters are defined as
+ alphanumeric for the current locale.
``\W``
- When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
- any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
- With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
- not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
- this will match anything other than ``[0-9_]`` and characters marked as
- alphanumeric in the Unicode character properties database.
+ Matches any character which is not an alphanumeric character. This is the
+ opposite of ``\w`` and is therefore similarly subject to the settings of
+ ``ASCII`` and ``LOCALE``.
``\Z``
Matches only at the end of the string.
@@ -454,6 +464,25 @@ form.
expression at a time needn't worry about compiling regular expressions.)
+.. data:: A
+ ASCII
+
+ Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` perform ASCII-only
+ matching instead of full Unicode matching. This is only meaningful for
+ Unicode patterns, and is ignored for byte patterns.
+
+ Note that the :const:`re.U` flag still exists (as well as its synonym
+ :const:`re.UNICODE` and its embedded counterpart ``(?u)``), but it has
+ become useless in Python 3.0.
+ In previous Python versions, it was used to specify that
+ matching had to be Unicode dependent (the default was ASCII matching in
+ all circumstances). Starting from Python 3.0, the default is Unicode
+ matching for Unicode strings (which can be changed by specifying the
+ ``'a'`` flag), and ASCII matching for 8-bit strings. Further, Unicode
+ dependent matching for 8-bit strings isn't allowed anymore and results
+ in a ValueError.
+
+
.. data:: I
IGNORECASE
@@ -465,7 +494,10 @@ form.
LOCALE
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
- current locale.
+ current locale. The use of this flag is discouraged as the locale mechanism
+ is very unreliable, and it only handles one "culture" at a time anyway;
+ you should use Unicode matching instead, which is the default in Python 3.0
+ for Unicode (str) patterns.
.. data:: M
@@ -486,13 +518,6 @@ form.
newline; without this flag, ``'.'`` will match anything *except* a newline.
-.. data:: U
- UNICODE
-
- Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
- on the Unicode character properties database.
-
-
.. data:: X
VERBOSE
@@ -511,6 +536,8 @@ form.
b = re.compile(r"\d+\.\d*")
+
+
.. function:: search(pattern, string[, flags])
Scan through *string* looking for a location where the regular expression