summaryrefslogtreecommitdiffstats
path: root/Doc
diff options
context:
space:
mode:
authorAntoine Pitrou <solipsis@pitrou.net>2008-08-19 17:56:33 (GMT)
committerAntoine Pitrou <solipsis@pitrou.net>2008-08-19 17:56:33 (GMT)
commitfd036451bf0e0ade8783e21df801abf7be96d020 (patch)
treee70ff65a9e641d8e790bc091f0dc2507baf344ca /Doc
parent3ad7ba10a20827b24d4b1aa9dd49474db8affbdd (diff)
downloadcpython-fd036451bf0e0ade8783e21df801abf7be96d020.zip
cpython-fd036451bf0e0ade8783e21df801abf7be96d020.tar.gz
cpython-fd036451bf0e0ade8783e21df801abf7be96d020.tar.bz2
#2834: Change re module semantics, so that str and bytes mixing is forbidden,
and str (unicode) patterns get full unicode matching by default. The re.ASCII flag is also introduced to ask for ASCII matching instead.
Diffstat (limited to 'Doc')
-rw-r--r--Doc/library/re.rst127
1 files changed, 77 insertions, 50 deletions
diff --git a/Doc/library/re.rst b/Doc/library/re.rst
index a6ebc22..f6f0d89 100644
--- a/Doc/library/re.rst
+++ b/Doc/library/re.rst
@@ -11,9 +11,13 @@
This module provides regular expression matching operations similar to
-those found in Perl. Both patterns and strings to be searched can be
-Unicode strings as well as 8-bit strings. The :mod:`re` module is
-always available.
+those found in Perl. The :mod:`re` module is always available.
+
+Both patterns and strings to be searched can be Unicode strings as well as
+8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed:
+that is, you cannot match an Unicode string with a byte pattern or
+vice-versa; similarly, when asking for a substition, the replacement
+string must be of the same type as both the pattern and the search string.
Regular expressions use the backslash character (``'\'``) to indicate
special forms or to allow special characters to be used without invoking
@@ -212,12 +216,12 @@ The special characters are:
group; ``(?P<name>...)`` is the only exception to this rule. Following are the
currently supported extensions.
-``(?iLmsux)``
- (One or more letters from the set ``'i'``, ``'L'``, ``'m'``, ``'s'``,
- ``'u'``, ``'x'``.) The group matches the empty string; the letters
- set the corresponding flags: :const:`re.I` (ignore case),
- :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
- :const:`re.S` (dot matches all), :const:`re.U` (Unicode dependent),
+``(?aiLmsux)``
+ (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
+ ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
+ letters set the corresponding flags: :const:`re.a` (ASCII-only matching),
+ :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
+ :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
and :const:`re.X` (verbose), for the entire regular expression. (The
flags are described in :ref:`contents-of-module-re`.) This
is useful if you wish to include the flags as part of the regular
@@ -324,56 +328,62 @@ the second character. For example, ``\$`` matches the character ``'$'``.
word is indicated by whitespace or a non-alphanumeric, non-underscore character.
Note that ``\b`` is defined as the boundary between ``\w`` and ``\ W``, so the
precise set of characters deemed to be alphanumeric depends on the values of the
- ``UNICODE`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
+ ``ASCII`` and ``LOCALE`` flags. Inside a character range, ``\b`` represents
the backspace character, for compatibility with Python's string literals.
``\B``
Matches the empty string, but only when it is *not* at the beginning or end of a
word. This is just the opposite of ``\b``, so is also subject to the settings
- of ``LOCALE`` and ``UNICODE``.
+ of ``ASCII`` and ``LOCALE`` .
``\d``
- When the :const:`UNICODE` flag is not specified, matches any decimal digit; this
- is equivalent to the set ``[0-9]``. With :const:`UNICODE`, it will match
- whatever is classified as a digit in the Unicode character properties database.
+ For Unicode (str) patterns:
+ When the :const:`ASCII` flag is specified, matches any decimal digit; this
+ is equivalent to the set ``[0-9]``. Otherwise, it will match whatever
+ is classified as a digit in the Unicode character properties database
+ (but this does include the standard ASCII digits and is thus a superset
+ of [0-9]).
+ For 8-bit (bytes) patterns:
+ Matches any decimal digit; this is equivalent to the set ``[0-9]``.
``\D``
- When the :const:`UNICODE` flag is not specified, matches any non-digit
- character; this is equivalent to the set ``[^0-9]``. With :const:`UNICODE`, it
- will match anything other than character marked as digits in the Unicode
- character properties database.
+ Matches any character which is not a decimal digit. This is the
+ opposite of ``\d`` and is therefore similarly subject to the settings of
+ ``ASCII`` and ``LOCALE``.
``\s``
- When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
- any whitespace character; this is equivalent to the set ``[ \t\n\r\f\v]``. With
- :const:`LOCALE`, it will match this set plus whatever characters are defined as
- space for the current locale. If :const:`UNICODE` is set, this will match the
- characters ``[ \t\n\r\f\v]`` plus whatever is classified as space in the Unicode
- character properties database.
+ For Unicode (str) patterns:
+ When the :const:`ASCII` flag is specified, matches only ASCII whitespace
+ characters; this is equivalent to the set ``[ \t\n\r\f\v]``. Otherwise,
+ it will match this set whatever is classified as space in the Unicode
+ character properties database (including for example the non-breaking
+ spaces mandated by typography rules in many languages).
+ For 8-bit (bytes) patterns:
+ Matches characters considered whitespace in the ASCII character set;
+ this is equivalent to the set ``[ \t\n\r\f\v]``.
``\S``
- When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
- any non-whitespace character; this is equivalent to the set ``[^ \t\n\r\f\v]``
- With :const:`LOCALE`, it will match any character not in this set, and not
- defined as space in the current locale. If :const:`UNICODE` is set, this will
- match anything other than ``[ \t\n\r\f\v]`` and characters marked as space in
- the Unicode character properties database.
+ Matches any character which is not a whitespace character. This is the
+ opposite of ``\s`` and is therefore similarly subject to the settings of
+ ``ASCII`` and ``LOCALE``.
``\w``
- When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
- any alphanumeric character and the underscore; this is equivalent to the set
- ``[a-zA-Z0-9_]``. With :const:`LOCALE`, it will match the set ``[0-9_]`` plus
- whatever characters are defined as alphanumeric for the current locale. If
- :const:`UNICODE` is set, this will match the characters ``[0-9_]`` plus whatever
- is classified as alphanumeric in the Unicode character properties database.
+ For Unicode (str) patterns:
+ When the :const:`ASCII` flag is specified, this is equivalent to the set
+ ``[a-zA-Z0-9_]``. Otherwise, it will match whatever is classified as
+ alphanumeric in the Unicode character properties database (it will
+ include most characters that can be part of a word in whatever language,
+ as well as numbers and the underscore sign).
+ For 8-bit (bytes) patterns:
+ Matches characters considered alphanumeric in the ASCII character set;
+ this is equivalent to the set ``[a-zA-Z0-9_]``. With :const:`LOCALE`,
+ it will additionally match whatever characters are defined as
+ alphanumeric for the current locale.
``\W``
- When the :const:`LOCALE` and :const:`UNICODE` flags are not specified, matches
- any non-alphanumeric character; this is equivalent to the set ``[^a-zA-Z0-9_]``.
- With :const:`LOCALE`, it will match any character not in the set ``[0-9_]``, and
- not defined as alphanumeric for the current locale. If :const:`UNICODE` is set,
- this will match anything other than ``[0-9_]`` and characters marked as
- alphanumeric in the Unicode character properties database.
+ Matches any character which is not an alphanumeric character. This is the
+ opposite of ``\w`` and is therefore similarly subject to the settings of
+ ``ASCII`` and ``LOCALE``.
``\Z``
Matches only at the end of the string.
@@ -454,6 +464,25 @@ form.
expression at a time needn't worry about compiling regular expressions.)
+.. data:: A
+ ASCII
+
+ Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` perform ASCII-only
+ matching instead of full Unicode matching. This is only meaningful for
+ Unicode patterns, and is ignored for byte patterns.
+
+ Note that the :const:`re.U` flag still exists (as well as its synonym
+ :const:`re.UNICODE` and its embedded counterpart ``(?u)``), but it has
+ become useless in Python 3.0.
+ In previous Python versions, it was used to specify that
+ matching had to be Unicode dependent (the default was ASCII matching in
+ all circumstances). Starting from Python 3.0, the default is Unicode
+ matching for Unicode strings (which can be changed by specifying the
+ ``'a'`` flag), and ASCII matching for 8-bit strings. Further, Unicode
+ dependent matching for 8-bit strings isn't allowed anymore and results
+ in a ValueError.
+
+
.. data:: I
IGNORECASE
@@ -465,7 +494,10 @@ form.
LOCALE
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\s`` and ``\S`` dependent on the
- current locale.
+ current locale. The use of this flag is discouraged as the locale mechanism
+ is very unreliable, and it only handles one "culture" at a time anyway;
+ you should use Unicode matching instead, which is the default in Python 3.0
+ for Unicode (str) patterns.
.. data:: M
@@ -486,13 +518,6 @@ form.
newline; without this flag, ``'.'`` will match anything *except* a newline.
-.. data:: U
- UNICODE
-
- Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` dependent
- on the Unicode character properties database.
-
-
.. data:: X
VERBOSE
@@ -511,6 +536,8 @@ form.
b = re.compile(r"\d+\.\d*")
+
+
.. function:: search(pattern, string[, flags])
Scan through *string* looking for a location where the regular expression