summaryrefslogtreecommitdiffstats
path: root/Doc/library/re.rst
diff options
context:
space:
mode:
authorAdam Turner <9087854+AA-Turner@users.noreply.github.com>2024-01-11 23:56:10 (GMT)
committerGitHub <noreply@github.com>2024-01-11 23:56:10 (GMT)
commitc9b8a22f3404d59e2c4950715f8c29413a349b8e (patch)
tree5319484c42d2a59dda9aa0b43522778f3b3d1ebc /Doc/library/re.rst
parentb4d4aa9e8d61476267951c72321fadffc2d82227 (diff)
downloadcpython-c9b8a22f3404d59e2c4950715f8c29413a349b8e.zip
cpython-c9b8a22f3404d59e2c4950715f8c29413a349b8e.tar.gz
cpython-c9b8a22f3404d59e2c4950715f8c29413a349b8e.tar.bz2
GH-107678: Improve Unicode handling clarity in ``library/re.rst`` (#107679)
Diffstat (limited to 'Doc/library/re.rst')
-rw-r--r--Doc/library/re.rst237
1 files changed, 145 insertions, 92 deletions
diff --git a/Doc/library/re.rst b/Doc/library/re.rst
index 302f722..5bb9339 100644
--- a/Doc/library/re.rst
+++ b/Doc/library/re.rst
@@ -17,7 +17,7 @@ those found in Perl.
Both patterns and strings to be searched can be Unicode strings (:class:`str`)
as well as 8-bit strings (:class:`bytes`).
However, Unicode strings and 8-bit strings cannot be mixed:
-that is, you cannot match a Unicode string with a byte pattern or
+that is, you cannot match a Unicode string with a bytes pattern or
vice-versa; similarly, when asking for a substitution, the replacement
string must be of the same type as both the pattern and the search string.
@@ -257,8 +257,7 @@ The special characters are:
.. index:: single: \ (backslash); in regular expressions
* Character classes such as ``\w`` or ``\S`` (defined below) are also accepted
- inside a set, although the characters they match depends on whether
- :const:`ASCII` or :const:`LOCALE` mode is in force.
+ inside a set, although the characters they match depend on the flags_ used.
.. index:: single: ^ (caret); in regular expressions
@@ -326,18 +325,24 @@ The special characters are:
currently supported extensions.
``(?aiLmsux)``
- (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
- ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the
- letters set the corresponding flags: :const:`re.A` (ASCII-only matching),
- :const:`re.I` (ignore case), :const:`re.L` (locale dependent),
- :const:`re.M` (multi-line), :const:`re.S` (dot matches all),
- :const:`re.U` (Unicode matching), and :const:`re.X` (verbose),
- for the entire regular expression.
+ (One or more letters from the set
+ ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``.)
+ The group matches the empty string;
+ the letters set the corresponding flags for the entire regular expression:
+
+ * :const:`re.A` (ASCII-only matching)
+ * :const:`re.I` (ignore case)
+ * :const:`re.L` (locale dependent)
+ * :const:`re.M` (multi-line)
+ * :const:`re.S` (dot matches all)
+ * :const:`re.U` (Unicode matching)
+ * :const:`re.X` (verbose)
+
(The flags are described in :ref:`contents-of-module-re`.)
This is useful if you wish to include the flags as part of the
regular expression, instead of passing a *flag* argument to the
- :func:`re.compile` function. Flags should be used first in the
- expression string.
+ :func:`re.compile` function.
+ Flags should be used first in the expression string.
.. versionchanged:: 3.11
This construction can only be used at the start of the expression.
@@ -351,14 +356,20 @@ The special characters are:
pattern.
``(?aiLmsux-imsx:...)``
- (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``,
- ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by
+ (Zero or more letters from the set
+ ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``,
+ optionally followed by ``'-'`` followed by
one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.)
- The letters set or remove the corresponding flags:
- :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case),
- :const:`re.L` (locale dependent), :const:`re.M` (multi-line),
- :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching),
- and :const:`re.X` (verbose), for the part of the expression.
+ The letters set or remove the corresponding flags for the part of the expression:
+
+ * :const:`re.A` (ASCII-only matching)
+ * :const:`re.I` (ignore case)
+ * :const:`re.L` (locale dependent)
+ * :const:`re.M` (multi-line)
+ * :const:`re.S` (dot matches all)
+ * :const:`re.U` (Unicode matching)
+ * :const:`re.X` (verbose)
+
(The flags are described in :ref:`contents-of-module-re`.)
The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used
@@ -366,7 +377,7 @@ The special characters are:
when one of them appears in an inline group, it overrides the matching mode
in the enclosing group. In Unicode patterns ``(?a:...)`` switches to
ASCII-only matching, and ``(?u:...)`` switches to Unicode matching
- (default). In byte pattern ``(?L:...)`` switches to locale depending
+ (default). In bytes patterns ``(?L:...)`` switches to locale dependent
matching, and ``(?a:...)`` switches to ASCII-only matching (default).
This override is only in effect for the narrow inline group, and the
original matching mode is restored outside of the group.
@@ -529,47 +540,61 @@ character ``'$'``.
``\b``
Matches the empty string, but only at the beginning or end of a word.
- A word is defined as a sequence of word characters. Note that formally,
- ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character
- (or vice versa), or between ``\w`` and the beginning/end of the string.
- This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``,
- ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``.
-
- By default Unicode alphanumerics are the ones used in Unicode patterns, but
- this can be changed by using the :const:`ASCII` flag. Word boundaries are
- determined by the current locale if the :const:`LOCALE` flag is used.
- Inside a character range, ``\b`` represents the backspace character, for
- compatibility with Python's string literals.
+ A word is defined as a sequence of word characters.
+ Note that formally, ``\b`` is defined as the boundary
+ between a ``\w`` and a ``\W`` character (or vice versa),
+ or between ``\w`` and the beginning or end of the string.
+ This means that ``r'\bat\b'`` matches ``'at'``, ``'at.'``, ``'(at)'``,
+ and ``'as at ay'`` but not ``'attempt'`` or ``'atlas'``.
+
+ The default word characters in Unicode (str) patterns
+ are Unicode alphanumerics and the underscore,
+ but this can be changed by using the :py:const:`~re.ASCII` flag.
+ Word boundaries are determined by the current locale
+ if the :py:const:`~re.LOCALE` flag is used.
+
+ .. note::
+
+ Inside a character range, ``\b`` represents the backspace character,
+ for compatibility with Python's string literals.
.. index:: single: \B; in regular expressions
``\B``
- Matches the empty string, but only when it is *not* at the beginning or end
- of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``,
- ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``.
- ``\B`` is just the opposite of ``\b``, so word characters in Unicode
- patterns are Unicode alphanumerics or the underscore, although this can
- be changed by using the :const:`ASCII` flag. Word boundaries are
- determined by the current locale if the :const:`LOCALE` flag is used.
+ Matches the empty string,
+ but only when it is *not* at the beginning or end of a word.
+ This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``,
+ ``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``.
+ ``\B`` is the opposite of ``\b``,
+ so word characters in Unicode (str) patterns
+ are Unicode alphanumerics or the underscore,
+ although this can be changed by using the :py:const:`~re.ASCII` flag.
+ Word boundaries are determined by the current locale
+ if the :py:const:`~re.LOCALE` flag is used.
.. index:: single: \d; in regular expressions
``\d``
For Unicode (str) patterns:
- Matches any Unicode decimal digit (that is, any character in
- Unicode character category [Nd]). This includes ``[0-9]``, and
- also many other digit characters. If the :const:`ASCII` flag is
- used only ``[0-9]`` is matched.
+ Matches any Unicode decimal digit
+ (that is, any character in Unicode character category `[Nd]`__).
+ This includes ``[0-9]``, and also many other digit characters.
+
+ Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used.
+
+ __ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153
For 8-bit (bytes) patterns:
- Matches any decimal digit; this is equivalent to ``[0-9]``.
+ Matches any decimal digit in the ASCII character set;
+ this is equivalent to ``[0-9]``.
.. index:: single: \D; in regular expressions
``\D``
- Matches any character which is not a decimal digit. This is
- the opposite of ``\d``. If the :const:`ASCII` flag is used this
- becomes the equivalent of ``[^0-9]``.
+ Matches any character which is not a decimal digit.
+ This is the opposite of ``\d``.
+
+ Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used.
.. index:: single: \s; in regular expressions
@@ -578,8 +603,9 @@ character ``'$'``.
Matches Unicode whitespace characters (which includes
``[ \t\n\r\f\v]``, and also many other characters, for example the
non-breaking spaces mandated by typography rules in many
- languages). If the :const:`ASCII` flag is used, only
- ``[ \t\n\r\f\v]`` is matched.
+ languages).
+
+ Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
For 8-bit (bytes) patterns:
Matches characters considered whitespace in the ASCII character set;
@@ -589,30 +615,39 @@ character ``'$'``.
``\S``
Matches any character which is not a whitespace character. This is
- the opposite of ``\s``. If the :const:`ASCII` flag is used this
- becomes the equivalent of ``[^ \t\n\r\f\v]``.
+ the opposite of ``\s``.
+
+ Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used.
.. index:: single: \w; in regular expressions
``\w``
For Unicode (str) patterns:
- Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`)
+ Matches Unicode word characters;
+ this includes all Unicode alphanumeric characters
+ (as defined by :py:meth:`str.isalnum`),
as well as the underscore (``_``).
- If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched.
+
+ Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
For 8-bit (bytes) patterns:
Matches characters considered alphanumeric in the ASCII character set;
- this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
- used, matches characters considered alphanumeric in the current locale
- and the underscore.
+ this is equivalent to ``[a-zA-Z0-9_]``.
+ If the :py:const:`~re.LOCALE` flag is used,
+ matches characters considered alphanumeric in the current locale and the underscore.
.. index:: single: \W; in regular expressions
``\W``
- Matches any character which is not a word character. This is
- the opposite of ``\w``. If the :const:`ASCII` flag is used this
- becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is
- used, matches characters which are neither alphanumeric in the current locale
+ Matches any character which is not a word character.
+ This is the opposite of ``\w``.
+ By default, matches non-underscore (``_``) characters
+ for which :py:meth:`str.isalnum` returns ``False``.
+
+ Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used.
+
+ If the :py:const:`~re.LOCALE` flag is used,
+ matches characters which are neither alphanumeric in the current locale
nor the underscore.
.. index:: single: \Z; in regular expressions
@@ -644,9 +679,11 @@ string literals are also accepted by the regular expression parser::
(Note that ``\b`` is used to represent word boundaries, and means "backspace"
only inside character classes.)
-``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode
-patterns. In bytes patterns they are errors. Unknown escapes of ASCII
-letters are reserved for future use and treated as errors.
+``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are
+only recognized in Unicode (str) patterns.
+In bytes patterns they are errors.
+Unknown escapes of ASCII letters are reserved
+for future use and treated as errors.
Octal escapes are included in a limited form. If the first digit is a 0, or if
there are three octal digits, it is considered an octal escape. Otherwise, it is
@@ -694,30 +731,37 @@ Flags
Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S``
perform ASCII-only matching instead of full Unicode matching. This is only
- meaningful for Unicode patterns, and is ignored for byte patterns.
+ meaningful for Unicode (str) patterns, and is ignored for bytes patterns.
+
Corresponds to the inline flag ``(?a)``.
- Note that for backward compatibility, the :const:`re.U` flag still
- exists (as well as its synonym :const:`re.UNICODE` and its embedded
- counterpart ``(?u)``), but these are redundant in Python 3 since
- matches are Unicode by default for strings (and Unicode matching
- isn't allowed for bytes).
+ .. note::
+
+ The :py:const:`~re.U` flag still exists for backward compatibility,
+ but is redundant in Python 3 since
+ matches are Unicode by default for ``str`` patterns,
+ and Unicode matching isn't allowed for bytes patterns.
+ :py:const:`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant.
.. data:: DEBUG
Display debug information about compiled expression.
+
No corresponding inline flag.
.. data:: I
IGNORECASE
- Perform case-insensitive matching; expressions like ``[A-Z]`` will also
- match lowercase letters. Full Unicode matching (such as ``Ü`` matching
- ``ü``) also works unless the :const:`re.ASCII` flag is used to disable
- non-ASCII matches. The current locale does not change the effect of this
- flag unless the :const:`re.LOCALE` flag is also used.
+ Perform case-insensitive matching;
+ expressions like ``[A-Z]`` will also match lowercase letters.
+ Full Unicode matching (such as ``Ü`` matching ``ü``)
+ also works unless the :py:const:`~re.ASCII` flag
+ is used to disable non-ASCII matches.
+ The current locale does not change the effect of this flag
+ unless the :py:const:`~re.LOCALE` flag is also used.
+
Corresponds to the inline flag ``(?i)``.
Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in
@@ -725,29 +769,35 @@ Flags
letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital
letter I with dot above), 'ı' (U+0131, Latin small letter dotless i),
'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign).
- If the :const:`ASCII` flag is used, only letters 'a' to 'z'
+ If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z'
and 'A' to 'Z' are matched.
.. data:: L
LOCALE
Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching
- dependent on the current locale. This flag can be used only with bytes
- patterns. The use of this flag is discouraged as the locale mechanism
- is very unreliable, it only handles one "culture" at a time, and it only
- works with 8-bit locales. Unicode matching is already enabled by default
- in Python 3 for Unicode (str) patterns, and it is able to handle different
- locales/languages.
+ dependent on the current locale.
+ This flag can be used only with bytes patterns.
+
Corresponds to the inline flag ``(?L)``.
+ .. warning::
+
+ This flag is discouraged; consider Unicode matching instead.
+ The locale mechanism is very unreliable
+ as it only handles one "culture" at a time
+ and only works with 8-bit locales.
+ Unicode matching is enabled by default for Unicode (str) patterns
+ and it is able to handle different locales and languages.
+
.. versionchanged:: 3.6
- :const:`re.LOCALE` can be used only with bytes patterns and is
- not compatible with :const:`re.ASCII`.
+ :py:const:`~re.LOCALE` can be used only with bytes patterns
+ and is not compatible with :py:const:`~re.ASCII`.
.. versionchanged:: 3.7
- Compiled regular expression objects with the :const:`re.LOCALE` flag no
- longer depend on the locale at compile time. Only the locale at
- matching time affects the result of matching.
+ Compiled regular expression objects with the :py:const:`~re.LOCALE` flag
+ no longer depend on the locale at compile time.
+ Only the locale at matching time affects the result of matching.
.. data:: M
@@ -759,6 +809,7 @@ Flags
end of each line (immediately preceding each newline). By default, ``'^'``
matches only at the beginning of the string, and ``'$'`` only at the end of the
string and immediately before the newline (if any) at the end of the string.
+
Corresponds to the inline flag ``(?m)``.
.. data:: NOFLAG
@@ -778,19 +829,19 @@ Flags
Make the ``'.'`` special character match any character at all, including a
newline; without this flag, ``'.'`` will match anything *except* a newline.
+
Corresponds to the inline flag ``(?s)``.
.. data:: U
UNICODE
- In Python 2, this flag made :ref:`special sequences <re-special-sequences>`
- include Unicode characters in matches. Since Python 3, Unicode characters
- are matched by default.
-
- See :const:`A` for restricting matching on ASCII characters instead.
+ In Python 3, Unicode characters are matched by default
+ for ``str`` patterns.
+ This flag is therefore redundant with **no effect**
+ and is only kept for backward compatibility.
- This flag is only kept for backward compatibility.
+ See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead.
.. data:: X
VERBOSE
@@ -914,6 +965,8 @@ Functions
Empty matches for the pattern split the string only when not adjacent
to a previous empty match.
+ .. code:: pycon
+
>>> re.split(r'\b', 'Words, words, words.')
['', 'Words', ', ', 'words', ', ', 'words', '.']
>>> re.split(r'\W*', '...words...')
@@ -1237,7 +1290,7 @@ Regular Expression Objects
The regex matching flags. This is a combination of the flags given to
:func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit
- flags such as :data:`UNICODE` if the pattern is a Unicode string.
+ flags such as :py:const:`~re.UNICODE` if the pattern is a Unicode string.
.. attribute:: Pattern.groups