diff options
author | Adam Turner <9087854+AA-Turner@users.noreply.github.com> | 2024-01-11 23:56:10 (GMT) |
---|---|---|
committer | GitHub <noreply@github.com> | 2024-01-11 23:56:10 (GMT) |
commit | c9b8a22f3404d59e2c4950715f8c29413a349b8e (patch) | |
tree | 5319484c42d2a59dda9aa0b43522778f3b3d1ebc /Doc/library/re.rst | |
parent | b4d4aa9e8d61476267951c72321fadffc2d82227 (diff) | |
download | cpython-c9b8a22f3404d59e2c4950715f8c29413a349b8e.zip cpython-c9b8a22f3404d59e2c4950715f8c29413a349b8e.tar.gz cpython-c9b8a22f3404d59e2c4950715f8c29413a349b8e.tar.bz2 |
GH-107678: Improve Unicode handling clarity in ``library/re.rst`` (#107679)
Diffstat (limited to 'Doc/library/re.rst')
-rw-r--r-- | Doc/library/re.rst | 237 |
1 files changed, 145 insertions, 92 deletions
diff --git a/Doc/library/re.rst b/Doc/library/re.rst index 302f722..5bb9339 100644 --- a/Doc/library/re.rst +++ b/Doc/library/re.rst @@ -17,7 +17,7 @@ those found in Perl. Both patterns and strings to be searched can be Unicode strings (:class:`str`) as well as 8-bit strings (:class:`bytes`). However, Unicode strings and 8-bit strings cannot be mixed: -that is, you cannot match a Unicode string with a byte pattern or +that is, you cannot match a Unicode string with a bytes pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string. @@ -257,8 +257,7 @@ The special characters are: .. index:: single: \ (backslash); in regular expressions * Character classes such as ``\w`` or ``\S`` (defined below) are also accepted - inside a set, although the characters they match depends on whether - :const:`ASCII` or :const:`LOCALE` mode is in force. + inside a set, although the characters they match depend on the flags_ used. .. index:: single: ^ (caret); in regular expressions @@ -326,18 +325,24 @@ The special characters are: currently supported extensions. ``(?aiLmsux)`` - (One or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, - ``'s'``, ``'u'``, ``'x'``.) The group matches the empty string; the - letters set the corresponding flags: :const:`re.A` (ASCII-only matching), - :const:`re.I` (ignore case), :const:`re.L` (locale dependent), - :const:`re.M` (multi-line), :const:`re.S` (dot matches all), - :const:`re.U` (Unicode matching), and :const:`re.X` (verbose), - for the entire regular expression. + (One or more letters from the set + ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``.) + The group matches the empty string; + the letters set the corresponding flags for the entire regular expression: + + * :const:`re.A` (ASCII-only matching) + * :const:`re.I` (ignore case) + * :const:`re.L` (locale dependent) + * :const:`re.M` (multi-line) + * :const:`re.S` (dot matches all) + * :const:`re.U` (Unicode matching) + * :const:`re.X` (verbose) + (The flags are described in :ref:`contents-of-module-re`.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a *flag* argument to the - :func:`re.compile` function. Flags should be used first in the - expression string. + :func:`re.compile` function. + Flags should be used first in the expression string. .. versionchanged:: 3.11 This construction can only be used at the start of the expression. @@ -351,14 +356,20 @@ The special characters are: pattern. ``(?aiLmsux-imsx:...)`` - (Zero or more letters from the set ``'a'``, ``'i'``, ``'L'``, ``'m'``, - ``'s'``, ``'u'``, ``'x'``, optionally followed by ``'-'`` followed by + (Zero or more letters from the set + ``'a'``, ``'i'``, ``'L'``, ``'m'``, ``'s'``, ``'u'``, ``'x'``, + optionally followed by ``'-'`` followed by one or more letters from the ``'i'``, ``'m'``, ``'s'``, ``'x'``.) - The letters set or remove the corresponding flags: - :const:`re.A` (ASCII-only matching), :const:`re.I` (ignore case), - :const:`re.L` (locale dependent), :const:`re.M` (multi-line), - :const:`re.S` (dot matches all), :const:`re.U` (Unicode matching), - and :const:`re.X` (verbose), for the part of the expression. + The letters set or remove the corresponding flags for the part of the expression: + + * :const:`re.A` (ASCII-only matching) + * :const:`re.I` (ignore case) + * :const:`re.L` (locale dependent) + * :const:`re.M` (multi-line) + * :const:`re.S` (dot matches all) + * :const:`re.U` (Unicode matching) + * :const:`re.X` (verbose) + (The flags are described in :ref:`contents-of-module-re`.) The letters ``'a'``, ``'L'`` and ``'u'`` are mutually exclusive when used @@ -366,7 +377,7 @@ The special characters are: when one of them appears in an inline group, it overrides the matching mode in the enclosing group. In Unicode patterns ``(?a:...)`` switches to ASCII-only matching, and ``(?u:...)`` switches to Unicode matching - (default). In byte pattern ``(?L:...)`` switches to locale depending + (default). In bytes patterns ``(?L:...)`` switches to locale dependent matching, and ``(?a:...)`` switches to ASCII-only matching (default). This override is only in effect for the narrow inline group, and the original matching mode is restored outside of the group. @@ -529,47 +540,61 @@ character ``'$'``. ``\b`` Matches the empty string, but only at the beginning or end of a word. - A word is defined as a sequence of word characters. Note that formally, - ``\b`` is defined as the boundary between a ``\w`` and a ``\W`` character - (or vice versa), or between ``\w`` and the beginning/end of the string. - This means that ``r'\bfoo\b'`` matches ``'foo'``, ``'foo.'``, ``'(foo)'``, - ``'bar foo baz'`` but not ``'foobar'`` or ``'foo3'``. - - By default Unicode alphanumerics are the ones used in Unicode patterns, but - this can be changed by using the :const:`ASCII` flag. Word boundaries are - determined by the current locale if the :const:`LOCALE` flag is used. - Inside a character range, ``\b`` represents the backspace character, for - compatibility with Python's string literals. + A word is defined as a sequence of word characters. + Note that formally, ``\b`` is defined as the boundary + between a ``\w`` and a ``\W`` character (or vice versa), + or between ``\w`` and the beginning or end of the string. + This means that ``r'\bat\b'`` matches ``'at'``, ``'at.'``, ``'(at)'``, + and ``'as at ay'`` but not ``'attempt'`` or ``'atlas'``. + + The default word characters in Unicode (str) patterns + are Unicode alphanumerics and the underscore, + but this can be changed by using the :py:const:`~re.ASCII` flag. + Word boundaries are determined by the current locale + if the :py:const:`~re.LOCALE` flag is used. + + .. note:: + + Inside a character range, ``\b`` represents the backspace character, + for compatibility with Python's string literals. .. index:: single: \B; in regular expressions ``\B`` - Matches the empty string, but only when it is *not* at the beginning or end - of a word. This means that ``r'py\B'`` matches ``'python'``, ``'py3'``, - ``'py2'``, but not ``'py'``, ``'py.'``, or ``'py!'``. - ``\B`` is just the opposite of ``\b``, so word characters in Unicode - patterns are Unicode alphanumerics or the underscore, although this can - be changed by using the :const:`ASCII` flag. Word boundaries are - determined by the current locale if the :const:`LOCALE` flag is used. + Matches the empty string, + but only when it is *not* at the beginning or end of a word. + This means that ``r'at\B'`` matches ``'athens'``, ``'atom'``, + ``'attorney'``, but not ``'at'``, ``'at.'``, or ``'at!'``. + ``\B`` is the opposite of ``\b``, + so word characters in Unicode (str) patterns + are Unicode alphanumerics or the underscore, + although this can be changed by using the :py:const:`~re.ASCII` flag. + Word boundaries are determined by the current locale + if the :py:const:`~re.LOCALE` flag is used. .. index:: single: \d; in regular expressions ``\d`` For Unicode (str) patterns: - Matches any Unicode decimal digit (that is, any character in - Unicode character category [Nd]). This includes ``[0-9]``, and - also many other digit characters. If the :const:`ASCII` flag is - used only ``[0-9]`` is matched. + Matches any Unicode decimal digit + (that is, any character in Unicode character category `[Nd]`__). + This includes ``[0-9]``, and also many other digit characters. + + Matches ``[0-9]`` if the :py:const:`~re.ASCII` flag is used. + + __ https://www.unicode.org/versions/Unicode15.0.0/ch04.pdf#G134153 For 8-bit (bytes) patterns: - Matches any decimal digit; this is equivalent to ``[0-9]``. + Matches any decimal digit in the ASCII character set; + this is equivalent to ``[0-9]``. .. index:: single: \D; in regular expressions ``\D`` - Matches any character which is not a decimal digit. This is - the opposite of ``\d``. If the :const:`ASCII` flag is used this - becomes the equivalent of ``[^0-9]``. + Matches any character which is not a decimal digit. + This is the opposite of ``\d``. + + Matches ``[^0-9]`` if the :py:const:`~re.ASCII` flag is used. .. index:: single: \s; in regular expressions @@ -578,8 +603,9 @@ character ``'$'``. Matches Unicode whitespace characters (which includes ``[ \t\n\r\f\v]``, and also many other characters, for example the non-breaking spaces mandated by typography rules in many - languages). If the :const:`ASCII` flag is used, only - ``[ \t\n\r\f\v]`` is matched. + languages). + + Matches ``[ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used. For 8-bit (bytes) patterns: Matches characters considered whitespace in the ASCII character set; @@ -589,30 +615,39 @@ character ``'$'``. ``\S`` Matches any character which is not a whitespace character. This is - the opposite of ``\s``. If the :const:`ASCII` flag is used this - becomes the equivalent of ``[^ \t\n\r\f\v]``. + the opposite of ``\s``. + + Matches ``[^ \t\n\r\f\v]`` if the :py:const:`~re.ASCII` flag is used. .. index:: single: \w; in regular expressions ``\w`` For Unicode (str) patterns: - Matches Unicode word characters; this includes alphanumeric characters (as defined by :meth:`str.isalnum`) + Matches Unicode word characters; + this includes all Unicode alphanumeric characters + (as defined by :py:meth:`str.isalnum`), as well as the underscore (``_``). - If the :const:`ASCII` flag is used, only ``[a-zA-Z0-9_]`` is matched. + + Matches ``[a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used. For 8-bit (bytes) patterns: Matches characters considered alphanumeric in the ASCII character set; - this is equivalent to ``[a-zA-Z0-9_]``. If the :const:`LOCALE` flag is - used, matches characters considered alphanumeric in the current locale - and the underscore. + this is equivalent to ``[a-zA-Z0-9_]``. + If the :py:const:`~re.LOCALE` flag is used, + matches characters considered alphanumeric in the current locale and the underscore. .. index:: single: \W; in regular expressions ``\W`` - Matches any character which is not a word character. This is - the opposite of ``\w``. If the :const:`ASCII` flag is used this - becomes the equivalent of ``[^a-zA-Z0-9_]``. If the :const:`LOCALE` flag is - used, matches characters which are neither alphanumeric in the current locale + Matches any character which is not a word character. + This is the opposite of ``\w``. + By default, matches non-underscore (``_``) characters + for which :py:meth:`str.isalnum` returns ``False``. + + Matches ``[^a-zA-Z0-9_]`` if the :py:const:`~re.ASCII` flag is used. + + If the :py:const:`~re.LOCALE` flag is used, + matches characters which are neither alphanumeric in the current locale nor the underscore. .. index:: single: \Z; in regular expressions @@ -644,9 +679,11 @@ string literals are also accepted by the regular expression parser:: (Note that ``\b`` is used to represent word boundaries, and means "backspace" only inside character classes.) -``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are only recognized in Unicode -patterns. In bytes patterns they are errors. Unknown escapes of ASCII -letters are reserved for future use and treated as errors. +``'\u'``, ``'\U'``, and ``'\N'`` escape sequences are +only recognized in Unicode (str) patterns. +In bytes patterns they are errors. +Unknown escapes of ASCII letters are reserved +for future use and treated as errors. Octal escapes are included in a limited form. If the first digit is a 0, or if there are three octal digits, it is considered an octal escape. Otherwise, it is @@ -694,30 +731,37 @@ Flags Make ``\w``, ``\W``, ``\b``, ``\B``, ``\d``, ``\D``, ``\s`` and ``\S`` perform ASCII-only matching instead of full Unicode matching. This is only - meaningful for Unicode patterns, and is ignored for byte patterns. + meaningful for Unicode (str) patterns, and is ignored for bytes patterns. + Corresponds to the inline flag ``(?a)``. - Note that for backward compatibility, the :const:`re.U` flag still - exists (as well as its synonym :const:`re.UNICODE` and its embedded - counterpart ``(?u)``), but these are redundant in Python 3 since - matches are Unicode by default for strings (and Unicode matching - isn't allowed for bytes). + .. note:: + + The :py:const:`~re.U` flag still exists for backward compatibility, + but is redundant in Python 3 since + matches are Unicode by default for ``str`` patterns, + and Unicode matching isn't allowed for bytes patterns. + :py:const:`~re.UNICODE` and the inline flag ``(?u)`` are similarly redundant. .. data:: DEBUG Display debug information about compiled expression. + No corresponding inline flag. .. data:: I IGNORECASE - Perform case-insensitive matching; expressions like ``[A-Z]`` will also - match lowercase letters. Full Unicode matching (such as ``Ü`` matching - ``ü``) also works unless the :const:`re.ASCII` flag is used to disable - non-ASCII matches. The current locale does not change the effect of this - flag unless the :const:`re.LOCALE` flag is also used. + Perform case-insensitive matching; + expressions like ``[A-Z]`` will also match lowercase letters. + Full Unicode matching (such as ``Ü`` matching ``ü``) + also works unless the :py:const:`~re.ASCII` flag + is used to disable non-ASCII matches. + The current locale does not change the effect of this flag + unless the :py:const:`~re.LOCALE` flag is also used. + Corresponds to the inline flag ``(?i)``. Note that when the Unicode patterns ``[a-z]`` or ``[A-Z]`` are used in @@ -725,29 +769,35 @@ Flags letters and 4 additional non-ASCII letters: 'İ' (U+0130, Latin capital letter I with dot above), 'ı' (U+0131, Latin small letter dotless i), 'ſ' (U+017F, Latin small letter long s) and 'K' (U+212A, Kelvin sign). - If the :const:`ASCII` flag is used, only letters 'a' to 'z' + If the :py:const:`~re.ASCII` flag is used, only letters 'a' to 'z' and 'A' to 'Z' are matched. .. data:: L LOCALE Make ``\w``, ``\W``, ``\b``, ``\B`` and case-insensitive matching - dependent on the current locale. This flag can be used only with bytes - patterns. The use of this flag is discouraged as the locale mechanism - is very unreliable, it only handles one "culture" at a time, and it only - works with 8-bit locales. Unicode matching is already enabled by default - in Python 3 for Unicode (str) patterns, and it is able to handle different - locales/languages. + dependent on the current locale. + This flag can be used only with bytes patterns. + Corresponds to the inline flag ``(?L)``. + .. warning:: + + This flag is discouraged; consider Unicode matching instead. + The locale mechanism is very unreliable + as it only handles one "culture" at a time + and only works with 8-bit locales. + Unicode matching is enabled by default for Unicode (str) patterns + and it is able to handle different locales and languages. + .. versionchanged:: 3.6 - :const:`re.LOCALE` can be used only with bytes patterns and is - not compatible with :const:`re.ASCII`. + :py:const:`~re.LOCALE` can be used only with bytes patterns + and is not compatible with :py:const:`~re.ASCII`. .. versionchanged:: 3.7 - Compiled regular expression objects with the :const:`re.LOCALE` flag no - longer depend on the locale at compile time. Only the locale at - matching time affects the result of matching. + Compiled regular expression objects with the :py:const:`~re.LOCALE` flag + no longer depend on the locale at compile time. + Only the locale at matching time affects the result of matching. .. data:: M @@ -759,6 +809,7 @@ Flags end of each line (immediately preceding each newline). By default, ``'^'`` matches only at the beginning of the string, and ``'$'`` only at the end of the string and immediately before the newline (if any) at the end of the string. + Corresponds to the inline flag ``(?m)``. .. data:: NOFLAG @@ -778,19 +829,19 @@ Flags Make the ``'.'`` special character match any character at all, including a newline; without this flag, ``'.'`` will match anything *except* a newline. + Corresponds to the inline flag ``(?s)``. .. data:: U UNICODE - In Python 2, this flag made :ref:`special sequences <re-special-sequences>` - include Unicode characters in matches. Since Python 3, Unicode characters - are matched by default. - - See :const:`A` for restricting matching on ASCII characters instead. + In Python 3, Unicode characters are matched by default + for ``str`` patterns. + This flag is therefore redundant with **no effect** + and is only kept for backward compatibility. - This flag is only kept for backward compatibility. + See :py:const:`~re.ASCII` to restrict matching to ASCII characters instead. .. data:: X VERBOSE @@ -914,6 +965,8 @@ Functions Empty matches for the pattern split the string only when not adjacent to a previous empty match. + .. code:: pycon + >>> re.split(r'\b', 'Words, words, words.') ['', 'Words', ', ', 'words', ', ', 'words', '.'] >>> re.split(r'\W*', '...words...') @@ -1237,7 +1290,7 @@ Regular Expression Objects The regex matching flags. This is a combination of the flags given to :func:`.compile`, any ``(?...)`` inline flags in the pattern, and implicit - flags such as :data:`UNICODE` if the pattern is a Unicode string. + flags such as :py:const:`~re.UNICODE` if the pattern is a Unicode string. .. attribute:: Pattern.groups |