diff options
author | Miss Islington (bot) <31488909+miss-islington@users.noreply.github.com> | 2022-05-09 03:20:07 (GMT) |
---|---|---|
committer | GitHub <noreply@github.com> | 2022-05-09 03:20:07 (GMT) |
commit | 03aa75221bf0bda469343de5af26766f09e904fe (patch) | |
tree | 876d3ab2acb8659dca22a7d3f1f580109e0afe61 /Doc | |
parent | bf5fc2adb7c6896cd6da59ce823098e611b16f90 (diff) | |
download | cpython-03aa75221bf0bda469343de5af26766f09e904fe.zip cpython-03aa75221bf0bda469343de5af26766f09e904fe.tar.gz cpython-03aa75221bf0bda469343de5af26766f09e904fe.tar.bz2 |
bpo-38056: overhaul Error Handlers section in codecs documentation (GH-15732)
* Some handlers were wrongly described as text-encoding only, but actually they can also be used in text-decoding.
* Add more description to each handler.
* Add two REPL examples.
* Add indexes for Error Handler's name.
Co-authored-by: Kyle Stanley <aeros167@gmail.com>
Co-authored-by: Victor Stinner <vstinner@python.org>
Co-authored-by: Jelle Zijlstra <jelle.zijlstra@gmail.com>
(cherry picked from commit 5bc2390229bbcb4f13359e867fd8a140a1d5496b)
Co-authored-by: Ma Lin <animalize@users.noreply.github.com>
Diffstat (limited to 'Doc')
-rw-r--r-- | Doc/glossary.rst | 11 | ||||
-rw-r--r-- | Doc/library/codecs.rst | 189 |
2 files changed, 126 insertions, 74 deletions
diff --git a/Doc/glossary.rst b/Doc/glossary.rst index f759c3f..b08e395 100644 --- a/Doc/glossary.rst +++ b/Doc/glossary.rst @@ -1072,7 +1072,16 @@ Glossary as :keyword:`if`, :keyword:`while` or :keyword:`for`. text encoding - A codec which encodes Unicode strings to bytes. + A string in Python is a sequence of Unicode code points (in range + ``U+0000``--``U+10FFFF``). To store or transfer a string, it needs to be + serialized as a sequence of bytes. + + Serializing a string into a sequence of bytes is known as "encoding", and + recreating the string from the sequence of bytes is known as "decoding". + + There are a variety of different text serialization + :ref:`codecs <standard-encodings>`, which are collectively referred to as + "text encodings". text file A :term:`file object` able to read and write :class:`str` objects. diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index e8a0a79..d4ea2b3 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -23,11 +23,11 @@ This module defines base classes for standard Python codecs (encoders and decoders) and provides access to the internal Python codec registry, which manages the codec and error handling lookup process. Most standard codecs -are :term:`text encodings <text encoding>`, which encode text to bytes, -but there are also codecs provided that encode text to text, and bytes to -bytes. Custom codecs may encode and decode between arbitrary types, but some -module features are restricted to use specifically with -:term:`text encodings <text encoding>`, or with codecs that encode to +are :term:`text encodings <text encoding>`, which encode text to bytes (and +decode bytes to text), but there are also codecs provided that encode text to +text, and bytes to bytes. Custom codecs may encode and decode between arbitrary +types, but some module features are restricted to be used specifically with +:term:`text encodings <text encoding>` or with codecs that encode to :class:`bytes`. The module defines the following functions for encoding and decoding with @@ -294,58 +294,56 @@ codec will handle encoding and decoding errors. Error Handlers ^^^^^^^^^^^^^^ -To simplify and standardize error handling, -codecs may implement different error handling schemes by -accepting the *errors* string argument. The following string values are -defined and implemented by all standard Python codecs: +To simplify and standardize error handling, codecs may implement different +error handling schemes by accepting the *errors* string argument: -.. tabularcolumns:: |l|L| - -+-------------------------+-----------------------------------------------+ -| Value | Meaning | -+=========================+===============================================+ -| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass); | -| | this is the default. Implemented in | -| | :func:`strict_errors`. | -+-------------------------+-----------------------------------------------+ -| ``'ignore'`` | Ignore the malformed data and continue | -| | without further notice. Implemented in | -| | :func:`ignore_errors`. | -+-------------------------+-----------------------------------------------+ - -The following error handlers are only applicable to -:term:`text encodings <text encoding>`: + >>> 'German ß, ♬'.encode(encoding='ascii', errors='backslashreplace') + b'German \\xdf, \\u266c' + >>> 'German ß, ♬'.encode(encoding='ascii', errors='xmlcharrefreplace') + b'German ß, ♬' .. index:: + pair: strict; error handler's name + pair: ignore; error handler's name + pair: replace; error handler's name + pair: backslashreplace; error handler's name + pair: surrogateescape; error handler's name single: ? (question mark); replacement character single: \ (backslash); escape sequence single: \x; escape sequence single: \u; escape sequence single: \U; escape sequence - single: \N; escape sequence + +The following error handlers can be used with all Python +:ref:`standard-encodings` codecs: + +.. tabularcolumns:: |l|L| +-------------------------+-----------------------------------------------+ | Value | Meaning | +=========================+===============================================+ -| ``'replace'`` | Replace with a suitable replacement | -| | marker; Python will use the official | -| | ``U+FFFD`` REPLACEMENT CHARACTER for the | -| | built-in codecs on decoding, and '?' on | -| | encoding. Implemented in | -| | :func:`replace_errors`. | +| ``'strict'`` | Raise :exc:`UnicodeError` (or a subclass), | +| | this is the default. Implemented in | +| | :func:`strict_errors`. | +-------------------------+-----------------------------------------------+ -| ``'xmlcharrefreplace'`` | Replace with the appropriate XML character | -| | reference (only for encoding). Implemented | -| | in :func:`xmlcharrefreplace_errors`. | +| ``'ignore'`` | Ignore the malformed data and continue without| +| | further notice. Implemented in | +| | :func:`ignore_errors`. | ++-------------------------+-----------------------------------------------+ +| ``'replace'`` | Replace with a replacement marker. On | +| | encoding, use ``?`` (ASCII character). On | +| | decoding, use ``�`` (U+FFFD, the official | +| | REPLACEMENT CHARACTER). Implemented in | +| | :func:`replace_errors`. | +-------------------------+-----------------------------------------------+ | ``'backslashreplace'`` | Replace with backslashed escape sequences. | +| | On encoding, use hexadecimal form of Unicode | +| | code point with formats ``\xhh`` ``\uxxxx`` | +| | ``\Uxxxxxxxx``. On decoding, use hexadecimal | +| | form of byte value with format ``\xhh``. | | | Implemented in | | | :func:`backslashreplace_errors`. | +-------------------------+-----------------------------------------------+ -| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences | -| | (only for encoding). Implemented in | -| | :func:`namereplace_errors`. | -+-------------------------+-----------------------------------------------+ | ``'surrogateescape'`` | On decoding, replace byte with individual | | | surrogate code ranging from ``U+DC80`` to | | | ``U+DCFF``. This code will then be turned | @@ -355,27 +353,55 @@ The following error handlers are only applicable to | | more.) | +-------------------------+-----------------------------------------------+ +.. index:: + pair: xmlcharrefreplace; error handler's name + pair: namereplace; error handler's name + single: \N; escape sequence + +The following error handlers are only applicable to encoding (within +:term:`text encodings <text encoding>`): + ++-------------------------+-----------------------------------------------+ +| Value | Meaning | ++=========================+===============================================+ +| ``'xmlcharrefreplace'`` | Replace with XML/HTML numeric character | +| | reference, which is a decimal form of Unicode | +| | code point with format ``&#num;`` Implemented | +| | in :func:`xmlcharrefreplace_errors`. | ++-------------------------+-----------------------------------------------+ +| ``'namereplace'`` | Replace with ``\N{...}`` escape sequences, | +| | what appears in the braces is the Name | +| | property from Unicode Character Database. | +| | Implemented in :func:`namereplace_errors`. | ++-------------------------+-----------------------------------------------+ + +.. index:: + pair: surrogatepass; error handler's name + In addition, the following error handler is specific to the given codecs: +-------------------+------------------------+-------------------------------------------+ | Value | Codecs | Meaning | +===================+========================+===========================================+ -|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding of surrogate | -| | utf-16-be, utf-16-le, | codes. These codecs normally treat the | -| | utf-32-be, utf-32-le | presence of surrogates as an error. | +|``'surrogatepass'``| utf-8, utf-16, utf-32, | Allow encoding and decoding surrogate code| +| | utf-16-be, utf-16-le, | point (``U+D800`` - ``U+DFFF``) as normal | +| | utf-32-be, utf-32-le | code point. Otherwise these codecs treat | +| | | the presence of surrogate code point in | +| | | :class:`str` as an error. | +-------------------+------------------------+-------------------------------------------+ .. versionadded:: 3.1 The ``'surrogateescape'`` and ``'surrogatepass'`` error handlers. .. versionchanged:: 3.4 - The ``'surrogatepass'`` error handlers now works with utf-16\* and utf-32\* codecs. + The ``'surrogatepass'`` error handler now works with utf-16\* and utf-32\* + codecs. .. versionadded:: 3.5 The ``'namereplace'`` error handler. .. versionchanged:: 3.5 - The ``'backslashreplace'`` error handlers now works with decoding and + The ``'backslashreplace'`` error handler now works with decoding and translating. The set of allowed values can be extended by registering a new named error @@ -418,42 +444,59 @@ functions: .. function:: strict_errors(exception) - Implements the ``'strict'`` error handling: each encoding or - decoding error raises a :exc:`UnicodeError`. + Implements the ``'strict'`` error handling. + Each encoding or decoding error raises a :exc:`UnicodeError`. -.. function:: replace_errors(exception) - Implements the ``'replace'`` error handling (for :term:`text encodings - <text encoding>` only): substitutes ``'?'`` for encoding errors - (to be encoded by the codec), and ``'\ufffd'`` (the Unicode replacement - character) for decoding errors. +.. function:: ignore_errors(exception) + Implements the ``'ignore'`` error handling. -.. function:: ignore_errors(exception) + Malformed data is ignored; encoding or decoding is continued without + further notice. - Implements the ``'ignore'`` error handling: malformed data is ignored and - encoding or decoding is continued without further notice. +.. function:: replace_errors(exception) -.. function:: xmlcharrefreplace_errors(exception) + Implements the ``'replace'`` error handling. - Implements the ``'xmlcharrefreplace'`` error handling (for encoding with - :term:`text encodings <text encoding>` only): the - unencodable character is replaced by an appropriate XML character reference. + Substitutes ``?`` (ASCII character) for encoding errors or ``�`` (U+FFFD, + the official REPLACEMENT CHARACTER) for decoding errors. .. function:: backslashreplace_errors(exception) - Implements the ``'backslashreplace'`` error handling (for - :term:`text encodings <text encoding>` only): malformed data is - replaced by a backslashed escape sequence. + Implements the ``'backslashreplace'`` error handling. + + Malformed data is replaced by a backslashed escape sequence. + On encoding, use the hexadecimal form of Unicode code point with formats + ``\xhh`` ``\uxxxx`` ``\Uxxxxxxxx``. On decoding, use the hexadecimal form of + byte value with format ``\xhh``. + + .. versionchanged:: 3.5 + Works with decoding and translating. + + +.. function:: xmlcharrefreplace_errors(exception) + + Implements the ``'xmlcharrefreplace'`` error handling (for encoding within + :term:`text encoding` only). + + The unencodable character is replaced by an appropriate XML/HTML numeric + character reference, which is a decimal form of Unicode code point with + format ``&#num;`` . + .. function:: namereplace_errors(exception) - Implements the ``'namereplace'`` error handling (for encoding with - :term:`text encodings <text encoding>` only): the - unencodable character is replaced by a ``\N{...}`` escape sequence. + Implements the ``'namereplace'`` error handling (for encoding within + :term:`text encoding` only). + + The unencodable character is replaced by a ``\N{...}`` escape sequence. The + set of characters that appear in the braces is the Name property from + Unicode Character Database. For example, the German lowercase letter ``'ß'`` + will be converted to byte sequence ``\N{LATIN SMALL LETTER SHARP S}`` . .. versionadded:: 3.5 @@ -467,7 +510,7 @@ The base :class:`Codec` class defines these methods which also define the function interfaces of the stateless encoder and decoder: -.. method:: Codec.encode(input[, errors]) +.. method:: Codec.encode(input, errors='strict') Encodes the object *input* and returns a tuple (output object, length consumed). For instance, :term:`text encoding` converts @@ -485,7 +528,7 @@ function interfaces of the stateless encoder and decoder: of the output object type in this situation. -.. method:: Codec.decode(input[, errors]) +.. method:: Codec.decode(input, errors='strict') Decodes the object *input* and returns a tuple (output object, length consumed). For instance, for a :term:`text encoding`, decoding converts @@ -552,7 +595,7 @@ define in order to be compatible with the Python codec registry. object. - .. method:: encode(object[, final]) + .. method:: encode(object, final=False) Encodes *object* (taking the current state of the encoder into account) and returns the resulting encoded object. If this is the last call to @@ -609,7 +652,7 @@ define in order to be compatible with the Python codec registry. object. - .. method:: decode(object[, final]) + .. method:: decode(object, final=False) Decodes *object* (taking the current state of the decoder into account) and returns the resulting decoded object. If this is the last call to @@ -743,7 +786,7 @@ compatible with the Python codec registry. :func:`register_error`. - .. method:: read([size[, chars, [firstline]]]) + .. method:: read(size=-1, chars=-1, firstline=False) Decodes data from the stream and returns the resulting object. @@ -769,7 +812,7 @@ compatible with the Python codec registry. available on the stream, these should be read too. - .. method:: readline([size[, keepends]]) + .. method:: readline(size=None, keepends=True) Read one line from the input stream and return the decoded data. @@ -780,7 +823,7 @@ compatible with the Python codec registry. returned. - .. method:: readlines([sizehint[, keepends]]) + .. method:: readlines(sizehint=None, keepends=True) Read all lines available on the input stream and return them as a list of lines. @@ -871,7 +914,7 @@ Encodings and Unicode --------------------- Strings are stored internally as sequences of code points in -range ``0x0``--``0x10FFFF``. (See :pep:`393` for +range ``U+0000``--``U+10FFFF``. (See :pep:`393` for more details about the implementation.) Once a string object is used outside of CPU and memory, endianness and how these arrays are stored as bytes become an issue. As with other @@ -952,7 +995,7 @@ encoding was used for encoding a string. Each charmap encoding can decode any random byte sequence. However that's not possible with UTF-8, as UTF-8 byte sequences have a structure that doesn't allow arbitrary byte sequences. To increase the reliability with which a UTF-8 encoding can be -detected, Microsoft invented a variant of UTF-8 (that Python 2.5 calls +detected, Microsoft invented a variant of UTF-8 (that Python calls ``"utf-8-sig"``) for its Notepad program: Before any of the Unicode characters is written to the file, a UTF-8 encoded BOM (which looks like this as a byte sequence: ``0xef``, ``0xbb``, ``0xbf``) is written. As it's rather improbable |