diff options
author | Serhiy Storchaka <storchaka@gmail.com> | 2015-01-18 09:28:37 (GMT) |
---|---|---|
committer | Serhiy Storchaka <storchaka@gmail.com> | 2015-01-18 09:28:37 (GMT) |
commit | d3faf43f9ba7da0ae504c9186b10d0fa3a8eb300 (patch) | |
tree | 1e6ffa87d3c0087fc5bb3509bc9ef455ce98c7ff /Doc/library/codecs.rst | |
parent | b2653b344e01710a99bc782a0b9e4b4d0f8eed5f (diff) | |
download | cpython-d3faf43f9ba7da0ae504c9186b10d0fa3a8eb300.zip cpython-d3faf43f9ba7da0ae504c9186b10d0fa3a8eb300.tar.gz cpython-d3faf43f9ba7da0ae504c9186b10d0fa3a8eb300.tar.bz2 |
Issue #23181: More "codepoint" -> "code point".
Diffstat (limited to 'Doc/library/codecs.rst')
-rw-r--r-- | Doc/library/codecs.rst | 12 |
1 files changed, 6 insertions, 6 deletions
diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 8be5d18..dd44cb2 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -827,7 +827,7 @@ methods and attributes from the underlying stream. Encodings and Unicode --------------------- -Strings are stored internally as sequences of codepoints in +Strings are stored internally as sequences of code points in range ``0x0``-``0x10FFFF``. (See :pep:`393` for more details about the implementation.) Once a string object is used outside of CPU and memory, endianness @@ -838,23 +838,23 @@ There are a variety of different text serialisation codecs, which are collectivity referred to as :term:`text encodings <text encoding>`. The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps -the codepoints 0-255 to the bytes ``0x0``-``0xff``, which means that a string -object that contains codepoints above ``U+00FF`` can't be encoded with this +the code points 0-255 to the bytes ``0x0``-``0xff``, which means that a string +object that contains code points above ``U+00FF`` can't be encoded with this codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks like the following (although the details of the error message may differ): ``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in position 3: ordinal not in range(256)``. There's another group of encodings (the so called charmap encodings) that choose -a different subset of all Unicode code points and how these codepoints are +a different subset of all Unicode code points and how these code points are mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on Windows). There's a string constant with 256 characters that shows you which character is mapped to which byte value. -All of these encodings can only encode 256 of the 1114112 codepoints +All of these encodings can only encode 256 of the 1114112 code points defined in Unicode. A simple and straightforward way that can store each Unicode -code point, is to store each codepoint as four consecutive bytes. There are two +code point, is to store each code point as four consecutive bytes. There are two possibilities: store the bytes in big endian or in little endian order. These two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you |