From 3be472b5f777fe5ebc0c1f4b6c0d96c73352db9c Mon Sep 17 00:00:00 2001 From: Georg Brandl Date: Wed, 14 Jan 2015 08:26:30 +0100 Subject: Closes #23181: codepoint -> code point --- Doc/c-api/unicode.rst | 2 +- Doc/library/codecs.rst | 12 ++++++------ Doc/library/email.mime.rst | 2 +- Doc/library/functions.rst | 2 +- Doc/library/html.entities.rst | 4 ++-- Doc/tutorial/datastructures.rst | 2 +- Doc/whatsnew/3.3.rst | 12 ++++++------ 7 files changed, 18 insertions(+), 18 deletions(-) diff --git a/Doc/c-api/unicode.rst b/Doc/c-api/unicode.rst index ed74f45..00063d0 100644 --- a/Doc/c-api/unicode.rst +++ b/Doc/c-api/unicode.rst @@ -1141,7 +1141,7 @@ These are the UTF-32 codec APIs: mark (U+FEFF). In the other two modes, no BOM mark is prepended. If *Py_UNICODE_WIDE* is not defined, surrogate pairs will be output - as a single codepoint. + as a single code point. Return *NULL* if an exception was raised by the codec. diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index b67e653..3510f69 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -841,7 +841,7 @@ methods and attributes from the underlying stream. Encodings and Unicode --------------------- -Strings are stored internally as sequences of codepoints in +Strings are stored internally as sequences of code points in range ``0x0``-``0x10FFFF``. (See :pep:`393` for more details about the implementation.) Once a string object is used outside of CPU and memory, endianness @@ -852,23 +852,23 @@ There are a variety of different text serialisation codecs, which are collectivity referred to as :term:`text encodings `. The simplest text encoding (called ``'latin-1'`` or ``'iso-8859-1'``) maps -the codepoints 0-255 to the bytes ``0x0``-``0xff``, which means that a string -object that contains codepoints above ``U+00FF`` can't be encoded with this +the code points 0-255 to the bytes ``0x0``-``0xff``, which means that a string +object that contains code points above ``U+00FF`` can't be encoded with this codec. Doing so will raise a :exc:`UnicodeEncodeError` that looks like the following (although the details of the error message may differ): ``UnicodeEncodeError: 'latin-1' codec can't encode character '\u1234' in position 3: ordinal not in range(256)``. There's another group of encodings (the so called charmap encodings) that choose -a different subset of all Unicode code points and how these codepoints are +a different subset of all Unicode code points and how these code points are mapped to the bytes ``0x0``-``0xff``. To see how this is done simply open e.g. :file:`encodings/cp1252.py` (which is an encoding that is used primarily on Windows). There's a string constant with 256 characters that shows you which character is mapped to which byte value. -All of these encodings can only encode 256 of the 1114112 codepoints +All of these encodings can only encode 256 of the 1114112 code points defined in Unicode. A simple and straightforward way that can store each Unicode -code point, is to store each codepoint as four consecutive bytes. There are two +code point, is to store each code point as four consecutive bytes. There are two possibilities: store the bytes in big endian or in little endian order. These two encodings are called ``UTF-32-BE`` and ``UTF-32-LE`` respectively. Their disadvantage is that if e.g. you use ``UTF-32-BE`` on a little endian machine you diff --git a/Doc/library/email.mime.rst b/Doc/library/email.mime.rst index 950b1c6..67d0a67 100644 --- a/Doc/library/email.mime.rst +++ b/Doc/library/email.mime.rst @@ -194,7 +194,7 @@ Here are the classes: minor type and defaults to :mimetype:`plain`. *_charset* is the character set of the text and is passed as an argument to the :class:`~email.mime.nonmultipart.MIMENonMultipart` constructor; it defaults - to ``us-ascii`` if the string contains only ``ascii`` codepoints, and + to ``us-ascii`` if the string contains only ``ascii`` code points, and ``utf-8`` otherwise. The *_charset* parameter accepts either a string or a :class:`~email.charset.Charset` instance. diff --git a/Doc/library/functions.rst b/Doc/library/functions.rst index 8a0c336..c6b66b5 100644 --- a/Doc/library/functions.rst +++ b/Doc/library/functions.rst @@ -156,7 +156,7 @@ are always available. They are listed here in alphabetical order. .. function:: chr(i) - Return the string representing a character whose Unicode codepoint is the + Return the string representing a character whose Unicode code point is the integer *i*. For example, ``chr(97)`` returns the string ``'a'``, while ``chr(931)`` returns the string ``'Σ'``. This is the inverse of :func:`ord`. diff --git a/Doc/library/html.entities.rst b/Doc/library/html.entities.rst index 09b0abc..e10e46e 100644 --- a/Doc/library/html.entities.rst +++ b/Doc/library/html.entities.rst @@ -33,12 +33,12 @@ This module defines four dictionaries, :data:`html5`, .. data:: name2codepoint - A dictionary that maps HTML entity names to the Unicode codepoints. + A dictionary that maps HTML entity names to the Unicode code points. .. data:: codepoint2name - A dictionary that maps Unicode codepoints to HTML entity names. + A dictionary that maps Unicode code points to HTML entity names. .. rubric:: Footnotes diff --git a/Doc/tutorial/datastructures.rst b/Doc/tutorial/datastructures.rst index 6dc17aa..a2031ed 100644 --- a/Doc/tutorial/datastructures.rst +++ b/Doc/tutorial/datastructures.rst @@ -685,7 +685,7 @@ the same type, the lexicographical comparison is carried out recursively. If all items of two sequences compare equal, the sequences are considered equal. If one sequence is an initial sub-sequence of the other, the shorter sequence is the smaller (lesser) one. Lexicographical ordering for strings uses the Unicode -codepoint number to order individual characters. Some examples of comparisons +code point number to order individual characters. Some examples of comparisons between sequences of the same type:: (1, 2, 3) < (1, 2, 4) diff --git a/Doc/whatsnew/3.3.rst b/Doc/whatsnew/3.3.rst index f8c3ca5..1d4ce72 100644 --- a/Doc/whatsnew/3.3.rst +++ b/Doc/whatsnew/3.3.rst @@ -228,7 +228,7 @@ Functionality Changes introduced by :pep:`393` are the following: -* Python now always supports the full range of Unicode codepoints, including +* Python now always supports the full range of Unicode code points, including non-BMP ones (i.e. from ``U+0000`` to ``U+10FFFF``). The distinction between narrow and wide builds no longer exists and Python now behaves like a wide build, even under Windows. @@ -246,7 +246,7 @@ Changes introduced by :pep:`393` are the following: so ``'\U0010FFFF'[0]`` now returns ``'\U0010FFFF'`` and not ``'\uDBFF'``; * all other functions in the standard library now correctly handle - non-BMP codepoints. + non-BMP code points. * The value of :data:`sys.maxunicode` is now always ``1114111`` (``0x10FFFF`` in hexadecimal). The :c:func:`PyUnicode_GetMax` function still returns @@ -258,13 +258,13 @@ Changes introduced by :pep:`393` are the following: Performance and resource usage ------------------------------ -The storage of Unicode strings now depends on the highest codepoint in the string: +The storage of Unicode strings now depends on the highest code point in the string: -* pure ASCII and Latin1 strings (``U+0000-U+00FF``) use 1 byte per codepoint; +* pure ASCII and Latin1 strings (``U+0000-U+00FF``) use 1 byte per code point; -* BMP strings (``U+0000-U+FFFF``) use 2 bytes per codepoint; +* BMP strings (``U+0000-U+FFFF``) use 2 bytes per code point; -* non-BMP strings (``U+10000-U+10FFFF``) use 4 bytes per codepoint. +* non-BMP strings (``U+10000-U+10FFFF``) use 4 bytes per code point. The net effect is that for most applications, memory usage of string storage should decrease significantly - especially compared to former -- cgit v0.12