docs 36789: resolve incorrect note regarding UTF-8 (GH-13111)

author: redshiftzero <jen@freedom.press> 2019-05-17 10:44:18 (GMT)
committer: Cheryl Sabella <cheryl.sabella@gmail.com> 2019-05-17 10:44:17 (GMT)
commit: f98c3c59c0930ee41175d8935f72bfeed5fee17a (patch)
tree: 01d820b4d224290c2f3bb28668064c7137e8d703 /Doc/howto/unicode.rst
parent: af8646c8054d0f4180a2013383039b6a472f9698 (diff)
download: cpython-f98c3c59c0930ee41175d8935f72bfeed5fee17a.zip
cpython-f98c3c59c0930ee41175d8935f72bfeed5fee17a.tar.gz
cpython-f98c3c59c0930ee41175d8935f72bfeed5fee17a.tar.bz2
1 files changed, 10 insertions, 5 deletions
diff --git a/Doc/howto/unicode.rst b/Doc/howto/unicode.rst
index 7b2d7b8..24c3235 100644
--- a/Doc/howto/unicode.rst
+++ b/Doc/howto/unicode.rst
@@ -135,17 +135,22 @@ used than UTF-8.)  UTF-8 uses the following rules:
 UTF-8 has several convenient properties:
 
 1. It can handle any Unicode code point.
-2. A Unicode string is turned into a sequence of bytes containing no embedded zero
-   bytes.  This avoids byte-ordering issues, and means UTF-8 strings can be
-   processed by C functions such as ``strcpy()`` and sent through protocols that
-   can't handle zero bytes.
+2. A Unicode string is turned into a sequence of bytes that contains embedded
+   zero bytes only where they represent the null character (U+0000). This means
+   that UTF-8 strings can be processed by C functions such as ``strcpy()`` and sent
+   through protocols that can't handle zero bytes for anything other than
+   end-of-string markers.
 3. A string of ASCII text is also valid UTF-8 text.
 4. UTF-8 is fairly compact; the majority of commonly used characters can be
    represented with one or two bytes.
 5. If bytes are corrupted or lost, it's possible to determine the start of the
    next UTF-8-encoded code point and resynchronize.  It's also unlikely that
    random 8-bit data will look like valid UTF-8.
-
+6. UTF-8 is a byte oriented encoding. The encoding specifies that each
+   character is represented by a specific sequence of one or more bytes. This
+   avoids the byte-ordering issues that can occur with integer and word oriented
+   encodings, like UTF-16 and UTF-32, where the sequence of bytes varies depending
+   on the hardware on which the string was encoded.
 
 
 References
author	redshiftzero <jen@freedom.press>	2019-05-17 10:44:18 (GMT)
committer	Cheryl Sabella <cheryl.sabella@gmail.com>	2019-05-17 10:44:17 (GMT)
commit	f98c3c59c0930ee41175d8935f72bfeed5fee17a (patch)
tree	01d820b4d224290c2f3bb28668064c7137e8d703 /Doc/howto/unicode.rst
parent	af8646c8054d0f4180a2013383039b6a472f9698 (diff)
download	cpython-f98c3c59c0930ee41175d8935f72bfeed5fee17a.zip cpython-f98c3c59c0930ee41175d8935f72bfeed5fee17a.tar.gz cpython-f98c3c59c0930ee41175d8935f72bfeed5fee17a.tar.bz2