diff options
author | Ezio Melotti <ezio.melotti@gmail.com> | 2011-09-01 05:11:28 (GMT) |
---|---|---|
committer | Ezio Melotti <ezio.melotti@gmail.com> | 2011-09-01 05:11:28 (GMT) |
commit | 222b20844f28f37dbe5431eb293ef2b35df71ae7 (patch) | |
tree | 95e015a47eecba6b40ce0dc20c46df28f4ef397b /Doc | |
parent | a9353db2cd0b1c2c08793a18364da6058db50caf (diff) | |
download | cpython-222b20844f28f37dbe5431eb293ef2b35df71ae7.zip cpython-222b20844f28f37dbe5431eb293ef2b35df71ae7.tar.gz cpython-222b20844f28f37dbe5431eb293ef2b35df71ae7.tar.bz2 |
From RFC 3629 5- and 6-bytes UTF-8 sequences are invalid, so remove them from the doc.
Diffstat (limited to 'Doc')
-rw-r--r-- | Doc/library/codecs.rst | 9 |
1 files changed, 2 insertions, 7 deletions
diff --git a/Doc/library/codecs.rst b/Doc/library/codecs.rst index 922bcf4..9477133 100644 --- a/Doc/library/codecs.rst +++ b/Doc/library/codecs.rst @@ -839,7 +839,7 @@ There's another encoding that is able to encoding the full range of Unicode characters: UTF-8. UTF-8 is an 8-bit encoding, which means there are no issues with byte order in UTF-8. Each byte in a UTF-8 byte sequence consists of two parts: Marker bits (the most significant bits) and payload bits. The marker bits -are a sequence of zero to six 1 bits followed by a 0 bit. Unicode characters are +are a sequence of zero to four ``1`` bits followed by a ``0`` bit. Unicode characters are encoded like this (with x being payload bits, which when concatenated give the Unicode character): @@ -852,12 +852,7 @@ Unicode character): +-----------------------------------+----------------------------------------------+ | ``U-00000800`` ... ``U-0000FFFF`` | 1110xxxx 10xxxxxx 10xxxxxx | +-----------------------------------+----------------------------------------------+ -| ``U-00010000`` ... ``U-001FFFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | -+-----------------------------------+----------------------------------------------+ -| ``U-00200000`` ... ``U-03FFFFFF`` | 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | -+-----------------------------------+----------------------------------------------+ -| ``U-04000000`` ... ``U-7FFFFFFF`` | 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx | -| | 10xxxxxx | +| ``U-00010000`` ... ``U-0010FFFF`` | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | +-----------------------------------+----------------------------------------------+ The least significant bit of the Unicode character is the rightmost x bit. |