Update to properly explain that the default Unicode encoding is ASCII, &c.

author: Ka-Ping Yee <ping@zesty.ca> 2001-02-13 22:20:22 (GMT)
committer: Ka-Ping Yee <ping@zesty.ca> 2001-02-13 22:20:22 (GMT)
commit: 5401996638445a9450b4ef1391eb6102d399cee7 (patch)
tree: edcb211f7d5e8df0c9fa447f7eff0f92bdb74dbb /Doc/tut
parent: ea4f931cb9c8c821c4f99011461f300caeffaad0 (diff)
download: cpython-5401996638445a9450b4ef1391eb6102d399cee7.zip
cpython-5401996638445a9450b4ef1391eb6102d399cee7.tar.gz
cpython-5401996638445a9450b4ef1391eb6102d399cee7.tar.bz2
1 files changed, 40 insertions, 22 deletions
diff --git a/Doc/tut/tut.tex b/Doc/tut/tut.tex
index 8a47b22..4720fed 100644
--- a/Doc/tut/tut.tex
+++ b/Doc/tut/tut.tex
@@ -772,17 +772,17 @@ u'Hello World !'
 \end{verbatim}
 
 The escape sequence \code{\e u0020} indicates to insert the Unicode
-character with the HEX ordinal 0x0020 (the space character) at the
+character with the ordinal value 0x0020 (the space character) at the
 given position.
 
 Other characters are interpreted by using their respective ordinal
-value directly as Unicode ordinal. Due to the fact that the lower 256
-Unicode are the same as the standard Latin-1 encoding used in many
-western countries, the process of entering Unicode is greatly
-simplified.
+values directly as Unicode ordinals.  If you have literal strings
+in the standard Latin-1 encoding that is used in many Western countries,
+you will find it convenient that the lower 256 characters
+of Unicode are the same as the 256 characters of Latin-1.
 
-For experts, there is also a raw mode just like for normal
-strings. You have to prepend the string with a small 'r' to have
+For experts, there is also a raw mode just like the one for normal
+strings. You have to prefix the opening quote with 'ur' to have
 Python use the \emph{Raw-Unicode-Escape} encoding. It will only apply
 the above \code{\e uXXXX} conversion if there is an uneven number of
 backslashes in front of the small 'u'.
@@ -801,32 +801,50 @@ Apart from these standard encodings, Python provides a whole set of
 other ways of creating Unicode strings on the basis of a known
 encoding. 
 
-The built-in function \function{unicode()}\bifuncindex{unicode} provides access
-to all registered Unicode codecs (COders and DECoders). Some of the
-more well known encodings which these codecs can convert are
-\emph{Latin-1}, \emph{ASCII}, \emph{UTF-8} and \emph{UTF-16}. The latter two
-are variable-length encodings which store Unicode characters
-in blocks of 8 or 16 bits. To print a Unicode string or write it to a file,
-you must convert it to a string with the \method{encode()} method.
+The built-in function \function{unicode()}\bifuncindex{unicode} provides
+access to all registered Unicode codecs (COders and DECoders). Some of
+the more well known encodings which these codecs can convert are
+\emph{Latin-1}, \emph{ASCII}, \emph{UTF-8}, and \emph{UTF-16}.
+The latter two are variable-length encodings that store each Unicode
+character in one or more bytes. The default encoding is
+normally set to ASCII, which passes through characters in the range
+0 to 127 and rejects any other characters with an error.
+When a Unicode string is printed, written to a file, or converted
+with \function{str()}, conversion takes place using this default encoding.
+
+\begin{verbatim}
+>>> u"abc"
+u'abc'
+>>> str(u"abc")
+'abc'
+>>> u"äöü"
+u'\xe4\xf6\xfc'
+>>> str(u"äöü")
+Traceback (most recent call last):
+  File "<stdin>", line 1, in ?
+UnicodeError: ASCII encoding error: ordinal not in range(128)
+\end{verbatim}
+
+To convert a Unicode string into an 8-bit string using a specific
+encoding, Unicode objects provide an \function{encode()} method
+that takes one argument, the name of the encoding.  Lowercase names
+for encodings are preferred.
 
 \begin{verbatim}
->>> u"äöü"
-u'\344\366\374'
->>> u"äöü".encode('UTF-8')
-'\303\244\303\266\303\274'
+>>> u"äöü".encode('utf-8')
+'\xc3\xa4\xc3\xb6\xc3\xbc'
 \end{verbatim}
 
 If you have data in a specific encoding and want to produce a
 corresponding Unicode string from it, you can use the
-\function{unicode()} function with the encoding name as second
+\function{unicode()} function with the encoding name as the second
 argument.
 
 \begin{verbatim}
->>> unicode('\303\244\303\266\303\274','UTF-8')
-u'\344\366\374'
+>>> unicode('\xc3\xa4\xc3\xb6\xc3\xbc', 'utf-8')
+u'\xe4\xf6\xfc'
 \end{verbatim}
 
-
 \subsection{Lists \label{lists}}
 
 Python knows a number of \emph{compound} data types, used to group
author	Ka-Ping Yee <ping@zesty.ca>	2001-02-13 22:20:22 (GMT)
committer	Ka-Ping Yee <ping@zesty.ca>	2001-02-13 22:20:22 (GMT)
commit	5401996638445a9450b4ef1391eb6102d399cee7 (patch)
tree	edcb211f7d5e8df0c9fa447f7eff0f92bdb74dbb /Doc/tut
parent	ea4f931cb9c8c821c4f99011461f300caeffaad0 (diff)
download	cpython-5401996638445a9450b4ef1391eb6102d399cee7.zip cpython-5401996638445a9450b4ef1391eb6102d399cee7.tar.gz cpython-5401996638445a9450b4ef1391eb6102d399cee7.tar.bz2