summaryrefslogtreecommitdiffstats
path: root/Doc/howto/unicode.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Doc/howto/unicode.rst')
-rw-r--r--Doc/howto/unicode.rst341
1 files changed, 183 insertions, 158 deletions
diff --git a/Doc/howto/unicode.rst b/Doc/howto/unicode.rst
index be1fefb..5339bf4 100644
--- a/Doc/howto/unicode.rst
+++ b/Doc/howto/unicode.rst
@@ -6,95 +6,48 @@
:Release: 1.12
-This HOWTO discusses Python support for Unicode, and explains
-various problems that people commonly encounter when trying to work
-with Unicode.
+This HOWTO discusses Python's support for the Unicode specification
+for representing textual data, and explains various problems that
+people commonly encounter when trying to work with Unicode.
+
Introduction to Unicode
=======================
-History of Character Codes
---------------------------
-
-In 1968, the American Standard Code for Information Interchange, better known by
-its acronym ASCII, was standardized. ASCII defined numeric codes for various
-characters, with the numeric values running from 0 to 127. For example, the
-lowercase letter 'a' is assigned 97 as its code value.
-
-ASCII was an American-developed standard, so it only defined unaccented
-characters. There was an 'e', but no 'é' or 'Í'. This meant that languages
-which required accented characters couldn't be faithfully represented in ASCII.
-(Actually the missing accents matter for English, too, which contains words such
-as 'naïve' and 'café', and some publications have house styles which require
-spellings such as 'coöperate'.)
-
-For a while people just wrote programs that didn't display accents.
-In the mid-1980s an Apple II BASIC program written by a French speaker
-might have lines like these:
-
-.. code-block:: basic
-
- PRINT "MISE A JOUR TERMINEE"
- PRINT "PARAMETRES ENREGISTRES"
-
-Those messages should contain accents (terminée, paramètre, enregistrés) and
-they just look wrong to someone who can read French.
-
-In the 1980s, almost all personal computers were 8-bit, meaning that bytes could
-hold values ranging from 0 to 255. ASCII codes only went up to 127, so some
-machines assigned values between 128 and 255 to accented characters. Different
-machines had different codes, however, which led to problems exchanging files.
-Eventually various commonly used sets of values for the 128--255 range emerged.
-Some were true standards, defined by the International Organization for
-Standardization, and some were *de facto* conventions that were invented by one
-company or another and managed to catch on.
-
-255 characters aren't very many. For example, you can't fit both the accented
-characters used in Western Europe and the Cyrillic alphabet used for Russian
-into the 128--255 range because there are more than 128 such characters.
-
-You could write files using different codes (all your Russian files in a coding
-system called KOI8, all your French files in a different coding system called
-Latin1), but what if you wanted to write a French document that quotes some
-Russian text? In the 1980s people began to want to solve this problem, and the
-Unicode standardization effort began.
-
-Unicode started out using 16-bit characters instead of 8-bit characters. 16
-bits means you have 2^16 = 65,536 distinct values available, making it possible
-to represent many different characters from many different alphabets; an initial
-goal was to have Unicode contain the alphabets for every single human language.
-It turns out that even 16 bits isn't enough to meet that goal, and the modern
-Unicode specification uses a wider range of codes, 0 through 1,114,111 (
-``0x10FFFF`` in base 16).
-
-There's a related ISO standard, ISO 10646. Unicode and ISO 10646 were
-originally separate efforts, but the specifications were merged with the 1.1
-revision of Unicode.
-
-(This discussion of Unicode's history is highly simplified. The
-precise historical details aren't necessary for understanding how to
-use Unicode effectively, but if you're curious, consult the Unicode
-consortium site listed in the References or
-the `Wikipedia entry for Unicode <https://en.wikipedia.org/wiki/Unicode#History>`_
-for more information.)
-
-
Definitions
-----------
+Today's programs need to be able to handle a wide variety of
+characters. Applications are often internationalized to display
+messages and output in a variety of user-selectable languages; the
+same program might need to output an error message in English, French,
+Japanese, Hebrew, or Russian. Web content can be written in any of
+these languages and can also include a variety of emoji symbols.
+Python's string type uses the Unicode Standard for representing
+characters, which lets Python programs work with all these different
+possible characters.
+
+Unicode (https://www.unicode.org/) is a specification that aims to
+list every character used by human languages and give each character
+its own unique code. The Unicode specifications are continually
+revised and updated to add new languages and symbols.
+
A **character** is the smallest possible component of a text. 'A', 'B', 'C',
-etc., are all different characters. So are 'È' and 'Í'. Characters are
-abstractions, and vary depending on the language or context you're talking
-about. For example, the symbol for ohms (Ω) is usually drawn much like the
-capital letter omega (Ω) in the Greek alphabet (they may even be the same in
-some fonts), but these are two different characters that have different
-meanings.
-
-The Unicode standard describes how characters are represented by **code
-points**. A code point is an integer value, usually denoted in base 16. In the
-standard, a code point is written using the notation ``U+12CA`` to mean the
-character with value ``0x12ca`` (4,810 decimal). The Unicode standard contains
-a lot of tables listing characters and their corresponding code points:
+etc., are all different characters. So are 'È' and 'Í'. Characters vary
+depending on the language or context you're talking
+about. For example, there's a character for "Roman Numeral One", 'Ⅰ', that's
+separate from the uppercase letter 'I'. They'll usually look the same,
+but these are two different characters that have different meanings.
+
+The Unicode standard describes how characters are represented by
+**code points**. A code point value is an integer in the range 0 to
+0x10FFFF (about 1.1 million values, with some 110 thousand assigned so
+far). In the standard and in this document, a code point is written
+using the notation ``U+265E`` to mean the character with value
+``0x265e`` (9,822 in decimal).
+
+The Unicode standard contains a lot of tables listing characters and
+their corresponding code points:
.. code-block:: none
@@ -103,10 +56,21 @@ a lot of tables listing characters and their corresponding code points:
0063 'c'; LATIN SMALL LETTER C
...
007B '{'; LEFT CURLY BRACKET
+ ...
+ 2167 'Ⅶ': ROMAN NUMERAL EIGHT
+ 2168 'Ⅸ': ROMAN NUMERAL NINE
+ ...
+ 265E '♞': BLACK CHESS KNIGHT
+ 265F '♟': BLACK CHESS PAWN
+ ...
+ 1F600 '😀': GRINNING FACE
+ 1F609 '😉': WINKING FACE
+ ...
Strictly, these definitions imply that it's meaningless to say 'this is
-character ``U+12CA``'. ``U+12CA`` is a code point, which represents some particular
-character; in this case, it represents the character 'ETHIOPIC SYLLABLE WI'. In
+character ``U+265E``'. ``U+265E`` is a code point, which represents some particular
+character; in this case, it represents the character 'BLACK CHESS KNIGHT',
+'♞'. In
informal contexts, this distinction between code points and characters will
sometimes be forgotten.
@@ -121,14 +85,17 @@ toolkit or a terminal's font renderer.
Encodings
---------
-To summarize the previous section: a Unicode string is a sequence of code
-points, which are numbers from 0 through ``0x10FFFF`` (1,114,111 decimal). This
-sequence needs to be represented as a set of bytes (meaning, values
-from 0 through 255) in memory. The rules for translating a Unicode string
-into a sequence of bytes are called an **encoding**.
+To summarize the previous section: a Unicode string is a sequence of
+code points, which are numbers from 0 through ``0x10FFFF`` (1,114,111
+decimal). This sequence of code points needs to be represented in
+memory as a set of **code units**, and **code units** are then mapped
+to 8-bit bytes. The rules for translating a Unicode string into a
+sequence of bytes are called a **character encoding**, or just
+an **encoding**.
-The first encoding you might think of is an array of 32-bit integers. In this
-representation, the string "Python" would look like this:
+The first encoding you might think of is using 32-bit integers as the
+code unit, and then using the CPU's representation of 32-bit integers.
+In this representation, the string "Python" might look like this:
.. code-block:: none
@@ -152,40 +119,14 @@ problems.
3. It's not compatible with existing C functions such as ``strlen()``, so a new
family of wide string functions would need to be used.
-4. Many Internet standards are defined in terms of textual data, and can't
- handle content with embedded zero bytes.
-
-Generally people don't use this encoding, instead choosing other
-encodings that are more efficient and convenient. UTF-8 is probably
-the most commonly supported encoding; it will be discussed below.
-
-Encodings don't have to handle every possible Unicode character, and most
-encodings don't. The rules for converting a Unicode string into the ASCII
-encoding, for example, are simple; for each code point:
-
-1. If the code point is < 128, each byte is the same as the value of the code
- point.
+Therefore this encoding isn't used very much, and people instead choose other
+encodings that are more efficient and convenient, such as UTF-8.
-2. If the code point is 128 or greater, the Unicode string can't be represented
- in this encoding. (Python raises a :exc:`UnicodeEncodeError` exception in this
- case.)
-
-Latin-1, also known as ISO-8859-1, is a similar encoding. Unicode code points
-0--255 are identical to the Latin-1 values, so converting to this encoding simply
-requires converting code points to byte values; if a code point larger than 255
-is encountered, the string can't be encoded into Latin-1.
-
-Encodings don't have to be simple one-to-one mappings like Latin-1. Consider
-IBM's EBCDIC, which was used on IBM mainframes. Letter values weren't in one
-block: 'a' through 'i' had values from 129 to 137, but 'j' through 'r' were 145
-through 153. If you wanted to use EBCDIC as an encoding, you'd probably use
-some sort of lookup table to perform the conversion, but this is largely an
-internal detail.
-
-UTF-8 is one of the most commonly used encodings. UTF stands for "Unicode
-Transformation Format", and the '8' means that 8-bit numbers are used in the
-encoding. (There are also a UTF-16 and UTF-32 encodings, but they are less
-frequently used than UTF-8.) UTF-8 uses the following rules:
+UTF-8 is one of the most commonly used encodings, and Python often
+defaults to using it. UTF stands for "Unicode Transformation Format",
+and the '8' means that 8-bit values are used in the encoding. (There
+are also UTF-16 and UTF-32 encodings, but they are less frequently
+used than UTF-8.) UTF-8 uses the following rules:
1. If the code point is < 128, it's represented by the corresponding byte value.
2. If the code point is >= 128, it's turned into a sequence of two, three, or
@@ -215,6 +156,10 @@ glossary, and PDF versions of the Unicode specification. Be prepared for some
difficult reading. `A chronology <http://www.unicode.org/history/>`_ of the
origin and development of Unicode is also available on the site.
+On the Computerphile Youtube channel, Tom Scott briefly
+`discusses the history of Unicode and UTF-8 <https://www.youtube.com/watch?v=MijmeoH9LT4>`
+(9 minutes 36 seconds).
+
To help understand the standard, Jukka Korpela has written `an introductory
guide <http://jkorpela.fi/unicode/guide.html>`_ to reading the
Unicode character tables.
@@ -238,7 +183,7 @@ Unicode features.
The String Type
---------------
-Since Python 3.0, the language features a :class:`str` type that contain Unicode
+Since Python 3.0, the language's :class:`str` type contains Unicode
characters, meaning any string created using ``"unicode rocks!"``, ``'unicode
rocks!'``, or the triple-quoted string syntax is stored as Unicode.
@@ -252,11 +197,6 @@ include a Unicode character in a string literal::
# 'File not found' error message.
print("Fichier non trouvé")
-You can use a different encoding from UTF-8 by putting a specially-formatted
-comment as the first or second line of the source code::
-
- # -*- coding: <encoding name> -*-
-
Side note: Python 3 also supports using Unicode characters in identifiers::
répertoire = "/tmp/records.log"
@@ -299,7 +239,7 @@ The following examples show the differences::
>>> b'\x80abc'.decode("utf-8", "ignore")
'abc'
-Encodings are specified as strings containing the encoding's name. Python 3.2
+Encodings are specified as strings containing the encoding's name. Python
comes with roughly 100 different encodings; see the Python Library Reference at
:ref:`standard-encodings` for a list. Some encodings have multiple names; for
example, ``'latin-1'``, ``'iso_8859_1'`` and ``'8859``' are all synonyms for
@@ -409,12 +349,13 @@ already mentioned. See also :pep:`263` for more information.
Unicode Properties
------------------
-The Unicode specification includes a database of information about code points.
-For each defined code point, the information includes the character's
-name, its category, the numeric value if applicable (Unicode has characters
-representing the Roman numerals and fractions such as one-third and
-four-fifths). There are also properties related to the code point's use in
-bidirectional text and other display-related properties.
+The Unicode specification includes a database of information about
+code points. For each defined code point, the information includes
+the character's name, its category, the numeric value if applicable
+(for characters representing numeric concepts such as the Roman
+numerals, fractions such as one-third and four-fifths, etc.). There
+are also display-related properties, such as how to use the code point
+in bidirectional text.
The following program displays some information about several characters, and
prints the numeric value of one particular character::
@@ -451,6 +392,88 @@ other". See
list of category codes.
+Comparing Strings
+-----------------
+
+Unicode adds some complication to comparing strings, because the same
+set of characters can be represented by different sequences of code
+points. For example, a letter like 'ê' can be represented as a single
+code point U+00EA, or as U+0065 U+0302, which is the code point for
+'e' followed by a code point for 'COMBINING CIRCUMFLEX ACCENT'. These
+will produce the same output when printed, but one is a string of
+length 1 and the other is of length 2.
+
+One tool for a case-insensitive comparison is the
+:meth:`~str.casefold` string method that converts a string to a
+case-insensitive form following an algorithm described by the Unicode
+Standard. This algorithm has special handling for characters such as
+the German letter 'ß' (code point U+00DF), which becomes the pair of
+lowercase letters 'ss'.
+
+::
+
+ >>> street = 'Gürzenichstraße'
+ >>> street.casefold()
+ 'gürzenichstrasse'
+
+A second tool is the :mod:`unicodedata` module's
+:func:`~unicodedata.normalize` function that converts strings to one
+of several normal forms, where letters followed by a combining
+character are replaced with single characters. :func:`normalize` can
+be used to perform string comparisons that won't falsely report
+inequality if two strings use combining characters differently:
+
+::
+
+ import unicodedata
+
+ def compare_strs(s1, s2):
+ def NFD(s):
+ return unicodedata.normalize('NFD', s)
+
+ return NFD(s1) == NFD(s2)
+
+ single_char = 'ê'
+ multiple_chars = '\N{LATIN SMALL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
+ print('length of first string=', len(single_char))
+ print('length of second string=', len(multiple_chars))
+ print(compare_strs(single_char, multiple_chars))
+
+When run, this outputs:
+
+.. code-block:: shell-session
+
+ $ python3 compare-strs.py
+ length of first string= 1
+ length of second string= 2
+ True
+
+The first argument to the :func:`~unicodedata.normalize` function is a
+string giving the desired normalization form, which can be one of
+'NFC', 'NFKC', 'NFD', and 'NFKD'.
+
+The Unicode Standard also specifies how to do caseless comparisons::
+
+ import unicodedata
+
+ def compare_caseless(s1, s2):
+ def NFD(s):
+ return unicodedata.normalize('NFD', s)
+
+ return NFD(NFD(s1).casefold()) == NFD(NFD(s2).casefold())
+
+ # Example usage
+ single_char = 'ê'
+ multiple_chars = '\N{LATIN CAPITAL LETTER E}\N{COMBINING CIRCUMFLEX ACCENT}'
+
+ print(compare_caseless(single_char, multiple_chars))
+
+This will print ``True``. (Why is :func:`NFD` invoked twice? Because
+there are a few characters that make :meth:`casefold` return a
+non-normalized string, so the result needs to be normalized again. See
+section 3.13 of the Unicode Standard for a discussion and an example.)
+
+
Unicode Regular Expressions
---------------------------
@@ -567,22 +590,22 @@ particular byte ordering and don't skip the BOM.
In some areas, it is also convention to use a "BOM" at the start of UTF-8
encoded files; the name is misleading since UTF-8 is not byte-order dependent.
-The mark simply announces that the file is encoded in UTF-8. Use the
-'utf-8-sig' codec to automatically skip the mark if present for reading such
-files.
+The mark simply announces that the file is encoded in UTF-8. For reading such
+files, use the 'utf-8-sig' codec to automatically skip the mark if present.
Unicode filenames
-----------------
-Most of the operating systems in common use today support filenames that contain
-arbitrary Unicode characters. Usually this is implemented by converting the
-Unicode string into some encoding that varies depending on the system. For
-example, Mac OS X uses UTF-8 while Windows uses a configurable encoding; on
-Windows, Python uses the name "mbcs" to refer to whatever the currently
-configured encoding is. On Unix systems, there will only be a filesystem
-encoding if you've set the ``LANG`` or ``LC_CTYPE`` environment variables; if
-you haven't, the default encoding is UTF-8.
+Most of the operating systems in common use today support filenames
+that contain arbitrary Unicode characters. Usually this is
+implemented by converting the Unicode string into some encoding that
+varies depending on the system. Today Python is converging on using
+UTF-8: Python on MacOS has used UTF-8 for several versions, and Python
+3.6 switched to using UTF-8 on Windows as well. On Unix systems,
+there will only be a filesystem encoding if you've set the ``LANG`` or
+``LC_CTYPE`` environment variables; if you haven't, the default
+encoding is again UTF-8.
The :func:`sys.getfilesystemencoding` function returns the encoding to use on
your current system, in case you want to do the encoding manually, but there's
@@ -597,9 +620,9 @@ automatically converted to the right encoding for you::
Functions in the :mod:`os` module such as :func:`os.stat` will also accept Unicode
filenames.
-The :func:`os.listdir` function returns filenames and raises an issue: should it return
+The :func:`os.listdir` function returns filenames, which raises an issue: should it return
the Unicode version of filenames, or should it return bytes containing
-the encoded versions? :func:`os.listdir` will do both, depending on whether you
+the encoded versions? :func:`os.listdir` can do both, depending on whether you
provided the directory path as bytes or a Unicode string. If you pass a
Unicode string as the path, filenames will be decoded using the filesystem's
encoding and a list of Unicode strings will be returned, while passing a byte
@@ -619,16 +642,17 @@ will produce the following output:
.. code-block:: shell-session
- amk:~$ python t.py
+ $ python listdir-test.py
[b'filename\xe4\x94\x80abc', ...]
['filename\u4500abc', ...]
The first list contains UTF-8-encoded filenames, and the second list contains
the Unicode versions.
-Note that on most occasions, the Unicode APIs should be used. The bytes APIs
-should only be used on systems where undecodable file names can be present,
-i.e. Unix systems.
+Note that on most occasions, you should can just stick with using
+Unicode with these APIs. The bytes APIs should only be used on
+systems where undecodable file names can be present; that's
+pretty much only Unix systems now.
Tips for Writing Unicode-aware Programs
@@ -695,10 +719,10 @@ with the ``surrogateescape`` error handler::
f.write(data)
The ``surrogateescape`` error handler will decode any non-ASCII bytes
-as code points in the Unicode Private Use Area ranging from U+DC80 to
-U+DCFF. These private code points will then be turned back into the
-same bytes when the ``surrogateescape`` error handler is used when
-encoding the data and writing it back out.
+as code points in a special range running from U+DC80 to
+U+DCFF. These code points will then turn back into the
+same bytes when the ``surrogateescape`` error handler is used to
+encode the data and write it back out.
References
@@ -730,4 +754,5 @@ Andrew Kuchling, and Ezio Melotti.
Thanks to the following people who have noted errors or offered
suggestions on this article: Éric Araujo, Nicholas Bastin, Nick
Coghlan, Marius Gedminas, Kent Johnson, Ken Krugler, Marc-André
-Lemburg, Martin von Löwis, Terry J. Reedy, Chad Whitacre.
+Lemburg, Martin von Löwis, Terry J. Reedy, Serhiy Storchaka,
+Eryk Sun, Chad Whitacre, Graham Wideman.