summaryrefslogtreecommitdiffstats
path: root/Doc/library/unicodedata.rst
diff options
context:
space:
mode:
Diffstat (limited to 'Doc/library/unicodedata.rst')
-rw-r--r--Doc/library/unicodedata.rst165
1 files changed, 165 insertions, 0 deletions
diff --git a/Doc/library/unicodedata.rst b/Doc/library/unicodedata.rst
new file mode 100644
index 0000000..017d4ee
--- /dev/null
+++ b/Doc/library/unicodedata.rst
@@ -0,0 +1,165 @@
+
+:mod:`unicodedata` --- Unicode Database
+=======================================
+
+.. module:: unicodedata
+ :synopsis: Access the Unicode Database.
+.. moduleauthor:: Marc-Andre Lemburg <mal@lemburg.com>
+.. sectionauthor:: Marc-Andre Lemburg <mal@lemburg.com>
+.. sectionauthor:: Martin v. Löwis <martin@v.loewis.de>
+
+
+.. index::
+ single: Unicode
+ single: character
+ pair: Unicode; database
+
+This module provides access to the Unicode Character Database which defines
+character properties for all Unicode characters. The data in this database is
+based on the :file:`UnicodeData.txt` file version 4.1.0 which is publicly
+available from ftp://ftp.unicode.org/.
+
+The module uses the same names and symbols as defined by the UnicodeData File
+Format 4.1.0 (see http://www.unicode.org/Public/4.1.0/ucd/UCD.html). It defines
+the following functions:
+
+
+.. function:: lookup(name)
+
+ Look up character by name. If a character with the given name is found, return
+ the corresponding Unicode character. If not found, :exc:`KeyError` is raised.
+
+
+.. function:: name(unichr[, default])
+
+ Returns the name assigned to the Unicode character *unichr* as a string. If no
+ name is defined, *default* is returned, or, if not given, :exc:`ValueError` is
+ raised.
+
+
+.. function:: decimal(unichr[, default])
+
+ Returns the decimal value assigned to the Unicode character *unichr* as integer.
+ If no such value is defined, *default* is returned, or, if not given,
+ :exc:`ValueError` is raised.
+
+
+.. function:: digit(unichr[, default])
+
+ Returns the digit value assigned to the Unicode character *unichr* as integer.
+ If no such value is defined, *default* is returned, or, if not given,
+ :exc:`ValueError` is raised.
+
+
+.. function:: numeric(unichr[, default])
+
+ Returns the numeric value assigned to the Unicode character *unichr* as float.
+ If no such value is defined, *default* is returned, or, if not given,
+ :exc:`ValueError` is raised.
+
+
+.. function:: category(unichr)
+
+ Returns the general category assigned to the Unicode character *unichr* as
+ string.
+
+
+.. function:: bidirectional(unichr)
+
+ Returns the bidirectional category assigned to the Unicode character *unichr* as
+ string. If no such value is defined, an empty string is returned.
+
+
+.. function:: combining(unichr)
+
+ Returns the canonical combining class assigned to the Unicode character *unichr*
+ as integer. Returns ``0`` if no combining class is defined.
+
+
+.. function:: east_asian_width(unichr)
+
+ Returns the east asian width assigned to the Unicode character *unichr* as
+ string.
+
+ .. versionadded:: 2.4
+
+
+.. function:: mirrored(unichr)
+
+ Returns the mirrored property assigned to the Unicode character *unichr* as
+ integer. Returns ``1`` if the character has been identified as a "mirrored"
+ character in bidirectional text, ``0`` otherwise.
+
+
+.. function:: decomposition(unichr)
+
+ Returns the character decomposition mapping assigned to the Unicode character
+ *unichr* as string. An empty string is returned in case no such mapping is
+ defined.
+
+
+.. function:: normalize(form, unistr)
+
+ Return the normal form *form* for the Unicode string *unistr*. Valid values for
+ *form* are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
+
+ The Unicode standard defines various normalization forms of a Unicode string,
+ based on the definition of canonical equivalence and compatibility equivalence.
+ In Unicode, several characters can be expressed in various way. For example, the
+ character U+00C7 (LATIN CAPITAL LETTER C WITH CEDILLA) can also be expressed as
+ the sequence U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
+
+ For each character, there are two normal forms: normal form C and normal form D.
+ Normal form D (NFD) is also known as canonical decomposition, and translates
+ each character into its decomposed form. Normal form C (NFC) first applies a
+ canonical decomposition, then composes pre-combined characters again.
+
+ In addition to these two forms, there are two additional normal forms based on
+ compatibility equivalence. In Unicode, certain characters are supported which
+ normally would be unified with other characters. For example, U+2160 (ROMAN
+ NUMERAL ONE) is really the same thing as U+0049 (LATIN CAPITAL LETTER I).
+ However, it is supported in Unicode for compatibility with existing character
+ sets (e.g. gb2312).
+
+ The normal form KD (NFKD) will apply the compatibility decomposition, i.e.
+ replace all compatibility characters with their equivalents. The normal form KC
+ (NFKC) first applies the compatibility decomposition, followed by the canonical
+ composition.
+
+ .. versionadded:: 2.3
+
+In addition, the module exposes the following constant:
+
+
+.. data:: unidata_version
+
+ The version of the Unicode database used in this module.
+
+ .. versionadded:: 2.3
+
+
+.. data:: ucd_3_2_0
+
+ This is an object that has the same methods as the entire module, but uses the
+ Unicode database version 3.2 instead, for applications that require this
+ specific version of the Unicode database (such as IDNA).
+
+ .. versionadded:: 2.5
+
+Examples::
+
+ >>> unicodedata.lookup('LEFT CURLY BRACKET')
+ u'{'
+ >>> unicodedata.name(u'/')
+ 'SOLIDUS'
+ >>> unicodedata.decimal(u'9')
+ 9
+ >>> unicodedata.decimal(u'a')
+ Traceback (most recent call last):
+ File "<stdin>", line 1, in ?
+ ValueError: not a decimal
+ >>> unicodedata.category(u'A') # 'L'etter, 'u'ppercase
+ 'Lu'
+ >>> unicodedata.bidirectional(u'\u0660') # 'A'rabic, 'N'umber
+ 'AN'
+