summaryrefslogtreecommitdiffstats
path: root/Doc
diff options
context:
space:
mode:
Diffstat (limited to 'Doc')
-rw-r--r--Doc/lib/libunicodedata.tex40
1 files changed, 37 insertions, 3 deletions
diff --git a/Doc/lib/libunicodedata.tex b/Doc/lib/libunicodedata.tex
index 5096652..add00c9 100644
--- a/Doc/lib/libunicodedata.tex
+++ b/Doc/lib/libunicodedata.tex
@@ -5,7 +5,7 @@
\modulesynopsis{Access the Unicode Database.}
\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
-
+\sectionauthor{Martin v. L\"owis}{martin@v.loewis.de}
\index{Unicode}
\index{character}
@@ -14,10 +14,10 @@
This module provides access to the Unicode Character Database which
defines character properties for all Unicode characters. The data in
this database is based on the \file{UnicodeData.txt} file version
-3.0.0 which is publically available from \url{ftp://ftp.unicode.org/}.
+3.2.0 which is publically available from \url{ftp://ftp.unicode.org/}.
The module uses the same names and symbols as defined by the
-UnicodeData File Format 3.0.0 (see
+UnicodeData File Format 3.2.0 (see
\url{http://www.unicode.org/Public/UNIDATA/UnicodeData.html}). It
defines the following functions:
@@ -83,3 +83,37 @@ defines the following functions:
character \var{unichr} as string. An empty string is returned in case
no such mapping is defined.
\end{funcdesc}
+
+\begin{funcdesc}{normalize}{form, unistr}
+
+Return the normal form \var{form} for the Unicode string \var{unistr}.
+Valid values for \var{form} are 'NFC', 'NFKC', 'NFD', and 'NFKD'.
+
+The Unicode standard defines various normalization forms of a Unicode
+string, based on the definition of canonical equivalence and
+compatibility equivalence. In Unicode, several characters can be
+expressed in various way. For example, the character U+00C7 (LATIN
+CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence
+U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA).
+
+For each character, there are two normal forms: normal form C and
+normal form D. Normal form D (NFD) is also known as canonical
+decomposition, and translates each character into its decomposed form.
+Normal form C (NFC) first applies a canonical decomposition, then
+composes pre-combined characters again.
+
+In addition to these two forms, there two additional normal forms
+based on compatibility equivalence. In Unicode, certain characters are
+supported which normally would be unified with other characters. For
+example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049
+(LATIN CAPITAL LETTER I). However, it is supported in Unicode for
+compatibility with existing character sets (e.g. gb2312).
+
+The normal form KD (NFKD) will apply the compatibility decomposition,
+i.e. replace all compatibility characters with their equivalents. The
+normal form KC (NFKC) first applies the compatibility decomposition,
+followed by the canonical composition.
+
+\versionadded{2.3}
+\end{funcdesc}
+