diff options
Diffstat (limited to 'Doc')
-rw-r--r-- | Doc/lib/libunicodedata.tex | 40 |
1 files changed, 37 insertions, 3 deletions
diff --git a/Doc/lib/libunicodedata.tex b/Doc/lib/libunicodedata.tex index 5096652..add00c9 100644 --- a/Doc/lib/libunicodedata.tex +++ b/Doc/lib/libunicodedata.tex @@ -5,7 +5,7 @@ \modulesynopsis{Access the Unicode Database.} \moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com} \sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com} - +\sectionauthor{Martin v. L\"owis}{martin@v.loewis.de} \index{Unicode} \index{character} @@ -14,10 +14,10 @@ This module provides access to the Unicode Character Database which defines character properties for all Unicode characters. The data in this database is based on the \file{UnicodeData.txt} file version -3.0.0 which is publically available from \url{ftp://ftp.unicode.org/}. +3.2.0 which is publically available from \url{ftp://ftp.unicode.org/}. The module uses the same names and symbols as defined by the -UnicodeData File Format 3.0.0 (see +UnicodeData File Format 3.2.0 (see \url{http://www.unicode.org/Public/UNIDATA/UnicodeData.html}). It defines the following functions: @@ -83,3 +83,37 @@ defines the following functions: character \var{unichr} as string. An empty string is returned in case no such mapping is defined. \end{funcdesc} + +\begin{funcdesc}{normalize}{form, unistr} + +Return the normal form \var{form} for the Unicode string \var{unistr}. +Valid values for \var{form} are 'NFC', 'NFKC', 'NFD', and 'NFKD'. + +The Unicode standard defines various normalization forms of a Unicode +string, based on the definition of canonical equivalence and +compatibility equivalence. In Unicode, several characters can be +expressed in various way. For example, the character U+00C7 (LATIN +CAPITAL LETTER C WITH CEDILLA) can also be expressed as the sequence +U+0043 (LATIN CAPITAL LETTER C) U+0327 (COMBINING CEDILLA). + +For each character, there are two normal forms: normal form C and +normal form D. Normal form D (NFD) is also known as canonical +decomposition, and translates each character into its decomposed form. +Normal form C (NFC) first applies a canonical decomposition, then +composes pre-combined characters again. + +In addition to these two forms, there two additional normal forms +based on compatibility equivalence. In Unicode, certain characters are +supported which normally would be unified with other characters. For +example, U+2160 (ROMAN NUMERAL ONE) is really the same thing as U+0049 +(LATIN CAPITAL LETTER I). However, it is supported in Unicode for +compatibility with existing character sets (e.g. gb2312). + +The normal form KD (NFKD) will apply the compatibility decomposition, +i.e. replace all compatibility characters with their equivalents. The +normal form KC (NFKC) first applies the compatibility decomposition, +followed by the canonical composition. + +\versionadded{2.3} +\end{funcdesc} + |