--- BEGIN (CJK.INF VERSION 2.1 07/12/96) 185553 BYTES ---
CJK.INF Version 2.1 (July 12, 1996)

Copyright (C) 1995-1996 Ken Lunde. All Rights Reserved.

CJK is a registered trademark and service mark of The Research
  Libraries Group, Inc.

Online Companion to "Understanding Japanese Information Processing"
- ENGLISH: 1993, O'Reilly & Associates, Inc., ISBN 1-56592-043-0
- JAPANESE: 1995, SOFTBANK Corporation, ISBN 4-89052-708-7


	This online document provides information on CJK (that is,
Chinese, Japanese, and Korean) character set standards and encoding
systems. In short, it provides detailed information on how CJK text is
handled electronically. I am happy to share this information with
others, and I would appreciate any comments/feedback on its content.
The current version (master copy) of this document is maintained at:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/cjk.inf

This file may also be obtained by contacting me directly using one of
the e-mail addresses listed in the CONTACT INFORMATION section.


TABLE OF CONTENTS

  VERSION HISTORY
  RESTRICTIONS
  CONTACT INFORMATION
  WHAT HAPPENED TO JAPAN.INF?
  DISCLAIMER
  CONVENTIONS
  INTRODUCTION
  PART 1: WHAT'S UP WITH UJIP?
  PART 2: CJK CHARACTER SET STANDARDS
    2.1: JAPANESE
      2.1.1: JIS X 0201-1976
      2.1.2: JIS X 0208-1990
      2.1.3: JIS X 0212-1990
      2.1.4: JIS X 0221-1995
      2.1.5: JIS X 0213-199X
      2.1.6: OBSOLETE STANDARDS
    2.2: CHINESE (PRC)
      2.2.1: GB 1988-89
      2.2.2: GB 2312-80
      2.2.3: GB 6345.1-86
      2.2.4: GB 7589-87
      2.2.5: GB 7590-87
      2.2.6: GB 8565.2-88
      2.2.7: GB/T 12345-90
      2.2.8: GB/T 13131-9X
      2.2.9: GB/T 13132-9X
      2.2.10: GB 13000.1-93
      2.2.11: ISO-IR-165:1992
      2.2.12: OBSOLETE STANDARDS
    2.3: CHINESE (TAIWAN)
      2.3.1: BIG FIVE
      2.3.2: CNS 11643-1992
      2.3.3: CNS 5205
      2.3.4: OBSOLETE STANDARDS
    2.4: KOREAN
      2.4.1: KS C 5636-1993
      2.4.2: KS C 5601-1992
      2.4.3: KS C 5657-1991
      2.4.4: GB 12052-89
      2.4.5: KS C 5700-1995
      2.4.6: OBSOLETE STANDARDS
    2.5: CJK
      2.5.1: ISO 10646-1:1993
      2.5.2: CCCII
      2.5.3: ANSI Z39.64-1989
    2.6: OTHER
      2.6.1: GB 8045-87
      2.6.2: TCVN-5773:1993
  PART 3: CJK ENCODING SYSTEMS
    3.1: 7-BIT ISO 2022 ENCODING
      3.1.1: CODE SPACE
      3.1.2: ISO-REGISTERED ESCAPE SEQUENCES
      3.1.3: ISO-2022-JP AND ISO-2022-JP-2
      3.1.4: ISO-2022-KR
      3.1.5: ISO-2022-CN AND ISO-2022-CN-EXT
    3.2: EUC ENCODING
      3.2.1: JAPANESE REPRESENTATION
      3.2.2: CHINESE (PRC) REPRESENTATION
      3.2.3: CHINESE (TAIWAN) REPRESENTATION
      3.2.4: KOREAN REPRESENTATION
    3.3: LOCALE-SPECIFIC ENCODINGS
      3.3.1: SHIFT-JIS
      3.3.2: HZ (HZ-GB-2312)
      3.3.3: zW
      3.3.4: BIG FIVE
      3.3.5: JOHAB
      3.3.6: N-BYTE HANGUL
      3.3.7: UCS-2
      3.3.8: UCS-4
      3.3.9: UTF-7
      3.3.10: UTF-8
      3.3.11: UTF-16
      3.3.12: ANSI Z39.64-1989
      3.3.13: BASE64
      3.3.14: IBM DBCS-HOST
      3.3.15: IBM DBCS-PC
      3.3.16: IBM DBCS-/TBCS-EUC
      3.3.17: UNIFIED HANGUL CODE
      3.3.18: TRON CODE
      3.3.19: GBK
    3.4: CJK CODE PAGES
  PART 4: CJK CHARACTER SET COMPATIBILITY ISSUES
    4.1: JAPANESE
    4.2: CHINESE (PRC)
    4.3: CHINESE (TAIWAN)
    4.4: KOREAN
    4.5: ISO 10646-1:1993
    4.6: UNICODE
    4.7: CODE CONVERSION TIPS
  PART 5: CJK-CAPABLE OPERATING SYSTEMS
    5.1: MS-DOS
    5.2: WINDOWS
    5.3: MACINTOSH
    5.4: UNIX AND X WINDOWS
    5.5: OTHERS
  PART 6: CJK TEXT AND INTERNET SERVICES
    6.1: ELECTRONIC MAIL
    6.2: USENET NEWS
    6.3: GOPHER
    6.4: WORLD-WIDE WEB
    6.5: FILE TRANSFER TIPS
  PART 7: CJK TEXT HANDLING SOFTWARE
    7.1: MULE
    7.2: CNPRINT
    7.3: MASS
    7.4: ADOBE TYPE MANAGER (ATM)
    7.5: MACINTOSH SOFTWARE
    7.6: MACBLUE TELNET
    7.7: CXTERM
    7.8: UW-DBM
    7.9: POSTSCRIPT
    7.10: NJWIN
  PART 8: CJK PROGRAMMING ISSUES
    8.1: C AND C++
    8.2: PERL
    8.3: JAVA
  A FINAL NOTE
  ACKNOWLEDGMENTS
  APPENDIX A: INFORMATION SOURCES
    A.1: USENET NEWSGROUPS AND MAILING LISTS
      A.1.1: USENET NEWSGROUPS
      A.1.2: MAILING LISTS
    A.2: INTERNET RESOURCES
      A.2.1: USEFUL FTP SITES
      A.2.2: USEFUL TELNET SITES
      A.2.3: USEFUL GOPHER SITES
      A.2.4: USEFUL WWW SITES
      A.2.5: USEFUL MAIL SERVERS
    A.3: OTHER RESOURCES
      A.3.1: BOOKS
      A.3.2: MAGAZINES
      A.3.3: JOURNALS
      A.3.4: RFCs
      A.3.5: FAQs


VERSION HISTORY

	The following is a complete listing of the earlier versions of
this document along with their release dates and sizes (in bytes):

  Document   Version  Release Date  Size
  ^^^^^^^^   ^^^^^^^  ^^^^^^^^^^^^  ^^^^
  JAPAN.INF  1.0      Unknown       Unknown
  JAPAN.INF  1.1      08/19/91      101,784
  JAPAN.INF  1.2      03/20/92      166,929 (JIS) or 165,639 (Shift-JIS/EUC)
  CJK.INF    1.0      06/09/95      103,985
  CJK.INF    1.1      06/12/95      112,771
  CJK.INF    1.2      06/14/95      125,275
  CJK.INF    1.3      06/16/95      130,069
  CJK.INF    1.4      06/19/95      142,543
  CJK.INF    1.5      06/22/95      146,064
  CJK.INF    1.6      06/29/95      150,882
  CJK.INF    1.7      08/15/95      153,772
  CJK.INF    1.8      09/11/95      157,295
  CJK.INF    1.9      12/18/95      170,698
  CJK.INF    2.0      03/12/96      175,973

With the release of this version, all of the above are now considered
obsolete. Also, note the three-year gap between the last installment
of JAPAN.INF and the first installment of CJK.INF -- I was writing
UJIP and my PhD dissertation during those three years. Ah, so much for
excuses...


RESTRICTIONS

	This document is provided free-of-charge to *anyone*, but no
person or company is permitted to modify, sell, or otherwise
distribute it for profit or other purposes. This document may be
bundled with commercial products only with the prior consent from the
author, and provided that it is not modified in any way whatsoever.
The point here is that I worked long and hard on this document so that
lots of fine folks and companies can benefit from its contents -- not
profit from it.


CONTACT INFORMATION

	I would enjoy hearing from readers of this document, even if
it is just to say "hello" or whatever. I can be contacted as follows:

  Ken Lunde
  Adobe Systems Incorporated
  1585 Charleston Road
  P.O. Box 7900
  Mountain View, CA 94039-7900 USA
  415-962-3866 (office phone)
  415-960-0886 (facsimile)
  lunde@adobe.com (preferred)
  lunde@ora.com or ujip@ora.com
  WWW Home Page: http://jasper.ora.com/lunde/

If you wonder what I do for my day job, read on.
	I have been working for Adobe Systems for over four years now
(before that I was a graduate student at UW-Madison), and my current
position is Project Manager, CJK Type Development.


WHAT HAPPENED TO JAPAN.INF?

	Put bluntly, JAPAN.INF died. It first evolved into my first
book entitled "Understanding Japanese Information Processing" (this
book is now into its second printing, and the Japanese translation was
just published). After my book came out, I did attempt to update
JAPAN.INF, but the effort felt a bit futile. I decided that something
fresh was necessary.
	JAPAN.INF also evolved into this document, which breaks the
Japanese barrier by providing similar information on Chinese and
Korean character sets and encodings. It fills the Chinese and Korean
gap, so to speak. My specialty (and hobby, believe it or not) is the
field of CJK character sets and encoding systems, so I felt that
shifting this document more towards those lines was appropriate use of
my (copious) free time (I wish there were more than 24 hours in a
day!). Besides, this document now becomes useful to a much broader
audience.


DISCLAIMER

	Ah yes, the ever popular disclaimer! Here's mine. Although I
list my address here at Adobe Systems Incorporated for contact
purposes, Adobe Systems does not endorse this document which I have
created, and have continued (and will continue) to update on a regular
basis (uh, yeah, I promise this time!). This document is a personal
endeavor to inform people of how CJK text can be handled on a variety
of platforms.


CONVENTIONS

	The notation that is used for detailing Internet resource
information, such as the Internet protocol type, site name, path, and
file follows the URL (Uniform Resource Locator) notation, namely:

  protocol://site-name/path/file

An example URL is as follows:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/00README

The protocol is FTP, the site-name is ftp.ora.com, the path is pub/
examples/nutshell/ujip/, and the file is 00README. Also note that this
same notation is used for invoking FTP on WWW (World Wide Web)
browsing software, such as Mosaic, Netscape, or Lynx.
	Note that most references to HTTP documents use the four-
letter file extension ".html". However, some HTTP documents are on
file systems that support only three-letter file extensions (can you
say "MS-DOS"?), so you may encounter just ".htm". This is just to let
you know that what you see is not a typo.
	References to my book "Understanding Japanese Information
Processing" are (affectionately) abbreviated as UJIP. These references
also apply to the Japanese translation (UJIP-J).
	Hexadecimal values are prefixed with 0x, and every two
hexadecimal digits represent a one-byte value. Other values can be
assumed to be in decimal notation.
	Chinese characters are referred to as kanji (Japanese), hanzi
(Chinese), or hanja (Korean), depending on context.
	References to ISO 10646-1:1993 also refer to Unicode
(usually). I have done this so that I do not have to repeat "Unicode"
in the same context as ISO 10646-1:1993. There are times, however,
when I need to distinguish ISO 10646-1:1993 from Unicode.


INTRODUCTION

	Electronic mail (e-mail), just one of the many Internet
resources, has become a very efficient means of communicating both
locally and world-wide. While it is very simple to send text which
uses only the 94 printable ASCII characters, character sets that
contain more than these ASCII characters pose special problems.
	This document is primarily concerned with CJK character set
and encoding issues. Much of this sort of information is not easily
obtained. This represents one person's attempt at making such
information more widely available.


PART 1: WHAT'S UP WITH UJIP?

	UJIP (First Edition) was published in September 1993 by
O'Reilly & Associates, Incorporated. The second printing (*not* the
Second Edition) was subsequently published in March 1994. The page
count for both printings is unchanged at 470.
	The following files contain the latest information about
changes (additions and corrections) made to UJIP and UJIP-J for
various printings, both for those that have taken place (such as for
the second printing of the English edition) and for those that are
planned (the first digit is the edition, and the second is the
printing):

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/errata/ujip-errata-1-2.txt
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/errata/ujip-errata-1-3.txt
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/errata/ujip-j-errata-1-2.txt

I *highly* recommend that all readers of UJIP obtain these errata
files. Those without FTP access can request copies directly from me.
	The Japanese translation of UJIP (UJIP-J), co-published by
O'Reilly & Associates, Incorporated and SOFTBANK Corporation, was just
released. The translation was done by my good friend Jack Halpern,
along with one of his colleagues, Takeo Suzuki. The Japanese edition
incorporates corrections and updates not yet found in the English
edition. The page count is 535.
	Late-breaking news! I am currently working on UJIP Second
Edition (to be retitled as "Understanding CJK Information Processing"
and abbreviated UCJKIP). If all goes well, it should be available by
January 1997, and will be well over 700 pages. If there was something
you wanted to see in UJIP, now's your chance to send me a request...


PART 2: CJK CHARACTER SET STANDARDS

	These sections describe the character sets used in Japan,
China (PRC and Taiwan), and Korea. Exact numbers of characters are
provided for each character set standard (when known), as well as
tidbits of information not otherwise available. This provides the
basic foundations for understanding how CJK scripts are handled on
computer systems.
	The two basic types of characters enumerated by CJK character
set standards are Chinese characters (kanji, hanzi, or hanja), which
number in the thousands (and, in some cases, tens of thousands), and
characters other than Chinese characters (symbols, numerals, kana
hangul, alphabets, and so on), which usually number in the hundreds
(there are thousands of pre-combined hangul, though).
	If you happen to be running X Windows, it is very easy to
display these CJK character sets (if a bitmapped font for the
character set exists, that is). Here is what I usually do:

o Obtain a BDF (Bitmap Distribution Format) font for the target
  character set. Try the following URLs for starters:

  ftp://cair-archive.kaist.ac.kr/pub/hangul/fonts/
  ftp://etlport.etl.go.jp/pub/mule/fonts/
  ftp://ftp.ifcss.org/pub/software/fonts/{big5,cns,gb,misc,unicode}/bdf/
  ftp://ftp.kuis.kyoto-u.ac.jp/misc/fonts/jisksp-fonts/
  ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/fonts/
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/unix/
  ftp://ftp.technet.sg:/pub/chinese/fonts/
  http://ccic.ifcss.org/www/pub/software/fonts/

  BDF files usually have the string "bdf" somewhere in their file
  name, usually at the end. If the file is compressed (noticing that
  it ends in .gz or .Z is a good indication), decompress it. BDF files
  are text files.

o Convert the BDF file to SNF (Server Natural Format) or PCF (Portable
  Compiled Format) using the programs "bdftosnf" or "bdftopcf,"
  respectively. Example command lines are as follows:

  % bdftopcf jiskan16-1990.bdf > k16-90.pcf
  % bdftosnf jiskan16-1990.bdf > k16-90.snf

  SNF files (and the "bdftosnf" program) are used on X11R4 and
  earlier, and PCF files (and the "bdftopcf" program) are used on
  X11R5 and later.

o Copy the SNF or PCF file to a directory in the font search path (or
  make a new path). Supposing I made a new directory called "fonts" in
  my home directory, I then run "mkfontdir" on the directory
  containing the SNF or PCF files as follows:

  % mkfontdir ~/fonts

  This creates a fonts.dir file in ~/fonts. I can now add this
  directory to my font search path with the following command:

  % xset +fp ~/fonts

o The command "xfd" (X Font Displayer) with the "-fn" switch followed
  by a font name then invokes a window that displays all the
  characters of the font. In the case of two-byte (CJK) fonts, one row
  is displayed at a time. The following is an example command line:

  % xfd -fn -misc-fixed-medium-r-normal--16-150-75-75-c-160-jisx0208.1990-0

  You can create a "fonts.alias" file in the same directory as the
  "fonts.dir" file in order to shorten the name when accessing the
  font. The alias "k16-90" could be used instead if the content of the
  fonts.alias file is as follows:

  k16-90  -misc-fixed-medium-r-normal--16-150-75-75-c-160-jisx0208.1990-0

  Don't forget to execute the following command in order to make the
  X Font Server aware of the new alias:

  % xset fp rehash

  Now you can use a simpler command line for "xfd" as follows:

  % xfd -fn k16-90

	The "X Window System User's Guide" (Volume 3 of the X Window
System series by O'Reilly & Associates, Inc.) provides detailed
information on managing fonts under X Windows (pp 123-160). The
article entitled "The X Administrator: Font Formats and Utilities" (pp
14-34 in "The X Resource," Issue 2), describes the BDF, SNF, and PCF
formats in great detail.
	There is another bitmap format called HBF (Hanzi Bitmap
Format), which is similar to BDF, but optimized for fixed-width
(monospaced) fonts. It is described in the article entitled "The HBF
Font Format: Optimizing Fixed-pitch Font Support" (pp 113-123 in "The
X Resource," Issue 10), and also at the following URL:

  ftp://ftp.ifcss.org/pub/software/fonts/hbf-discussion/

HBF fonts can be found at the following URL:

  ftp://ftp.ifcss.org/pub/software/fonts/{big5,cns,gb,misc,unicode}/hbf/

	Lastly, you may wish to check out my newly-developed CJK
Character Set Server, which generates various CJK character sets with
proper encoding applied. It is written in Perl, and accessed through
an HTML form. This server can be considered an upgrade to my JChar
tool (written in C). The URL is:

  http://jasper.ora.com/lunde/cjk-char.html


2.1: JAPANESE

	All (national) character set standards that originate in Japan
have names that begin with the three letters JIS. JIS is short for
"Japanese Industrial Standard." But it is JSA (Japanese Standards
Association) who publishes the corresponding manuals. Chapter 3 and
Appendixes H and J of UJIP provide more detailed information on
Japanese character set standards.


2.1.1: JIS X 0201-1976

	JIS X 0201-1976 (formerly JIS C 6220-1969; reaffirmed in 1989;
and its revision [with no character set changes] is currently under
public review) enumerates two sets of characters: JIS-Roman and
half-width katakana.
	JIS-Roman is the Japanese equivalent of the ASCII character
set, namely 128 characters consisting of the following:

o 10 numerals
o 52 uppercase and lowercase characters of the Latin alphabet
o 32 symbols (punctuation and so on)
o 34 non-printing characters (white space and control characters)

The term "white space" refers to characters that occupy space, but
have no appearance, such as tabs, spaces, and termination characters
(line feed, carriage return, and form feed).
	So, how are JIS-Roman and ASCII different? The following
three codes are (usually) different:

  Code   ASCII        JIS-Roman
  ^^^^   ^^^^^        ^^^^^^^^^
  0x5C   backslash    yen symbol
  0x7C   broken bar   bar
  0x7E   tilde        overbar

	Half-width katakana consists of 63 characters that provide a
minimal set of characters necessary for expressing Japanese. The
shapes are compressed, and visually occupy a space half that of
*normal* Japanese characters.


2.1.2: JIS X 0208-1990

	This basic Japanese character set standard enumerates 6,879
characters, 6,355 of which are kanji separated into two levels. Kanji
in the first level are arranged by (most frequent) reading, and those
in the second level are arranged by radical then total number of
(remaining) strokes.

o Row 1: 94 symbols
o Row 2: 53 symbols
o Row 3: 10 numerals and 52 uppercase and lowercase Latin alphabet
o Row 4: 83 hiragana
o Row 5: 86 katakana
o Row 6: 48 uppercase and lowercase Greek alphabet
o Row 7: 66 uppercase and lowercase Cyrillic (Russian) alphabet
o Row 8: 32 line-drawing elements
o Rows 16 through 47: 2,965 kanji (JIS Level 1 Kanji; last is 47-51)
o Rows 48 through 84: 3,390 kanji (JIS Level 2 Kanji; last is 84-06)

Appendix B of UJIP provides a complete illustration of the JIS X
0208-1990 character set standard by KUTEN (row-cell) code. Appendix G
(pp 294-317) of "Developing International Software for Windows 95 and
Windows NT" by Nadine Kano illustrates the JIS X 0208-1990 character
set standard plus the Microsoft extensions by Shift-JIS code
(Microsoft calls this Code Page 932).
	Earlier versions of this standard were dated 1978 (JIS C
6226-1978) and 1983 (JIS X 0208-1983, formerly JIS C 6226-1983).
	JIS X 0208 went through a revision (from November 1995 until
February 1996), and is slated for publication sometime in 1996 (to
become JIS X 0208-1996). More information on this revision is
available at the following URL:

  ftp://ftp.tiu.ac.jp/jis/jisx0208/


2.1.3: JIS X 0212-1990

	This supplemental Japanese character set standard enumerates
6,067 characters, 5,801 of which are kanji ordered by radical then
total number of (remaining) strokes. All 5,801 kanji are unique when
compared to those in JIS X 0208-1990 (see Section 2.1.2). The
remaining 266 characters are categorized as non-kanji.

o Row 2: 21 diacritics and symbols
o Row 6: 21 Greek characters with diacritics
o Row 7: 26 Eastern European characters
o Rows 9 through 11: 198 alphabetic characters
o Rows 16 through 77: 5,801 kanji (last is 77-67)

Appendix C of UJIP provides a complete illustration of the JIS X
0212-1990 character set standard by KUTEN (row-cell) code.
	The only commercial operating system that provides JIS X
0212-1990 support is BTRON by Personal Media Corporation:

  http://www.personal-media.co.jp/

Section 3.3.18 provides information about TRON Code (used by BTRON),
and details how it encodes the JIS X 0212-1990 character set.


2.1.4: JIS X 0221-1995

	This document is, for all practical purposes, the Japanese
translation of ISO 10646-1:1993 (see Section 2.5.1). Like ISO
10646-1:1993, it is based on Unicode Version 1.1.
	It is noteworthy that JIS X 0221-1995 enumerates subsets that
are applicable for Japanese use (a brief description of their contents
in parentheses):

o BASIC JAPANESE (JIS X 0208-1990 and JIS X 0201-1976 -- characters
  that can be created by means of combining are not included -- 6,884
  characters)
o JAPANESE NON IDEOGRAPHICS SUPPLEMENT (1,913 characters: all non-
  kanji of JIS X 0212-1990 plus hundreds of non-JIS characters)
o JAPANESE IDEOGRAPHICS SUPPLEMENT 1 (918 frequently-used kanji from
  JIS X 0212-1990, including 28 that are identical to kanji forms in
  JIS C 6226-1978)
o JAPANESE IDEOGRAPHICS SUPPLEMENT 2 (the remainder of JIS X 0212-
  1990, namely 4,883 kanji)
o JAPANESE IDEOGRAPHICS SUPPLEMENT 3 (the remaining kanji of ISO
  10646-1:1993, namely 8,746 characters)
o FULLWIDTH ALPHANUMERICS (94 characters; for compatibility)
o HALFWIDTH KATAKANA (63 characters; for compatibility)

	Pages 893 through 993 provide Kangxi Zidian (a classic
300-year-old Chinese character dictionary containing approximately
50,000 characters) and Dai Kanwa Jiten (also known as Morohashi)
indexes for the entire Chinese character block, namely from 0x4E00
through 0x9FA5.
	At 25,750 Yen, it is actually cheaper than ISO 10646-1:1993!


2.1.5: JIS X 0213-199X

	I recently became aware that JSA plans to publish an extension
to JIS X 0208, containing approximately 2,000 characters (kanji and
non-kanji). A public review of this new standard is planned for Summer
1996. I would expect that its information will eventually be available
at the following URL:

    ftp://ftp.tiu.ac.jp/jis/


2.1.6: OBSOLETE STANDARDS

	JIS C 6226-1978 and JIS X 0208-1983 (formerly JIS C 6226-1983)
have been superseded by JIS X 0208-1990. Section 4.1 provides details
on the changes made between these earlier versions of JIS X 0208.
	JIS X 0221-1995 does not mean the end of JIS X 0201-1976, JIS
X 0208-1990, and JIS X 0212-1990. Instead, it will co-exist with those
standards.


2.2: CHINESE (PRC)

	All character set standards that originate in PRC have
designations that begin with "GB." "GB" is short for "Guo Biao" (which
is, in turn, short for "Guojia Biaojun") and means "National
Standard." A select few also have "/T" attached. The "T" presumably is
short for "Traditional." Section 2.2.11 describes ISO-IR-165:1992,
which is a variant of GB 2312-80. It is included here because of this
relationship.
	Most people correlate GB character set standards with
simplified Chinese, but as you will see below, that is not always the
case.
	There are three basic character sets, each one having a
simplified and traditional version.

  Character Set  Set Number  Character Forms
  ^^^^^^^^^^^^^  ^^^^^^^^^^  ^^^^^^^^^^^^^^^
  GB 2312-80     0           Simplified
  GB/T 12345-90  1           Traditional of GB 2312-80
  GB 7589-87     2           Simplified
  GB/T 13131-9X  3           Traditional of GB 7589-87
  GB 7590-87     4           Simplified
  GB/T 13132-9X  5           Traditional of GB 7590-87


2.2.1: GB 1988-89

	This character set, formerly GB 1988-80 and sometimes referred
to as GB-Roman, is the Chinese analog to ASCII and ISO 646. The main
difference is that the currency symbol (0x24), which is represented as
a dollar sign ($) in ASCII, is represented as a Chinese Yuan
(currency) symbol instead. GB 1988-89 is sometimes referred to as
GB-Roman.


2.2.2: GB 2312-80

	This basic (simplified) Chinese character set standard
enumerates 7,445 characters, 6,763 of which are hanzi separated into
two levels. Hanzi in the first level are arranged by reading, and
those in the second level are arranges by radical then total number of
(remaining) strokes. GB 2312-80 is also known as the "Primary Set,"
GB0 (zero), or just GB.

o Row 1: 94 symbols
o Row 2: 72 numerals
o Row 3: 94 full-width GB 1988-89 characters (see Section 2.2.1)
o Row 4: 83 hiragana
o Row 5: 86 katakana
o Row 6: 48 uppercase and lowercase Greek alphabet
o Row 7: 66 uppercase and lowercase Cyrillic (Russian) alphabet
o Row 8: 26 Pinyin and 37 Bopomofo characters
o Row 9: 76 line-drawing elements (09-04 through 09-79)
o Rows 16 through 55: 3,755 hanzi (Level 1 Hanzi; last is 55-89)
o Rows 56 through 87: 3,008 hanzi (Level 2 Hanzi; last is 87-94)

Compare some of the structure with JIS X 0208-1990, and you will find
many similarities, such as:

o Hiragana, katakana, Greek, and Cyrillic characters are in Rows 4, 5,
  6, and 7, respectively
o Chinese characters begin at Row 16
o Chinese characters are separated into two levels
o Level 1 arranged by reading
o Level 2 arranged by radical then total number of strokes

The Japanese standard, JIS C 6226-1978, came out in 1978, which means
that it pre-dates GB 2312-80. The above similarities could not be by
coincidence, but rather by design.
	Appendix G (pp 318-344) of "Developing International Software
for Windows 95 and Windows NT" by Nadine Kano illustrates the GB 2312-
80 character set standard by EUC code (Microsoft calls this Code Page
936). Code Page 936 incorporates the correction of the hanzi at 79-81,
and the correction of the order of 07-22 and 07-23 (see Section 2.2.3
for more details).


2.2.3: GB 6345.1-86

	This document specifies corrections and additions to GB
2312-80 (see Section 2.2.2). The following is a detailed enumeration
of the changes:

o The form of "g" in Row 3 (position 71) was altered
o Row 8 has six additional Pinyin characters (08-27 through 08-32)
o Row 10 contains half-width versions of Row 3 (94 characters)
o Row 11 contains half-width versions of the Pinyin characters from
  Row 8 (32 characters; 11-01 through 11-32)
o The hanzi at 79-81 was corrected to have a simplified left-side
  radical (this was an error in GB 2312-80)

Note that these changes affect the total number of characters in GB
2312-80 -- an increase of 132 characters. This now makes 7,577 as
the total number of characters in GB 2312-80 (7,445 plus 132).
	There was, however, an undocumented correction made in GB
6345.1-86. The order of characters 07-22 and 07-23 (uppercase
Cyrillic) were reversed. This error is apparently in the first and
perhaps second printing of the GB 2312-80 manual, because the copy I
have is from the third printing, and this has been corrected. Page 145
(Figure 113) of John Clews' "Language Automation Worldwide: The
Development of Character Set Standards" illustrates this error.
Developers should take special note of this -- I have seen GB 2312-80
based font products that propagate this ordering error.


2.2.4: GB 7589-87

	This character set enumerates 7,237 hanzi in Rows 16 through
92 (last is 92-93), and they are ordered by radical then total number
of (remaining) strokes. GB 7589-87 is also known as the "Second
Supplementary Set" or GB2.


2.2.5: GB 7590-87

	This character set enumerates 7,039 hanzi in Rows 16 through
90 (last is 90-83), and they are ordered by radical then total number
of (remaining) strokes. GB 7590-87 is also known as the "Fourth
Supplementary Set" or GB4.


2.2.6: GB 8565.2-88

	This standard makes additions to GB 2312-80 (these additions
are separate from those made in GB 6345.1-86 described in Section
2.2.3). GB 8565.2-88 is also known as GB8. In this case there are 705
additions, indicated as follows:

o Row 13 contains 50 hanzi from GB 7589-87 (last is 13-50)
o Row 14 contains 92 hanzi from GB 7590-87 (last is 14-92)
o Row 15 contains 69 non-hanzi indicating dates and times, plus 24
  miscellaneous hanzi (for personal/place names and radicals; last is
  15-93).
o Rows 90 through 94 contain 470 hanzi from GB 7589-87 (94 each)

GB 8565.2-88 therefore provides a total of 8,150 characters (7,445
plus 705).


2.2.7: GB/T 12345-90

	This character set is nearly identical to GB 2312-80 (see
Section 2.2.2) in terms of the number and arrangement of characters,
but simplified hanzi are replaced by their traditional versions. GB/T
12345-90 is also known as the "Supplementary Set" or GB1.
	The following are some interesting facts about this character
set (some instances of simplified/traditional pairs that appear below
are actually character form differences):

o 29 vertical-use characters (punctuation and parentheses) included in
  Row 6 (06-57 through 06-85).

o 2,118 traditional hanzi replace simplified hanzi in Rows 16 through
  87. The "G1-Unique" appendix of the unofficial version (supplied to
  the CJK-JRG for Han Unification purposes) is missing the following
  four (specifies only 2,114):

  0x5B3B    0x6D2F
  0x5E7C    0x6F71

  But, ISO 10646-1:1993 ended up getting these hanzi included anyway,
  with correct mappings.

o Four simplified/traditional hanzi pairs (eight affected code points)
  in rows 16 through 87 are swapped:

  0x3A73 <-> 0x6161
  0x5577 <-> 0x6167
  0x5360 <-> 0x6245 (see the next bullet)
  0x4334 <-> 0x7761

o One hanzi (0x6245), after being swapped, had its left-side radical
  unsimplified (this character, now at 0x5360, is considered part of
  the 2,118 traditional hanzi from the second bullet):

  0x6245 -> 0x5360

o 103 hanzi included in Rows 88 (94 characters) and 89 (9 characters;
  89-01 through 89-09). These are all related to characters between
  Rows 16 and 87.

  - 41 simplified hanzi from Rows 16 through 87 moved to Rows 88 and
    89 (traditional hanzi are now at the original code points):

    0x3327 -> 0x7827  0x3E5D -> 0x7846  0x4B49 -> 0x7869
    0x3365 -> 0x7828  0x3F64 -> 0x7849  0x4C28 -> 0x786B
    0x3373 -> 0x7829  0x402F -> 0x784B  0x4D3F -> 0x786F
    0x3533 -> 0x782C  0x4030 -> 0x784C  0x4D72 -> 0x7871
    0x356D -> 0x782D  0x406F -> 0x784E  0x5236 -> 0x7878
    0x3637 -> 0x782F  0x4131 -> 0x7850  0x5374 -> 0x7879
    0x3736 -> 0x7832  0x463B -> 0x785C  0x5438 -> 0x787C
    0x3761 -> 0x7833  0x463E -> 0x785D  0x5446 -> 0x787D
    0x3849 -> 0x7835  0x464B -> 0x785E  0x5622 -> 0x7921
    0x3963 -> 0x7838  0x464D -> 0x785F  0x563B -> 0x7923
    0x3B2E -> 0x783B  0x4653 -> 0x7860  0x5656 -> 0x7926
    0x3C38 -> 0x7840  0x4837 -> 0x7866  0x567E -> 0x7928
    0x3C5B -> 0x7842  0x4961 -> 0x7867  0x573C -> 0x7929
    0x3C76 -> 0x7843  0x4A75 -> 0x7868

  - 62 hanzi added to Rows 88 and 89 (the gaps from the above are
    filled in). These were mostly to account for multiple traditional
    hanzi collapsing into a single simplified form.

  - The following code point mappings illustrate how all of these 103
    hanzi are related to hanzi between Rows 16 and 87 (note how many
    of these 103 hanzi map to a single code point):

    0x7821 -> 0x305A  0x7844 -> 0x3D2A  0x7867 -> 0x4961
    0x7822 -> 0x3065  0x7845 -> 0x3E21  0x7868 -> 0x4A75
    0x7823 -> 0x316D  0x7846 -> 0x3E5D  0x7869 -> 0x4B49
    0x7824 -> 0x3170  0x7847 -> 0x3E6D  0x786A -> 0x4B55
    0x7825 -> 0x3237  0x7848 -> 0x3F4B  0x786B -> 0x4C28
    0x7826 -> 0x3245  0x7849 -> 0x3F64  0x786C -> 0x4C28
    0x7827 -> 0x3327  0x784A -> 0x4027  0x786D -> 0x4C28
    0x7828 -> 0x3365  0x784B -> 0x402F  0x786E -> 0x4C33
    0x7829 -> 0x3373  0x784C -> 0x4030  0x786F -> 0x4D3F
    0x782A -> 0x3376  0x784D -> 0x405B  0x7870 -> 0x4D45
    0x782B -> 0x3531  0x784E -> 0x406F  0x7871 -> 0x4D72
    0x782C -> 0x3533  0x784F -> 0x407A  0x7872 -> 0x4F35
    0x782D -> 0x356D  0x7850 -> 0x4131  0x7873 -> 0x4F35
    0x782E -> 0x362C  0x7851 -> 0x414B  0x7874 -> 0x4F4C
    0x782F -> 0x3637  0x7852 -> 0x4231  0x7875 -> 0x4F72
    0x7830 -> 0x3671  0x7853 -> 0x425E  0x7876 -> 0x506B
    0x7831 -> 0x3722  0x7854 -> 0x4339  0x7877 -> 0x5229
    0x7832 -> 0x3736  0x7855 -> 0x4349  0x7878 -> 0x5236
    0x7833 -> 0x3761  0x7856 -> 0x4349  0x7879 -> 0x5374
    0x7834 -> 0x3834  0x7857 -> 0x4349  0x787A -> 0x5379
    0x7835 -> 0x3849  0x7858 -> 0x4356  0x787B -> 0x5375
    0x7836 -> 0x3948  0x7859 -> 0x4366  0x787C -> 0x5438
    0x7837 -> 0x394E  0x785A -> 0x436F  0x787D -> 0x5446
    0x7838 -> 0x3963  0x785B -> 0x3159  0x787E -> 0x5460
    0x7839 -> 0x6358  0x785C -> 0x463B  0x7921 -> 0x5622
    0x783A -> 0x3A7A  0x785D -> 0x463E  0x7922 -> 0x563B
    0x783B -> 0x3B2E  0x785E -> 0x464B  0x7923 -> 0x563B
    0x783C -> 0x3B58  0x785F -> 0x464D  0x7924 -> 0x5642
    0x783D -> 0x3B63  0x7860 -> 0x4653  0x7925 -> 0x5646
    0x783E -> 0x3B71  0x7861 -> 0x4727  0x7926 -> 0x5656
    0x783F -> 0x3C22  0x7862 -> 0x4729  0x7927 -> 0x566C
    0x7840 -> 0x3C38  0x7863 -> 0x4F4B  0x7928 -> 0x567E
    0x7841 -> 0x3C52  0x7864 -> 0x476F  0x7929 -> 0x573C
    0x7842 -> 0x3C5B  0x7865 -> 0x477A
    0x7843 -> 0x3C76  0x7866 -> 0x4837

So, if we total everything up, we see that GB/T 12345-90 has 2,180
hanzi (2,118 are replacements for GB 2312-80 code points, and 62 are
additional) and 29 non-hanzi not found in GB 2312-80.
	Note that the printing of the GB/T 12345-90 has some
character-form errors. The errors I am aware of are as follows:

  Code Point  Description of Error
  ^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^
  0x4125      The upper-left element should be "tree" instead of
              "warrior"
  0x596C      The "bird" radical should not include the "fire" element


2.2.8: GB/T 13131-9X

	This character set is identical to GB 7589-87 (see Section
2.2.4) in terms of number of characters, but simplified hanzi are
replaced by their traditional versions. The exact number of such
substitutions is currently unknown to this author. GB/T 13131-9X is
also known as the "Third Supplementary Set" or GB3.


2.2.9: GB/T 13132-9X

	This character set is identical to GB 7590-87 (see Section
2.2.5) in terms of number of characters, but simplified hanzi are
replaced by their traditional versions. The exact number of such
substitutions is currently unknown to this author. GB/T 13132-9X is
also known as the "Fifth Supplementary Set" or GB5.


2.2.10: GB 13000.1-93

	This document is, for all practical purposes, the Chinese
translation of ISO 10646-1:1993 (see Section 2.5.1).


2.2.11: ISO-IR-165:1992

	This standard, also known as the CCITT Chinese Set, is a
variant of GB 2312-80 with the following characteristics:

o GB 6345.1-86 modifications (including the undocumented one) and
  additions, namely 132 characters (see Section 2.2.3)
o GB 8565.2-88 additions, namely 705 characters (see Section 2.2.6)
o Row 6 contains 22 background (shading) characters (06-60 through
  06-81)
o Row 12 contains 94 hanzi
o Row 13 contains 44 additional hanzi (13-51 through 13-94; fills the
  row)
o Row 15 contains 1 additional hanzi (15-94)

ISO-IR-165:1992 can therefore be considered a superset of GB 2312-80,
GB 6345.1-86, and GB 8565.2-88. This means 8,443 total characters
compared to the 7,445 in GB 2312-80, 7,577 in GB 6345.1-86, and the
8,150 in GB 8565.2-88.


2.2.12: OBSOLETE STANDARDS

	Most GB standards seem to be revised through other documents,
so it is hard to point to a standard and claim that it is obsolete.
The only revision I am aware of is the GB 1988-89 (the original was
named GB 1988-80).


2.3: CHINESE (TAIWAN)

	The sections below describe two major Taiwanese character
sets, namely Big Five and CNS 11643-1992. As you will learn they are
somewhat compatible. CCCII, also developed in Taiwan, is described in
Section 2.5.2.


2.3.1: BIG FIVE

	The Big Five character set is composed of 94 rows of 157
characters each (the 157 characters of each row are encoded in an
initial group of 63 codes followed by the remaining 94 codes). The
following is a break-down of its contents:

o Row 1: 157 symbols
o Row 2: 157 symbols
o Row 3: 94 symbols
o Rows 4 through 38: 5,401 hanzi (Level 1 Hanzi; last is 38-63)
o Rows 41 through 89: 7,652 hanzi (Level 2 Hanzi; last is 89-116)

This forms what I consider to be the basic Big Five set. Actually, two
of the hanzi in Level 2 are duplicates, so there are actually only
7,650 unique hanzi in Level 2.
	There are two major extensions to Big Five. The first really
has no name, and can be considered part of the basic Big Five set as
specified above. It adds the following characters:

o Rows 38-39: 4 Japanese iteration marks, 83 hiragana, 86 katakana, 66
  uppercase and lowercase Cyrillic (Russian) alphabet, 10 circled
  digits, and 10 parenthesized digits

	The other extension was developed by a company called ETen
Information System in Taiwan, and is actually considered to be the
most widely used version of Big Five. It provides the following
extensions to Big Five (different from the above extension):

o Rows 38-40: 10 circled digits, 10 parenthesized digits, 10 lowercase
  Roman numerals, 25 classical radicals, 15 Japanese-specific symbols,
  83 hiragana, 86 katakana, 66 uppercase and lowercase Cyrillic
  (Russian) alphabet, 3 arrows, 10 radical-like hanzi elements, 40
  fraction-like digits, and 7 symbols
o Row 89: 7 hanzi, 33 double-lined line-drawing elements, and a black
  box

	It is *very* important to note that while these two extensions
have many common portions (in particular, hiragana, katakana, the
Cyrillic alphabet, and so on), they do not share the same code points
for such characters.
	Appendix G (pp 407-450) of "Developing International Software
for Windows 95 and Windows NT" by Nadine Kano illustrates the Big Five
character set standard by Big Five code (Microsoft calls this Code
Page 950). Code Page 950 incorporates some of the ETen extensions,
namely those in Row 89.


2.3.2: CNS 11643-1992

	CNS 11643-1992 (also known as CNS 11643 X 5012), by
definition, consists of 16 planes of characters, seven of which have
character assignments. Each plane is a 94-row-by-94-cell matrix
capable of holding a total of 8,836 characters. CNS stands for
"Chinese National Standard."
	CNS 11643-1992 specifies characters only in the first seven
planes. A break-down of characters, by plane, is as follows:

o Plane 1:
  - 438 symbols in Rows 1 through 6
  - 213 classical radicals in Rows 7 through 9
  - 33 graphic representations of control characters in Row 34
  - 5,401 hanzi in Rows 36 through 93 (last is 93-43)
o Plane 2: 7,650 hanzi in Rows 1 through 82 (last is 82-36)
o Plane 3: 6,148 hanzi in Rows 1 through 66 (last is 66-38)
o Plane 4: 7,298 hanzi in Rows 1 through 78 (last is 78-60)
o Plane 5: 8,603 hanzi in Rows 1 through 92 (last is 92-49)
o Plane 6: 6,388 hanzi in Rows 1 through 68 (last is 68-90)
o Plane 7: 6,539 hanzi in Rows 1 through 70 (last is 70-53)

The total number of characters in CNS 11643-1992 is a staggering
48,711 characters, 48,027 of which are hanzi. Also note that number of
hanzi in Plane 1 is identical to Level 1 hanzi of Big Five (see
Section 2.3.1). The 2 extra hanzi in Level 2 hanzi of Big Five are
actually redundant, and are therefore not in CNS 11643-1992 Plane 2.
	It is rumored that Plane 8 is currently being defined, and
will add yet more hanzi to this standard.


2.3.3: CNS 5205

	This character set is Taiwan's analog to ASCII and ISO 646,
and is reportedly rarely used. How it differs from ASCII, if at all,
is unknown to this author.


2.3.4: OBSOLETE STANDARDS

	CNS 11643-1986 specified characters only in the first three
planes, as described in Section 2.3.2. Also, Plane 3 of CNS 11643-1992
was called Plane 14 of CNS 11643-1986.


2.4: KOREAN

	The sections below describe the most current Korean character
sets, namely KS C 5636-1993, KS C 5601-1992, KS C 5657-1991, and KS C
5700-1995. "KS" stands for "Korean Standard."


2.4.1: KS C 5636-1993

	This character set (published on January 6, 1993), formerly KS
C 5636-1989 (published on April 22, 1989) and sometimes referred to as
KS-Roman, is the Korean analog to ASCII and ISO 646-1991. The primary
difference is that the ASCII backslash (0x5C) is represented as a Won
symbol.


2.4.2: KS C 5601-1992

	This basic Korean character set standard enumerates 8,224
characters, 4,888 of which are hanja, and 2,350 of which are pre-
combined hangul. The hanja and hangul blocks are arranged by reading.
The following is a break-down of its contents:

o Row 1: 94 symbols
o Row 2: 69 abbreviations and symbols
o Row 3: 94 full-width KS C 5636-1993 characters (see Section 2.4.1)
o Row 4: 94 hangul elements
o Row 5: 68 lowercase and uppercase Roman numerals and lowercase and
  uppercase Greek alphabet
o Row 6: 68 line-drawing elements
o Row 7: 79 abbreviations
o Row 8: 91 phonetic symbols, circled characters, and fractions
o Row 9: 94 phonetic symbols, parenthesized characters, subscripts,
  and superscripts
o Row 10: 83 hiragana
o Row 11: 86 katakana
o Row 12: 66 lowercase and uppercase Cyrillic (Russian) alphabet
o Rows 16 through 40: 2,350 pre-combined hangul (last is 40-94)
o Rows 42 through 93: 4,888 hanja (last is 93-94)

Rows 41 and 94 are designated for user-defined characters.
	There are many similarities with JIS X 0208-1990 and GB
2312-80, such as hiragana, katakana, Greek, and Cyrillic characters,
but they are assigned to different rows.
	There is an interesting note about the hanja block (Rows 42
through 93). Although there are 4,888 hanja, not all are unique. The
hanja block is arranged by reading, and in those cases when a hanja
has more than one reading, that hanja is duplicated (sometimes more
than once) in the same character set. There are 268 such cases of
duplicate hanja in KS C 5601-1992, meaning that it contains 4,620
unique hanja. If you have a copy of the KS C 5601-1992 manual handy,
you can compare the following four code points:

  0x6445
  0x5162
  0x5525
  0x6879

While most of these cases involve two hanja instances, there are four
hanja that have three instances, and one (listed above) that has four!
This is the only CJK character set that has this property of
intentionally duplicating Chinese characters. See Section 4.4 for more
details.
	Annex 3 of this standard defines the complete set of 11,172
pre-combined hangul characters, also known as Johab. Johab refers to
the encoding method, and is almost like encoding all possible three-
letter words (meaning that most are nonsense). See Section 3.3.5 for
more details on Johab encoding.


2.4.3: KS C 5657-1991

	This character set standard provides supplemental characters
for Korean writing, to include symbols, pre-combined hangul, and
hanja. The following is a break-down of its contents:

o Rows 1 through 7: 613 lowercase and uppercase Latin characters with
  diacritics (see note below)
o Rows 8 through 10: 273 lowercase and uppercase Greek characters with
  diacritics
o Rows 11 through 13: 275 symbols
o Row 14: 27 compound hangul elements
o Rows 16 through 36: 1,930 pre-combined hangul (last is 36-50)
o Rows 37 through 54: 1,675 pre-combined hangul (last is 54-77; see
  note below)
o Rows 55 through 85: 2,856 hanja (last is 85-36)

The KS C 5657-1991 manual has a possible error (or at least an
inconsistency) for Rows 1 through 7. The manual says that there are
615 characters in that range, but I only counted 613. The difference
can be found on page 19 as the following two characters:

  Character Code  Character
  ^^^^^^^^^^^^^^  ^^^^^^^^^
  0x2137          X
  0x217A          TM

An "X" doesn't belong there (it is already in KS C 5601-1992 at code
point 0x2358), and the trademark symbol is also part of KS C 5601-1992
at code point 0x2262. This is why I feel that my count of 613 is more
accurate than what is explicitly stated in the manual on page 2.
	Also, page 2 of the manual says that Rows 37 through 54
contains 1,677 pre-combined hangul, but I only counted 1,675 (17 rows
of 94 characters plus a final row with 77 characters -- do the math
for yourself).
	Here's another interesting note. My official copy of this
standard has all of its 2,856 hanja hand-written.


2.4.4: GB 12052-89

	You may be asking yourself why a GB standard is listed under
the Korean section of this document. Well, there is a rather large
Korean population in China (Korea was considered part of China before
the 1890s), and they need a character set standard for communicating
using hangul. GB 12052-89 is a Korean character set standard
established by China (PRC), and enumerates a total of 5,979
characters.
	The following is the arrangement of this character set:

o Row 1: 94 symbols
o Row 2: 72 numerals
o Row 3: 94 full-width ASCII characters
o Row 4: 83 hiragana
o Row 5: 86 katakana
o Row 6: 48 uppercase and lowercase Greek alphabet
o Row 7: 66 uppercase and lowercase Cyrillic (Russian) alphabet
o Row 8: 26 Pinyin and 37 Bopomofo characters
o Row 9: 76 line-drawing elements (09-04 through 09-79)
o Rows 16 through 37: 2,068 pre-combined hangul (Level 1 Hangul, Part
  1; last is 37-94)
o Rows 38 through 52: 1,356 pre-combined hangul (Level 1 Hangul, Part
  2; last is 52-40)
o Rows 53 through 71: 1,779 pre-combined hangul (Level 2 Hangul; last
  is 71-87)
o Rows 71 through 72: 94 "Idu" hanja (71-89 through 72-88)

	There are a few interesting notes I can make about this
character set:

o Rows 1 through 9 are identical to the same rows in GB 2312-80,
  except that 03-04 is a dollar sign, not a Chinese Yuan (currency)
  symbol.

o The GB 12052-89 manual states on pp 1 and 3 that Rows 53 through 72
  contain 1,876 characters, but I only counted 1,873 (1,779 hangul
  plus 94 hanja).

o The total number of characters, 5,979, is correctly stated in the
  manual although the hangul count is incorrect.

o The arrangement and ordering of these hangul bear no relationship to
  that of KS C 5601-1992. Both standards order by reading, which is
  the only way in which they are similar.

	I am not aware to what extent this character set is being
used (and who might be using it).


2.4.5: KS C 5700-1995

	Korea has developed a new character set standard called KS C
5700-1995. It is equivalent to ISO 10646-1:1993, but have pre-combined
hangul as provided (and ordered) in Unicode Version 2.0 (meaning that
all 11,172 hangul are in a contiguous block).


2.4.6: OBSOLETE STANDARDS

	KS C 5601-1986, KS C 5601-1987, and KS C 5601-1989 are the
same, character-set wise, to KS C 5601-1992. The 1992 edition provides
more material in the form of annexes. KS C 5601-1982, the original
version, enumerated only the 51 basic hangul elements in a one-byte 7-
and 8-bit encoding. This information is still part of KS C 5601-1992,
but in Annex 4.
	There were two earlier multiple-byte standards called KS C
5619-1982 and KIPS. KS C 5619-1982 enumerated 51 hangul elements,
1,316 pre-combined hangul, and 1,672 hanja. KIPS (Korean Information
Processing System) enumerated 2,058 pre-combined hangul and 2,392
hanja. Both have been rendered obsolete by KS C 5601-1987.


2.5: CJK

	The only true CJK character sets available today are CCCII,
ANSI Z39.64-1989 (also known as EACC or REACC), and ISO 10646-1:1993.
ISO 10646-1:1993 is unique in that it goes beyond CJK (Chinese
characters) to provide virtually all commonly-used alphabetic scripts.
	Of these three, only ISO 10646-1:1993 is expected to gain
wide-spread acceptance. CCCII and ANSI Z39.64-1989 are still used
today, but primarily for bibliographic purposes.


2.5.1: ISO 10646-1:1993

	Published by ISO (International Organization for
Standardization) in Switzerland, this character set enumerates over
34,000 characters. Its I-zone ("I" stands for "Ideograph") enumerates
approximately 21,000 Chinese characters, which is the result of a
massive effort by the CJK-JRG (CJK Joint Research Group) called "Han
Unification." The CJK-JRG is now called the IRG (Ideographic
Rapporteur Group), and is off doing additional research for future
Chinese character allocations to ISO 10646-1:1993.
	The Basic Multilingual Plane (BMP) of ISO 10646-1:1993 is
equivalent to Unicode. While Unicode is comprised of a single plane of
characters (which doesn't allow much room for future expansion), ISO
10646-1:1993 contains hundreds of such planes.
	One very nice feature of this standard's manual are the CJK
code correspondence tables in Section 26 (pp 262-698). Four columns
are provided for each ISO 10646-1:1993 I-zone code point -- simplified
Chinese, traditional Chinese, Japanese, and Korean. If the ISO
10646-1:1993 Chinese character maps to one of these locales, the
hexadecimal character code, (decimal) row-cell value, and glyph for
that locale is provided. The corresponding tables in Volume 2 of "The
Unicode Standard" provide character codes (sometimes the hexadecimal
character code, and sometimes the row-cell value) and a single
glyph. Quite unfortunate. I hear that a new edition of "The Unicode
Standard" is about to be released. I hope that this problem has been
addressed.
	ISO 10646-1:1993 does not replace existing national character
set standards. It simply provides a single character set that is a
superset of *most* national character sets. For example, only a
fraction of the 48,027 hanzi in CNS 11643-1992 are included in ISO
10646-1:1993. I feel that it is best to think of ISO 10646-1:1993 as
"just another character set." My philosophy is to support the maximum
number of character sets and encodings as possible.
	A note about ordering this standard. If you order through ANSI
in the United States, try to get an original manual. It is not easy,
though. You see, ANSI has duplication rights for ISO documents.
Photocopying Section 26 (pp 262-698) doesn't do the Chinese characters
much justice, and some characters become hard-to-read. Unfortunately,
there is no way to indicate that you want an original ISO document
through ANSI's ordering process, so some post-ordering haggling may
become necessary.
	More information on ISO 10646-1:1993 can be found at the
following URL:

  http://www.unicode.org/

	Japan, China (PRC), and Korea have developed their own
national standards that are based on ISO 10646-1:1993. They are
designated as JIS X 0221-1995 (see Section 2.1.4), GB 13000.1-93 (see
Section 2.2.10), and KS C 5700-1995 (see Section 2.4.5), respectively.
	Note that these national-standard versions of Unicode are
aligned differently with its three versions:

  Unicode Version 1.0
  Unicode Version 1.1 <-> ISO 10646-1:1993, JIS X 0221-1995, GB 13000.1-93
  Unicode Version 2.0 <-> KS C 5700-1995

One of the major changes made for Unicode Version 2.0 is the inclusion
of all 11,172 hangul. Versions 1.1 has 6,656 hangul.


2.5.2: CCCII

	The Chinese Character Analysis Group in Taiwan developed CCCII
(Chinese Character Code for Information Interchange) in the 1980s.
This character set is composed of 94 planes that have 94 rows and 94
cells (94 x 94 x 94 = 830,584 characters). Furthermore, every six
planes constitute a "layer" (6 x 94 x 94 = 53,016 characters). The
following is the contents of each of the 16 layers (the 16th layer
contains only four planes):

o Layer 1: Symbols and Traditional Chinese characters
o Layer 2: Simplified Chinese characters from PRC
o Layers 3 through 12: Variant Chinese character forms
o Layer 13: Japanese kana and kokuji (Japanese-made kanji)
o Layer 14: Korean hangul
o Layer 15: Reserved
o Layer 16: Miscellaneous characters (Japanese and Korean)

	Layers 1 through 12 have a special meaning and relationship.
The same code point in these layers is designed to hold the same
character, but with different forms. Layer 1 code points contain the
traditional character forms, Layer 2 code points contain the
simplified character forms (if any), and Layers 3 through 12 contain
variant character forms (if any). For example, given a Chinese
character with three forms, its encoding and arrangement may be as
follows:

  Character Form  Code Point  Layer
  ^^^^^^^^^^^^^^  ^^^^^^^^^^  ^^^^^
  Traditional     0x224E41    1
  Simplified      0x284E41    2
  Variant         0x2E4E41    3

Note how the second and third bytes (0x4E41) are identical in all
three instances -- only the first byte's value, which indicates the
layer, differs. Needless to say, this method of arrangement provides
easy access to related Chinese character forms. No wonder it is used
for bibliographic purposes.
	The first layer is composed as follows:

o Plane 1/Row 2: 56 mathematical symbols
o Plane 1/Row 3: The ASCII character set
o Plane 1/Row 11: 35 Chinese punctuation marks
o Plane 1/Rows 12 through 14: 214 classical radicals
o Plane 1/Row 15: 41 Chinese numerical symbols, 37 phonetic symbols,
  and 4 tone marks
o Plane 1/Rows 16 through 67: 4,808 common Chinese characters
o Plane 1/Row 68 through Plane 3/Row 64: 17,032 less common Chinese
  characters
o Plane 3/Row 65 through Plane 6/Row 5: 20,583 rare Chinese characters

Note that Row 1 of all planes is reserved, and never assigned
characters. Take this into account when studying the above table
ranges that span planes (that is, skip Row 1).
	In addition to the above, there are 11,517 simplified Chinese
characters in Layer 2 (3,625 are considered PRC simplified forms, and
the remaining 7,892 are regular simplified forms). This provides a
total of 53,940 Chinese characters.
	Further information on CCCII (to include very interesting
historical notes) can be found on pp 146-149 of John Clews' "Language
Automation Worldwide: The Development of Character Set Standards" and
Chapter 6 of Huang & Huang's "An Introduction to Chinese, Japanese,
and Korean Computing."


2.5.3: ANSI Z39.64-1989

	This national standard is designated as ANSI Z39.64-1989 and
named "East Asian Character Code" (EACC), but was originally known as
REACC (RLIN East Asian Character Code), that is, before it became a
national standard. RLIN stands for "Research Libraries Information
Network," which was developed by the Research Libraries Group (RLG)
located in Mountain View, California.
	RLG's Home Page is at the following URL:

  http://www.rlg.org/

	The structure of ANSI Z39.64-1989 is based on CCCII, but with
a few differences. Many consider it to be superior to and a
replacement for CCCII (see Section 2.5.2).
	The ANSI Z39.64-1989 standard is available through ANSI, but
you should be aware that it is distributed in the form of several
microfiche. Not a terribly useful storage medium these days. I had my
set tranformed into tangible printed pages. You can also obtain this
standard through NISO (National Information Standards Organization)
Press Fulfillment. Their URL is:

  http://www.niso.org/

	EACC has been designated by the Library of Congress as a
character set for use in USMARC (United States MAchine-Readable
Cataloging) records, and is used extensively by East Asian libraries
across North America.
	EACC is also being used in Australia for the National CJK
Project. Check out the following URL for more details:

  http://www.nla.gov.au/1/asian/ncjk/cjkhome.html

	Further information on ANSI Z39.64-1989 (to include very
interesting historical notes) can be found on pp 150-156 of John
Clews' "Language Automation Worldwide: The Development of Character
Set Standards" (although a source at RLG tells me that some of Clews'
facts are wrong) and Chapter 6 of Huang & Huang's "An Introduction to
Chinese, Japanese, and Korean Computing."
	The authoritative paper on EACC is "RLIN East Asian Character
Code and the RLIN CJK Thesaurus" by Karen Smith Yoshimura and Alan
Tucker, published in "Proceedings of the Second Asian-Pacific
Conference on Library Science," May 20-24,1985, Seoul, Korea.


2.6: OTHER

	This section includes character set standards that don't
properly fall under the above sections.


2.6.1: GB 8045-87

	GB 8045-87 is a Mongolian character set standard established
by China (PRC). This standard enumerates 94 Mongolian characters. Of
these 94 characters, 12 are punctuation (vertically-oriented), and the
remaining 82 are characters specific to the Mongolian script.
Mongolian is written vertically like Chinese.
	I do not discuss the encoding for GB 8045-87 in Part 3, so
will do it here. The GB 8045-87 manual describes a 7- and 8-bit
encoding. The 7-bit encoding puts these 94 characters in the standard
ASCII printable range, namely 0x21 through 0x7E. Code point 0x20 is
marked as "MSP" which stands for "Mongolian space." The 8-bit encoding
puts these 94 characters in the range 0xA1 through 0xFE, with the
"MSP" character at code point 0xA0. The GB 1988-89 set is then encoded
in the range 0x21 through 0x7E.


2.6.2: TCVN-5773:1993

	TCVN-5773:1993 (also called NSCII, which is short for Nom
Standard Code for Information Interchange) is the Vietnamese analog to
ISO 10646-1:1993, but adds 1,775 Vietnamese-specific Chinese
characters. These 1,775 characters are encoded in the range 0xA000
through 0xA6EE.
	More information on TCVN-5773:1993 can be found at the
following URL:

  ftp://unicode.org/pub/MappingTables/EastAsiaMaps/

There are two files at the above URL that pertain to this standard.
The first is a README, and the second is a Macintosh HyperCard stack
(requires HyperCard):

  TCVN-NSCII.README
  TCVN-NSCIIstack_1.0.sea.hqx


PART 3: CJK ENCODING SYSTEMS

	These sections describe the various systems for encoding the
character set standards listed in Part 2. The first two described,
7-bit ISO 2022 and EUC, are not specific to a locale, and in some
cases not specific to CJK.
	The CJK Character Set Server at the following URL can generate
character sets based on encodings described in this section:

  http://jasper.ora.com/lunde/cjk-char.html

I suggest that you use this as a way to obtain files that illustrate
these encodings in action.
	But first, please take a peek at the following table, which is
an attempt to illustrate how two Chinese characters (that stand for
"kanji/hanzi/hanja") are encoded using the various methods presented
in the following sections (character codes as hexadecimal digits, and
escape sequences or shift sequences as printable characters):

o Japanese (JIS X 0208-1990 & JIS X 0201-1976):
  - 7-bit ISO 2022        <ESC> & @ <ESC> $ B 0x3441 0x3B7A <ESC> ( J
  - ISO-2022-JP           <ESC> $ B 0x3441 0x3B7A <ESC> ( J
  - EUC                   0xB4C1 0xBBFA
  - Shift-JIS             0x8ABF 0x8E9A

o Simplified Chinese (GB 2312-80 & GB 1988-89 or ASCII):
  - 7-bit ISO 2022        <ESC> $ A 0x3A3A 0x5756 <ESC> ( T
  - ISO-2022-CN           <ESC> $ ) A <SO> 0x3A3A 0x5756 <SI>
  - EUC                   0xBABA 0xD7D6
  - HZ (HZ-GB-2312)       ~{ 0x3A3A 0x5756 ~}
  - zW                    zW 0x3A3A 0x5756

o Traditional Chinese (CNS 11643-1992):
  - 7-bit ISO 2022        <ESC> $ ( G 0x6947 0x4773 <ESC> ( B
  - ISO-2022-CN           <ESC> $ ) G <SO> 0x6947 0x4773 <SI>
  - EUC                   0xE9C7 0xC7F3 or 0x8EA1E9C7 0x8EA1C7F3

o Traditional Chinese (Big Five):
  - Big Five              0xBA7E 0xA672

o Korean (KS C 5601-1992 & ASCII):
  - 7-bit ISO 2022        <ESC> $ ( C 0x7953 0x6D2E <ESC> ( B
  - ISO-2022-KR           <ESC> $ ) C <SO> 0x7953 0x6D2E <SI>
  - EUC                   0xF9D3 0xEDAE
  - Johab                 0xF7D3 0xF1AE

o CJK (ISO 10646-1:1993, JIS X 0221-1995, GB 13000.1-93, or KS C
  5700-1995):
  - UCS-2                 0x6F22 0x5B57
  - UCS-4                 0x00006F22 0x00005B57

The above should have given you a taste of what information the
following sections provide.


3.1: 7-BIT ISO 2022 ENCODING

	7-bit ISO 2022 is the name commonly given to the encoding
system that uses escape sequences to shift between character sets.
(ISO 2022 encoded Japanese text is also known as "JIS" encoding, but
is different from ISO-2022-JP and ISO-2022-JP-2, and will be explained
in Section 3.1.3.) This encoding comes from the ISO 2022-1993
standard.
	An escape sequence, as the name implies, consists of an escape
character followed by a sequence of one or more characters. These
escape sequences are used to change character set of the text
stream. This may also mean a shift from one- to two-byte-per-character
mode (or vice versa).
	7-bit ISO 2022 Character sets fall into two types: one-byte
and two-byte. CJK character sets, for obvious reasons, fall into the
latter group.
	One advantage that 7-bit ISO 2022 encoding has over other
encoding systems is that its escape sequences specify the character
set, thus specify the locale. 7-bit ISO 2022 encoding also encodes
text using only seven-bit bytes, which has the benefit of being able
to survive Internet travel (e-mail).


3.1.1: CODE SPACE

	Each byte in the representation of graphic (printable)
characters fall into the range 0x21 (decimal 33) through 0x7E (decimal
126). For one-byte character sets, this means a maximum of 94
characters. For two-byte character sets, this means a maximum of 8,836
characters (94 x 94 = 8,836).

  One-byte Characters                           Encoding Range
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
  first byte range                              0x21-0x7E

  Two-byte Characters                           Encoding Range
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
  first byte range                              0x21-0x7E
  second byte range                             0x21-0x7E

White space and control characters (of which the "escape" character is
one) are still found in 0x00-0x20 and 0x7F.


3.1.2: ISO-REGISTERED ESCAPE SEQUENCES

	The following is a table that provides the ISO-registered
escape sequences for various one- and two-byte character sets
mentioned in Part 2 of this document (ISO registration numbers
provided in the fourth column):

  One-byte Character Set  Escape Sequence      Hexadecimal     ISO Reg
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^      ^^^^^^^^^^^     ^^^^^^^
  ASCII (ANSI X3.4-1986)  <ESC> ( B            0x1B2842        6
  Half-width katakana     <ESC> ( I            0x1B2849        13
  JIS X 0201-1976 Roman   <ESC> ( J            0x1B284A        14
  GB 1988-89 Roman        <ESC> ( T            0x1B2854        57

  Two-byte Character Set  Escape Sequence      Hexadecimal     ISO Reg
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^      ^^^^^^^^^^^     ^^^^^^^
  JIS C 6226-1978         <ESC> $ @            0x1B2440        42
  GB 2312-80              <ESC> $ A            0x1B2441        58
  JIS X 0208-1983         <ESC> $ B            0x1B2442        87
  KS C 5601-1992          <ESC> $ ( C          0x1B242843      149
  JIS X 0212-1990         <ESC> $ ( D          0x1B242844      159
  ISO-IR-165:1992         <ESC> $ ( E          0x1B242845      165
  JIS X 0208-1990         <ESC> & @ <ESC> $ B  0x1B26401B2442  168
  CNS 11643-1992 Plane 1  <ESC> $ ( G          0x1B242847      171
  CNS 11643-1992 Plane 2  <ESC> $ ( H          0x1B242848      172
  CNS 11643-1992 Plane 3  <ESC> $ ( I          0x1B242849      183
  CNS 11643-1992 Plane 4  <ESC> $ ( J          0x1B24284A      184
  CNS 11643-1992 Plane 5  <ESC> $ ( K          0x1B24284B      185
  CNS 11643-1992 Plane 6  <ESC> $ ( L          0x1B24284C      186
  CNS 11643-1992 Plane 7  <ESC> $ ( M          0x1B24284D      187

Note that the first four two-byte character sets do not use an opening
parenthesis (0x28 or "(") in their escape sequences, which means that
they don't follow the 7-bit ISO 2022 rules precisely. They are shorter
for historical reasons, and are retained for backwards compatibility.
Also note that not all of the CJK character set standards described in
Part 2 have ISO-registered escape sequences.
	There are other encoding methods that are similar to 7-bit ISO
2022 in that they are suitable for Internet use, but are locale-
specific. These include HZ and zW encoding, both of which are specific
to the GB 2312-80 character set (see Sections 3.3.2 and 3.3.3).
ISO-2022-JP, ISO-2022-KR, ISO-2022-CN, and ISO-2022-CN-EXT are
described below.


3.1.3: ISO-2022-JP AND ISO-2022-JP-2

	ISO-2022-JP is best described as a subset of 7-bit ISO 2022
encoding for Japanese, and reflects how Japanese text is encoded for
e-mail messages. ISO-2022-JP-2 is an extension that supports
additional character sets.
	There are only four escape sequences permitted in ISO-2022-JP,
indicated as follows:

  One-byte Character Set  Escape Sequence      Hexadecimal
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^      ^^^^^^^^^^^
  ASCII (ANSI X3.4-1986)  <ESC> ( B            0x1B2842
  JIS X 0201-1976 Roman   <ESC> ( J            0x1B284A

  Two-byte Character Set  Escape Sequence      Hexadecimal
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^      ^^^^^^^^^^^
  JIS C 6226-1978         <ESC> $ @            0x1B2440
  JIS X 0208-1983         <ESC> $ B            0x1B2442

Note the lack of JIS X 0208-1990, JIS X 0212-1990, and half-width
katakana escape sequences. The JIS X 0208-1983 escape sequence is used
to indicate both JIS X 0208-1983 and JIS X 0208-1990 (for practical
reasons).
	ISO-2022-JP-2 permits additional escape sequences, indicated
as follows:

  One-byte Character Set  Escape Sequence      Hexadecimal
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^      ^^^^^^^^^^^
  ASCII (ANSI X3.4-1986)  <ESC> ( B            0x1B2842
  JIS X 0201-1976 Roman   <ESC> ( J            0x1B284A

  Two-byte Character Set  Escape Sequence      Hexadecimal
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^      ^^^^^^^^^^^
  JIS C 6226-1978         <ESC> $ @            0x1B2440
  JIS X 0208-1983         <ESC> $ B            0x1B2442
  JIS X 0212-1990         <ESC> $ ( D          0x1B242844
  GB 2312-80              <ESC> $ A            0x1B2441
  KS C 5601-1992          <ESC> $ ( C          0x1B242843

With the introduction of ISO-2022-KR (see Section 3.1.4), ISO-2022-CN
(see Section 3.1.5), and ISO-2022-CN-EXT (see Section 3.1.5), the
usefulness of supporting GB 2312-80 and KS C 5601-1992 can be
questioned. However, ISO-2022-JP-2 provides support for JIS X
0212-1990.
	More detailed information on ISO-2022-JP encoding can be found
in RFC 1468. And, more detailed information on ISO-2022-JP-2 encoding
can be found in RFC 1554.


3.1.4: ISO-2022-KR

	ISO-2022-KR is similar to ISO-2022-JP (see Section 3.1.3) in
that it reflects how Korean text is encoded for e-mail messages.
However, its actual implementation is a bit different. Below is a
summary.
	There are only two shift sequences used in ISO-2022-KR,
indicated as follows:

  One-byte Character Set  Shift Sequence       Hexadecimal
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^       ^^^^^^^^^^^
  ASCII (ANSI X3.4-1986)  <SI>                 0x0F

  Two-byte Character Set  Shift Sequence       Hexadecimal
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^       ^^^^^^^^^^^
  KS C 5601-1992          <SO>                 0x0E

Furthermore, the following designator sequence must appear only once,
at the beginning of a line, before any KS C 5601-1992 characters (this
usually means that it appears by itself on the first line of the
file):

  <ESC> $ ) C             0x1B242943

It almost looks the same as the KS C 5601-1992 escape sequence in
7-bit ISO 2022, but look again. The opening parenthesis (0x28 or "(")
is replaced by a closing parenthesis (0x29 or ")"). This designator
sequence serves a different purpose than an escape sequence. It is
like a flag indicating that "this document contains KS C 5601-1992
characters." The <SO> and <SI> control characters actually perform the
switching between one- (ASCII) and two-byte (KS C 5601-1992) codes.
	More detailed information on ISO-2022-KR encoding can be found
in RFC 1557.


3.1.5: ISO-2022-CN AND ISO-2022-CN-EXT

	ISO-2022-CN and ISO-2022-CN-EXT are similar to ISO-2022-JP
(see Section 3.1.3) and ISO-2022-KR (see Section 3.1.4) in that they
reflect how Chinese text is encoded for e-mail messages.
	Like with ISO-2022-KR, there are only two shift sequences,
indicated as follows:

  One-byte Character Set  Shift Sequence       Hexadecimal
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^       ^^^^^^^^^^^
  ASCII (ANSI X3.4-1986)  <SI>                 0x0F

  Two-byte Character Set  Shift Sequence       Hexadecimal
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^       ^^^^^^^^^^^
  <Too Many to List>      <SO>                 0x0E

But, unlike ISO-2022-KR, there are single shift sequences. Single
shift means that they are used before every (single) character, not
before sequences of characters.

  Single Shift Type       Shift Sequence       Hexadecimal
  ^^^^^^^^^^^^^^^^^       ^^^^^^^^^^^^^^       ^^^^^^^^^^^
  SS2                     <ESC> N              0x1B4E
  SS3                     <ESC> O (not zero!)  0x1B4F

	ISO-2022-CN supports the following character sets using SO and
SS2 designations:

  Character Set           Type   Designation Sequence  Hexadecimal
  ^^^^^^^^^^^^^           ^^^^   ^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^
  GB 2312-80              SO     <ESC> $ ) A           0x1B242941
  CNS 11643-1992 Plane 1  SO     <ESC> $ ) G           0x1B242947
  CNS 11643-1992 Plane 2  SS2    <ESC> $ * H           0x1B242A48

The designator sequences must appear once on a line before any
instance of the character set it designates. If two lines contain
characters from the same character set, both lines must include the
designator sequence (this is so the text can be displayed correctly
when scroll back in a window). This is different behavior from
ISO-2022-KR where the designator sequence appears once in the entire
file (this is because ISO-2022-KR supports a single two-byte character
set).
	ISO-2022-CN-EXT supports the following character sets using
SO, SS2, and SS3 designations (notice how ISO-2022-CN is still
supported in the same manner):

  Character Set           Type   Designation Sequence  Hexadecimal
  ^^^^^^^^^^^^^           ^^^^   ^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^
  GB 2312-80              SO     <ESC> $ ) A           0x1B242941
  GB/T 12345-90           SO     NOT REGISTERED
  ISO-IR-165              SO     <ESC> $ ) E           0x1B242945
  CNS 11643-1992 Plane 1  SO     <ESC> $ ) G           0x1B242947
  CNS 11643-1992 Plane 2  SS2    <ESC> $ * H           0x1B242A48
  GB 7589-87              SS2    NOT REGISTERED
  GB/T 13131-9X           SS2    NOT REGISTERED
  CNS 11643-1992 Plane 3  SS3    <ESC> $ + I           0x1B242B49
  CNS 11643-1992 Plane 4  SS3    <ESC> $ + J           0x1B242B4A
  CNS 11643-1992 Plane 5  SS3    <ESC> $ + K           0x1B242B4B
  CNS 11643-1992 Plane 6  SS3    <ESC> $ + L           0x1B242B4C
  CNS 11643-1992 Plane 7  SS3    <ESC> $ + M           0x1B242B4D
  GB 7590-87              SS3    NOT REGISTERED
  GB/T 13132-9X           SS3    NOT REGISTERED

Support for character sets indicated as NOT REGISTERED will be added
once they are ISO-registered.
	More detailed information on ISO-2022-CN and ISO-2022-CN-EXT
encodings can be found in RFC 1922.


3.2: EUC ENCODING

	EUC stands for "Extended UNIX Code," and is a rich encoding
system from ISO 2022-1993 that is designed to handle large or multiple
character sets. It is primarily used on UNIX systems, such as Sun's
Solaris.
	EUC consists of four codes sets, numbered 0 through 3. The
only code set that is more or less fixed by definition is code set 0,
which is specified to contain ASCII or a locale's equivalent (such as
JIS X 0201-1976 for Japanese or GB 1988-89 for PRC Chinese).
	It is quite common to append the locale name to "EUC" when
designating a specific instance of EUC encoding. Common designations
include EUC-JP, EUC-CN, EUC-KR, and EUC-TW.


3.2.1: JAPANESE REPRESENTATION

	The following table illustrates the Japanese representation of
EUC packed format:

  EUC Code Sets                                 Encoding Range
  ^^^^^^^^^^^^^                                 ^^^^^^^^^^^^^^
  Code set 0 (ASCII or JIS X 0201-1976 Roman):  0x21-0x7E
  Code set 1 (JIS X 0208):                      0xA1A1-0xFEFE
  Code set 2 (half-width katakana):             0x8EA1-0x8EDF
  Code set 3 (JIS X 0212-1990):                 0x8FA1A1-0x8FFEFE

An earlier version of EUC for Japanese used code set 3 as the user-
defined range.


3.2.2: CHINESE (PRC) REPRESENTATION

	The following table illustrates the Chinese (PRC)
representation of EUC packed format:

  EUC Code Sets                                 Encoding Range
  ^^^^^^^^^^^^^                                 ^^^^^^^^^^^^^^
  Code set 0 (ASCII or GB 1988-89):             0x21-0x7E
  Code set 1 (GB 2312-80):                      0xA1A1-0xFEFE
  Code set 2:                                   unused
  Code set 3:                                   unused

Note how code sets 2 and 3 are unused.
	The encoding used on Macintosh is quite similar, but has a
shortened two-byte range (0xA1A1 through 0xFCFE) plus additional
one-byte code points, namely 0x80 ("u" with dieresis), 0xFD
("copyright" symbol: "c" in a circle), 0xFE ("trademark" symbol: "TM"
as a superscript), and 0xFF ("ellipsis" symbol: three dots).


3.2.3: CHINESE (TAIWAN) REPRESENTATION

	The following table illustrates the Chinese (Taiwan)
representation of EUC packed format:

  EUC Code Sets                                 Encoding Range
  ^^^^^^^^^^^^^                                 ^^^^^^^^^^^^^^
  Code set 0 (ASCII):                           0x21-0x7E
  Code set 1 (CNS 11643-1992 Plane 1):          0xA1A1-0xFEFE
  Code set 2 (CNS 11643-1992 Planes 1-16):      0x8EA1A1A1-0x8EB0FEFE
  Code set 3:                                   unused

Note how CNS 11643-1992 Plane 1 is redundantly encoded in code set 1
(two-byte) and code set 2 (four-byte). The second byte of code set 2
indicates the plane number. For example, 0xA1 is Plane 1 and so on up
until 0xB0, which is Plane 16.


3.2.4: KOREAN REPRESENTATION

	The following table illustrates the Korean representation of
EUC packed format (this is also known as "Wansung" encoding -- the
Korean word "wansung" means "pre-compose"):

  EUC Code Sets                                 Encoding Range
  ^^^^^^^^^^^^^                                 ^^^^^^^^^^^^^^
  Code set 0 (ASCII or KS C 5636-1993):         0x21-0x7E
  Code set 1 (KS C 5601-1992):                  0xA1A1-0xFEFE
  Code set 2:                                   unused
  Code set 3:                                   unused

Note how code sets 2 and 3 are unused.
	The encoding used on Macintosh is quite similar, but has a
shortened two-byte range (0xA1A1 through 0xFDFE) plus additional
one-byte code points, namely 0x81 ("won" symbol), 0x82 (hyphen), 0x83
("copyright" symbol: "c" in a circle), 0xFE ("trademark" symbol: "TM"
as a superscript), and 0xFF ("ellipsis" symbol: three dots).
	See Section 3.3.17 for a description of Microsoft's extension
to this encoding, called Unified Hangul Code.


3.3: LOCALE-SPECIFIC ENCODINGS

	The encoding systems described in the following sections are
considered to be locale-specific, namely that are used to encode a
specific character set standard. This is not to say that they are not
widely used (actually, some of these are among the most widely used
encoding systems!), but rather that they are tied to a specific
character set.


3.3.1: SHIFT-JIS

	Shift-JIS (also known as MS Kanji, SJIS, or DBCS-PC) is the
encoding system used on machines that support MS-DOS or Windows, and
also for Macintosh (KanjiTalk or Japanese Language Kit). It was
originally developed by Microsoft Corporation as a way to support the
Japanese character set on MS-DOS. The following tables provide the
Shift-JIS encoding ranges:

  Two-byte Standard Characters                  Encoding Ranges
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                  ^^^^^^^^^^^^^^^
  first byte ranges                             0x81-0x9F, 0xE0-0xEF
  second byte ranges                            0x40-0x7E, 0x80-0xFC

  Two-byte User-defined Characters              Encoding Ranges
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^              ^^^^^^^^^^^^^^^
  first byte range                              0xF0-0xFC
  second byte ranges                            0x40-0x7E, 0x80-0xFC

  One-byte Characters                           Encoding Range
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
  Half-width katakana                           0xA1-0xDF
  ASCII/JIS-Roman                               0x21-0x7E

It is important to note that the user-defined range does not
correspond to code points in other encodings that support Japanese,
such as 7-bit ISO 2022 or EUC. This is a portability problem. It is
also unique in that it does not support the JIS X 0212-1990 character
set standard.
	The encoding used on Macintosh is quite similar to the above
table, but has additional one-byte code points, namely 0x80
(backslash), 0xFD ("copyright" symbol: "c" in a circle), 0xFE
("trademark" symbol: "TM" as a superscript), and 0xFF ("ellipsis"
symbol: three dots).


3.3.2: HZ (HZ-GB-2312)

	HZ is a simple yet very powerful and reliable system for
encoding GB 2312-80 text which was developed by Fung Fung Lee
(lee@umunhum.stanford.edu). HZ encoding is commonly used when
exchanging e-mail or posting messages to Usenet News (specifically, to
alt.chinese.text).
	The actual encoding ranges used for one- and two-byte
characters is almost identical to 7-bit ISO 2022 encoding (see Section
3.1.1). The first-byte range is limited to 0x21 through 0x77. But,
instead of using an escape sequence to shift between one- and two-byte
character modes, a simple string of two printable characters is used.

  One-byte Character Set  Shift Sequence       Hexadecimal
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^       ^^^^^^^^^^^
  ASCII                   ~}                   0x7E7D

  Two-byte Character Set  Shift Sequence       Hexadecimal
  ^^^^^^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^       ^^^^^^^^^^^
  GB 2312-80              ~{                   0x7E7B

The tilde character (0x7E) is interpreted as an escape character in HZ
encoding, so it has special meaning. If a tilde character is to appear
in one-byte-per-character mode, it must be doubled (so ~~ would appear
as just ~). This means that there are three escape sequences used in
HZ encoding:

  Escape Sequence  Meaning
  ^^^^^^^^^^^^^^^  ^^^^^^^
  ~~               ~ in one-byte-per-character mode
  ~}               Shift into one-byte-per-character mode
  ~{               Shift into two-byte-per-character mode

There is also a fourth escape sequence, namely ~ plus a newline
character (~\n). This escape sequence is a line-continuation marker to
be consumed with no output produced.
	This method works without problems because the shift sequences
represent empty positions in the very last row of the GB 2312-80 table
(actually, the second- and third-from-last code points). HZ encoding
makes 77 of the 94 rows accessible, and because there are no defined
characters beyond row 77, this causes no problems.
	The complete HZ specification is part of the HZ package,
described in RFC 1843, and available in HTML format. These are
available at the following URLs:

  ftp://ftp.ifcss.org/pub/software/unix/convert/HZ-2.0.tar.gz
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/Ch9/rfc-1843.txt
  http://umunhum.stanford.edu/~lee/chicomp/HZ_spec.html

In addition, RFC 1842 establishes "HZ-GB-2312" as the "charset"
parameter in MIME-encoded e-mail headers. Its properties are identical
to HZ encoding as described in RFC 1843.


3.3.3: zW

	zW encoding, developed by Ya-Gui Wei and Edmund Lai, is older
than and somewhat similar to HZ encoding (HZ is considered to be a
better encoding system, and users are encouraged to switch over to HZ
encoding).
	zW encoding is named by how it encodes each line of GB 2312-80
text, namely lines that contain Chinese text must begin with the two
characters "z" and "W" ("zW"). This encoding method does not permit
the mixture of one- (ASCII) and two-byte (GB 2312-80) characters on a
per-character basis, but rather on a per-line basis. That is, each
line can contain only Chinese or ASCII text, but not both.
	More information on zW encoding can be found as part of the
ZWDOS package available at the following URL:

  ftp://ftp.ifcss.org/pub/software/dos/ZWDOS/


3.3.4: BIG FIVE

	Big Five is the encoding system used on machines that support
MS-DOS or Windows, and also for Macintosh (such as the Chinese
Language Kit or the fully-localized operating system).

  Two-byte Standard Characters                  Encoding Ranges
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                  ^^^^^^^^^^^^^^^
  first byte range                              0xA1-0xFE
  second byte ranges                            0x40-0x7E, 0xA1-0xFE

  One-byte Characters                           Encoding Range
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
  ASCII                                         0x21-0x7E

	The encoding used on Macintosh is quite similar to the above,
but has a slightly shortened two-byte range (second byte range up to
0xFC only) plus additional one-byte code points, namely 0x80
(backslash), 0xFD ("copyright" symbol: "c" in a circle), 0xFE
("trademark" symbol: "TM" as a superscript), and 0xFF ("ellipsis"
symbol: three dots).


3.3.5: JOHAB

	Korean hangul characters are typically encoded in what is
known as pre-combined form, namely 2 or 3 hangul elements bound into a
single character. KS C 5601-1992 enumerates 2,350 such pre-combined
forms. While this number is felt to be sufficient for most purposes,
it does not account for the total number of possible permutations. The
encoding system that encodes all possible pre-combined hangul is known
as Johab encoding (also known as "two-byte combination code" -- the
Korean word "johab" means "combine"), and is described in Annex 3 of
the KS C 5601-1992 standard. This encoding is almost like encoding all
possible three-letter words in English -- while all combinations are
possible, only a fraction represent *real* words.
	Pre-combined hangul can be composed of 19 different initial,
21 different medial, and 27 different final hangul elements (28,
actually, if you count the placeholder). This provides a maximum of
11,172 pre-combined hangul. Of these 67 hangul elements, 51 are unique
(some can occur in different positions). Each of these positions are
encoded using five bits each (five bits can encode up to 32 unique
objects). The encoding array looks as follows:

o Bit 1: always on
o Bits 2-6: initial hangul element
o Bits 7-11: medial hangul element
o Bits 12-16: final hangul element

Initial and final elements are consonants, and the medial elements are
vowels. This encoding must be treated as a 16-bite entity because the
bit array of the medial hangul element spans the first and second byte.
	Johab encoding also provides the complete set of KS C 5601-
1992 symbols and hanja, but in different code points. Annex 3 of the
KS C 5601-1992 manual (pp 33-34) contains a complete symbol and hanja
mapping table between EUC and Johab code points. (The KS C 5601-1989
manual did not have this.) The code space ranges for Johab encoding
are as follows:

  One-byte Characters                           Encoding Range
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
  ASCII or KS C 5636-1993                       0x21-0x7E

  Two-byte Pre-combined Hangul                  Encoding Ranges
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                  ^^^^^^^^^^^^^^^
  first byte range                              0x84-0xD3
  second byte ranges                            0x41-0x7E, 0x81-0xFE

  Two-byte Symbols and Hanja                    Encoding Ranges
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                  ^^^^^^^^^^^^^^^
  first byte ranges                             0xD8-0xDE, 0xE0-0xF9
  second byte ranges                            0x31-0x7E, 0x91-0xFE

Note that the second byte ranges encode a total of 188 characters, and
that the second byte ranges for hangul and symbols/hanja are slightly
different (yet the same size, namely 188 characters).
	Here is a summary of the above table, which better describes
what is encoded where. Rows 0x84 through 0xD3 provide 80 rows of 188
characters each (15,040 code points, which is more than enough for the
11,172 pre-combined hangul). Row 0xD8 provides 188 user-defined
positions, the same as Rows 41 and 94 in the standard KS C 5601-1992
table. Rows 0xD9 through 0xDE encode Rows 1 through 12 of the standard
KS C 5601-1992 table (symbols). Rows 0xE0 through 0xF9 encode Rows 42
through 94 of the KS C 5601-1992 table (hanja). The following URL
provides a complete mapping table for the KS C 5601-1992 symbols and
hanja:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/non-hangul-codes.txt

The following URLs provides similar information (they are the same
file), but only for the 11,172 pre-combined hangul:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/hangul-codes.txt
  ftp://unicode.org/pub/MappingTables/EastAsiaMaps/hangul-codes.txt

	Of further interest may be that Microsoft designates Johab
encoding as its Code Page 1361. Microsoft if planning to support Johab
encoding for Korean Windows NT.


3.3.6: N-BYTE HANGUL

	In the days before full two-byte capable operating systems,
each of the 51 basic hangul elements were encoding using a single
(7-bit) byte. The encoding range spans 0x40 through 0x7C, but there
are several unassigned gaps. This is known as the "N-byte Hangul"
code, and is described in Annex 4 (page 35) of the KS C 5601-1992
manual.
	The following table illustrates these 51 one-byte code points
(the pronunciation or meaning of the hangul element is provided in
parentheses) and how they map to the three 5-bit arrays in Johab
encoding (expressed as binary patterns):

  Element        Initial  Medial   Final
  ^^^^^^^        ^^^^^^^  ^^^^^^   ^^^^^
  0x40 ("fill")  00001    00010    00001
  0x41 (g)       00010    *****    00010
  0x42 (gg)      00011    *****    00011
  0x43 (gs)      *****    *****    00100
  0x44 (n)       00100    *****    00101
  0x45 (nj)      *****    *****    00110
  0x46 (nh)      *****    *****    00111
  0x47 (d)       00101    *****    01000
  0x48 (dd)      00110    *****    *****
  0x49 (r)       00111    *****    01001
  0x4A (rg)      *****    *****    01010
  0x4B (rm)      *****    *****    01011
  0x4C (rb)      *****    *****    01100
  0x4D (rs)      *****    *****    01101
  0x4E (rt)      *****    *****    01110
  0x4F (rp)      *****    *****    01111
  0x50 (rh)      *****    *****    10000
  0x51 (m)       01000    *****    10001
  0x52 (b)       01001    *****    10011
  0x53 (bb)      01010    *****    *****
  0x54 (bs)      *****    *****    10100
  0x55 (s)       01011    *****    10101
  0x56 (ss)      01100    *****    10110
  0x57 (ng)      01101    *****    10111
  0x58 (j)       01110    *****    11000
  0x59 (jj)      01111    *****    *****
  0x5A (c)       10000    *****    11001
  0x5B (k)       10001    *****    11010
  0x5C (t)       10010    *****    11011
  0x5D (p)       10011    *****    11100
  0x5E (h)       10100    *****    11101
  0x5F UNASSIGNED
  0x60 UNASSIGNED
  0x61 UNASSIGNED
  0x62 (a)       *****    00011    *****
  0x63 (ae)      *****    00100    *****
  0x64 (ya)      *****    00101    *****
  0x65 (yae)     *****    00110    *****
  0x66 (eo)      *****    00111    *****
  0x67 (e)       *****    01010    *****
  0x68 UNASSIGNED
  0x69 UNASSIGNED
  0x6A (yeo)     *****    01011    *****
  0x6B (ye)      *****    01100    *****
  0x6C (o)       *****    01101    *****
  0x6D (wa)      *****    01110    *****
  0x6E (wae)     *****    01111    *****
  0x6F (oe)      *****    10010    *****
  0x70 UNASSIGNED
  0x71 UNASSIGNED
  0x72 (yo)      *****    10011    *****
  0x73 (u)       *****    10100    *****
  0x74 (weo)     *****    10101    *****
  0x75 (we)      *****    10110    *****
  0x76 (wi)      *****    10111    *****
  0x77 (yu)      *****    11010    *****
  0x78 UNASSIGNED
  0x79 UNASSIGNED
  0x7A (eu)      *****    11011    *****
  0x7B (yi)      *****    11100    *****
  0x7C (i)       *****    11101    *****

	There are utilities to convert N-byte Hangul code to other,
more widely-used, encoding methods. Pointers to these and other code
conversion utilities can be found in Section 4.7.


3.3.7: UCS-2

	UCS-2 (Universal Character Set containing 2 bytes) encoding is
one way to encode ISO 10646-1:1993 text, and is considered identical
to Unicode encoding. Its encoding range, which is quite simple, is as
follows:

  ISO 10646-1:1993 Characters                   Encoding Range
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                  ^^^^^^^^^^^^^^
  first byte range                              0x00-0xFF
  second byte range                             0x00-0xFF

Yes, folks, the whole range of 65,536 possible code points are
available for encoding characters. The "signature" that indicates a
file using UCS-2 is as follows:

  0xFEFF

	Escape sequences for UCS-2 have already been registered with
ISO, and are as follows:

  ISO 10646-1:1993        Escape Sequence      Hexadecimal     ISO Reg
  ^^^^^^^^^^^^^^^^        ^^^^^^^^^^^^^^^      ^^^^^^^^^^^     ^^^^^^^
  UCS-2 Level 1           <ESC> % / @          0x1B252F40      162
  UCS-2 Level 2           <ESC> % / C          0x1B252F43      174
  UCS-2 Level 3           <ESC> % / E          0x1B252F45      176

So what do these three levels mean? Level 3 means all characters in
ISO 10646-1:1993 with no restrictions (0x0000 through 0xFFFF).
	Level 2 begins to restrict the character set by not including
the following characters or character ranges:

  0x0300-0x0345  0x09D7         0x0BD7         0x11A8-0x11F9
  0x0360-0x0361  0x0A3C         0x0C55-0x0C56  0x20D0-0x20E1
  0x0483-0x0486  0x0A70-0x0A71  0x0CD5-0x0CD6  0x302A-0x302F
  0x093C         0x0ABC         0x0D57         0x3099-0x309A
  0x0953-0x0954  0x0B3C         0x1100-0x1159  0xFE20-0xFE23
  0x09BC         0x0B56-0x0B57  0x115F-0x11A2

These are all combining characters, and represent 364 code points.
	Level 1 further restricts the character set by not including
the following characters or character ranges:

  0x05B0-0x05B9  0x09BE-0x09C4  0x0B47-0x0B48  0x0D02-0x0D03
  0x05BB-0x05BD  0x09C7-0x09C8  0x0B4B-0x0B4D  0x0D3E-0x0D43
  0x05BF         0x09CB-0x09CD  0x0B82-0x0B83  0x0D46-0x0D48
  0x05C1-0x05C2  0x09E2-0x09E3  0x0BBE-0x0BC2  0x0D4A-0x0D4D
  0x064B-0x0652  0x0A02         0x0BC6-0x0BC8  0x0E31
  0x0670         0x0A3E-0x0A42  0x0BCA-0x0BCD  0x0E34-0x0E3A
  0x06D6-0x06E4  0x0A47-0x0A48  0x0C01-0x0C03  0x0E47-0x0E4E
  0x06E7-0x06E8  0x0A4B-0x0A4D  0x0C3E-0x0C44  0x0EB1
  0x06EA-0x06ED  0x0A81-0x0A83  0x0C46-0x0C48  0x0EB4-0x0EB9
  0x0901-0x0903  0x0ABE-0x0AC5  0x0C4A-0x0C4D  0x0EBB-0x0EBC
  0x093E-0x094D  0x0AC7-0x0AC9  0x0C82-0x0C83  0x0EC8-0x0ECD
  0x0951-0x0952  0x0ACB-0x0ACD  0x0CBE-0x0CC4  0xFB1E
  0x0962-0x0963  0x0B01-0x0B03  0x0CC6-0x0CC8
  0x0981-0x0983  0x0B3E-0x0B43  0x0CCA-0x0CCD

These, too, are all combining characters, and represent 586 code
points (222 above plus the 364 characters from the Level 2
restriction).


3.3.8: UCS-4

	UCS-4 (Universal Character Set containing 4 bytes) encoding is
another way to encode ISO 10646-1:1993 text, and is used for future
expansion of the character set. Its encoding range is as follows:

  ISO 10646-1:1993 Characters                   Encoding Range
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                  ^^^^^^^^^^^^^^
  first byte range                              0x00-0x7F
  second byte range                             0x00-0xFF
  third byte range                              0x00-0xFF
  fourth byte range                             0x00-0xFF

Note that the first byte range only goes up to 0x7F. This means that
UCS-4 is a 31-bit encoding. And, in case you're wondering, 31 bits
provide 2,147,483,648 code points. The "signature" that indicates a
file using UCS-4 is as follows:

  0x0000 0xFEFF

	Escape sequences for UCS-4 have already been registered with
ISO, and are as follows:

  ISO 10646-1:1993        Escape Sequence      Hexadecimal     ISO Reg
  ^^^^^^^^^^^^^^^^        ^^^^^^^^^^^^^^^      ^^^^^^^^^^^     ^^^^^^^
  UCS-4 Level 1           <ESC> % / A          0x1B252F41      163
  UCS-4 Level 2           <ESC> % / D          0x1B252F44      175
  UCS-4 Level 3           <ESC> % / F          0x1B252F46      177

See the end of Section 3.3.7 for a description of these three levels.
But, in the case of UCS-4, simply prepend "0000" to all the values.


3.3.9: UTF-7

	It turns out that *raw* ISO 10646-1:1993 encoding (that is,
UCS-2 or UCS-4) can cause problems because null bytes (0x00) are
possible (and frequent). Several UTFs (UCS Transformation Formats)
have been developed to deal with this and other problems. I must admit
that I don't know too much about UTFs, and what I provide below is
minimal, but does include pointers to more complete descriptions.
	UTF-7 is a mail-safe 7-bit transformation format for UCS-2
(including UTF-16). It uses straight ASCII for many ASCII characters,
and switches into a Base64 encoding of UCS-2 or UTF-16 for everything
else. It was designed to be usable in MIME-compliant e-mail headers as
well as message bodies, and to pass through gateways to non-ASCII mail
systems (like Bitnet). More detailed information on UTF-7 can be found
in RFC 1642, and a UTF-7 converter is available. The following URLs
provide this information:

  http://www.stonehand.com/unicode/standard/utf7.html
  ftp://unicode.org/pub/Programs/ConvertUTF/


3.3.10: UTF-8

	UTF-8 (also known as UTF-2 or FSS-UTF -- FSS stands for "file
system safe") can represent any character in UCS-2 and UCS-4, and is
officially an annex to ISO 10646-1:1993. It is different from UTF-7 in
that it encodes character sets into 8-bit bytes. UCS-2 and UCS-4 have
problems with some file systems and utilities, so this UTF was
developed.
	More detailed information on UTF-8 and its relationship with
ISO 10646-1:1993 can be found at the following URLs:

  http://www.stonehand.com/unicode/standard/utf8.html
  ftp://unicode.org/pub/Programs/ConvertUTF/

	X/Open Company Limited also published a document that
describes UTF-8 in detail (they call it FSS-UTF), and you can find
information about it at the following URL:

  http://www.xopen.co.uk/public/pubs/catalog/c501.htm

The new programming language called Java supports Unicode through
UTF-8. More information on Java is at the following URL:

  http://www.javasoft.com/


3.3.11: UTF-16

	UTF-16 (formerly UCS-2E), like UTF-8, is now officially an
annex to ISO 10646-1:1993. From what I've read, UTF-16 transforms
UCS-4 into a 16-bit form. UTF-16 can then be further encoded in UTF-7
or UTF-8 (but doing this is not according to the standard -- there is
little to gain by doing so).
	More detailed information on UTF-16 and its relationship with
ISO 10646-1:1993 can be found at the following URLs:

  http://www.stonehand.com/unicode/standard/utf16.html
  ftp://unicode.org/pub/Programs/ConvertUTF/


3.3.12: ANSI Z39.64-1989

	The encoding used for ANSI Z39.64-1989 (and CCCII) is three-
byte 7-bit ISO 2022, namely the following code space:

  Three-byte ANSI Z39.64-1989                   Encoding Range
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^                   ^^^^^^^^^^^^^^
  first byte range                              0x21-0x7E
  second byte range                             0x21-0x7E
  third byte range                              0x21-0x7E


3.3.13: BASE64

	Base64 encoding is mentioned here only because of its common
usage in e-mail headers, and relationship with MIME (Multi-purpose
Internet Mail Extensions). It is also a source of confusion. Base64 is
a method of encoding arbitrary bytes into the safest 64-character
ASCII subset, and is defined in RFC 1341 (which adapted it from RFC
1113). RFC 1341 was made obsolete by RFC 1521. RFC 1522 also provides
useful information, particularly for handling non-ASCII text, and
obsoletes RFC 1342.
	Here is how it works. Every three bytes are encoded as a
four-byte sequence. That is, the 24 bits that make up the three bytes
are split into four 6-bit segments (6 bits can encode up to 64
characters). Each 6-bit segment is then converted into a character in
the Base64 Alphabet (see below). There is a 65th character, "=", which
has a special purpose (it functions as a "pad" if a full three-byte
sequence is not found). This all may sound a bit like uuencoding, but
it is different. The Base64 Alphabet is as follows:

  ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/

	My name, written in Japanese kanji, is as follows when it is
EUC-encoded (six bytes, expressed as three groups of hexadecimal
values, one group for each character):

  0xBEAE 0xCED3 0xB7F5

When these three EUC-encoded characters are converted to Base64
encoding, they appear as follows (eight bytes):

  vq7O07f1

	Base64 encoding is most commonly used for encoding non-ASCII
text that appears in e-mail headers. Of all the portions of an e-mail
message, its header gets manipulated the most during transmission, and
Base64 encoding offers a safe way to further encode non-ASCII text so
that it is not altered by mail-routing software. This is where Base64
encoding can cause confusion. For example, what goes through your mind
when you see the following chunk o' text?

  From: lunde@adobe.com (=?ISO-2022-JP?B?vq7O07f1?=)

Many folks think that they are seeing ISO-2022-JP encoding. Not
true. The "ISO-2022-JP" portion is just a flag that indicates the
original encoding before Base64 encoding was applied. The actual
Base64-encoded portion is enclosed between question marks (?) as
follows:

  From: lunde@adobe.com (=?ISO-2022-JP?B?vq7O07f1?=)
                                        >^^^^^^^^<

The whole string enclosed in parentheses has several components, and
the following explains their purpose and relationships (using the
above string as an example):

  Component      Explanation
  ^^^^^^^^^      ^^^^^^^^^^^
  =?             Signals start of encoded string
  ISO-2022-JP    Charset name ("ISO-2022-JP" is for Japanese)
  ?              Delimiter
  B              Encoding ("B" is for Base64)
  ?              Delimiter
  vq7O07f1       Example string of type "charset" encoded by "encoding"
  ?=             Signals end of encoded string

	One typically does not need to worry about encoding text as
Base64 (MIME-compliant mailing software usually performs this task for
you). The problem is usually trying to decode Base64-encoded text. A
Base64 decoder is available in Perl at the following URL:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/perl/b64decode.pl

Note that this program takes "raw" Base64 data as input. Any non-
Base64 stuff must be stripped. I usually run this from within Mule
("C-u M-| b64decode.pl") after defining a region around the Base64-
encoded material. I hope to replace this program soon with one that
automatically recognizes the Base64-encoded portions.
	Most MIME-compliant e-mail software can decode Base64-encoded
text.


3.3.14: IBM DBCS-HOST

	The oldest two-byte encoding system is IBM's DBCS-Host. DBCS
stands for Double-Byte Character Set. DBCS-Host is still in use on
IBM's mainframe computer systems (hence the use of "Host").
	DBCS-Host encoding is EBCDIC-based, and uses Shift characters,
0x0E and 0x0F, to switch between one- and two-byte mode. Its encoding
specifications are as follows:

  Two-byte Characters                           Encoding Range
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
  first byte range                              0x41-0xFE
  second byte range                             0x41-0xFE

  Two-byte "Space" Character                    Code Point
  ^^^^^^^^^^^^^^^^^^^^^^^^^^                    ^^^^^^^^^^
  first- and second byte                        0x4040

  One-byte Characters                           Encoding Range
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
  EBCDIC                                        0x41-0xF9

  Shifting Characters                           Code Point
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
  Two-byte                                      0x0E
  One-byte                                      0x0F

This same encoding specification is shared by all of IBM's CJK
character sets, namely for Japanese, Simplified Chinese, Traditional
Chinese, and Korean.


3.3.15: IBM DBCS-PC

	IBM's DBCS-PC encoding is used on IBM personal computers (that
is where the "PC" comes from). DBCS-PC encoding is ASCII-based, and
uses the values of characters' bytes themselves to switch between one-
and two-byte mode. Its encoding specifications are as follows:

  Two-byte Characters                           Encoding Ranges
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^^
  first byte range                              0x81-0xFE
  second byte range                             0x40-0x7E, 0x80-0xFE

  One-byte Characters                           Encoding Range
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
  ASCII                                         0x21-0x7E

This same encoding specification is shared by all of IBM's CJK
character sets, namely for Japanese, Simplified Chinese, Traditional
Chinese, and Korean.
	DBCS-PC encoding for Japanese, although conforming to the
above encoding specifications, actually uses the same encoding
specifications for Shift-JIS, to include the full user-defined range
(see Section 3.3.1 for more details on Shift-JIS encoding). One big
accommodation is the half-width katakana range, namely 0xA1 through
0xDF. Further, the DBCS-PC code space that is outside the Shift-JIS
specification is unused.
	DBCS-PC encoding for Korean uses the equivalent of EUC code
set 1 code points (0xA1A1 through 0xFEFE) for those characters that
are common with KS C 5601-1992. Those characters that are not common
with KS C 5601-1992, namely IBM's extensions, are within the DBCS-PC
encoding space, but outside EUC encoding space (0x9A through 0xA0).
Many hanja and pre-combined hangul are part of IBM's Korean extension.
	Note that DBCS-PC is sort of useless without a corresponding
SBCS (Single-Byte Character Set) for the one-byte range. Mixing DBCS
and SBCS results in a MBCS (Multiple-Byte Character Set). How these
are mixed to form MBCSs is detailed in Section 3.4.


3.3.16: IBM DBCS-/TBCS-EUC

	IBM has also developed DBCS-EUC and TBCS-EUC encodings. TBCS
stands for Triple-Byte Character Set. These essentially follow the EUC
encoding specifications, and were developed for use with IBM's AIX
(Advanced Interactive Executive) operating system, which is
UNIX-based.
	Refer to Section 3.2 for all the details on EUC encoding.


3.3.17: UNIFIED HANGUL CODE

	Microsoft has developed what is called "Unified Hangul Code"
(UHC) for its Windows 95 operating system (this was also known as
"Extended Wansung"). It is the optional, not standard, character set
of Win95K.
	UHC provides full compatibility with KS C 5601-1992 EUC
encoding (see Section 3.2.4), but adds additional encoding ranges for
holding additional pre-combined hangul (more precisely, the 8,822 that
are needed to fully support the Johab character set). The following is
a table that provides the encoding ranges for UHC encoding:

  Two-byte Standard Characters                  Encoding Ranges
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                  ^^^^^^^^^^^^^^^
  first byte range                              0x81-0xFE
  second byte ranges                            0x41-0x5A, 0x61-0x7A,
                                                and 0x81-0xFE

  One-byte Characters                           Encoding Range
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
  ASCII                                         0x21-0x7E

Note that 0xA1A1 through 0xFEFE in the above encoding is still
identical, in terms of character-to-code allocation, with KS C 5601-
1992 in EUC encoding.
	Appendix G (pp 345-406) of "Developing International Software
for Windows 95 and Windows NT" by Nadine Kano illustrates the KS C
5601-1992 character set standard plus these Microsoft extensions
(8,822 pre-combined hangul) by UHC code (Microsoft calls this Code
Page 949).


3.3.18: TRON CODE

	TRON (The Real-time Operating system Nucleus) is an OS
developed in Japan some time ago. Personal Media Corporation has done
work to develop BTRON (Business TRON), which is unique in that it is
the only commercially-available OS that supports JIS X 0212-1990.
	TRON Code provides a one- and two-byte encoding space and a
method for switching between them.
	The following is how the two-byte space in TRON Code is
allocated:

  A-Zone (8,836 characters; JIS X 0208-1990)    Encoding Range
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^
  first byte range                              0x21-0x7E
  second byte range                             0x21-0x7E

  B-Zone (11,844 characters; JIS X 0212-1990)   Encoding Range
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   ^^^^^^^^^^^^^^
  first byte range                              0x80-0xFD
  second byte range                             0x21-0x7E

  C-Zone (11,844 characters; unassigned)        Encoding Range
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^        ^^^^^^^^^^^^^^
  first byte range                              0x21-0x7E
  second byte range                             0x80-0xFD

  D-Zone (15,876 characters; unassigned)        Encoding Range
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^        ^^^^^^^^^^^^^^
  first byte range                              0x80-0xFD
  second byte range                             0x80-0xFD

Note how the B-Zone is larger that the conventional 94-by-94
matrix. In fact, the JIS X 0212-1990 portion of the B-Zone is
restricted to 0xA121-0xFD7E (93-by-94 matrix -- 0xFE as a first-byte
value is unavailable, and you will see why in a minute).
	TRON Code implements "language specifying codes" consisting of
two bytes as follows:

  Two-byte Japanese                             0xFE21
  One-byte English                              0xFE80

0xFE21 in a one-byte stream invokes two-byte Japanese mode, and 0xFE80
in a two-byte stream invokes one-byte English mode.
	The following is the one-byte encoding range for TRON Code:

  One-byte Characters                           0x21-0x7E and 0x80-0xFD

Control codes are in 0x00-0x20 and 0x7F (the usual ASCII control code
range). Also, 0xA0 is reserved as a fixed-width space character.


3.3.19: GBK

	GBK is an extension to GB 2312-80 that adds all ISO 10646-
1:1993 (GB 13000.1-93) hanzi not already in GB 2312-80. GBK is defined
as a normative annex of GB 13000.1-93 (see Section 2.2.10). The "K" in
"GBK" is the first sound in the Chinese word meaning "extension" (read
"Kuo Zhan").
	GBK is divided into five levels as follows:

  Level  Encoded Range  Total Code Points  Total Encoded Characters
  ^^^^^  ^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^^^^^^^^
  GBK/1  0xA1A1-0xA9FE    846                717
  GBK/2  0xB0A1-0xF7FE  6,768              6,763
  GBK/3  0x8140-0xA0FE  6,080              6,080
  GBK/4  0xAA40-0xFEA0  8,160              8,160
  GBK/5  0xA840-0xA9A0    192                166

	There are also 1,894 user-defined code points as follows:

  Encoded Range  Total Code Points
  ^^^^^^^^^^^^^  ^^^^^^^^^^^^^^^^^
  0xAAA1-0xAFFE  564
  0xF8A1-0xFEFE  658
  0xA140-0xA7A0  672

	GBK thus provides a total of 23,940 code points, 21,886 of
which are assigned.
	Each "row" in the GBK code table consists of 190 characters.
The following describes the encoding ranges of GBK in detail:

  Two-byte Standard Characters                  Encoding Ranges
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^                  ^^^^^^^^^^^^^^^
  first byte range                              0x81-0xFE
  second byte ranges                            0x40-0x7E and 0x80-0xFE

  One-byte Characters                           Encoding Range
  ^^^^^^^^^^^^^^^^^^^                           ^^^^^^^^^^^^^^
  ASCII                                         0x21-0x7E

Note that the sub-range 0xA1A1-0xFEFE in the above encoding is still
identical, in terms of character-to-code allocation, with GB 2312-80
in EUC encoding. GBK is therefore backward-compatible with GB 2312-80
and forward-compatible with ISO 10646-1:1993.
	GBK is the standard character set and encoding for the
Simplified Chinese version of Windows 95.


3.4: CJK CODE PAGES

	Many times one reads about references to "Code Pages" in
material about CJK (and other) character sets and encodings. These are
not literal pages, but rather references to a character set and
encoding combination. In the case of CJK Code Pages, they definitely
comprise more than one page!
	Microsoft refers to its supported CJK character sets and
encodings through such Code Page designations. The following is a
listing of several Microsoft CJK Code Pages along with their
characteristics:

  Code Page  Characteristics
  ^^^^^^^^^  ^^^^^^^^^^^^^^^
  932        JIS X 0208-1990 base, Shift-JIS encoding, Microsoft
             extensions (NEC Row 13 and IBM select characters in
             redundantly encoded in Rows 89 through 92 and Rows 115
             through 119)
  936        GB 2312-80 base, EUC encoding
  949        KS C 5601-1992 base, Unified Hangul Code encoding,
             remaining 8,822 pre-combined hangul as extension (all of
             this is referred to as Unified Hangul Code)
  950        Big Five base, Big Five encoding, Microsoft extensions
             (actually, the ETen extensions of Row 89)
  1361       Johab base, Johab encoding

	IBM also uses Code Page designations, and, in fact, some
designations (and associated characteristics) are nearly identical to
those in the above table, most notably, Code Pages 932 and 936. IBM's
Code Page 932 does not include NEC Row 13 or IBM select characters in
Rows 89 through 92.
	The best way to describe IBM Code Page designations is by
first listing the SBCS (Single-Byte Character Set) and DBCS (Double-
Byte Character Set) Code Page designations (those designated by "Host"
use EBCDIC-based encodings):

  IBM SBCS Code Page          Characteristics
  ^^^^^^^^^^^^^^^^^^          ^^^^^^^^^^^^^^^
  37 (US)                     SBCS-Host
  290 (Japanese)              SBCS-Host
  833 (Korean)                SBCS-Host
  836 (Simplified Chinese)    SBCS-Host
  891 (Korean)                SBCS-PC
  897 (Japanese)              SBCS-PC
  903 (Simplified Chinese)    SBCS-PC
  904 (Traditional Chinese)   SBCS-PC

  IBM DBCS Code Page          Characteristics
  ^^^^^^^^^^^^^^^^^^          ^^^^^^^^^^^^^^^
  300 (Japanese)              DBCS-Host
  301 (Japanese)              DBCS-PC
  834 (Korean)                DBCS-Host
  835 (Traditional Chinese)   DBCS-Host
  837 (Simplified Chinese)    DBCS-Host
  926 (Korean)                DBCS-PC
  927 (Traditional Chinese)   DBCS-PC
  928 (Simplified Chinese)    DBCS-PC

So far there appears to be no relationship with Microsoft's CJK Code
Pages, but when we combine the above SBCS and DBCS Code Pages into
MBCS (Multiple-Byte Character Set) Code Pages, things become a bit
more revealing:

  IBM MBCS Code Page          Characteristics
  ^^^^^^^^^^^^^^^^^^          ^^^^^^^^^^^^^^^
  930 (Japanese)              MBCS-Host (Code Pages 300 and 290)
  932 (Japanese)              MBCS-PC (Code Pages 301 and 897)
  933 (Korean)                MBCS-Host (Code Pages 834 and 833)
  934 (Korean)                MBCS-PC (Code Pages 926 and 891)
  938 (Traditional Chinese)   MBCS-PC (Code Pages 927 and 904)
  936 (Simplified Chinese)    MBCS-PC (Code Pages 928 and 903)
  5031 (Simplified Chinese)   MBCS-Host (Code Pages 837 and 836)
  5033 (Traditional Chinese)  MBCS-Host (Code Pages 835 and 37)

So, you can now see that many of Microsoft's CJK Code Pages are
derived from those established by IBM.
	More detailed information on the encoding specifications for
DBCS-Host and DBCS-PC can be found in Sections 3.3.14 and 3.3.15,
respectively.


PART 4: CJK CHARACTER SET COMPATIBILITY ISSUES

	The sections below provide detailed information about
compatibility issues between CJK character sets, to include tidbits of
useful information.
	One thing to mention first is that conversion to and from
IBM's DBCS-Host (Section 3.3.14) and DBCS-PC (Section 3.3.15)
encodings is table-driven, and fully documented in the following IBM
publication:

o IBM Corporation. "Character Data Representation Architecture - Level
  2, Registry." 1993. IBM order number SC09-1391-01.

Unfortunately, the CJK-related tables are not supplied in machine-
readable format, and must be obtained from IBM directly. The only real
compatibility issue is trying to obtain the conversion tables from
IBM.


4.1: JAPANESE

	In general, when a Japanese character set was revised,
characters were simply added (usually appended at the end). However,
when JIS C 6226-1978 was revised in 1983 (to become JIS X 0208-1983),
a bit more happened (this is still a controversy).
	A detailed treatment of the two main transitions, JIS C 6226-
1978 to JIS X 0208-1983 and JIS X 0208-1983 to JIS X 0208-1990, is
covered in Appendix J of UJIP. I provide machine-readable files that
detail these transitions at the following URL:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/AppJ/

	An interesting side note here is that there is a reason why
there are many lists that illustrate JIS C 6226-1978 and JIS X 0208-
1983 kanji form differences. While most share the same basic set of
changes, there are some inconsistencies. Well, it turns out that JIS C
6226-1978 had ten printings, and not all of them shared the same kanji
forms. If comparisons between JIS C 6226-1978 and JIS X 0208-1983 were
made using different printings of the JIS C 6226-1978 manual, the
results can differ slightly.
	There are also interesting correspondences between JIS X
0208-1990 and JIS X 0212-1990. 28 kanji that vanished during the JIS C
6226-1978 to JIS X 0208-1983 transition (they were replaced by
simplified versions) were restored in JIS X 0212-1990 (at totally
different code points). Appendix J of UJIP discusses this, and a file
at the following URL details the 28 mappings:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/AppJ/TJ2.jis


4.2: CHINESE (PRC)

	The basic PRC standard, GB 2312-80, has been revised, but not
through a later version of the standard. Instead, the revisions were
carried out in the form of three other documents. Specifically, they
are (in order of publication):

o GB 6345.1-86 (see Section 2.2.3)
o GB 8565.2-88 (see Section 2.2.6)
o GB/T 12345-90 (see Section 2.2.7)

Unless you are aware of these documents, figuring out what has been
corrected or added to GB 2312-80 is nearly impossible.


4.3: CHINESE (TAIWAN)

	The first question people think of with regard to Big Five and
CNS 11643-1992 is compatibility. It turns out that Planes 1 and 2 of
CNS 11643-1992 are more or less equivalent to Big Five, but a handful
of hanzi are in a different order. The following tables detail the
mapping from Big Five (with the ETen extension) to CNS 11643-1992
(when using this conversion table, keep in mind the encoding space
ranges for both Big Five and CNS 11643-1992):

Big Five Level 1 Correspondence to CNS 11643-1992 Plane 1:

  0xA140-0xA1F5 <-> 0x2121-0x2256
         0xA1F6 <-> 0x2258
         0xA1F7 <-> 0x2257
  0xA1F8-0xA2AE <-> 0x2259-0x234E
  0xA2AF-0xA3BF <-> 0x2421-0x2570
  0xA3C0-0xA3E0 <-> 0x4221-0x4241  # Symbols for control characters
  0xA440-0xACFD <-> 0x4421-0x5322  # Level 1 Hanzi BEGIN
         0xACFE <-> 0x5753
  0xAD40-0xAFCF <-> 0x5323-0x5752
  0xAFD0-0xBBC7 <-> 0x5754-0x6B4F
  0xBBC8-0xBE51 <-> 0x6B51-0x6F5B
         0xBE52 <-> 0x6B50
  0xBE53-0xC1AA <-> 0x6F5C-0x7534
  0xC1AB-0xC2CA <-> 0x7536-0x7736
         0xC2CB <-> 0x7535
  0xC2CC-0xC360 <-> 0x7737-0x782C
  0xC361-0xC3B8 <-> 0x782E-0x7863
         0xC3B9 <-> 0x7865
         0xC3BA <-> 0x7864
  0xC3BB-0xC455 <-> 0x7866-0x7961
         0xC456 <-> 0x782D
  0xC457-0xC67E <-> 0x7962-0x7D4B  # Level 1 Hanzi END
  0xC6A1-0xC6AA <-> 0x2621-0x262A  # Circled numerals
  0xC6AB-0xC6B4 <-> 0x262B-0x2634  # Parenthesized numerals
  0xC6B5-0xC6BE <-> 0x2635-0x263E  # Lowercase Roman numerals
  0xC6BF-0xC6C0 <-> 0x2723-0x2724  # 213 radicals BEGIN
  0xC6C1-0xC6C2 <-> 0x2726, 0x2728
  0xC6C3-0xC6C5 <-> 0x272D-0x272F
  0xC6C6-0xC6C7 <-> 0x2734, 0x2737
  0xC6C8-0xC6C9 <-> 0x273A, 0x273C
  0xC6CA-0xC6CB <-> 0x2742, 0x2747
  0xC6CC-0xC6CD <-> 0x274E, 0x2753
  0xC6CE-0xC6CF <-> 0x2754-0x2755
  0xC6D0-0xC6D1 <-> 0x2759-0x275A
  0xC6D2-0xC6D3 <-> 0x2761, 0x2766
  0xC6D4-0xC6D5 <-> 0x2829-0x282A
  0xC6D6-0xC6D7 <-> 0x2863, 0x286C # 213 radicals END
  0xC6D8-0xC6E6  -> ******         # Japanese symbols
  0xC6E7-0xC77A  -> ******         # Hiragana
  0xC77B-0xC7F2  -> ******         # Katakana
  0xC7F3-0xC875  -> ******         # Cyrillic alphabet
  0xC876-0xC878  -> ******         # Symbols
         0xC87A  -> ******         # Hanzi element
         0xC87C  -> ******         # Hanzi element
  0xC87E-0xC8A1  -> ******         # Hanzi elements
  0xC8A3-0xC8A4  -> ******         # Hanzi elements
  0xC8A5-0xC8CC  -> ******         # Combined numerals
  0xC8CD-0xC8D3  -> ******         # Japanese symbols

Big Five Level 1 Correspondences to CNS 11643-1992 Plane 4:

         0xC879 <-> 0x2123         # Hanzi element
         0xC87B <-> 0x2124         # Hanzi element
         0xC87D <-> 0x212A         # Hanzi element
         0xC8A2 <-> 0x2152         # Hanzi element

Big Five Level 2 Correspondence to CNS 11643-1992 Plane 1:

         0xC94A  -> 0x4442         # duplicate of 0xA461

Big Five Level 2 Correspondences to CNS 11643-1992 Plane 2:

  0xC940-0xC949 <-> 0x2121-0x212A  # Level 2 Hanzi BEGIN
  0xC94B-0xC96B <-> 0x212B-0x214B
  0xC96C-0xC9BD <-> 0x214D-0x217C
         0xC9BE <-> 0x214C
  0xC9BF-0xC9EC <-> 0x217D-0x224C
  0xC9ED-0xCAF6 <-> 0x224E-0x2438
         0xCAF7 <-> 0x224D
  0xCAF8-0xD6CB <-> 0x2439-0x376E
         0xD6CC <-> 0x3E63
  0xD6CD-0xD779 <-> 0x3770-0x387D
         0xD77A <-> 0x3F6A
  0xD77B-0xDADE <-> 0x387E-0x3E62
         0xDADF <-> 0x376F
  0xDAE0-0xDBA6 <-> 0x3E64-0x3F69
  0xDBA7-0xDDFB <-> 0x3F6B-0x4423
         0xDDFC  -> 0x4176         # duplicate of 0xDCD1
  0xDDFD-0xE8A2 <-> 0x4424-0x554A
  0xE8A3-0xE975 <-> 0x554C-0x5721
  0xE976-0xEB5A <-> 0x5723-0x5A27
  0xEB5B-0xEBF0 <-> 0x5A29-0x5B3E
         0xEBF1 <-> 0x554B
  0xEBF2-0xECDD <-> 0x5B3F-0x5C69
         0xECDE <-> 0x5722
  0xECDF-0xEDA9 <-> 0x5C6A-0x5D73
  0xEDAA-0xEEEA <-> 0x5D75-0x6038
         0xEEEB <-> 0x642F
  0xEEEC-0xF055 <-> 0x6039-0x6242
         0xF056 <-> 0x5D74
  0xF057-0xF0CA <-> 0x6243-0x6336
         0xF0CB <-> 0x5A28
  0xF0CC-0xF162 <-> 0x6337-0x642E
  0xF163-0xF16A <-> 0x6430-0x6437
         0xF16B <-> 0x6761
  0xF16C-0xF267 <-> 0x6438-0x6572
         0xF268 <-> 0x6934
  0xF269-0xF2C2 <-> 0x6573-0x664C
  0xF2C3-0xF374 <-> 0x664E-0x6760
  0xF375-0xF465 <-> 0x6762-0x6933
  0xF466-0xF4B4 <-> 0x6935-0x6961
         0xF4B5 <-> 0x664D
  0xF4B6-0xF4FC <-> 0x6962-0x6A4A
  0xF4FD-0xF662 <-> 0x6A4C-0x6C51
         0xF663 <-> 0x6A4B
  0xF664-0xF976 <-> 0x6C52-0x7165
  0xF977-0xF9C3 <-> 0x7167-0x7233
         0xF9C4 <-> 0x7166
         0xF9C5 <-> 0x7234
         0xF9C6 <-> 0x7240
  0xF9C7-0xF9D1 <-> 0x7235-0x723F
  0xF9D2-0xF9D5 <-> 0x7241-0x7244  # Level 2 Hanzi END
  0xF9DD-0xF9FE  -> ******         # Symbols

Big Five Level 2 Correspondence to CNS 11643-1992 Plane 3:

         0xF9D6 <-> 0x4337         # ETen-specific hanzi
         0xF9D7 <-> 0x4F50         # ETen-specific hanzi
         0xF9D8 <-> 0x444E         # ETen-specific hanzi
         0xF9D9 <-> 0x504A         # ETen-specific hanzi
         0xF9DA <-> 0x2C5D         # ETen-specific hanzi
         0xF9DB <-> 0x3D7E         # ETen-specific hanzi
         0xF9DC <-> 0x4B5C         # ETen-specific hanzi

I adapted the above from material Ross Paterson (rap@doc.ic.ac.uk)
kindly made available at the following URL:

  http://www.ifcss.org:8001/www/pub/software/info/cjk-codes/

Check it out. Basically, I just changed the CNS 11643-1992 codes from
decimal row-cell values to hexadecimal codes, and corrected the
mappings to correspond to ETen's Big Five (which is considered to be
the most standard).
	It turns out that corrections were made to Big Five (at least
in the ETen and Microsoft implementations thereof) which made it a bit
closer to CNS 11643-1992 as far as character ordering is concerned.
The following six lines of code correspondences:

  0xCAF8-0xD6CB <-> 0x2439-0x376E
         0xD6CC <-> 0x3E63
  0xD6CD-0xD779 <-> 0x3770-0x387D
         0xD77A <-> 0x3F6A
  0xD77B-0xDADE <-> 0x387E-0x3E62
         0xDADF <-> 0x376F

can now be expressed as the following three lines:

  0xCAF8-0xD779 <-> 0x2439-0x387D
         0xD77A <-> 0x3F6A
  0xD77B-0xDBA6 <-> 0x387E-0x3F69

In essence, the ordering of Big Five characters 0xD6CC and 0xDADF were
reversed. This resulted in the same order as found in CNS 11643-1992
Plane 2.
	As for the two duplicate hanzi in Big Five (as indicated in
the above tables), they have been placed into a compatibility zone in
ISO 10646-1:1993 (this allows for round-trip conversion). The mapping
is as follows:

  Big Five  ISO 10646-1:1993
  ^^^^^^^^  ^^^^^^^^^^^^^^^^
  0xC94A -> 0xFA0C
  0xDDFC -> 0xFA0D

	Speaking of duplicate hanzi, Plane 1 of CNS 11643-1992
contains 213 classical radicals in rows 27 through 29. However, 187 of
them map directly to hanzi code points in Planes 1, 2, and 3 (and
naturally to Big Five). Below is a detailed mapping of these 213
radicals:

  Radical   CNS 11643   Big Five    Radical   CNS 11643   Big Five
  ^^^^^^^   ^^^^^^^^^   ^^^^^^^^    ^^^^^^^   ^^^^^^^^^   ^^^^^^^^
  0x2721 -> 0x4421      0xA440      0x282E -> 0x4678      0xA5D8
  0x2722 -> 0x2121 (3)  ******      0x282F -> 0x4679      0xA5D9
  0x2723 -> 0x2122 (3)  0xC6BF      0x2830 -> 0x467A      0xA5DA
  0x2724 -> 0x2123 (3)	0xC6C0      0x2831 -> 0x467B      0xA5DB
  0x2725 -> 0x4422      0xA441      0x2832 -> 0x467C      0xA5DC
  0x2726 -> 0x2124 (3)	0xC6C1      0x2833 -> 0x2167 (2)  0xC9A8
  0x2727 -> 0x4428      0xA447      0x2834 -> 0x467D      0xA5DD
  0x2728 -> ******	0xC6C2      0x2835 -> 0x467E      0xA5DE
  0x2729 -> 0x4429      0xA448      0x2836 -> 0x4721      0xA5DF
  0x272A -> 0x442A      0xA449      0x2837 -> 0x484C      0xA6CB
  0x272B -> 0x442B      0xA44A      0x2838 -> 0x484D      0xA6CC
  0x272C -> 0x442C      0xA44B      0x2839 -> 0x484E      0xA6CD
  0x272D -> 0x2127 (3)	0xC6C3      0x283A -> 0x484F      0xA6CE
  0x272E -> 0x2128 (3)	0xC6C4      0x283B -> 0x2269 (2)  0xCA49
  0x272F -> ******	0xC6C5      0x283C -> 0x4850      0xA6CF
  0x2730 -> 0x442D      0xA44C      0x283D -> 0x4851      0xA6D0
  0x2731 -> 0x2123 (2)  0xC942      0x283E -> 0x4852      0xA6D1
  0x2732 -> 0x442E      0xA44D      0x283F -> 0x4854      0xA6D3
  0x2733 -> 0x4430      0xA44F      0x2840 -> 0x4855      0xA6D4
  0x2734 -> ******      0xC6C6      0x2841 -> 0x4856      0xA6D5
  0x2735 -> 0x4431      0xA450      0x2842 -> 0x4857      0xA6D6
  0x2736 -> 0x2124 (2)  0xC943      0x2843 -> 0x4858      0xA6D7
  0x2737 -> 0x2129 (3)  0xC6C7      0x2844 -> 0x485B      0xA6DA
  0x2738 -> 0x4432      0xA451      0x2845 -> 0x485C      0xA6DB
  0x2739 -> 0x4433      0xA452      0x2846 -> 0x485D      0xA6DC
  0x273A -> 0x212A (3)  0xC6C8      0x2847 -> 0x485E      0xA6DD
  0x273B -> 0x2125 (2)  0xC944      0x2848 -> 0x485F      0xA6DE
  0x273C -> 0x212B (3)  0xC6C9      0x2849 -> 0x4860      0xA6DF
  0x273D -> 0x4434      0xA453      0x284A -> 0x4861      0xA6E0
  0x273E -> 0x4447      0xA466      0x284B -> 0x4862      0xA6E1
  0x273F -> 0x212A (2)  0xC949      0x284C -> 0x4863      0xA6E2
  0x2740 -> 0x4448      0xA467      0x284D -> 0x226A (2)  0xCA4A
  0x2741 -> 0x4449      0xA468      0x284E -> 0x226F (2)  0xCA4F
  0x2742 -> 0x213A (3)  0xC6CA      0x284F -> 0x4865      0xA6E4
  0x2743 -> 0x444A      0xA469      0x2850 -> 0x4866      0xA6E5
  0x2744 -> 0x444B      0xA46A      0x2851 -> 0x4867      0xA6E6
  0x2745 -> 0x444C      0xA46B      0x2852 -> 0x4868      0xA6E7
  0x2746 -> 0x444D      0xA46C      0x2853 -> 0x2270 (2)  0xCA50
  0x2747 -> 0x213B (3)  0xC6CB      0x2854 -> 0x4B44      0xA8A3
  0x2748 -> 0x4450      0xA46F      0x2855 -> 0x4B45      0xA8A4
  0x2749 -> 0x4451      0xA470      0x2856 -> 0x4B46      0xA8A5
  0x274A -> 0x4452      0xA471      0x2857 -> 0x4B47      0xA8A6
  0x274B -> 0x4453      0xA472      0x2858 -> 0x4B48      0xA8A7
  0x274C -> 0x212B (2)  0xC94B      0x2859 -> 0x4B49      0xA8A8
  0x274D -> 0x4454      0xA473      0x285A -> 0x2524 (2)  0xCBA4
  0x274E -> 0x213C (3)  0xC6CC      0x285B -> 0x4B4A      0xA8A9
  0x274F -> 0x4456      0xA475      0x285C -> 0x4B4B      0xA8AA
  0x2750 -> 0x4457      0xA476      0x285D -> 0x4B4C      0xA8AB
  0x2751 -> 0x445A      0xA479      0x285E -> 0x4B4D      0xA8AC
  0x2752 -> 0x445B      0xA47A      0x285F -> 0x4B4E      0xA8AD
  0x2753 -> 0x213D (3)  0xC6CD      0x2860 -> 0x4B4F      0xA8AE
  0x2754 -> 0x213E (3)  0xC6CE      0x2861 -> 0x4B50      0xA8AF
  0x2755 -> 0x213F (3)  0xC6CF      0x2862 -> 0x4B51      0xA8B0
  0x2756 -> 0x445C      0xA47B      0x2863 -> 0x272F (3)  0xC6D6
  0x2757 -> 0x445D      0xA47C      0x2864 -> 0x4B57      0xA8B6
  0x2758 -> 0x445E      0xA47D      0x2865 -> 0x4B5C      0xA8BB
  0x2759 -> 0x2140 (3)  0xC6D0      0x2866 -> 0x4B5D      0xA8BC
  0x275A -> 0x2142 (3)  0xC6D1      0x2867 -> 0x4B5E      0xA8BD
  0x275B -> 0x212C (2)  0xC94C      0x2868 -> 0x4F5A      0xAAF7
  0x275C -> 0x4540      0xA4DF      0x2869 -> 0x4F5B      0xAAF8
  0x275D -> 0x4541      0xA4E0      0x286A -> 0x4F5C      0xAAF9
  0x275E -> 0x4542      0xA4E1      0x286B -> 0x4F5D      0xAAFA
  0x275F -> 0x4543      0xA4E2      0x286C -> 0x2A7D (3)  0xC6D7
  0x2760 -> 0x4545      0xA4E4      0x286D -> 0x4F63      0xAB41
  0x2761 -> 0x2167 (3)  0xC6D2      0x286E -> 0x4F64      0xAB42
  0x2762 -> 0x4546      0xA4E5      0x286F -> 0x4F65      0xAB43
  0x2763 -> 0x4547      0xA4E6      0x2870 -> 0x4F66      0xAB44
  0x2764 -> 0x4548      0xA4E7      0x2871 -> 0x5372      0xADB1
  0x2765 -> 0x4549      0xA4E8      0x2872 -> 0x5373      0xADB2
  0x2766 -> 0x2169 (3)  0xC6D3      0x2873 -> 0x5374      0xADB3
  0x2767 -> 0x454A      0xA4E9      0x2874 -> 0x5375      0xADB4
  0x2768 -> 0x454B      0xA4EA      0x2875 -> 0x5376      0xADB5
  0x2769 -> 0x454C      0xA4EB      0x2876 -> 0x5377      0xADB6
  0x276A -> 0x454D      0xA4EC      0x2877 -> 0x5378      0xADB7
  0x276B -> 0x454E      0xA4ED      0x2878 -> 0x5379      0xADB8
  0x276C -> 0x454F      0xA4EE      0x2879 -> 0x537A      0xADB9
  0x276D -> 0x4550      0xA4EF      0x287A -> 0x537B      0xADBA
  0x276E -> 0x213F (2)  0xC95F      0x287B -> 0x537C      0xADBB
  0x276F -> 0x4551      0xA4F0      0x287C -> 0x586B      0xB0A8
  0x2770 -> 0x4552      0xA4F1      0x287D -> 0x586C      0xB0A9
  0x2771 -> 0x4553      0xA4F2      0x287E -> 0x586D      0xB0AA
  0x2772 -> 0x4554      0xA4F3      0x2921 -> 0x334C (2)  0xD449
  0x2773 -> 0x2141 (2)  0xC961      0x2922 -> 0x586E      0xB0AB
  0x2774 -> 0x4555      0xA4F4      0x2923 -> 0x334D (2)  0xD44A
  0x2775 -> 0x4556      0xA4F5      0x2924 -> 0x586F      0xB0AC
  0x2776 -> 0x4557      0xA4F6      0x2925 -> 0x5870      0xB0AD
  0x2777 -> 0x4558      0xA4F7      0x2926 -> 0x5E23      0xB3BD
  0x2778 -> 0x4559      0xA4F8      0x2927 -> 0x5E24      0xB3BE
  0x2779 -> 0x2142 (2)  0xC962      0x2928 -> 0x5E25      0xB3BF
  0x277A -> 0x455A      0xA4F9      0x2929 -> 0x5E26      0xB3C0
  0x277B -> 0x455B      0xA4FA      0x292A -> 0x5E27      0xB3C1
  0x277C -> 0x455C      0xA4FB      0x292B -> 0x5E28      0xB3C2
  0x277D -> 0x455D      0xA4FC      0x292C -> 0x6327      0xB6C0
  0x277E -> 0x4668      0xA5C8      0x292D -> 0x6328      0xB6C1
  0x2821 -> 0x4669      0xA5C9      0x292E -> 0x6329      0xB6C2
  0x2822 -> 0x466A      0xA5CA      0x292F -> 0x4155 (2)  0xDCB0
  0x2823 -> 0x466B      0xA5CB      0x2930 -> 0x4875 (2)  0xE0EF
  0x2824 -> 0x466C      0xA5CC      0x2931 -> 0x676F      0xB9A9
  0x2825 -> 0x466D      0xA5CD      0x2932 -> 0x6770      0xB9AA
  0x2826 -> 0x466E      0xA5CE      0x2933 -> 0x6771      0xB9AB
  0x2827 -> 0x4670      0xA5D0      0x2934 -> 0x6B7C      0xBBF3
  0x2828 -> 0x4674      0xA5D4      0x2935 -> 0x6B7D      0xBBF4
  0x2829 -> 0x225B (3)  0xC6D4      0x2936 -> 0x702F      0xBEA6
  0x282A -> 0x225C (3)  0xC6D5      0x2937 -> 0x733E      0xC073
  0x282B -> 0x4675      0xA5D5      0x2938 -> 0x733F      0xC074
  0x282C -> 0x4676      0xA5D6      0x2939 -> 0x6142 (2)  0xEFB6
  0x282D -> 0x4677      0xA5D7


4.4: KOREAN

	The 268 duplicate hanja in KS C 5601-1992 can cause problems
when converting to and from other CJK character sets. When converting
from KS C 5601-1992, two or more hanja can collapse into a single code
point. When converting these 268 hanja to KS C 5601-1992, a decision
about which KS C 5601-1992 code point to map to must be made. The only
exception to this is mapping to and from ISO 10646-1:1993. That
standard encodes these 268 duplicate hanja in a compatibility zone,
namely from 0xF900 through 0xFA0B.
	The following is a listing of 262 hanja that map to two or
more code points (four map to three code points, and one maps to four:
a total of 268 redundantly-encoded hanja) in KS C 5601-1992:

  Standard  Extra     Standard  Extra     Standard  Extra
  ^^^^^^^^  ^^^^^     ^^^^^^^^  ^^^^^     ^^^^^^^^  ^^^^^
  0x4A39 -> 0x4D4F    0x5573 -> 0x6631    0x573C -> 0x6B29
  0x4B3D -> 0x7A22    0x5574 -> 0x6633    0x573E -> 0x6B3A
  0x4C38 -> 0x7A66    0x5575 -> 0x6637    0x573F -> 0x6B3B
  0x4C5A -> 0x4B56    0x5576 -> 0x6638    0x5740 -> 0x6B3D
  0x4C78 -> 0x5050    0x5579 -> 0x663C    0x5741 -> 0x6B41
  0x4D7A -> 0x4E2D    0x557B -> 0x6646    0x5743 -> 0x6B42
  0x4E29 -> 0x7C29    0x557C -> 0x6647    0x5744 -> 0x6B46
  0x4F23 -> 0x4F7B    0x557E -> 0x6652    0x5745 -> 0x6B47
  0x4F4F -> 0x5022    0x5621 -> 0x6656    0x5747 -> 0x6B4C
            0x5038    0x5622 -> 0x6659    0x5748 -> 0x6B4F
  0x5142 -> 0x4B50    0x5623 -> 0x665F    0x5749 -> 0x6B50
  0x5151 -> 0x505D    0x5624 -> 0x6661    0x574A -> 0x6B51
  0x5159 -> 0x547C    0x5625 -> 0x6665    0x574C -> 0x6B58
  0x5167 -> 0x552B    0x5626 -> 0x6664    0x574D -> 0x5270
  0x522F -> 0x5155    0x5627 -> 0x6666    0x574E -> 0x5271
  0x5233 -> 0x657C    0x5628 -> 0x6668    0x574F -> 0x5272
  0x5234 -> 0x6644    0x562A -> 0x666A    0x5750 -> 0x5273
  0x5235 -> 0x664A    0x562B -> 0x666B    0x5752 -> 0x5274
  0x5236 -> 0x665C    0x562D -> 0x666F    0x5753 -> 0x5275
  0x5237 -> 0x6676    0x562E -> 0x6671    0x5754 -> 0x5277
  0x523A -> 0x6677    0x562F -> 0x6675    0x5755 -> 0x5278
  0x523B -> 0x5638    0x5631 -> 0x6679    0x5757 -> 0x6C26
            0x672C    0x5633 -> 0x6721    0x5759 -> 0x6C27
  0x5241 -> 0x564D    0x5634 -> 0x6726    0x575B -> 0x6C2A
  0x5263 -> 0x6871    0x5635 -> 0x6729    0x575D -> 0x6C30
  0x526E -> 0x6A74    0x5637 -> 0x672A    0x575E -> 0x6C31
  0x526F -> 0x6B2A    0x563A -> 0x672D    0x5762 -> 0x6C35
  0x527A -> 0x6C32    0x563B -> 0x6730    0x5765 -> 0x6C38
  0x527B -> 0x6C49    0x563C -> 0x673F    0x5767 -> 0x6C3A
  0x527C -> 0x6C4A    0x563E -> 0x6746    0x576A -> 0x6C40
  0x527E -> 0x7331    0x5640 -> 0x6747    0x576B -> 0x6C41
  0x5321 -> 0x552E    0x5642 -> 0x674B    0x576C -> 0x6C45
  0x5358 -> 0x7738    0x5643 -> 0x674D    0x576E -> 0x6C46
  0x536B -> 0x7748    0x5644 -> 0x674F    0x5770 -> 0x6C55
  0x5378 -> 0x7674    0x5645 -> 0x6750    0x5772 -> 0x6C5D
  0x5441 -> 0x5466    0x5647 -> 0x6753    0x5773 -> 0x6C5E
  0x5457 -> 0x7753    0x5649 -> 0x675F    0x5774 -> 0x6C61
  0x547A -> 0x5154    0x564A -> 0x6764    0x5776 -> 0x6C64
  0x547B -> 0x5158    0x564B -> 0x6766    0x5777 -> 0x6C67
  0x547D -> 0x515B    0x564C -> 0x523E    0x5778 -> 0x6C68
  0x547E -> 0x515C    0x564F -> 0x5242    0x5779 -> 0x6C77
  0x5521 -> 0x515D    0x5650 -> 0x5243    0x577A -> 0x6C78
  0x5522 -> 0x515E    0x5653 -> 0x5244    0x577C -> 0x6C7A
  0x5523 -> 0x515F    0x5654 -> 0x5246    0x5821 -> 0x6D21
  0x5524 -> 0x5160    0x5655 -> 0x5247    0x5822 -> 0x6D22
  0x5526 -> 0x5163    0x5656 -> 0x5248    0x5823 -> 0x6D23
  0x5527 -> 0x5164    0x5657 -> 0x5249    0x5A72 -> 0x5B64
  0x5528 -> 0x5165    0x5658 -> 0x524A    0x5C56 -> 0x5D25
  0x552A -> 0x5166    0x565A -> 0x524B    0x5C5F -> 0x7870
  0x552C -> 0x5168    0x565B -> 0x524D    0x5C74 -> 0x5D55
  0x552D -> 0x5169    0x565C -> 0x524E    0x5D41 -> 0x5B45
  0x552F -> 0x516A    0x565E -> 0x524F    0x5F2F -> 0x616D
  0x5530 -> 0x516B    0x565F -> 0x5250    0x5F52 -> 0x6D6E
  0x5531 -> 0x516D    0x5660 -> 0x5251    0x5F5D -> 0x5F61
  0x5534 -> 0x516F    0x5661 -> 0x5252    0x5F63 -> 0x5E7E
  0x5535 -> 0x5170    0x5662 -> 0x5253    0x6063 -> 0x612D
  0x5536 -> 0x5172    0x5663 -> 0x5254              0x6672
  0x5539 -> 0x5176    0x5665 -> 0x5255    0x607D -> 0x5F68
  0x553D -> 0x517A    0x5666 -> 0x5256    0x6163 -> 0x574B
  0x5540 -> 0x517C    0x5667 -> 0x5257              0x6B52
  0x5541 -> 0x517D    0x566B -> 0x5259    0x6226 -> 0x5E7C
  0x5543 -> 0x517E    0x566C -> 0x525A    0x6326 -> 0x6429
  0x5544 -> 0x5222    0x566F -> 0x525E    0x635B -> 0x723D
  0x5545 -> 0x5223    0x5670 -> 0x525F    0x6427 -> 0x727A
  0x5546 -> 0x5227    0x5671 -> 0x5261    0x6442 -> 0x6777
  0x5547 -> 0x5228    0x5674 -> 0x5262    0x6445 -> 0x5162
  0x5548 -> 0x5229    0x5675 -> 0x6867              0x5525
  0x5549 -> 0x522A    0x5676 -> 0x6868              0x6879
  0x554D -> 0x522B    0x5677 -> 0x6870    0x6534 -> 0x652E
  0x554E -> 0x522D    0x5679 -> 0x6877    0x6636 -> 0x6C2F
  0x5552 -> 0x5232    0x567A -> 0x687B    0x6728 -> 0x6071
  0x5553 -> 0x6531    0x567B -> 0x687E    0x6856 -> 0x6A41
  0x5554 -> 0x6532    0x567E -> 0x6927    0x6C36 -> 0x5764
  0x5555 -> 0x6539    0x5721 -> 0x692C    0x6C56 -> 0x666C
  0x5557 -> 0x653B    0x5723 -> 0x694C    0x6D29 -> 0x7427
  0x5558 -> 0x653C    0x5724 -> 0x5264    0x6D33 -> 0x6E5B
  0x5559 -> 0x6544    0x5726 -> 0x5265    0x6F37 -> 0x746E
  0x555D -> 0x654E    0x5727 -> 0x5266    0x7263 -> 0x6375
  0x555E -> 0x6550    0x5728 -> 0x5267    0x7333 -> 0x4B67
  0x555F -> 0x6552    0x5729 -> 0x5268    0x7351 -> 0x5F33
  0x5561 -> 0x6556    0x572B -> 0x5269    0x742C -> 0x7676
  0x5564 -> 0x657A    0x572C -> 0x526A    0x7658 -> 0x6421
  0x5565 -> 0x657B    0x5730 -> 0x526B    0x7835 -> 0x5C25
  0x5566 -> 0x657E    0x5731 -> 0x6A65    0x786C -> 0x785B
  0x5569 -> 0x6621    0x5733 -> 0x6A77    0x7932 -> 0x5D74
  0x556B -> 0x6624    0x5735 -> 0x6A7C    0x7A3C -> 0x7A21
  0x556C -> 0x6627    0x5736 -> 0x6A7E    0x7B29 -> 0x6741
  0x556F -> 0x662D    0x5738 -> 0x6B24    0x7C41 -> 0x4D68
  0x5571 -> 0x662F    0x573A -> 0x6B27    0x7D3B -> 0x6977
  0x5572 -> 0x6630

The above table represents a weekend of my time (but time well spent,
in my opinion).


4.5: ISO 10646-1:1993

	The Chinese character subset of ISO 10646-1:1993
has excellent round-trip conversion capability with the various
national character sets. Those national character sets with duplicate
characters, such as KS C 5601-1992 (268 hanja) and Big Five (2 hanzi),
have corresponding code points in ISO 10646-1:1993 within
a compatibility zone. See Sections 4.3 and 4.4 for more details.
	Other issues regarding ISO 10646-1:1993 have to do with proper
character rendering (that is, how characters are displayed, printed,
or otherwise imaged). Many (sometimes) subtle character form
differences have been collapsed under ISO 10646-1:1993. Language or
locale was not one of the factors used in performing Han Unification.
This means that it is nearly impossible to create a single ISO 10646-1:
1993 font that meets the character form criteria of each of the four
CJK locales. An ISO 10646-1:1993 code point is not enough information
to render a Chinese character. If the font was specifically designed
for a single locale, it is a non-problem, but if there is any CJK
intent, text must be flagged for language or locale.


4.6: UNICODE

	One of the most interesting (and major) differences between
the current three flavors of Unicode are the number and arrangement of
pre-combined hangul. The following table provides a summary of the
differences:

  Unicode       Number of Pre-combined Hangul   UCS-2 Ranges
  ^^^^^^^       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^   ^^^^^^^^^^^^
  Version 1.0   2,350 Basic Hangul              0x3400-0x3D3D

  Version 1.1   2,350 Basic Hangul              0x3400-0x3D3D
                1,930 Supplemental Hangul A     0x3D2E-0x44B7
                2,376 Supplemental Hangul B     0x44BE-0x4DFF

  Version 2.0  11,172 Hangul                    0xAC00-0xD7A3

Of the above three versions, the most controversial is Version 2.0.
Why? Because it is located in the user-defined range of Unicode
(O-Zone: 16,384 code points in 0xA000-0xDFFF), and occupies
approximately two-thirds of its space.
	The information in the above table is courtesy of the
following useful document:

  ftp://unicode.org/pub/MappingTables/EastAsiaMaps/Hangul-Codes.txt

The same file is also mirrored at the following URL:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/hangul-codes.txt


4.7: CODE CONVERSION TIPS

	There are two types of conversions that can be performed. The
first type is converting between different encodings for the same
character set. This is usually without problems (but not always). The
second type is converting from one character set to another (it is not
usually relevant whether the underlying encoding has changed or not).
This usually involves the handling of characters that are in one
character set, but not the other. So, what to do?
	I suggest JConv for handling Japanese code conversion (this
means converting between JIS, Shift-JIS, and EUC encodings). This is
in the category of different encodings for the same character set. The
following URLs provide executables or source code:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/mac/jconv-30.hqx
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/mac/jconv-dd-181.hqx
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/dos/jconv.exe
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/src/jconv.c

There are other programs available that do the same basic thing as
JConv, such as kc and nkf. They are available at the following URL:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/unix/

	For software and tables that handles Chinese code conversion
(this includes conversion to and from Japanese), I suggest browsing at
the following URLs:

  ftp://etlport.etl.go.jp/pub/iso-2022-cn/convert/
  ftp://ftp.ifcss.org/pub/software/dos/convert/
  ftp://ftp.ifcss.org/pub/software/mac/convert/
  ftp://ftp.ifcss.org/pub/software/ms-win/convert/
  ftp://ftp.ifcss.org/pub/software/unix/convert/
  ftp://ftp.ifcss.org/pub/software/vms/convert/
  ftp://ftp.net.tsinghua.edu.cn/pub/Chinese/convert/
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/
  ftp://ftp.seed.net.tw/Pub/Chinese/DOS/code-convert/
  http://www.yajima.kuis.kyoto-u.ac.jp/staffs/yasuoka/CJK.html

The latter URL has FTP links to tables created by Koichi Yasuoka
(yasuoka@kudpc.kyoto-u.ac.jp).
	The following URLs provide utilities or tables for converting
between various Korean encodings (the last represent the same file):

  ftp://cair-archive.kaist.ac.kr/pub/hangul/code/
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/non-hangul-codes.txt
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/hangul-codes.txt
  ftp://unicode.org/pub/MappingTables/EastAsiaMaps/Hangul-Codes.txt

A popular Korean code conversion utility seems to be "hcode" by
June-Yub Lee (jylee@cims.nyu.edu).
	Finally, the following URLs provide many Unicode- and CJK-
related mapping tables:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/map/
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/unicode/
  ftp://unicode.org/pub/MappingTables/
  http://www.yajima.kuis.kyoto-u.ac.jp/staffs/yasuoka/CJK.html

Note that the official and authoritative Unicode mapping tables (from
Unicode values to various international, national and vendor
standards) are maintained by the Unicode Consortium at the following
URL:

  ftp://unicode.org/pub/MappingTables/

Version 2.0 of "The Unicode Standard" (to be published by Addison-
Wesley shortly) will include these mapping tables on CD-ROM.


PART 5: CJK-CAPABLE OPERATING SYSTEMS

	The first step in being able to display CJK text is to obtain
an operating system that handles such text (or an application that
sets up its own CJK-capable environment). Below I describe how
different types of machines can handle CJK text.
	Actually, for the first few releases of CJK.INF, these
subsections will be far from complete (some may even be empty!). The
purpose of CJK.INF is to provide detailed information on character set
standards and encoding systems, so I therefore consider this sort of
information secondary.


5.1: MS-DOS

	I am not aware of any CJK-capable MS-DOS operating system, but
localized versions do exist. CJK support has been introduced with
Microsoft's Windows operating system (see Section 5.2).


5.2: WINDOWS

	Microsoft has CJK versions of its Windows operating system
available. The latest versions of their Windows operating system are
called Windows 95 and Windows NT. Windows 95 supports the same
character sets and encodings as in Windows Version 3.1 -- Windows NT
supports Unicode (ISO 10646-1:1993). Contact Microsoft Corporation for
more details. The URL of their WWW Home Page is:

  http://www.microsoft.com/

Nadine Kano's "Developing International Software for Windows 95 and
Windows NT" provides abundant reference material for how CJK is
supported in Windows 95 and Windows NT. Check it out.
	TwinBridge is a package that adds CJK functionality to non-CJK
Windows. Demo versions of TwinBridge for Japanese and Chinese are at
the following URLs:

  ftp://ftp.netcom.com/pub/tw/twinbrg/Japanese/demo/tbjdemo.zip
  ftp://ftp.netcom.com/pub/tw/twinbrg/Chinese/demo/tbcdemo.zip

	Another useful CJK add-on for Windows 95 is NJWIN (see Section
7.10) by Hongbo Data Systems.


5.3: MACINTOSH

	Macintosh is well-known as a computer that was designed to
handle multilingual texts. There are currently fully-localized
operating systems available for Japanese (KanjiTalk), Chinese
(simplified and traditional available), and Korean (HangulTalk). In
addition, Apple has developed "Language Kits" (*LK) for Chinese (CLK)
and Japanese (JLK). A Korean Language Kit (KLK) will be released
shortly.
	These localized operating systems can usually be installed
together in order to make your system CJK-capable.
	The common portion of these CJK-capable operating systems is a
technology Apple calls "WorldScript II" ("WorldScript I" is for one-
byte scripts). It provides the basic one- and two-byte functionality.


5.4: UNIX AND X WINDOWS

	The typical encoding system used on UNIX and X Windows is EUC
(see Section 3.2). Many systems, such as IBM's AIX, can be configured
to handle both EUC and Shift-JIS (for Japanese). In addition, X11R6 (X
Window System, Version 11, Release 6) has many CJK-capable features.
	If you have a fast PC and a good amount of RAM (more than
4MB), you should consider replacing MS-DOS (and Microsoft Windows,
too, if you have it) with Linux, which is a full-blown UNIX operating
system that runs on Intel processors. You can even run X Windows
(X11R6). "Running Linux" by Matt Welsh and Lar Kaufman is an excellent
guide to installing and using Linux. The companion volume, "Linux
Network Administrator's Guide" by Olaf Kirch is also useful. Because
there is a fine line -- or no line at all -- between a user and System
Administrator when using Linux, "Essential System Administration"
Second Edition by AEleen Frisch is a must-have.
	Linux and Linux information are available at the following
URLs:

  ftp://sunsite.unc.edu/pub/Linux/
  http://sunsite.unc.edu/mdw/linux.html

I personally use Linux, and find it quite useful and powerful. My bias
comes from being a UNIX user. But, you can't beat the price (free),
and all of my favorite text-manipulation tools (such as Perl) are
readily available.


5.5: OTHERS

	No information yet.


PART 6: CJK TEXT AND INTERNET SERVICES

	Part 5 described how CJK text is handled on a machine
internally, but this part goes into the implications of handling such
text externally, namely for information interchange purposes. This
boils down to handling CJK text on Internet services.
	For more detailed information on how these and other Internet
services are used, I suggest "The Whole Internet User's Guide &
Catalog" by Ed Krol. For more information on setting up and
maintaining these and other Internet services, I suggest "Managing
Internet Information Services" by Cricket Liu et al.


6.1: ELECTRONIC MAIL

	The most basic Internet service is electronic mail (henceforth
to be called "e-mail"), which is virtually guaranteed to be available
to all users regardless of their system.
	Several Internet standards (called RFCs, short for Request For
Comments) have been developed to describe how CJK text is to be handled
over e-mail systems (see Section A.3.4).
	The bottom-line is that most e-mail systems do not support
8-bit characters (that is, bytes that have their 8th bit set). Some do
offer 8-bit support, but you can never know what path your e-mail
might take while on route to its recipient. This means that 7-bit ISO
2022 (or equivalent) is the ideal encoding to use when sending CJK
text through e-mail. If your operating system processes another
encoding system, you must convert from that encoding to one that is
compatible with 7-bit ISO 2022.
	However, even 7-bit ISO 2022 encoding can get mangled by
mail-routing software -- the escape character, sometimes even part of
the escape sequence (meaning more than just the escape character), is
stripped. The JConv tool described in Section 4.7 restores stripped
escape sequences for Japanese 7-bit ISO 2022.
	If your mailing software is MIME-compliant, there is a means
to identify the character set and encoding of the message using the
"charset" parameter. Some valid "charset" values include the
following:

o iso-2022-jp     (see Section 3.1.3)
o iso-2022-jp-2   (see Section 3.1.3)
o iso-2022-kr     (see Section 3.1.4)
o iso-2022-cn     (see Section 3.1.5)
o iso-2022-cn-ext (see Section 3.1.5)
o iso-8859-1

Insertion of these values should happen automatically.
	A last-ditch effort to send CJK text through e-mail is to use
uuencode or Base64 encoding (see Section 3.3.13). Base64 is something
that is usually done automatically by mailing software -- explicit
Base64 encoding is not common. The recipient must then run uudecode or
a Base64 decoder to get the original file (if such utilities are
available).


6.2: USENET NEWS

	Usenet News follows many of the same requirements as e-mail,
namely that 7-bit ISO 2022 encoding is ideal. However, some newsgroups
use specific encoding methods, such as:

  alt.chinese.text             (HZ encoding used for Chinese text)
  alt.chinese.text.big5        (Big Five encoding used for Chinese text)
  chinese.flame                (UTF-7)
  chinese.text.unicode         (UTF-8)

Also, the newsgroups in Korean (all begin with "han.*") use EUC (EUC-
KR) because the news-handling software in Korea has been designed to
handle eight-bit characters correctly. Mailing list versions of Korean
newsgroups are likely to use ISO-2022-KR encoding.
	One common problem with Usenet News is that the escape
characters used in 7-bit ISO 2022 encoding are sometimes stripped,
usually by the software used to post the article. This can be quite
annoying. There are programs available, such as JConv, that repair
such files by restoring the escape characters.
	Another common problem are news readers that do not allow
escape characters to function. One simple solution is to "pipe" the
article through a display command, such as "more," "page," "less," or
"cat." This is done by typing a "pipe" character (|) followed by the
command name anywhere within the article being displayed.


6.3: GOPHER

	The World-Wide Web (WWW) has almost eliminated the need for
using Gopher, so I won't discuss it here. Not that I don't appreciate
Gopher servers, but what I mean is that WWW browsing software permits
access to Gopher sites.


6.4: WORLD-WIDE WEB

	First, there are two types of WWW browsers available. The most
common type is the graphics-based browser (examples include Mosaic and
Netscape). Graphics-based browsers have the unfortunate requirement of
a TCP/IP (SLIP and PPP support these protocols) connection. Lynx and
the W3 client for Emacs, which are text-based browsers, can be run
from the host computer through a standard terminal connection. They
don't display all the pretty pictures that folks put into their WWW
documents, but you get all the text (this is, in many ways, a blessing
in disguise -- transferring graphics is what slows down graphics-based
browsers the most). When the W3 client is run using Mule, it becomes a
fully CJK-capable WWW browser. Both Lynx and the W3 client for Emacs
are freely available. A Japanese-capable Lynx is available at the
following URL:

  ftp://ftp.ipc.chiba-u.ac.jp/pub.asada/www/lynx/

There is also a WWW page that provides information on Japanese-capable
Lynx. Its URL is as follows:

  http://www.icsd6.tj.chiba-u.ac.jp/lynx/

	When WWW documents first came online, there was no method for
handling CJK character sets. This has, fortunately, changed. As of
this writing, two commercial WWW browsers support Japanese. They are
Infomosaic by Fujitsu Limited, and Netscape Navigator by Netscape
Communications Corporation (Version 1.1 added Japanese support). Both
are graphics-based browsers. The former can be ordered at the
following URL:

  http://www.fujitsu.co.jp/

The latter can be found at the following URLs:

  http://www.netscape.com/
  ftp://ftp.netscape.com/

	One can also use a delegate server to *filter* Japanese codes
to the one supported by your browser. It is also possible to
"Japanize" existing WWW browsers using assorted tools and patches.
Katsuhiko Momoi (momoi@tigger.stcloud.msus.edu) has authored an
excellent guide to Japanizing WWW browsers. Its URL is:

  http://condor.stcloud.msus.edu:20020/netscape.html

I *highly* suggest reading it.
	Japanese-capable WWW browsers support automatic detection of
the three Japanese encoding methods (JIS, Shift-JIS, and EUC). Hey,
but, what about support for the "C" and "K" of CJK? Attempting to
answer this question provides us an answer to another question: "What
is the best encoding method to use for CJK WWW documents?"
	Encoding methods such as EUC and Shift-JIS provide for mixing
only two character sets. This is because they provide no way to *flag*
or *tag* text for locale (character set) information. Without flagging
information, it is impossible to distinguish Japanese EUC from Chinese
or Korean EUC. However, the escape sequences used in 7-bit ISO 2022
encoding explicitly provide locale information. 7-bit ISO 2022 is
ideal for static documents, which is exactly what one finds on WWW.
	My personal recommendation (for the short-term) is to compose
WWW documents (also called HTML documents; HTML stands for Hyper Text
Markup Language) using 7-bit ISO 2022 encoding. The escape sequences
themselves act as explicit flags that indicate locale. However, some
WWW clients are confused by 7-bit ISO 2022 encoding, but the products
by Netscape Communications and Fujitsu Limited prove that this can
work. See the following URL for a description of this problem:

  http://www.ntt.jp/japan/note-on-JP/LibWWW-patch.html

	Check out the following URLs for information on and proposals
for international support for WWW:

  http://www.ebt.com:8080/docs/multilingual-www.html
  http://www.w3.org/hypertext/WWW/International/Overview/

	There is currently an RFC in the works (called an Internet
Draft) to address the problem of internationalizing HTML by using
Unicode. It is very promising. The latest draft is available at the
following URLs:

  ftp://ds.internic.net/internet-drafts/draft-ietf-html-i18n-04.txt.Z
  ftp://ftp.isi.edu/internet-drafts/draft-ietf-html-i18n-04.txt
  ftp://munnari.oz.au/internet-drafts/draft-ietf-html-i18n-04.txt.Z
  ftp://nic.nordu.net/internet-drafts/draft-ietf-html-i18n-04.txt

Note that some have been compressed.


6.5: FILE TRANSFER TIPS

	Although CJK encoding systems such as Shift-JIS and EUC make
extensive use of 8-bit bytes, that does not mean that you need to
treat the data as binary. Such files are simply to be treated as text,
and should be transferred in text mode (for example, FTP's ASCII mode,
which is also called "Type A Transfer").
	When text files are transferred in binary mode (such as FTP's
BINARY mode, which is also called Type I Transfer"), line termination
characters are left unaltered. For example, when transferring a text
file from UNIX to Macintosh, a text transfer will translate the UNIX
newline (0x0A) characters to Macintosh carriage return (0x0D)
characters, but a binary transfer will make no such modifications.
Text-style conversion is typically desired.
	The most common types of files that need to be handled as
binary include tar archives (*.tar), compressed files (*.Z, *.gz,
*.zip, *.zoo, *.lzh, and so on), and executables (*.exe, *.bin, and so
on).


PART 7: CJK TEXT HANDLING SOFTWARE

	This section describes various CJK-capable software packages.
I expect this section to grow with future versions of this document. I
define "CJK-capable" as being able to support Chinese, Japanese, and
Korean text.
	The descriptions I provide below are intentionally short. You
are encouraged to use the information pointers to obtain further
information or the software itself.


7.1: MULE

	Mule (multilingual enhancement to GNU Emacs), written by
Kenichi Handa (handa@etl.go.jp), is the first (and only?) CJK-capable
editor for UNIX systems, and is freely available under the terms of
the GNU General Public License. Mule was developed from Nemacs
(Nihongo Emacs).
	Mule is available at the following URL:

  ftp://etlport.etl.go.jp/pub/mule/

	Mule, beginning with Version 2.2, includes handy utilities
(any2ps and m2ps) for printing files in any of the encodings supported
by Mule (which is a lot of encodings, by the way). These programs use
BDF fonts. See the beginning of Part 2 for a list of URLs that have
CJK BDF fonts.
	GNU Emacs is a fine editor, and Mule takes it several steps
further by providing multilingual support. I personally use Mule
together with SKK (for Japanese input) -- it is a superb combination.


7.2: CNPRINT

	CNPRINT, developed by Yidao Cai (cai@neurophys.wisc.edu), is a
utility to print CJK text (or convert it to a PostScript file), and is
available for MS-DOS, VMS, and UNIX systems. A wide range of encoding
methods are supported by CNPRINT.
	CNPRINT is available at the following URLs:

  ftp://ftp.ifcss.org/pub/software/{dos,unix,vms}/print/
  ftp://neurophys.wisc.edu/[public.cn]/


7.3: MASS

	MASS (Multilingual Application Support Service), developed at
the National University of Singapore, is a suite of software tools
that speed and ease the development of UNIX-based CJK (actually, more
than just CJK) applications. It supports a wide variety of character
sets and encodings, including ISO 10646-1:1993 (UCS-2, UTF-7, and
UTF-8), EACC, and CCCII.
	More information on MASS, to include contact information for
its developers, can be found at the following URL:

  http://www.iss.nus.sg/RND/MLP/Projects/MASS/MASS.html


7.4: ADOBE TYPE MANAGER (ATM)

	Adobe Type Manager for Macintosh, beginning with Version 3.8,
is CJK-capable (as long as the underlying operating system is CJK-
capable). Actually, ATM generically supports CID-keyed fonts, which
are based on a newly-developed file specification for fonts with large
numbers of characters (like CJK fonts). See Section 7.9 for more
details.
	ATM is very easy to obtain. It is bundled with fonts and
applications from Adobe Systems (chances are you have ATM if you
recently purchased an Adobe product). But what about Windows? The
Windows version of ATM should soon follow with identical
functionality.


7.5: MACINTOSH SOFTWARE

	WorldScript II, a System Extension introduced with System 7,
provides multi-byte script handling, namely CJK support. If a
Macintosh product claims to support WorldScript II, chances are it is
CJK-capable (provided that your operating system has the necessary
extensions loaded).
	The CJK encodings that are supported by WorldScript II capable
applications are the same as made available by the underlying
Macintosh operating system. No import/export of other encodings is
supported at the operating system level. You must run separate
conversion utilities for both import and export. Anyway, below are
some products that are known to be CJK capable.
	Nisus Writer, written by Nisus Software, is fully CJK-capable
as long as you have the appropriate scripts installed (such as CLK for
Chinese or JLK for Japanese). A "Language Key" (read "dongle") is also
required for Chinese and Korean (and some one-byte scripts such as
Arabic and Hebrew). A demo version of Nisus Writer is available at the
following URL:

  ftp://ftp.nisus-soft.com/pub/nisus/demos/

Give it a try! Updates are also available at the same FTP site. Nisus
Software can be contacted using the following e-mail address or
through their WWW page:

  info@nisus-soft.com
  http://www.nisus-soft.com/

I also suggest reading "The Nisus Way" by Joe Kissell. Chapter 13
provides detailed information about using Nisus Writer with
WorldScript, and includes a CD-ROM containing among other things a
trial (expires after 90 days) version of Nisus Writer and a
non-expiring version of Nisus Compact.
	ClarisWorks by Claris Corporation, beginning with Version 4.0,
is compatible with WorldScript II and all Apple language kits. This
translates into full CJK support. The following URL provides a trial
version of ClarisWorks:

  ftp://ftp.claris.com/pub/USA-Macintosh/Trial_Software/

The following URL has detailed information on this and other Claris
products:

  http://www.claris.com/

	The latest version of WordPerfect by Novell Incorporated is
also compatible with WorldScript II. The following URL has detailed
information:

  http://wp.novell.com/tree.htm


7.6: MACBLUE TELNET

	Although MacBlue Telnet (a modified version of NCSA Telnet) is
Macintosh software, I describe it separately because it does not
require the various Apple Language Kits or localized operating
systems. There are also input methods, adapted from cxterm (see
Section 7.7), available that cover the CJK spectrum (Japanese,
Simplified Chinese, Traditional Chinese, and Korean).
	MacBlue Telnet is available at the following URL:

  ftp://ftp.ifcss.org/pub/software/mac/networking/MacBlueTelnet/

Its associated CJK input methods are at the following URL:

  ftp://ftp.ifcss.org/pub/software/mac/input/


7.7: CXTERM

	This program, cxterm, is a CJK-capable xterm for X Windows
(works with X11R4, X11R5, and X11R6). It is based on the X11R6 xterm.
It is available at the following URL:

  ftp://ftp.ifcss.org/pub/software/x-win/cxterm/

	The following URL is for a program that adds Unicode
capability to cxterm:

  ftp://ftp.ifcss.org/pub/software/unix/convert/hztty-2.0.tar.gz

The following URL adds support for other encodings to cxterm:

  ftp://ftp.ifcss.org/pub/software/unix/convert/BeTTY-1.534.tar.gz


7.8: UW-DBM

	UW-DBM, for Windows 3.1, Windows 95, and Windows NT, is a
program that allows users to handle Chinese (Big Five, GB-2312-80, or
HZ code), Japanese (Shift-JIS), and Korean (KS C 5601-1992)
simultaneously. More information on UW-DBM is available at the
following URL:

  http://www.gy.com/ccd/win95/cjkw95.htm

	A demo version of UW-DBM is available at the following URL:

  ftp://ftp.aimnet.com/pub/users/chinabus/uwdbm40.zip


7.9: POSTSCRIPT

	With the introduction of CID-keyed Font Technology, PostScript
has become fully CJK capable.
	Adobe Systems has developed the following CJK character
collection for CID-keyed fonts (font developers are encouraged to
conform to these specifications):

  Character Collection  CIDs   Supported Character Sets & Encodings
  ^^^^^^^^^^^^^^^^^^^^  ^^^^   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  Adobe-GB1-1           9,897  GB 2312-80 and GB/T 12345-90; 7-bit ISO
                               2022 and EUC
  Adobe-CNS1-0         14,099  Big Five (ETen extensions) and CNS
                               11643-1992 Planes 1 and 2; Big Five,
                               7-bit ISO 2022, and EUC
  Adobe-Japan1-2        8,720  JIS X 0208-1990; Shift-JIS, 7-bit ISO
                               2022, and EUC
  Adobe-Japan2-0        6,068  JIS X 0212-1990; 7-bit ISO 2022 and EUC
  Adobe-Korea1-1       18,155  KS C 5601-1992 (Macintosh extensions
                               plus Johab); 7-bit ISO 2022, EUC, UHC,
                               and Johab

Note that Macintosh and Windows do not support any of the encodings
for Adobe-Japan2-0, thus fonts based on that specification are
unusable for those platforms.
	Adobe Systems also have a few things in the works (that is,
they are either proposed or in draft form), all of which are
supplements to above character collections (that is, they add CIDs):

  Character Collection  CIDs   Supported Character Sets & Encodings
  ^^^^^^^^^^^^^^^^^^^^  ^^^^   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  Adobe-CNS1-1         +6,018  Add CNS 11643-1992 Plane 3 support (30
                               of the 6,148 hanzi are in Adobe-CNS1-0)

	To find out more about these CJK character collections or
CID-keyed font technology, contact the Adobe Developers Association.
Several CID-related documents have been published. ADA's contact
information is as follows:

  Adobe Developers Association
  Adobe Systems Incorporated
  1585 Charleston Road
  P.O. Box 7900
  Mountain View, CA 94039-7900
  USA
  +1-415-961-4111 (phone)
  +1-415-967-9231 (facsimile)
  devsupp-person@adobe.com
  http://www.adobe.com/Support/

Adobe Systems has recently developed the CID SDK (CID Software
Developers Kit), which is on a single CD-ROM. Contact the Adobe
Developers Association for information on obtaining a copy.
	The complete CID-keyed font file specification and an overview
document are available at the following URLs (as a PostScript or PDF
[Adobe Acrobat] file, respectively):

  ftp://ftp.adobe.com/pub/adobe/DeveloperSupport/TechNotes/PSfiles/
  ftp://ftp.adobe.com/pub/adobe/DeveloperSupport/TechNotes/PDFfiles/

The file names (not provided above due to URL length) are:

  5014.CMap_CIDFont_Spec.ps    (complete CID engineering specification)
  5014.CMap_CIDFont_Spec.pdf
  5092.CID_Overview.ps         (CID technology overview)
  5092.CID_Overview.pdf

Other related files, most character collection specifications, are
available only in PDF format at the latter URL indicated above:

  5004.AFM_Spec.pdf            (Includes CID-keyed AFM specification)
  5078b.pdf                    (Adobe-Japan1-2 character collection)
  5079b.pdf                    (Adobe-GB1-0 character collection)
  5080b.pdf                    (Adobe-CNS1-0 character collection)
  5093b.pdf                    (Adobe-Korea1-0 character collection)
  5094.pdf                     (Adobe CJK CMap file descriptions)
  5097b.pdf                    (Adobe-Japan2-0 character collection)

If you do not have Adobe Acrobat, there is a freely-available Acrobat
Reader (for Macintosh, Windows, MS-DOS, and UNIX) at the following
URL:

  ftp://ftp.adobe.com/pub/adobe/Applications/Acrobat/

	I have also placed some CJK character collection materials,
including prototype Unicode (UCS-2 and UTF-8) CMap files, at the
following URL:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/adobe/

A sample (Adobe-Korea1-0) CIDFont is also available at the above URL.
	There is also a somewhat brief description of CID-keyed fonts
at the end of Chapter 6 in UJIP.


7.10: NJWIN

	Hongbo Data Systems has recently release a ShareWare ($49 USD)
product called NJWIN whose purpose is to force the display of CJK text
in non-CJK applications running under US Windows 95. Actually, there
are two versions: full CJK and Japanese only.
	NJWIN and its full description are available at the following
URL:

  http://www.njstar.com.au/njstar/njwin.htm

Other (popular) URLs that carry NJWIN are as follows:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/windows/
  ftp://ftp.cc.monash.edu.au/pub/nihongo/

	Hongbo Data Systems' e-mail address is:

  hongbo@njstar.com.au

Their WWW Home Page is at the following URL:

  http://www.njstar.com.au/


PART 8: CJK PROGRAMMING ISSUES

	This new section describes issues related to using specific
programming languages to process CJK text.


8.1: C AND C++

	At one time I used C on a regular basis for my CJK programming
needs, and released three tools for others to use: JConv, JChar, and
JCode. While these tools are specific to Japanese, they can be easily
adapted for CJK use. Their source code is available at the following
URL:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/src/

	I also provided several C code snippets in Chapter 7 of
UJIP. These are available in machine-readable form at the following
URL:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/Ch7/


8.2: PERL

	Although Perl does not have any special CJK facilities (note
that most implementations of C and C++ do not either), it provides a
powerful programming environment that is useful for many CJK-related
tasks.
	The noteworthy features of Perl are associative arrays and
regular expressions. These are features not found in C or C++, and
allow one to write meaningful code in little time.
	JPerl is an implementation of Perl that provides two-byte
support for Japanese (EUC or Shift-JIS encoding). It is not ideal
because JPerl scripts often cannot run under (non-Japanese) Perl.
	If you often write programs for internal use, I suggest that
you check out Perl to see if it can offer you something. Chances are
that it can. A good place to start looking at Perl are through books
on the subject (see Section A.3.1) and at the following URL:

  http://www.perl.com/

	For those who like additional reading, "The Perl Journal" is
starting up, and information is at the following URL:

  http://work.media.mit.edu/the_perl_journal/


8.3: JAVA

	I am just starting to learn about the Java programming
language (and rightly so since my wife is Javanese!). It seems to have
a lot to offer.
	The most interesting aspects of Java are:

o Built-in support for Unicode and UTF-8.
o The programmer must write code in the object-oriented paradigm.
o Provides a portable way to supply compiled code.
o Security features for Internet use.

More information on Java are at the following URLs:

  http://www.gamelan.com/
  http://www.javasoft.com/

Oh, Gamelan is the name of Javanese music.
	Of the books about Java published thus far, the one I consider
to be the best is "Java in a Nutshell" by David Flanagan.
	One programming feature of Perl that I dearly miss in Java are
regexes (regular expressions). Luckily, some kind person wrote a regex
package for Java based on Perl regexes. Information on this Java regex
package is available at the following URL:

  http://www.win.net/~stevesoft/pat/


A FINAL NOTE

	I hope that the information presented here will prove
useful. I would like to keep the electronic version of this document
as up-to-date as possible, and through readers' input, I am able to
do so.
	Many readers will notice that I am very heavy into UNIX and
Macintosh (well, I recently got my first PC). If anyone has any
information on CJK-capable interfaces for other platforms, please feel
free to send it to me, and I will be sure to include it in the next
version of CJK.INF. Please include sources for the software or
documentation by providing addresses, phone numbers, FTP sites, and so
on.
	Please do not hesitate to ask me further question concerning
any subject presented in this document.


ACKNOWLEDGMENTS

	I would like to express my deepest thanks to Kazumasa Utashiro
of Internet Initiative Japan (IIJ). He taught to me how to send and
receive Japanese text using the 7-bit ISO 2022 codes back in 1989.
With his help I was able to write JAPAN.INF, my book, and this
document in order to inform others about what he has taught me plus
more.
	Next, I thank all the folks at O'Reilly & Associates for
publishing UJIP. Special thanks to Tim O'Reilly for accepting the book
proposal, and to Peter Mui for guiding me through the process. I have
had nothing but good experiences with "them there fine folks."
	I got to know Jack Halpern through UJIP, and he subsequently
translated it into Japanese. Many thanks to him.
	I am also grateful to my employer, Adobe Systems, for letting
me work on interesting CJK-related projects. I really like what I do
here. In particular, I want to thank Dan Mills, my manager, for
putting up with me for these past four years.
	Lastly, I would also like to thank the countless people who
provided comments on JAPAN.INF, UJIP, and CJK.INF. I hope that this
new document lives up to the spirit of my previous efforts.


APPENDIX A: OTHER INFORMATION SOURCES

	One of the most useful types of information are pointers to
other information sources. This appendix provides just that.


A.1: USENET NEWSGROUPS AND MAILING LISTS

	Appendix L of UJIP provided information on a number of mailing
lists. This section supplements that appendix with information on
other useful mailing lists, and points out which ones in UJIP are
relevant to readers of CJK.INF.


A.1.1: USENET NEWSGROUPS

	The following Usenet Newsgroups typically have postings with
information relevant to issues discussed in CJK.INF (in alphabetical
order):

  alt.chinese.computing
  alt.chinese.text                (HZ encoding used for Chinese text)
  alt.chinese.text.big5           (Big Five encoding used for Chinese text)
  alt.japanese.text               (JIS encoding used for Japanese text)
  chinese.flame                   (UTF-7)
  chinese.text.unicode            (UTF-8)
  comp.lang.c
  comp.lang.c++
  comp.lang.java
  comp.lang.perl.misc
  comp.software.international
  comp.std.internat
  fj.editor.mule                  (JIS encoding used for Japanese text)
  fj.kanji                        (JIS encoding used for Japanese text)
  fj.net.infosystems.www.browsers (JIS encoding used for Japanese text)
  fj.news.reader                  (JIS encoding used for Japanese text)
  han.comp.hangul
  han.sys.mac
  sci.lang.japan                  (JIS encoding used for Japanese text)

	If your local news host does not provide a feed of the fj.*
newsgroups (shame on them!), or if you do not have access to Usenet
News, you can alternatively fetch them from the following URL:

  ftp://kuso.shef.ac.uk/pub/News/

The subdirectories correspond to the newsgroup name, but with the
"dots" being replaced by "slashes." For example, the "fj.binaries.mac"
newsgroup is archived in the "fj/binaries/mac" subdirectory. Many
thanks to Earl Kinmonth (jp1ek@sunc.shef.uc.uk) for this service.
	There are some sites that carry full feeds of the fj.*
newsgroups, and permit public access (meaning that you can configure
your news reader to point to it). The only one I know of thus far is
as follows:

  ume.cc.tsukuba.ac.jp


A.1.2: MAILING LISTS

	The following are mailing lists that should interest readers
of this document (some are more active than others). The first line
after each entry indicates the address (or addresses) that can be used
for subscribing. The second line is the address for posting.

o CCNET-L MAILING LIST
  listserv@uga.uga.edu (or listserv@uga)
  ccnet-l@uga.uga.edu

o China Net Mailing List
  majordomo@lists.mindspring.com
  (See http://www.asia-net.com/ or jobs@asia-net.com)

o EASUG (East Asian Software Users Group) Mailing List
  easug-request@guvax.acc.georgetown.edu
  easug@guvax.acc.georgetown.edu

o EBTI-L (Electronic Buddhist Text Initiative) Mailing List
  ebti-l-request@uxmail.ust.hk
  ebti-l@uxmail.ust.hk

o EFJ (Electronic Frontiers Japan) Mailing List
  majordomo@lists.twics.com
  efj@lists.twics.com

o Hangul Mailing List (han.comp.hangul newsgroup)
  majordomo@cair.kaist.ac.kr
  hangul@cair.kaist.ac.kr

o INSOFT-L Mailing List
  majordomo@trans2.b30.ingr.com
  insoft-l@trans2.b30

o ISO 10646 Mailing List
  listproc@listproc.hcf.jhu.edu
  iso10646@listproc.hcf.jhu.edu

o Japan Net Mailing List
  majordomo@lists.mindspring.com
  (See http://www.asia-net.com/ or jobs@asia-net.com)

o KanjiTalk Mailing List
  kanjitalk-request@cs15.atr-sw.atr.co.jp (or kanjitalk-request@crl.go.jp)
  kanjitalk@cs15.atr-sw.atr.co.jp (or kanjitalk@crl.go.jp)

o Mac Mailing List (han.sys.mac newsgroup)
  majordomo@krnic.net
  mac@krnic.net

o Mule Mailing List
  mule-request@etl.go.jp
  mule@etl.go.jp or mule-jp@etl.go.jp

o NIHONGO Mailing List (sci.lang.japan newsgroup)
  listserv@mitvma.mit.edu (or listserv@mitvma)
  nihongo@mitvma.mit.edu

o Nihongo-Hiroba Mailing List
  listproc@mcfeeley.cc.utexas.edu
  nihongo-hiroba@mcfeeley.cc.utexas.edu

o Nisus Mailing List
  listserv@dartmouth.edu
  nisus@dartmouth.edu

o TLUG (Tokyo Linux User's Group) Mailing List
  majordomo@lists.twics.com
  tlug@lists.twics.com

o Unicode Mailing List
  unicode-request@unicode.org
  unicode@unicode.org

o WNN User Mailing List
  wnn-user-request@wnn.astem.or.jp
  wnn-user-jp@wnn.astem.or.jp

o WWW Multilingual Mailing List
  www-mling-request@square.ntt.jp
  www-mling@square.ntt.jp

If the name of the mailing list is part of the subscription address
(such as "easug-request"), the message body should look like this:

  subscribe

Including your name is optional. If username in the subscription
address is "listserv" or "majordomo" (these are names of mailing list
managing software), the mailing list name must appear after
"subscribe" in the message body as follows:

  subscribe ccnet-l

Again, including your name is optional.
	The following URL has information about Japanese-related
mailing lists:

  gopher://gan1.ncc.go.jp/11/INFO/mail-lists/


A.2: INTERNET RESOURCES

	The Internet provides what I would consider to be the greatest
information resources of all. These can be subcategorized into FTP,
Telnet, Gopher, WWW, and e-mail.


A.2.1: USEFUL FTP SITES

	Below are the URLs for useful FTP sites. The directory
specified is the recommended place from which to start poking around
for useful files.

  ftp://cair-archive.kaist.ac.kr/pub/hangul/
  ftp://etlport.etl.go.jp/pub/mule/
  ftp://ftp.adobe.com/pub/adobe/
  ftp://ftp.cc.monash.edu.au/pub/nihongo/
  ftp://ftp.ifcss.org/pub/software/
  ftp://ftp.ora.com/pub/examples/nutshell/ujip/
  ftp://ftp.sra.co.jp/pub/
  ftp://ftp.uwtc.washington.edu/pub/Japanese/
  ftp://kuso.shef.ac.uk/pub/Japanese/
  ftp://unicode.org/pub/

This list is expected to grow.


A.2.2: USEFUL TELNET SITES

	For those who have a NIFTY-Serve account, there is now a very
convenient way to access NIFTY-Serve using telnet. The URL is as
follows:

  telnet://r2.niftyserve.or.jp/

Information about what NIFTY-Serve has to offer (and how to subscribe)
can be found at the following URL:

  http://www.nifty.co.jp/

	Another information service with a similar access mechanism is
CompuServe, whose URL is as follows:

  telnet://compuserve.com/

You will need to press the return key to get the "Host Name:" prompt,
at which time you type "cis" (just follow the menus from this point
on).
	You can also do a search on fj.* newsgroup articles at the
following URL:

  telnet://asahi-net.or.jp/

You login as "fj-db" once you are connected.


A.2.3: USEFUL GOPHER SITES

	I am not too much of a Gopher user. There, of course, is the
following:

  gopher://gopher.ora.com/

Another Gopher site provides information on Japanese-related mailing
lists:

  gopher://gan1.ncc.go.jp/11/INFO/mail-lists/

If you happen to know of others, please let me know.


A.2.4: USEFUL WWW SITES

	Because the World-Wide Web is a constantly changing place (and
more importantly, because I don't want to re-issue a new version of
this document every month!), I will maintain links to useful documents
at my WWW Home Page. Its URL is as follows:

  http://jasper.ora.com/lunde/

If you cannot get to my WWW Home Page, you couldn't get to any that I
would list here anyway.


A.2.5: USEFUL MAIL SERVERS

	In the past (that is, in JAPAN.INF) I included a full list of
the domains in the "jp" hierarchy. That took up a lot of space, and
changes very rapidly. You can now send a request to a mail server in
order to return the most current listing. The mail server is:

  mail-server@nic.ad.jp

The most common command is "send," and the following arguments can be
supplied to retrieve specific documents (and should be in the message
body, not on the "Subject:" line):

  send help
  send index
  send jpnic/domain-list.txt
  send jpnic/domain-list-e.txt

The first sends back a help file, the second sends back a complete
index of files that can be retrieved (use this one to see what other
useful stuff is available), and the last two send back a complete
listing of domains in the "fj" hierarchy (the last one send it back in
English/romanized).


A.3: OTHER RESOURCES

	This section provides pointers to specific documentation
available electronically or in print.


A.3.1: BOOKS

	There are other useful reference materials available in print
or online, in addition to the various national and international
standards mentioned throughout this document. The following are books
that I recommend for further reading or mental stimulus. (Sorry for
plugging my own books in this list, but they are relevant.)

o Clews, John. "Language Automation Worldwide: The Development of
  Character Set Standards." SESAME Computer Projects. 1988. ISBN
  1-870095-01-4.

o Flanagan, David. "Java in a Nutshell." O'Reilly & Associates,
  Inc. 1996. ISBN 1-56592-183-6.

o Frisch, AEleen. "Essential System Administration." Second Edition.
  O'Reilly & Associates, Inc. 1995. ISBN 1-56592-127-5.

o Huang, Jack & Timothy Huang. "An Introduction to Chinese, Japanese
  and Korean Computing." World Scientific Computing. 1989. ISBN
  9971-50-664-5.

o IBM Corporation. "Character Data Representation Architecture - Level
  2, Registry." 1993. IBM order number SC09-1391-01.

o Kano, Nadine. "Developing International Software for Windows 95 and
  Windows NT." Microsoft Press. 1995. ISBN 1-55615-840-8.

o Kirch, Olaf. "Linux Network Administrator's Guide." O'Reilly &
  Associates, Inc. 1995. ISBN 1-56592-087-2.

o Kissell, Joe. "The Nisus Way." MIS:Press. 1996. ISBN 1-55828-455-9.

o Krol, Ed. "The Whole Internet User's Guide & Catalog." Second
  Edition. O'Reilly & Associates, Inc. 1994. ISBN 1-56592-063-5.

o Liu, Cricket et al. "Managing Internet Information Services."
  O'Reilly & Associates, Inc. 1994. ISBN 1-56592-062-7.

o Lunde, Ken. "Understanding Japanese Information Processing."
  O'Reilly & Associates, Incorporated. 1993. ISBN 1-56592-043-0. LCCN
  PL524.5.L86 1993.

o Lunde, Ken. "Nihongo Joho Shori." SOFTBANK Corporation. 1995. ISBN
  4-89052-708-7.

o Luong, Tuoc V. et al. "Internationalization: Developing Software for
  Global Markets." John Wiley & Sons, Incorporated. 1995. ISBN
  0-471-07661-9.

o Schwartz, Randal L. "Learning Perl." O'Reilly & Associates,
  Incorporated. 1993. ISBN 1-56592-042-2.

o Stallman, Richard M. "GNU Emacs Manual." Tenth edition. Free
  Software Foundation. 1994. ISBN 1-882114-04-3.

o Tuthill, Bill. "Solaris International Developer's Guide." SunSoft
  Press and PTR Prentice Hall. 1993. ISBN 0-13-031063-8.

o Unicode Consortium, The. "The Unicode Standard: Worldwide Character
  Encoding." Version 1.0. Volume 2. Addison-Wesley. 1992. ISBN
  0-201-60845-6.

o Vromans, Johan. "Perl 5 Desktop Reference." O'Reilly & Associates,
  Inc. 1996. ISBN 1-56592-187-9.

o Wall, Larry & Randal L. Schwartz. "Programming Perl." O'Reilly &
  Associates, Incorporated. 1991. ISBN 0-937175-64-1.

o Welsh, Matt & Lar Kaufman. "Running Linux." O'Reilly & Associates,
  Inc. 1995. ISBN 1-56592-100-3.

	If you want to get your hands on any of the national or
international standards mentioned in this document, I suggest the
following:

o The American National Standards Institute can provide ISO, KS, and
  JIS standards. Bear in mind that ISO standards will most likely
  arrive as a photocopy of the original.

  ANSI
  11 West 42nd Street
  New York, NY 10036
  USA
  +1-212-642-4900 (phone)
  +1-212-302-1286 (facsimile)

o The International Organization for Standardization can provide
  ISO standards.

  ISO
  1, rue de Varemb
  Case postale 56
  CH-1211, Geneva 20
  SWITZERLAND
  +41-22-749-01-11 (phone)
  +41-22-733-34-30 (facsimile)
  central@isocs.iso.ch (e-mail)
  http://www.iso.ch/ (WWW)

o Chinese (GB and CNS) standards are the hardest to obtain. It is
  quite unfortunate.


A.3.2: MAGAZINES

o "Computing Japan," published monthly, ISSN 1340-7228,
  editors@cj.gol.com.

o "MANGAJIN," published 10 times per year, ISSN 1051-8177.

o "Multilingual Communications & Computing," published bi-monthly,
  ISSN 1065-7657, info@multilingual.com.

o "The Perl Journal," published quarterly, ISSN 1087-903X,
  perl-journal-subscriptions@perl.com.


A.3.3: JOURNALS

o "Chinese Information Processing" (CIP), published bi-monthly, ISSN
  1003-9082. (In Chinese.)

o "Computer Processing of Chinese & Oriental Languages" (CPCOL),
  co-published twice a year by World Scientific Publishing and Chinese
  Language Computer Society (CLCS), ISSN 0715-9048.

o "The Electronic Bodhidharma," published by the International
  Research Institute for Zen (IRIZ) Buddhism, Hanazono University,
  Japan. More information on the organization that publishes this
  journal is available at the following URL:

  http://www.iijnet.or.jp/iriz/irizhtml/irizhome.htm


A.3.4: RFCs

	Many RFCs (Request For Comments) are relevant to this
document. They are:

o RFC 1341: "MIME (Multipurpose Internet Mail Extensions): Mechanisms
  for Specifying and Describing the Format of Internet Message
  Bodies," by Nathaniel Borenstein and Ned Freed, June 1992.

o RFC 1342: "Representation of Non-ASCII Text in Internet Message
  Headers," by Keith Moore, June 1992.

o RFC 1468: "Japanese Character Encoding for Internet Messages," by
  Jun Murai et al., June 1993.

o RFC 1521: "MIME (Multipurpose Internet Mail Extensions) Part One:
  Mechanisms for Specifying and Describing the Format of Internet
  Message Bodies," by Nathaniel Borenstein and Ned Freed, September
  1993. Obsoletes RFC 1341.

o RFC 1522: "MIME (Multipurpose Internet Mail Extensions) Part Two:
  Message Header Extensions for Non-ASCII Text," by Keith Moore,
  September 1993. Obsoletes RFC 1342.

o RFC 1554: "ISO-2022-JP-2: Multilingual Extension of ISO-2022-JP," by
  Masataka Ohta and Kenichi Handa, December 1993.

o RFC 1557: "Korean Character Encoding for Internet Messages," by
  Uhhyung Choi et al., December 1993.

o RFC 1642: "UTF-7: A Mail-Safe Transformation Format of Unicode," by
  David Goldsmith and Mark Davis, July 1994.

o RFC 1815: "Character Sets ISO-10646 and ISO-10646-J-1," by Masataka
  Ohta, July 1995.

o RFC 1842: "ASCII Printable Characters-Based Chinese Character
  Encoding for Internet Messages," by Ya-Gui Wei et al., August 1995.

o RFC 1843: "HZ - A Data Format for Exchanging Files of Arbitrarily
  Mixed Chinese and ASCII Characters," by Fung Fung Lee, August 1995.

o RFC 1922: "Chinese Character Encoding for Internet Messages," by
  Haifeng Zhu et al., March 1996.

These RFCs can be obtained from FTP archives that contain all RFC
documents, such as at the following URLs

  ftp://nic.ddn.mil/rfc/
  ftp://ftp.uu.net/inet/rfc/

But these specific ones are mirrored at the following URL for
convenience:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/Ch9/


A.3.5: FAQs

	There are several FAQ (Frequently Asked Questions) files that
provide useful information. The following is a listing of some along
with their URLs:

o "Japanese Language Information" FAQ (formerly the "sci.lang.japan"
  FAQ) by Rafael Santos (santos@mickey.ai.kyutech.ac.jp) at:

  http://www.mickey.ai.kyutech.ac.jp/cgi-bin/japanese/

  Update announcements are usually posted to the sci.lang.japan
  newsgroup.

o "Programming for Internationalization" FAQ by Michael Gschwind
  (mike@vlsivie.tuwien.ac.at) at:

  ftp://ftp.vlsivie.tuwien.ac.at/pub/8bit/ISO-programming

  Also posted to the comp.software.international newsgroup. This and
  other internationalization documents are also accessible through the
  following URL:

  http://www.vlsivie.tuwien.ac.at/mike/i18n.html

o Three FAQs about Internet Service Providers in Japan by Taki Naruto
  (tn@panix.com), Jesse Casman (jcasman@unm.edu), and Kenji Yoshida
  (kenny@mb.tokyo.infoweb.or.jp), respectively, at:

  http://www.panix.com/~tn/ispj.html
  http://nobunaga.unm.edu/internet.html
  http://cswww2.essex.ac.uk/users/whean/japan/net.html

o "Internationalization Reference List" by Eugene Dorr
  (gdorr@pgh.legent.com) at:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/doc/i18n-books.txt

  Note really a FAQ, but quite useful because it is a very complete
  listing of I18N-related books.

o "INSOFT-L Service" by Brian Tatro (btatro@tatro.com) at:

  http://iquest.com/~btatro/in2.html

  This includes a link to the FAQ for the INSOFT-L Mailing List (see
  Section A.1.2).

o "How to Use Japanese on the Internet with a PC: From Login to WWW"
  by Hideki Hirayama (sgw01623@niftyserve.or.jp) at:

  ftp://ftp.ora.com/pub/examples/nutshell/ujip/faq/jpn-inet.FAQ

o "Hangul and Internet in Korea" FAQ by Jungshik Shin
  (jshin@minerva.cis.yale.edu) at:

  http://pantheon.cis.yale.edu/~jshin/faq/
---  END (CJK.INF VERSION 2.1 07/12/96) 185553 BYTES  ---